NetBSD Problem Report #32717

From raeburn@MIT.EDU  Fri Feb  3 08:30:46 2006
Return-Path: <raeburn@MIT.EDU>
Received: from biscayne-one-station.mit.edu (BISCAYNE-ONE-STATION.MIT.EDU [18.7.7.80])
	by narn.netbsd.org (Postfix) with ESMTP id 9A38963B86B
	for <gnats-bugs@gnats.netbsd.org>; Fri,  3 Feb 2006 08:30:45 +0000 (UTC)
Message-Id: <tx1oe1on8sv.fsf@mit.edu>
Date: Fri, 03 Feb 2006 03:30:40 -0500
From: Ken Raeburn <raeburn@MIT.EDU>
To: gnats-bugs@netbsd.org
Subject: alpha 3.0 install kernel doesn't see scsi disks
X-Send-Pr-Version: 3.95

>Number:         32717
>Category:       kern
>Synopsis:       alpha 3.0 install kernel doesn't see scsi disks
>Confidential:   no
>Severity:       non-critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Feb 03 08:35:00 +0000 2006
>Last-Modified:  Wed May 17 03:35:00 +0000 2006
>Originator:     Ken Raeburn
>Release:        NetBSD 3.0
>Organization:
	MIT
>Environment:
System: NetBSD 3.0 installation CD
Architecture: alpha
Machine: alpha
>Description:

I've got an XP1000 that's been running NetBSD 2.0 quite happily, aside
from occasional complaints of resource shortages from the siop(?)
driver, which I think I've reported already.  Otherwise, everything
seems fine.  The machine has two SCSI disks internally, and gets used
regularly as a build engine via cron jobs.

I burned the alpha 3.0 install cd image to a cd and booted from it.
It fails to recognize the disks, and thus can't update the machine.
The boot messages from this kernel (as much as were left in memory
when I got 2.0 up again) were:

u0 at mainbus0: ID 0 (primary), 21264A-9
cpu0: Architecture extensions: 307<PAT,MVI,CIX,FIX,BWX>
tsc0 at mainbus0: 21272 Core Logic Chipset, Cchip rev 0
tsc0: 4 Dchips, 1 memory bus of 32 bytes
tsc0: arrays present: 256MB, 1024MB, 0MB, 0MB, Dchip 0 rev 1
tsp0 at tsc0
pci0 at tsp0 bus 0
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
sio0 at pci0 dev 7 function 0: vendor 0x1080 product 0xc693 (rev. 0x00)
cypide0 at pci0 dev 7 function 1
cypide0: Cypress 82C693 IDE Controller (rev. 0x00)
cypide0: bus-master DMA support present
cypide0: primary channel wired to compatibility mode
cypide0: primary channel interrupting at isa irq 14
atabus0 at cypide0 channel 0
cypide1 at pci0 dev 7 function 2
cypide1: Cypress 82C693 IDE Controller (rev. 0x00)
cypide1: hardware does not support DMA
cypide1: primary channel wired to compatibility mode
cypide1: secondary channel interrupting at isa irq 15
atabus1 at cypide1 channel 0
vendor 0x1080 product 0xc693 (USB serial bus, interface 0x10) at pci0 dev 7 function 3 not configured
siop0 at pci0 dev 12 function 0: Symbios Logic 53c895 (ultra2-wide scsi)
siop0: using on-board RAM
siop0: interrupting at dec 6600 irq 36
scsibus0 at siop0: 16 targets, 8 luns per target
vga0 at pci0 dev 13 function 0: vendor 0x104c product 0x3d07 (rev. 0x01)
wsdisplay0 at vga0 (kbdmux ignored): console (80x25, vt100 emulation)
isa0 at sio0
lpt0 at isa0 port 0x3bc-0x3bf irq 7
com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, working fifo
pckbc0 at isa0 port 0x60-0x64
pckbd0 at pckbc0 (kbd slot)
pckbc0: using irq 1 for kbd slot
wskbd0 at pckbd0 (mux ignored): console keyboard, using wsdisplay0
pms0 at pckbc0 (aux slot)
pckbc0: using irq 12 for aux slot
wsmouse0 at pms0 (mux ignored)
fdc0 at isa0 port 0x3f0-0x3f7 irq 6 drq 2
mcclock0 at isa0 port 0x70-0x71: mc146818 or compatible
tsp1 at tsc0
pci1 at tsp1 bus 0
pci1: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
tlp0 at pci1 dev 3 function 0: DECchip 21143 Ethernet, pass 4.1
tlp0: interrupting at dec 6600 irq 45
tlp0: DEC, Ethernet address 08:00:2b:87:0e:6d
tlp0: 10baseT, 10base2, 10base5, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
isp0 at pci1 dev 6 function 0: QLogic 1020 Fast Wide SCSI HBA
isp0: interrupting at dec 6600 irq 47
scsibus1 at isp0: 16 targets, 8 luns per target
ppb0 at pci1 dev 8 function 0: vendor 0x1011 product 0x0024 (rev. 0x03)
pci2 at ppb0 bus 2
pci2: i/o space, memory space enabled, rd/line, wr/inv ok
fd0 at fdc0 drive 0: 1.44MB, 80 cyl, 2 head, 18 sec
md0: internal 4650 KB image area
stray isa irq 14
atapibus0 at atabus0: 2 targets
cd0 at atapibus0 drive 1: <Compaq  CRD-8322B, 1999/02/11, 1.07> cdrom removable
scsibus0: waiting 2 seconds for devices to settle...
scsibus1: waiting 2 seconds for devices to settle...
cd0: 32-bit data port
cd0: drive supports PIO mode 4, DMA mode 2
cd0(cypide0:0:1): using PIO mode 4, DMA mode 2 (using DMA)
probe(siop0:0:0:0): request sense for a request sense ?
probe(siop0:0:0:0): request sense failed with error 22
probe(siop0:0:0:0): generic HBA error
probe(siop0:0:1:0): request sense for a request sense ?
probe(siop0:0:1:0): request sense failed with error 22
probe(siop0:0:1:0): generic HBA error
WARNING: can't figure what device matches "IDE 0 107 0 1 1 0 0"
root on md0a dumps on md0b
root file system type: ffs
WARNING: clock gained 46 days -- CHECK AND RESET THE DATE!


The 2.0 boot messages (taken from a reboot after failing to upgrade)
are:

consinit: not using prom console
Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004
    The NetBSD Foundation, Inc.  All rights reserved.
Copyright (c) 1982, 1986, 1989, 1991, 1993
    The Regents of the University of California.  All rights reserved.

NetBSD 2.0 (GENERIC) #0: Tue Nov 30 21:04:03 UTC 2004
	builds@build:/big/builds/ab/netbsd-2-0-RELEASE/alpha/200411300000Z-obj/big/builds/ab/netbsd-2-0-RELEASE/src/sys/arch/alpha/compile/GENERIC
COMPAQ Professional Workstation XP1000, 666MHz, s/n 4029DRSZ10
8192 byte page size, 1 processor.
total memory = 1280 MB
(1792 KB reserved for PROM, 1278 MB used by NetBSD)
avail memory = 1246 MB
mainbus0 (root)
cpu0 at mainbus0: ID 0 (primary), 21264A-9
cpu0: Architecture extensions: 307<PAT,MVI,CIX,FIX,BWX>
tsc0 at mainbus0: 21272 Core Logic Chipset, Cchip rev 0
tsc0: 4 Dchips, 1 memory bus of 32 bytes
tsc0: arrays present: 256MB, 1024MB, 0MB, 0MB, Dchip 0 rev 1
tsp0 at tsc0
pci0 at tsp0 bus 0
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
sio0 at pci0 dev 7 function 0: Contaq Microsystems 82C693 PCI-ISA Bridge (rev. 0x00)
cypide0 at pci0 dev 7 function 1
cypide0: Cypress 82C693 IDE Controller (rev. 0x00)
cypide0: bus-master DMA support present
cypide0: primary channel wired to compatibility mode
cypide0: primary channel interrupting at isa irq 14
atabus0 at cypide0 channel 0
cypide1 at pci0 dev 7 function 2
cypide1: Cypress 82C693 IDE Controller (rev. 0x00)
cypide1: hardware does not support DMA
cypide1: primary channel wired to compatibility mode
cypide1: secondary channel interrupting at isa irq 15
atabus1 at cypide1 channel 0
ohci0 at pci0 dev 7 function 3: Contaq Microsystems 82C693 PCI-ISA Bridge (rev. 0x00)
ohci0: interrupting at isa irq 10
ohci0: OHCI version 1.0, legacy support
usb0 at ohci0: USB revision 1.0
uhub0 at usb0
uhub0: Contaq Microsys OHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 2 ports with 2 removable, self powered
esiop0 at pci0 dev 12 function 0: Symbios Logic 53c895 (ultra2-wide scsi)
esiop0: using on-board RAM
esiop0: interrupting at dec 6600 irq 36
scsibus0 at esiop0: 16 targets, 8 luns per target
vga0 at pci0 dev 13 function 0: Texas Instruments TVP4020 Permedia 2 (rev. 0x01)
wsdisplay0 at vga0 (kbdmux ignored): console (80x25, vt100 emulation)
isa0 at sio0
lpt0 at isa0 port 0x3bc-0x3bf irq 7
com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, working fifo
pckbc0 at isa0 port 0x60-0x64
pckbd0 at pckbc0 (kbd slot)
pckbc0: using irq 1 for kbd slot
wskbd0 at pckbd0 (mux ignored): console keyboard, using wsdisplay0
pms0 at pckbc0 (aux slot)
pckbc0: using irq 12 for aux slot
wsmouse0 at pms0 (mux ignored)
sb0 at isa0 port 0x220-0x237 irq 5 drq 1: dsp v3.01
audio0 at sb0: half duplex, mmap, independent
midi at sb0 not configured
opl at sb0 not configured
pcppi0 at isa0 port 0x61
midi0 at pcppi0: PC speaker
spkr0 at pcppi0
isabeep0 at pcppi0
fdc0 at isa0 port 0x3f0-0x3f7 irq 6 drq 2
mcclock0 at isa0 port 0x70-0x71: mc146818 or compatible
tsp1 at tsc0
pci1 at tsp1 bus 0
pci1: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
tlp0 at pci1 dev 3 function 0: DECchip 21143 Ethernet, pass 4.1
tlp0: interrupting at dec 6600 irq 45
tlp0: DEC , Ethernet address 08:00:2b:87:0e:6d
tlp0: 10baseT, 10base2, 10base5, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
isp0 at pci1 dev 6 function 0: QLogic 1020 Fast Wide SCSI HBA
isp0: interrupting at dec 6600 irq 47
scsibus1 at isp0: 16 targets, 8 luns per target
ppb0 at pci1 dev 8 function 0: Digital Equipment DECchip 21152 PCI-PCI Bridge (rev. 0x03)
pci2 at ppb0 bus 2
pci2: i/o space, memory space enabled, rd/line, wr/inv ok
fd0 at fdc0 drive 0: 1.44MB, 80 cyl, 2 head, 18 sec
Kernelized RAIDframe activated
stray isa irq 14
atapibus0 at atabus0: 2 targets
scsibus0: waiting 2 seconds for devices to settle...
cd0 at atapibus0 drive 1: <Compaq  CRD-8322B, 1999/02/11, 1.07> cdrom removable
scsibus1: waiting 2 seconds for devices to settle...
cd0: 32-bit data port
cd0: drive supports PIO mode 4, DMA mode 2
cd0(cypide0:0:1): using PIO mode 4, DMA mode 2 (using DMA data transfers)
sd0 at scsibus0 target 0 lun 0: <COMPAQ, BD009222C7, B016> disk fixed
sd0: 8678 MB, 5273 cyl, 20 head, 168 sec, 512 bytes/sect x 17773524 sectors
sd0: sync (25.00ns offset 31), 16-bit (80.000MB/s) transfers, tagged queueing
sd1 at scsibus0 target 1 lun 0: <DEC, RZ2ED-KS (C) DEC, 0306> disk fixed
sd1: 17365 MB, 7001 cyl, 20 head, 254 sec, 512 bytes/sect x 35565080 sectors
sd1: sync (25.00ns offset 15), 16-bit (80.000MB/s) transfers, tagged queueing
root on sd0a dumps on sd0b
root file system type: ffs


The probe error messages come out of scsipi_base.c, in code that
doesn't appear to have changed since 2.0, so presumably the 3.0 kernel
is doing something different that causes the error state to arise.  I
do notice that the 2.0 kernel reported a "esiop" device where the 3.0
install kernel reported "siop" (and appears not to have esiop listed
in the kernel config file), but the comments in the two drivers look
like they're for the same cards.  (I also notice that the "waiting 2
seconds" messages are ordered differently with respect to the cd0
detection, don't know if that's relevant at all, since cd0 is an IDE
device.)

>How-To-Repeat:
	Try updating my xp1000...
>Fix:
	?

	Would it be worth jumping through all the hoops of updating
the parts of my system needed to properly build a 3.0 kernel with an
esiop driver (and maybe not siop?), to see if that fares any better?

>Audit-Trail:
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org,
	netbsd-bugs@NetBSD.org
Subject: Re: kern/32717: alpha 3.0 install kernel doesn't see scsi disks
Date: Fri, 3 Feb 2006 19:39:26 +0100

 On Fri, Feb 03, 2006 at 08:35:00AM +0000, Ken Raeburn wrote:
 > [...]
 > 	?
 > 
 > 	Would it be worth jumping through all the hoops of updating
 > the parts of my system needed to properly build a 3.0 kernel with an
 > esiop driver (and maybe not siop?), to see if that fares any better?

 You could try booting a 3.0 GENERIC kernel on your 2.0 system (this should
 work), and see if it works better. It's possible that siop(4) has not been
 tested for a long time on newer NCR/symbios cards as most kernels also have
 esiop(4).

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: Ken Raeburn <raeburn@MIT.EDU>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/32717: alpha 3.0 install kernel doesn't see scsi disks
Date: Sun, 5 Feb 2006 15:10:25 -0500

 On Feb 3, 2006, at 13:40, Manuel Bouyer wrote:
 >> 	Would it be worth jumping through all the hoops of updating
 >> the parts of my system needed to properly build a 3.0 kernel with an
 >> esiop driver (and maybe not siop?), to see if that fares any better?
 >
 >  You could try booting a 3.0 GENERIC kernel on your 2.0 system  
 > (this should
 >  work), and see if it works better. It's possible that siop(4) has  
 > not been
 >  tested for a long time on newer NCR/symbios cards as most kernels  
 > also have
 >  esiop(4).

 Ah, good idea, I should've thought of that. :-/

 I got netbsd-GENERIC for 3.0 from the ftp server and booted it ("boot  
 -file /netbsd-3.0-GENERIC dkb0").  The kernel used the esiop driver  
 for the SCSI controller, instead of siop, but still displayed the  
 same probe error messages (with "siop" replaced by "esiop"), and  
 failed to find sd0 or sd1.

 Ken

From: Ken Raeburn <raeburn@MIT.EDU>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/32717: alpha 3.0 install kernel doesn't see scsi disks
Date: Fri, 31 Mar 2006 21:15:32 -0500

 So, I got another machine (PWS500au, also with SCSI disks) installed  
 under 3.0 (no problems), and started building kernels.  Doing an  
 (approximately) binary search, I found a point where the kernel  
 source on the trunk stops working, showing the error I indicated in  
 my problem report ("request sense for a request sense ?").  With the  
 CVS sources from 9/17/2004 20:45Z, my XP1000 recognizes its disks.   
 With the CVS sources from 21:00Z, it reports an error.

 Only two files changed in this time interval: uvm/uvm_page.c and uvm/ 
 uvm_pglist.c, each changed in one line aside from CVS keywords.   
 Version 1.100 of uvm_page.c and version 1.32 of uvm_pglist.c have  
 this log message:

 date: 2004/09/17 20:46:03;  author: yamt;  state: Exp;  lines: +3 -3
 make free page queue filo rather than fifo.
 data in pages freed more recently are more likely on cpu cache.

 Updating to netbsd-3-0-RELEASE and reverting the change to  
 uvm_page.c, or to both files, gives me an INSTALL kernel that  
 recognizes the disks, and is able to come up to single-user mode once  
 I tell it to use sd0a for the root, and I can find and run ps, ls,  
 and reboot.  Reverting uvm_pglist.c only produces a kernel that shows  
 the failure I first reported.  On the netbsd-3-0 branch (as of about  
 20:10 US/Eastern) I get the same result -- the SCSI controller  
 reports errors using the current CVS version, but if I undo this  
 uvm_page.c change and make it a FIFO queue again, it's happy.

 I have not yet tried building install media with the patch to  
 uvm_page.c.

 The uvm_page.c change itself seems logical.  Assuming it's actually  
 correct, I would guess that my problem means that some page is being  
 put onto the free list, and probably allocated again, while some  
 other part of the kernel (or a DMA device) isn't done with it yet,  
 and the FIFO version of the queue happens to give the extra time  
 needed.  Or maybe it's bad memory and during the boot process one  
 pattern of usage trips over it consistently and the other pattern (as  
 well as running NetBSD 2.0 and doing nightly builds of some code I  
 work on) does not in any noticeable way.  But I think I'm done for  
 tonight....

 Ken

From: YAMAMOTO Takashi <yamt@mwd.biglobe.ne.jp>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org, raeburn@MIT.EDU
Subject: Re: kern/32717: alpha 3.0 install kernel doesn't see scsi disks
Date: Thu, 06 Apr 2006 11:41:24 +0900

 hi,

 >  Only two files changed in this time interval: uvm/uvm_page.c and uvm/ 
 >  uvm_pglist.c, each changed in one line aside from CVS keywords.   
 >  Version 1.100 of uvm_page.c and version 1.32 of uvm_pglist.c have  
 >  this log message:
 >  
 >  date: 2004/09/17 20:46:03;  author: yamt;  state: Exp;  lines: +3 -3
 >  make free page queue filo rather than fifo.
 >  data in pages freed more recently are more likely on cpu cache.

 i hardly see how it causes the symptom.
 i think it discovered a bug in somewhere else as you said.

 >  The uvm_page.c change itself seems logical.  Assuming it's actually  
 >  correct, I would guess that my problem means that some page is being  
 >  put onto the free list, and probably allocated again, while some  
 >  other part of the kernel (or a DMA device) isn't done with it yet,  
 >  and the FIFO version of the queue happens to give the extra time  
 >  needed.  Or maybe it's bad memory and during the boot process one  
 >  pattern of usage trips over it consistently and the other pattern (as  
 >  well as running NetBSD 2.0 and doing nightly builds of some code I  
 >  work on) does not in any noticeable way.  But I think I'm done for  
 >  tonight....

 because uvm_page_physload also uses uvm_pagefree,
 boot process likely uses very different set of pages after the change.

 YAMAMOTO Takashi

From: Ken Raeburn <raeburn@MIT.EDU>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/32717: alpha 3.0 install kernel doesn't see scsi disks
Date: Tue, 16 May 2006 23:33:43 -0400

 So, I've been stumbling around trying to figure out how to debug this  
 one.

 I tried tweaking uvm_pagefree to use insert-at-head initially, then  
 after N calls, switch to inserting at the tail.  I found that for  
 some values of N the system would boot okay, and for others it  
 wouldn't.  I also tried switching back to head insertions after some  
 other number of calls was reached.  At this point, it appears that if  
 I use tail insertions for calls 1287212 through 160000, it works; for  
 130000:160000, it reports SCSI errors, as I originally reported; for  
 128716:160000, and at least up through 128500:160000, it recognizes  
 the disks but not the disk label.

 I also hacked uvm_pagefree to scribble the pattern 7d,5d over the  
 page (and clear PG_ZERO) before putting it on the free list.  I also  
 enabled DEBUG and DIAGNOSTIC, but they don't seem to have found  
 anything interesting.

 When I start the tail insertions at 1287212, the storage for the  
 first disk label is at kernel address 0xfffffc003fffc040 and gets  
 filled with a reasonable disk label.  When I start the tail  
 insertions at 1287216, the disk label is supposed to be at  
 0xfffffc0040004040 and is filled with the 7d,5d pattern I used.  The  
 low 32 bits of that address is right after the 1G mark.

 This machine has 1280M of memory, 256M in bank A and 1024M in bank  
 B.  I suppose it's possible that some of the memory chips are bad in  
 a way that doesn't show up writing 7d,5d from the kernel but causes  
 writes from the SCSI controller to fail completely, consistently, and  
 quietly; I haven't figured out how to run the console memory tester  
 yet.  But could there be a problem in telling the PCI SCSI controller  
 how to access some of the memory?

 I'm also looking at possible hardware issues.  Pulling the 1G memory  
 makes everything work fine, so it appears that the 256M is not bad,  
 or at least not completely broken; running with just the 1G in bank A  
 also works fine.  Swapping the banks  (A=1024, B=256) leaves it  
 breaking in the same way as it does now.  Reseating the SCSI  
 controller card also makes no difference.

 With just the 1G bank installed, I can boot and run the install CD.   
 I assume once it finishes, I'll either have to install a custom  
 kernel using tail insertion (which would worry me, since I don't know  
 what the actual problem is or whether it might bite me in some other  
 way), or keep running with an empty memory bank when I seem to have  
 two banks worth of working memory...  Any other suggestions for  
 things I can try?

 Ken

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.