NetBSD Problem Report #29936
From woods@building.weird.com Sun Apr 10 19:35:45 2005
Return-Path: <woods@building.weird.com>
Received: from building.weird.com (building.weird.com [204.92.254.24])
by narn.netbsd.org (Postfix) with ESMTP id E2FD863B116
for <gnats-bugs@gnats.netbsd.org>; Sun, 10 Apr 2005 19:35:44 +0000 (UTC)
Message-Id: <m1DKiDg-0024fjC@building.weird.com>
Date: Sun, 10 Apr 2005 15:35:44 -0400 (EDT)
From: "Greg A. Woods" <woods@planix.com>
Reply-To: "Greg A. Woods" <woods@planix.com>
To: gnats-bugs@netbsd.org
Subject: isp(4) with Qlogic 2312 FC HBA hangs with: "unable to load DMA (35)"
X-Send-Pr-Version: 3.95
>Number: 29936
>Category: kern
>Synopsis: isp(4) with Qlogic 2312 FC HBA hangs with: "unable to load DMA (35)"
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sun Apr 10 19:36:00 +0000 2005
>Last-Modified: Sun Feb 26 06:31:14 +0000 2012
>Originator: Greg A. Woods
>Release: NetBSD 1.6.2_STABLE
>Organization:
Planix, Inc.; Toronto, Ontario; Canada
>Environment:
System: NetBSD 1.6.2_STABLE
Architecture: alpha
Machine: alpha
isp(4) from -current:
ic/isp.c:
$NetBSD: isp.c,v 1.106 2005/02/27 00:27:01 perry Exp $
$NetBSD: isp.c,v 1.106 2005/02/27 00:27:01 perry Exp $
ic/isp_inline.h:
$NetBSD: isp_inline.h,v 1.25 2005/02/27 00:27:01 perry Exp $
ic/isp_ioctl.h:
$NetBSD: isp_ioctl.h,v 1.6 2005/02/27 00:27:01 perry Exp $
ic/isp_netbsd.c:
$NetBSD: isp_netbsd.c,v 1.65 2005/02/27 00:27:01 perry Exp $
$NetBSD: isp_netbsd.c,v 1.65 2005/02/27 00:27:01 perry Exp $
ic/isp_netbsd.h:
$NetBSD: isp_netbsd.h,v 1.53 2005/02/27 00:27:01 perry Exp $
ic/isp_target.c:
$NetBSD: isp_target.c,v 1.27 2005/02/27 00:27:01 perry Exp $
$NetBSD: isp_target.c,v 1.27 2005/02/27 00:27:01 perry Exp $
ic/isp_target.h:
$NetBSD: isp_target.h,v 1.21 2003/12/04 13:57:30 keihan Exp $
ic/isp_tpublic.h:
$NetBSD: isp_tpublic.h,v 1.13 2005/02/27 00:27:01 perry Exp $
ic/ispmbox.h:
$NetBSD: ispmbox.h,v 1.48 2005/02/27 00:27:01 perry Exp $
ic/ispreg.h:
$NetBSD: ispreg.h,v 1.29 2003/12/04 13:57:30 keihan Exp $
ic/ispvar.h:
$NetBSD: ispvar.h,v 1.62 2005/02/27 00:27:01 perry Exp $
pci/isp_pci.c:
$NetBSD: isp_pci.c,v 1.92 2005/02/27 00:27:33 perry Exp $
$NetBSD: isp_pci.c,v 1.92 2005/02/27 00:27:33 perry Exp $
sbus/isp_sbus.c:
$NetBSD: isp_sbus.c,v 1.53 2002/05/18 00:48:11 mjacob Exp $
$NetBSD: isp_sbus.c,v 1.53 2002/05/18 00:48:11 mjacob Exp $
microcode/isp/asm_1040.h:
$NetBSD: asm_1040.h,v 1.3 2005/02/27 00:27:23 perry Exp $
microcode/isp/asm_1080.h:
$NetBSD: asm_1080.h,v 1.3 2005/02/27 00:27:23 perry Exp $
microcode/isp/asm_12160.h:
$NetBSD: asm_12160.h,v 1.5 2005/02/27 00:27:23 perry Exp $
microcode/isp/asm_2100.h:
$NetBSD: asm_2100.h,v 1.6 2005/02/27 00:27:23 perry Exp $
microcode/isp/asm_2200.h:
$NetBSD: asm_2200.h,v 1.6 2005/02/27 00:27:24 perry Exp $
microcode/isp/asm_2300.h:
$NetBSD: asm_2300.h,v 1.6 2005/02/27 00:27:24 perry Exp $
microcode/isp/asm_sbus.h:
$NetBSD: asm_sbus.h,v 1.18 2005/02/27 00:27:24 perry Exp $
microcode/isp/isp_1000.bin:
ident warning: no id keywords in microcode/isp/isp_1000.bin
microcode/isp/isp_1040.bin:
ident warning: no id keywords in microcode/isp/isp_1040.bin
microcode/isp/isp_1080.bin:
ident warning: no id keywords in microcode/isp/isp_1080.bin
microcode/isp/isp_12160.bin:
ident warning: no id keywords in microcode/isp/isp_12160.bin
microcode/isp/isp_2100.bin:
ident warning: no id keywords in microcode/isp/isp_2100.bin
microcode/isp/isp_2200.bin:
ident warning: no id keywords in microcode/isp/isp_2200.bin
microcode/isp/isp_2300.bin:
ident warning: no id keywords in microcode/isp/isp_2300.bin
>Description:
What a way to spoil a beautiful bright sunny spring weekend.
A filesystem hangs bringing the system into a state of very
reduced functionality and the console says:
isp1: unable to load DMA (35)
sd6(isp1:0:1:0): adapter resource shortage
This is from an alphaserver es40 with twin Qlogic ISP 2312 HBAs
connected to an Apple Xserve RAID array.
The isp(4) driver is from -current a few weeks ago, but the rest
of the kernel is essentially from the head of the netbsd-1-6
branch.
This also seems to affect the adaptec card to which the root
disks are attached as "root" cannot login on the console, though
existing processes continue to run (e.g. shells and sshds), as
does most of Cyrus IMAP thus allowing users to continue to fetch
mail (through the other still running isp(4) controller). Only
/home and /var/log, on sd6, are known to be frozen, though this
doesn't really explain why root can't login on the console
(unless the attempt to write the login record to /var/log/wtmp
is what's stopping that).
A manual halt doesn't seem to get anywhere:
RMC>halt in
Returning to COM port
halted CPU 0
CPU 1 is not halted
CPU 2 is not halted
CPU 3 is not halted
halt code = 1
operator initiated halt
PC = fffffc0000300750
P00>>>cont
continuing CPU 0
CP - RESTORE_TERM routine to be called
panic: user requested console halt
Begin traceback...
alpha trace requires known PC =eject=
End traceback...
syncing disks...
CPU 0: fatal kernel trap:
CPU 0 trap entry = 0x2 (memory management fault)
CPU 0 a0 = 0x1a4
CPU 0 a1 = 0x1
CPU 0 a2 = 0x0
CPU 0 pc = 0xfffffc000042ccac
CPU 0 ra = 0xfffffc00003a5a18
CPU 0 pv = 0xfffffc000042c9e0
CPU 0 curproc = 0x0
panic: trap
Begin traceback...
alpha trace requires known PC =eject=
End traceback...
cpu3: shutting down...
cpu2: shutting down...
cpu1: shutting down...
CPU 0: fatal kernel trap:
CPU 0 trap entry = 0x2 (memory management fault)
CPU 0 a0 = 0x1a4
CPU 0 a1 = 0x1
CPU 0 a2 = 0x0
CPU 0 pc = 0xfffffc00004a4764
CPU 0 ra = 0xfffffc00004a65b0
CPU 0 pv = 0xfffffc00004a4700
CPU 0 curproc = 0x0
panic: trap
Begin traceback...
alpha trace requires known PC =eject=
End traceback...
The machine is still hung at this point and another toggle (and
a half) of the virtual halt button is necessary before I can
force it to boot from the SRM firmware.
The system has run for months in test with no problem like this,
sometimes moving several gigabytes of data between the RAID
partitions. It then ran for almost five days in production as a
mail and web server before suddenly hanging like this during a
relatively quite period of activity late on Friday afternoon.
It ran another day and a half and now sometime this (Sunday)
morning has done it a second time; GRRR, and while I type this
something similar seems to be happenning though this time
there's no response from any running process, no ping response,
no console response. Time for a cold boot. OK, back up and
running but still re-sync'ing the root mirror....
Could this be a driver bug, an Xserve RAID bug, a hardware
problem, or what?
NetBSD 1.6.2_STABLE (TSUNAMI.MP) #21: Sat Apr 2 06:35:47 EST 2005
woods@whats:/build/woods/whats/NetBSD-1.6.x-alpha-alpha-21164a-obj/building/work/woods/m-NetBSD-1.6/sys/arch/alpha/compile/TSUNAMI.MP
AlphaServer ES40, 666MHz, s/n NI94900217
8192 byte page size, 4 processors.
total memory = 16384 MB
(7080 KB reserved for PROM, 16377 MB used by NetBSD)
avail memory = 14236 MB
using 16384 buffers containing 1637 MB of memory
mainbus0 (root)
cpu0 at mainbus0: ID 0 (primary), 21264A-14
cpu0: Architecture extensions: 307<PAT,MVI,CIX,FIX,BWX>
cpu1 at mainbus0: ID 1, 21264A-14
cpu1: Architecture extensions: 307<PAT,MVI,CIX,FIX,BWX>
cpu2 at mainbus0: ID 2, 21264A-14
cpu2: Architecture extensions: 307<PAT,MVI,CIX,FIX,BWX>
cpu3 at mainbus0: ID 3, 21264A-14
cpu3: Architecture extensions: 307<PAT,MVI,CIX,FIX,BWX>
tsc0 at mainbus0: 21272 Core Logic Chipset, Cchip rev 0
tsc0: 8 Dchips, 2 memory buses of 32 bytes
tsc0: arrays present: 4096MB (split), 4096MB (split), 4096MB (split), 4096MB (split), Dchip 0 rev 1
tsp0 at tsc0
pci0 at tsp0 bus 0
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
vga0 at pci0 dev 1 function 0: ATI Technologies 3D Rage II+ (rev. 0x9a)
pci_mem_find: void region
pci_mem_find: void region
pci_mem_find: void region
wsdisplay0 at vga0 (kbdmux ignored)
ahc0 at pci0 dev 2 function 0
ahc0: interrupting at dec 6600 irq 12
ahc0: aic7899: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs
scsibus0 at ahc0: 16 targets, 8 luns per target
ahc1 at pci0 dev 2 function 1
ahc1: interrupting at dec 6600 irq 13
ahc1: aic7899: Ultra160 Wide Channel B, SCSI Id=7, 32/253 SCBs
scsibus1 at ahc1: 16 targets, 8 luns per target
isp0 at pci0 dev 3 function 0: QLogic Dual Port FC-AL and 2Gbps Fabric HBA
isp0: interrupting at dec 6600 irq 16
isp0: bad execution throttle of 0- using 16
scsibus2 at isp0: 256 targets, 8 luns per target
tlp0 at pci0 dev 4 function 0: DECchip 21143 Ethernet, pass 3.0
tlp0: interrupting at dec 6600 irq 20
tlp0: DEC DE500-BA, Ethernet address 08:00:2b:c4:b5:26
tlp0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
sio0 at pci0 dev 7 function 0: Acer Labs M1543 PCI-ISA Bridge (rev. 0xc3)
Acer Labs M5229 UDMA IDE Controller (IDE mass storage, interface 0xfa, revision 0xc1) at pci0 dev 15 function 0 not configured
Acer Labs M5237 USB 1.1 Host Controller (USB serial bus, interface 0x10, revision 0x03) at pci0 dev 19 function 0 not configured
isa0 at sio0
com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
com0: console
com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, working fifo
pckbc0 at isa0 port 0x60-0x64
pcppi0 at isa0 port 0x61
midi0 at pcppi0: PC speaker
spkr0 at pcppi0
isabeep0 at pcppi0
fdc0 at isa0 port 0x3f0-0x3f7 irq 6 drq 2
fd0 at fdc0 drive 0: 1.44MB, 80 cyl, 2 head, 18 sec
mcclock0 at isa0 port 0x70-0x71: mc146818 or compatible
tsp1 at tsc0
pci1 at tsp1 bus 0
pci1: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
siop0 at pci1 dev 2 function 0: Symbios Logic 53c895 (ultra2-wide scsi)
siop0: using on-board RAM
siop0: interrupting at dec 6600 irq 28
scsibus3 at siop0: 16 targets, 8 luns per target
wm0 at pci1 dev 3 function 0: Intel i82546GB 1000BASE-X Ethernet, rev. 3
wm0: interrupting at dec 6600 irq 32
wm0: Ethernet address 00:04:23:a8:79:28
wm0: 1000baseSX, 1000baseSX-FDX, auto
wm1 at pci1 dev 3 function 1: Intel i82546GB 1000BASE-X Ethernet, rev. 3
wm1: interrupting at dec 6600 irq 33
wm1: Ethernet address 00:04:23:a8:79:28
wm1: 1000baseSX, 1000baseSX-FDX, auto
isp1 at pci1 dev 4 function 0: QLogic Dual Port FC-AL and 2Gbps Fabric HBA
isp1: interrupting at dec 6600 irq 36
isp1: bad execution throttle of 0- using 16
scsibus4 at isp1: 256 targets, 8 luns per target
tlp1 at pci1 dev 5 function 0: DECchip 21143 Ethernet, pass 3.0
tlp1: interrupting at dec 6600 irq 40
tlp1: DEC DE500-BA, Ethernet address 08:00:2b:c4:7a:70
tlp1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
bge0 at pci1 dev 6 function 0: Broadcom BCM5703X Gigabit Ethernet
bge0: interrupting at dec 6600 irq 44
bge0: ASIC BCM5703 A2, Ethernet address 00:08:02:91:89:ae
brgphy0 at bge0 phy 1: BCM5703 1000BASE-T media interface, rev. 2
brgphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
scsibus0: waiting 2 seconds for devices to settle...
cd0 at scsibus0 target 4 lun 0: <TOSHIBA, CD-ROM XM-5701TA, 0557> SCSI2 5/cdrom removable
cd0: sync (100.0ns offset 8), 8-bit (10.000MB/s) transfers
scsibus1: waiting 2 seconds for devices to settle...
sd0 at scsibus1 target 0 lun 0: <COMPAQ, BF03685A35, HPB7> SCSI3 0/direct fixed
sd0: 34732 MB, 31310 cyl, 4 head, 567 sec, 512 bytes/sect x 71132000 sectors
sd0: sync (12.5ns offset 63), 16-bit (160.000MB/s) transfers, tagged queueing
sd1 at scsibus1 target 1 lun 0: <COMPAQ, BF03685A35, HPB7> SCSI3 0/direct fixed
sd1: 34732 MB, 31310 cyl, 4 head, 567 sec, 512 bytes/sect x 71132000 sectors
sd1: sync (12.5ns offset 63), 16-bit (160.000MB/s) transfers, tagged queueing
sd2 at scsibus1 target 2 lun 0: <COMPAQ, BF03685A35, HPB7> SCSI3 0/direct fixed
sd2: 34732 MB, 31310 cyl, 4 head, 567 sec, 512 bytes/sect x 71132000 sectors
sd2: sync (12.5ns offset 63), 16-bit (160.000MB/s) transfers, tagged queueing
sd3 at scsibus1 target 4 lun 0: <COMPAQ, BF01864663, 3B07> SCSI2 0/direct fixed
sd3: 17365 MB, 7001 cyl, 20 head, 254 sec, 512 bytes/sect x 35565080 sectors
sd3: sync (25.0ns offset 63), 16-bit (80.000MB/s) transfers, tagged queueing
sd4 at scsibus1 target 5 lun 0: <COMPAQ, BF01864663, 3B07> SCSI2 0/direct fixed
sd4: 17365 MB, 7001 cyl, 20 head, 254 sec, 512 bytes/sect x 35565080 sectors
sd4: sync (25.0ns offset 63), 16-bit (80.000MB/s) transfers, tagged queueing
scsibus2: waiting 2 seconds for devices to settle...
sd5 at scsibus2 target 1 lun 0: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed
sd5: 1168 GB, 149605 cyl, 128 head, 128 sec, 512 bytes/sect x 2451128320 sectors
scsibus2 target 1 lun 1: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus2 target 1 lun 2: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus2 target 1 lun 3: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus2 target 1 lun 4: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus2 target 1 lun 5: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus2 target 1 lun 6: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus2 target 1 lun 7: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus3: waiting 2 seconds for devices to settle...
scsibus4: waiting 2 seconds for devices to settle...
sd6 at scsibus4 target 1 lun 0: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed
sd6: 701 GB, 89763 cyl, 128 head, 128 sec, 512 bytes/sect x 1470676992 sectors
sd7 at scsibus4 target 1 lun 1: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed
sd7: 701 GB, 89763 cyl, 128 head, 128 sec, 512 bytes/sect x 1470676992 sectors
scsibus4 target 1 lun 2: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus4 target 1 lun 3: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus4 target 1 lun 4: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus4 target 1 lun 5: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus4 target 1 lun 6: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus4 target 1 lun 7: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
Kernelized RAIDframe activated
RAIDframe: Searching for RAID components...
RAIDframe: Component on: sd0a: 71132000
Row: 0 Column: 1 Num Rows: 1 Num Columns: 2
Version: 2 Serial Number: 1412893 Mod Counter: 232
Clean: No Status: 0
sectPerSU: 128 SUsPerPU: 1 SUsPerRU: 1
RAID Level: 1 blocksize: 512 numBlocks: 71131904
Autoconfig: Yes
Contains root partition: Yes
Last configured as: raid0
RAIDframe: Component on: sd1a: 71132000
Row: 0 Column: 0 Num Rows: 1 Num Columns: 2
Version: 2 Serial Number: 1412893 Mod Counter: 232
Clean: No Status: 0
sectPerSU: 128 SUsPerPU: 1 SUsPerRU: 1
RAID Level: 1 blocksize: 512 numBlocks: 71131904
Autoconfig: Yes
Contains root partition: Yes
Last configured as: raid0
RAIDframe: the sd device sd2a has invalid RAID component label
RAIDframe: attempting to find a root-enabled RAID set...
RAIDframe: Found: sd1a at 0
RAIDframe: Found: sd0a at 1
RAIDframe autoconfigure
RAIDframe: Configuring raid0:
raid0: RAID Level 1
raid0: Components: /dev/sd1a /dev/sd0a
raid0: Total Sectors: 71131904 (34732 MB)
root on raid0a dumps on raid0b
Note that there was a very large mess to clean up in all the
filesystems after the first hang as they were all mounted with
softdep.
Softdep is definitely _not_ safe for production if there's any
risk whatsoever of any crash. All recently created files were
moved to lost+found directories and with something like a busy
Cyrus IMAP server that's as good as losing them since their
positions in mailbox indexes cannot be recovered.
Softdep is no longer in use and will be removed from my
production kernel for the forseable future.
Luckily the Xserve RAID has its full 512MB of cache RAM and
everything's on a good solid UPS with a generator behind it, so
the performance reduction will not be so perceivable.
>How-To-Repeat:
>Fix:
maybe I should at least turn that printf() into a panic()?
(but then it will certainly just hang at syncing disks...)
--
My first fortune-of-the-day after the second reboot today:
Screw up your courage! You've screwed up everything else.
>Release-Note:
>Audit-Trail:
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org,
netbsd-bugs@NetBSD.org
Subject: Re: kern/29936: isp(4) with Qlogic 2312 FC HBA hangs with: "unable to load DMA (35)"
Date: Sun, 10 Apr 2005 22:42:12 +0200
On Sun, Apr 10, 2005 at 07:36:00PM +0000, Greg A. Woods wrote:
> >Number: 29936
> >Category: kern
> >Synopsis: isp(4) with Qlogic 2312 FC HBA hangs with: "unable to load DMA (35)"
> >Confidential: no
> >Severity: critical
> >Priority: high
> >Responsible: kern-bug-people
> >State: open
> >Class: sw-bug
> >Submitter-Id: net
> >Arrival-Date: Sun Apr 10 19:36:00 +0000 2005
> >Originator: Greg A. Woods
> >Release: NetBSD 1.6.2_STABLE
> >Organization:
> Planix, Inc.; Toronto, Ontario; Canada
> >Environment:
> System: NetBSD 1.6.2_STABLE
> Architecture: alpha
> Machine: alpha
>
> [...]
> >Description:
>
> What a way to spoil a beautiful bright sunny spring weekend.
>
> A filesystem hangs bringing the system into a state of very
> reduced functionality and the console says:
>
> isp1: unable to load DMA (35)
This is EAGAIN. My guess is that pci_sgmap_pte64_load() is in ressource
shortage.
> sd6(isp1:0:1:0): adapter resource shortage
the scsipi subsystem will sleep for one second and try again, 5 times.
What is strange is that you say other isp devices don't have this problem.
If there is ressource shortage it should be for everyone using this sgamap.
If I understood it properly, the sgamap is per-tsp bus, which means that
the ressource shortage is only for devices on the pci1 bus.
I see you have lots of network adapters on pci1; it's possible that their
drivers allocate DMA ressources statically, causing this condition.
You should try to arrange to have all network devices on one PCI bus,
and all scsi ones on the second PCI bus.
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
From: "Greg A. Woods" <woods@planix.com>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: NetBSD GNATS submissions and followups <gnats-bugs@netbsd.org>,
<kern-bug-people@NetBSD.org>,
NetBSD GNATS Administrator <gnats-admin@NetBSD.org>
Subject: Re: kern/29936: isp(4) with Qlogic 2312 FC HBA hangs with: "unable to load DMA (35)"
Date: Sun, 10 Apr 2005 19:59:57 -0400 (EDT)
[ On Sunday, April 10, 2005 at 22:42:12 (+0200), Manuel Bouyer wrote: ]
> Subject: Re: kern/29936: isp(4) with Qlogic 2312 FC HBA hangs with: "unable to load DMA (35)"
>
> > isp1: unable to load DMA (35)
>
> This is EAGAIN. My guess is that pci_sgmap_pte64_load() is in ressource
> shortage.
Indeed, but why is it so "fatal" -- a "shortage" is not an "outage" and
I wouldn't have thought it to be a permanent condition....
> > sd6(isp1:0:1:0): adapter resource shortage
>
> the scsipi subsystem will sleep for one second and try again, 5 times.
Which of course won't help if the "shortage" never goes away (in time?).
> What is strange is that you say other isp devices don't have this problem.
No, so far it hasn't, though I wasn't going to let a sample of 2 decide
that certain. :-)
> If there is ressource shortage it should be for everyone using this sgamap.
> If I understood it properly, the sgamap is per-tsp bus, which means that
> the ressource shortage is only for devices on the pci1 bus.
> I see you have lots of network adapters on pci1; it's possible that their
> drivers allocate DMA ressources statically, causing this condition.
> You should try to arrange to have all network devices on one PCI bus,
> and all scsi ones on the second PCI bus.
Well that's a very good clue! Thanks!
Indeed the bge0 device on pci1 (along with isp1) is not being used,
partly because it alone can trigger some very similar kind of problem
with DMA resources. Like I say it's unused, however I suppose there
could be some situation which might somehow trigger it and cause it to
try to allocate DMA buffers. As far as I know nobody has ifconfig'ed it
before either hang, but it's possible someone or something did something
to activate it. (However the third crash -- the one where everything
hung completely, was, perhaps not coincidentally, right after I had done
a "pcictl pci0 list" command to get the product code for the Qlogic
card.)
I had thought I had applied Jason's patches from the "bge(4) (DEGXA-TX)
no-go on the AlphaServer ES40" thread on tech-kern (& port-alpha) to the
1.6.x code too, but it seems I had not, so the 1.6.x version definitely
still causes problems on big memory machins.
I guess this still all boils down to needing a proper fix for PR# 28362
as well as complete support for 64-bit DMA so that mapping doesn't have
to be done for 64-bit cards on 64-bit systems like this.
In the mean time I will remove the bge driver from the kernel entirely
and hope that it was indeed the underlying cause.
However that still leaves wm0 (and the unused wm1) on pci1 along with
isp1. I'm not very comfortable with moving all the isp and ahc devices
to one bus just to put the network devices alone on the other, but I
suppose if that's what it takes.... I guess I won't know for sure
though if the bge removal fixes it until at least a couple of weeks go
by without further problems along these lines.
(note I cannot bring up wm1 concurrently with wm0 with this kernel -- I
encounter a similar DMA resource problem.... I'm not even sure it
worked with a -current kernel. I didn't want a dual-port card, but they
were the same price as the single, and a dual is of more use in other
kinds of machines if I can ever get the bge to work again, and if we
ever get a copper GigE port on the/a switch to connect it with, but in
the mean time even without the DMA resource issues, the bge driver still
only goes about half the speed of the wm driver. :-)
--
Greg A. Woods
H:+1 416 218-0098 W:+1 416 489-5852 x122 VE3TCP RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com> Secrets of the Weird <woods@weird.com>
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.