NetBSD Problem Report #29936

From woods@building.weird.com  Sun Apr 10 19:35:45 2005
Return-Path: <woods@building.weird.com>
Received: from building.weird.com (building.weird.com [204.92.254.24])
	by narn.netbsd.org (Postfix) with ESMTP id E2FD863B116
	for <gnats-bugs@gnats.netbsd.org>; Sun, 10 Apr 2005 19:35:44 +0000 (UTC)
Message-Id: <m1DKiDg-0024fjC@building.weird.com>
Date: Sun, 10 Apr 2005 15:35:44 -0400 (EDT)
From: "Greg A. Woods" <woods@planix.com>
Reply-To: "Greg A. Woods" <woods@planix.com>
To: gnats-bugs@netbsd.org
Subject: isp(4) with Qlogic 2312 FC HBA hangs with: "unable to load DMA (35)"
X-Send-Pr-Version: 3.95

>Number:         29936
>Category:       kern
>Synopsis:       isp(4) with Qlogic 2312 FC HBA hangs with: "unable to load DMA (35)"
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Apr 10 19:36:00 +0000 2005
>Last-Modified:  Sun Feb 26 06:31:14 +0000 2012
>Originator:     Greg A. Woods
>Release:        NetBSD 1.6.2_STABLE
>Organization:
Planix, Inc.; Toronto, Ontario; Canada
>Environment:
System: NetBSD 1.6.2_STABLE
Architecture: alpha
Machine: alpha

isp(4) from -current:

ic/isp.c:
     $NetBSD: isp.c,v 1.106 2005/02/27 00:27:01 perry Exp $
     $NetBSD: isp.c,v 1.106 2005/02/27 00:27:01 perry Exp $

ic/isp_inline.h:
     $NetBSD: isp_inline.h,v 1.25 2005/02/27 00:27:01 perry Exp $

ic/isp_ioctl.h:
     $NetBSD: isp_ioctl.h,v 1.6 2005/02/27 00:27:01 perry Exp $

ic/isp_netbsd.c:
     $NetBSD: isp_netbsd.c,v 1.65 2005/02/27 00:27:01 perry Exp $
     $NetBSD: isp_netbsd.c,v 1.65 2005/02/27 00:27:01 perry Exp $

ic/isp_netbsd.h:
     $NetBSD: isp_netbsd.h,v 1.53 2005/02/27 00:27:01 perry Exp $

ic/isp_target.c:
     $NetBSD: isp_target.c,v 1.27 2005/02/27 00:27:01 perry Exp $
     $NetBSD: isp_target.c,v 1.27 2005/02/27 00:27:01 perry Exp $

ic/isp_target.h:
     $NetBSD: isp_target.h,v 1.21 2003/12/04 13:57:30 keihan Exp $

ic/isp_tpublic.h:
     $NetBSD: isp_tpublic.h,v 1.13 2005/02/27 00:27:01 perry Exp $

ic/ispmbox.h:
     $NetBSD: ispmbox.h,v 1.48 2005/02/27 00:27:01 perry Exp $

ic/ispreg.h:
     $NetBSD: ispreg.h,v 1.29 2003/12/04 13:57:30 keihan Exp $

ic/ispvar.h:
     $NetBSD: ispvar.h,v 1.62 2005/02/27 00:27:01 perry Exp $

pci/isp_pci.c:
     $NetBSD: isp_pci.c,v 1.92 2005/02/27 00:27:33 perry Exp $
     $NetBSD: isp_pci.c,v 1.92 2005/02/27 00:27:33 perry Exp $

sbus/isp_sbus.c:
     $NetBSD: isp_sbus.c,v 1.53 2002/05/18 00:48:11 mjacob Exp $
     $NetBSD: isp_sbus.c,v 1.53 2002/05/18 00:48:11 mjacob Exp $

microcode/isp/asm_1040.h:
     $NetBSD: asm_1040.h,v 1.3 2005/02/27 00:27:23 perry Exp $

microcode/isp/asm_1080.h:
     $NetBSD: asm_1080.h,v 1.3 2005/02/27 00:27:23 perry Exp $

microcode/isp/asm_12160.h:
     $NetBSD: asm_12160.h,v 1.5 2005/02/27 00:27:23 perry Exp $

microcode/isp/asm_2100.h:
     $NetBSD: asm_2100.h,v 1.6 2005/02/27 00:27:23 perry Exp $

microcode/isp/asm_2200.h:
     $NetBSD: asm_2200.h,v 1.6 2005/02/27 00:27:24 perry Exp $

microcode/isp/asm_2300.h:
     $NetBSD: asm_2300.h,v 1.6 2005/02/27 00:27:24 perry Exp $

microcode/isp/asm_sbus.h:
     $NetBSD: asm_sbus.h,v 1.18 2005/02/27 00:27:24 perry Exp $

microcode/isp/isp_1000.bin:
ident warning: no id keywords in microcode/isp/isp_1000.bin

microcode/isp/isp_1040.bin:
ident warning: no id keywords in microcode/isp/isp_1040.bin

microcode/isp/isp_1080.bin:
ident warning: no id keywords in microcode/isp/isp_1080.bin

microcode/isp/isp_12160.bin:
ident warning: no id keywords in microcode/isp/isp_12160.bin

microcode/isp/isp_2100.bin:
ident warning: no id keywords in microcode/isp/isp_2100.bin

microcode/isp/isp_2200.bin:
ident warning: no id keywords in microcode/isp/isp_2200.bin

microcode/isp/isp_2300.bin:
ident warning: no id keywords in microcode/isp/isp_2300.bin


>Description:

	What a way to spoil a beautiful bright sunny spring weekend.

	A filesystem hangs bringing the system into a state of very
	reduced functionality and the console says:

	    isp1: unable to load DMA (35)
	    sd6(isp1:0:1:0): adapter resource shortage

	This is from an alphaserver es40 with twin Qlogic ISP 2312 HBAs
	connected to an Apple Xserve RAID array.

	The isp(4) driver is from -current a few weeks ago, but the rest
	of the kernel is essentially from the head of the netbsd-1-6
	branch.

	This also seems to affect the adaptec card to which the root
	disks are attached as "root" cannot login on the console, though
	existing processes continue to run (e.g. shells and sshds), as
	does most of Cyrus IMAP thus allowing users to continue to fetch
	mail (through the other still running isp(4) controller).  Only
	/home and /var/log, on sd6, are known to be frozen, though this
	doesn't really explain why root can't login on the console
	(unless the attempt to write the login record to /var/log/wtmp
	is what's stopping that).

	A manual halt doesn't seem to get anywhere:

	    RMC>halt in

	    Returning to COM port

	    halted CPU 0
	    CPU 1 is not halted
	    CPU 2 is not halted
	    CPU 3 is not halted

	    halt code = 1
	    operator initiated halt
	    PC = fffffc0000300750
	    P00>>>cont

	    continuing CPU 0
	    CP - RESTORE_TERM routine to be called
	    panic: user requested console halt
	    Begin traceback...
	    alpha trace requires known PC =eject=
	    End traceback...
	    syncing disks... 
	    CPU 0: fatal kernel trap:

	    CPU 0    trap entry = 0x2 (memory management fault)
	    CPU 0    a0         = 0x1a4
	    CPU 0    a1         = 0x1
	    CPU 0    a2         = 0x0
	    CPU 0    pc         = 0xfffffc000042ccac
	    CPU 0    ra         = 0xfffffc00003a5a18
	    CPU 0    pv         = 0xfffffc000042c9e0
	    CPU 0    curproc    = 0x0

	    panic: trap
	    Begin traceback...
	    alpha trace requires known PC =eject=
	    End traceback...
	    cpu3: shutting down...
	    cpu2: shutting down...
	    cpu1: shutting down...

	    CPU 0: fatal kernel trap:

	    CPU 0    trap entry = 0x2 (memory management fault)
	    CPU 0    a0         = 0x1a4
	    CPU 0    a1         = 0x1
	    CPU 0    a2         = 0x0
	    CPU 0    pc         = 0xfffffc00004a4764
	    CPU 0    ra         = 0xfffffc00004a65b0
	    CPU 0    pv         = 0xfffffc00004a4700
	    CPU 0    curproc    = 0x0

	    panic: trap
	    Begin traceback...
	    alpha trace requires known PC =eject=
	    End traceback...

	The machine is still hung at this point and another toggle (and
	a half) of the virtual halt button is necessary before I can
	force it to boot from the SRM firmware.

	The system has run for months in test with no problem like this,
	sometimes moving several gigabytes of data between the RAID
	partitions.  It then ran for almost five days in production as a
	mail and web server before suddenly hanging like this during a
	relatively quite period of activity late on Friday afternoon.
	It ran another day and a half and now sometime this (Sunday)
	morning has done it a second time; GRRR, and while I type this
	something similar seems to be happenning though this time
	there's no response from any running process, no ping response,
	no console response.  Time for a cold boot.  OK, back up and
	running but still re-sync'ing the root mirror....

	Could this be a driver bug, an Xserve RAID bug, a hardware
	problem, or what?

NetBSD 1.6.2_STABLE (TSUNAMI.MP) #21: Sat Apr  2 06:35:47 EST 2005
    woods@whats:/build/woods/whats/NetBSD-1.6.x-alpha-alpha-21164a-obj/building/work/woods/m-NetBSD-1.6/sys/arch/alpha/compile/TSUNAMI.MP
AlphaServer ES40, 666MHz, s/n NI94900217
8192 byte page size, 4 processors.
total memory = 16384 MB
(7080 KB reserved for PROM, 16377 MB used by NetBSD)
avail memory = 14236 MB
using 16384 buffers containing 1637 MB of memory
mainbus0 (root)
cpu0 at mainbus0: ID 0 (primary), 21264A-14
cpu0: Architecture extensions: 307<PAT,MVI,CIX,FIX,BWX>
cpu1 at mainbus0: ID 1, 21264A-14
cpu1: Architecture extensions: 307<PAT,MVI,CIX,FIX,BWX>
cpu2 at mainbus0: ID 2, 21264A-14
cpu2: Architecture extensions: 307<PAT,MVI,CIX,FIX,BWX>
cpu3 at mainbus0: ID 3, 21264A-14
cpu3: Architecture extensions: 307<PAT,MVI,CIX,FIX,BWX>
tsc0 at mainbus0: 21272 Core Logic Chipset, Cchip rev 0
tsc0: 8 Dchips, 2 memory buses of 32 bytes
tsc0: arrays present: 4096MB (split), 4096MB (split), 4096MB (split), 4096MB (split), Dchip 0 rev 1
tsp0 at tsc0
pci0 at tsp0 bus 0
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
vga0 at pci0 dev 1 function 0: ATI Technologies 3D Rage II+ (rev. 0x9a)
pci_mem_find: void region
pci_mem_find: void region
pci_mem_find: void region
wsdisplay0 at vga0 (kbdmux ignored)
ahc0 at pci0 dev 2 function 0
ahc0: interrupting at dec 6600 irq 12
ahc0: aic7899: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs
scsibus0 at ahc0: 16 targets, 8 luns per target
ahc1 at pci0 dev 2 function 1
ahc1: interrupting at dec 6600 irq 13
ahc1: aic7899: Ultra160 Wide Channel B, SCSI Id=7, 32/253 SCBs
scsibus1 at ahc1: 16 targets, 8 luns per target
isp0 at pci0 dev 3 function 0: QLogic Dual Port FC-AL and 2Gbps Fabric HBA
isp0: interrupting at dec 6600 irq 16
isp0: bad execution throttle of 0- using 16
scsibus2 at isp0: 256 targets, 8 luns per target
tlp0 at pci0 dev 4 function 0: DECchip 21143 Ethernet, pass 3.0
tlp0: interrupting at dec 6600 irq 20
tlp0: DEC DE500-BA, Ethernet address 08:00:2b:c4:b5:26
tlp0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
sio0 at pci0 dev 7 function 0: Acer Labs M1543 PCI-ISA Bridge (rev. 0xc3)
Acer Labs M5229 UDMA IDE Controller (IDE mass storage, interface 0xfa, revision 0xc1) at pci0 dev 15 function 0 not configured
Acer Labs M5237 USB 1.1 Host Controller (USB serial bus, interface 0x10, revision 0x03) at pci0 dev 19 function 0 not configured
isa0 at sio0
com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
com0: console
com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, working fifo
pckbc0 at isa0 port 0x60-0x64
pcppi0 at isa0 port 0x61
midi0 at pcppi0: PC speaker
spkr0 at pcppi0
isabeep0 at pcppi0
fdc0 at isa0 port 0x3f0-0x3f7 irq 6 drq 2
fd0 at fdc0 drive 0: 1.44MB, 80 cyl, 2 head, 18 sec
mcclock0 at isa0 port 0x70-0x71: mc146818 or compatible
tsp1 at tsc0
pci1 at tsp1 bus 0
pci1: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
siop0 at pci1 dev 2 function 0: Symbios Logic 53c895 (ultra2-wide scsi)
siop0: using on-board RAM
siop0: interrupting at dec 6600 irq 28
scsibus3 at siop0: 16 targets, 8 luns per target
wm0 at pci1 dev 3 function 0: Intel i82546GB 1000BASE-X Ethernet, rev. 3
wm0: interrupting at dec 6600 irq 32
wm0: Ethernet address 00:04:23:a8:79:28
wm0: 1000baseSX, 1000baseSX-FDX, auto
wm1 at pci1 dev 3 function 1: Intel i82546GB 1000BASE-X Ethernet, rev. 3
wm1: interrupting at dec 6600 irq 33
wm1: Ethernet address 00:04:23:a8:79:28
wm1: 1000baseSX, 1000baseSX-FDX, auto
isp1 at pci1 dev 4 function 0: QLogic Dual Port FC-AL and 2Gbps Fabric HBA
isp1: interrupting at dec 6600 irq 36
isp1: bad execution throttle of 0- using 16
scsibus4 at isp1: 256 targets, 8 luns per target
tlp1 at pci1 dev 5 function 0: DECchip 21143 Ethernet, pass 3.0
tlp1: interrupting at dec 6600 irq 40
tlp1: DEC DE500-BA, Ethernet address 08:00:2b:c4:7a:70
tlp1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
bge0 at pci1 dev 6 function 0: Broadcom BCM5703X Gigabit Ethernet
bge0: interrupting at dec 6600 irq 44
bge0: ASIC BCM5703 A2, Ethernet address 00:08:02:91:89:ae
brgphy0 at bge0 phy 1: BCM5703 1000BASE-T media interface, rev. 2
brgphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
scsibus0: waiting 2 seconds for devices to settle...
cd0 at scsibus0 target 4 lun 0: <TOSHIBA, CD-ROM XM-5701TA, 0557> SCSI2 5/cdrom removable
cd0: sync (100.0ns offset 8), 8-bit (10.000MB/s) transfers
scsibus1: waiting 2 seconds for devices to settle...
sd0 at scsibus1 target 0 lun 0: <COMPAQ, BF03685A35, HPB7> SCSI3 0/direct fixed
sd0: 34732 MB, 31310 cyl, 4 head, 567 sec, 512 bytes/sect x 71132000 sectors
sd0: sync (12.5ns offset 63), 16-bit (160.000MB/s) transfers, tagged queueing
sd1 at scsibus1 target 1 lun 0: <COMPAQ, BF03685A35, HPB7> SCSI3 0/direct fixed
sd1: 34732 MB, 31310 cyl, 4 head, 567 sec, 512 bytes/sect x 71132000 sectors
sd1: sync (12.5ns offset 63), 16-bit (160.000MB/s) transfers, tagged queueing
sd2 at scsibus1 target 2 lun 0: <COMPAQ, BF03685A35, HPB7> SCSI3 0/direct fixed
sd2: 34732 MB, 31310 cyl, 4 head, 567 sec, 512 bytes/sect x 71132000 sectors
sd2: sync (12.5ns offset 63), 16-bit (160.000MB/s) transfers, tagged queueing
sd3 at scsibus1 target 4 lun 0: <COMPAQ, BF01864663, 3B07> SCSI2 0/direct fixed
sd3: 17365 MB, 7001 cyl, 20 head, 254 sec, 512 bytes/sect x 35565080 sectors
sd3: sync (25.0ns offset 63), 16-bit (80.000MB/s) transfers, tagged queueing
sd4 at scsibus1 target 5 lun 0: <COMPAQ, BF01864663, 3B07> SCSI2 0/direct fixed
sd4: 17365 MB, 7001 cyl, 20 head, 254 sec, 512 bytes/sect x 35565080 sectors
sd4: sync (25.0ns offset 63), 16-bit (80.000MB/s) transfers, tagged queueing
scsibus2: waiting 2 seconds for devices to settle...
sd5 at scsibus2 target 1 lun 0: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed
sd5: 1168 GB, 149605 cyl, 128 head, 128 sec, 512 bytes/sect x 2451128320 sectors
scsibus2 target 1 lun 1: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus2 target 1 lun 2: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus2 target 1 lun 3: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus2 target 1 lun 4: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus2 target 1 lun 5: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus2 target 1 lun 6: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus2 target 1 lun 7: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus3: waiting 2 seconds for devices to settle...
scsibus4: waiting 2 seconds for devices to settle...
sd6 at scsibus4 target 1 lun 0: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed
sd6: 701 GB, 89763 cyl, 128 head, 128 sec, 512 bytes/sect x 1470676992 sectors
sd7 at scsibus4 target 1 lun 1: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed
sd7: 701 GB, 89763 cyl, 128 head, 128 sec, 512 bytes/sect x 1470676992 sectors
scsibus4 target 1 lun 2: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus4 target 1 lun 3: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus4 target 1 lun 4: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus4 target 1 lun 5: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus4 target 1 lun 6: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
scsibus4 target 1 lun 7: <APPLE, Xserve RAID, 1.26> SCSI5 0/direct fixed offline not configured
Kernelized RAIDframe activated
RAIDframe: Searching for RAID components...
RAIDframe: Component on: sd0a: 71132000
   Row: 0 Column: 1 Num Rows: 1 Num Columns: 2
   Version: 2 Serial Number: 1412893 Mod Counter: 232
   Clean: No Status: 0
   sectPerSU: 128 SUsPerPU: 1 SUsPerRU: 1
   RAID Level: 1  blocksize: 512 numBlocks: 71131904
   Autoconfig: Yes
   Contains root partition: Yes
   Last configured as: raid0
RAIDframe: Component on: sd1a: 71132000
   Row: 0 Column: 0 Num Rows: 1 Num Columns: 2
   Version: 2 Serial Number: 1412893 Mod Counter: 232
   Clean: No Status: 0
   sectPerSU: 128 SUsPerPU: 1 SUsPerRU: 1
   RAID Level: 1  blocksize: 512 numBlocks: 71131904
   Autoconfig: Yes
   Contains root partition: Yes
   Last configured as: raid0
RAIDframe: the sd device sd2a has invalid RAID component label
RAIDframe: attempting to find a root-enabled RAID set...
RAIDframe: Found: sd1a at 0
RAIDframe: Found: sd0a at 1
RAIDframe autoconfigure
RAIDframe: Configuring raid0:
raid0: RAID Level 1
raid0: Components: /dev/sd1a /dev/sd0a
raid0: Total Sectors: 71131904 (34732 MB)
root on raid0a dumps on raid0b

	Note that there was a very large mess to clean up in all the
	filesystems after the first hang as they were all mounted with
	softdep.

	Softdep is definitely _not_ safe for production if there's any
	risk whatsoever of any crash.  All recently created files were
	moved to lost+found directories and with something like a busy
	Cyrus IMAP server that's as good as losing them since their
	positions in mailbox indexes cannot be recovered.

	Softdep is no longer in use and will be removed from my
	production kernel for the forseable future.

	Luckily the Xserve RAID has its full 512MB of cache RAM and
	everything's on a good solid UPS with a generator behind it, so
	the performance reduction will not be so perceivable.

>How-To-Repeat:

>Fix:

	maybe I should at least turn that printf() into a panic()?

	(but then it will certainly just hang at syncing disks...)

-- 
My first fortune-of-the-day after the second reboot today:

Screw up your courage!  You've screwed up everything else.

>Release-Note:

>Audit-Trail:
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org,
	netbsd-bugs@NetBSD.org
Subject: Re: kern/29936: isp(4) with Qlogic 2312 FC HBA hangs with: "unable to load DMA (35)"
Date: Sun, 10 Apr 2005 22:42:12 +0200

 On Sun, Apr 10, 2005 at 07:36:00PM +0000, Greg A. Woods wrote:
 > >Number:         29936
 > >Category:       kern
 > >Synopsis:       isp(4) with Qlogic 2312 FC HBA hangs with: "unable to load DMA (35)"
 > >Confidential:   no
 > >Severity:       critical
 > >Priority:       high
 > >Responsible:    kern-bug-people
 > >State:          open
 > >Class:          sw-bug
 > >Submitter-Id:   net
 > >Arrival-Date:   Sun Apr 10 19:36:00 +0000 2005
 > >Originator:     Greg A. Woods
 > >Release:        NetBSD 1.6.2_STABLE
 > >Organization:
 > Planix, Inc.; Toronto, Ontario; Canada
 > >Environment:
 > System: NetBSD 1.6.2_STABLE
 > Architecture: alpha
 > Machine: alpha
 > 
 > [...]
 > >Description:
 > 
 > 	What a way to spoil a beautiful bright sunny spring weekend.
 > 
 > 	A filesystem hangs bringing the system into a state of very
 > 	reduced functionality and the console says:
 > 
 > 	    isp1: unable to load DMA (35)

 This is EAGAIN. My guess is that pci_sgmap_pte64_load() is in ressource
 shortage.

 > 	    sd6(isp1:0:1:0): adapter resource shortage

 the scsipi subsystem will sleep for one second and try again, 5 times.

 What is strange is that you say other isp devices don't have this problem.
 If there is ressource shortage it should be for everyone using this sgamap.
 If I understood it properly, the sgamap is per-tsp bus, which means that
 the ressource shortage is only for devices on the pci1 bus.
 I see you have lots of network adapters on pci1; it's possible that their
 drivers allocate DMA ressources statically, causing this condition.
 You should try to arrange to have all network devices on one PCI bus,
 and all scsi ones on the second PCI bus.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: "Greg A. Woods" <woods@planix.com>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: NetBSD GNATS submissions and followups <gnats-bugs@netbsd.org>,
	<kern-bug-people@NetBSD.org>,
	NetBSD GNATS Administrator <gnats-admin@NetBSD.org>
Subject: Re: kern/29936: isp(4) with Qlogic 2312 FC HBA hangs with: "unable to load DMA (35)"
Date: Sun, 10 Apr 2005 19:59:57 -0400 (EDT)

 [ On Sunday, April 10, 2005 at 22:42:12 (+0200), Manuel Bouyer wrote: ]
 > Subject: Re: kern/29936: isp(4) with Qlogic 2312 FC HBA hangs with: "unable to load DMA (35)"
 >
 > > 	    isp1: unable to load DMA (35)
 > 
 > This is EAGAIN. My guess is that pci_sgmap_pte64_load() is in ressource
 > shortage.

 Indeed, but why is it so "fatal" -- a "shortage" is not an "outage" and
 I wouldn't have thought it to be a permanent condition....

 > > 	    sd6(isp1:0:1:0): adapter resource shortage
 > 
 > the scsipi subsystem will sleep for one second and try again, 5 times.

 Which of course won't help if the "shortage" never goes away (in time?).


 > What is strange is that you say other isp devices don't have this problem.

 No, so far it hasn't, though I wasn't going to let a sample of 2 decide
 that certain.  :-)


 > If there is ressource shortage it should be for everyone using this sgamap.
 > If I understood it properly, the sgamap is per-tsp bus, which means that
 > the ressource shortage is only for devices on the pci1 bus.
 > I see you have lots of network adapters on pci1; it's possible that their
 > drivers allocate DMA ressources statically, causing this condition.
 > You should try to arrange to have all network devices on one PCI bus,
 > and all scsi ones on the second PCI bus.

 Well that's a very good clue!  Thanks!

 Indeed the bge0 device on pci1 (along with isp1) is not being used,
 partly because it alone can trigger some very similar kind of problem
 with DMA resources.  Like I say it's unused, however I suppose there
 could be some situation which might somehow trigger it and cause it to
 try to allocate DMA buffers.  As far as I know nobody has ifconfig'ed it
 before either hang, but it's possible someone or something did something
 to activate it.  (However the third crash -- the one where everything
 hung completely, was, perhaps not coincidentally, right after I had done
 a "pcictl pci0 list" command to get the product code for the Qlogic
 card.)

 I had thought I had applied Jason's patches from the "bge(4) (DEGXA-TX)
 no-go on the AlphaServer ES40" thread on tech-kern (& port-alpha) to the
 1.6.x code too, but it seems I had not, so the 1.6.x version definitely
 still causes problems on big memory machins.

 I guess this still all boils down to needing a proper fix for PR# 28362
 as well as complete support for 64-bit DMA so that mapping doesn't have
 to be done for 64-bit cards on 64-bit systems like this.

 In the mean time I will remove the bge driver from the kernel entirely
 and hope that it was indeed the underlying cause.

 However that still leaves wm0 (and the unused wm1) on pci1 along with
 isp1.  I'm not very comfortable with moving all the isp and ahc devices
 to one bus just to put the network devices alone on the other, but I
 suppose if that's what it takes....  I guess I won't know for sure
 though if the bge removal fixes it until at least a couple of weeks go
 by without further problems along these lines.

 (note I cannot bring up wm1 concurrently with wm0 with this kernel -- I
 encounter a similar DMA resource problem....  I'm not even sure it
 worked with a -current kernel.  I didn't want a dual-port card, but they
 were the same price as the single, and a dual is of more use in other
 kinds of machines if I can ever get the bge to work again, and if we
 ever get a copper GigE port on the/a switch to connect it with, but in
 the mean time even without the DMA resource issues, the bge driver still
 only goes about half the speed of the wm driver.  :-)

 -- 
 						Greg A. Woods

 H:+1 416 218-0098  W:+1 416 489-5852 x122  VE3TCP  RoboHack <woods@robohack.ca>
 Planix, Inc. <woods@planix.com>          Secrets of the Weird <woods@weird.com>

>Unformatted:
Home
PR Database Search
(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.