NetBSD Problem Report #13298

Received: (qmail 5175 invoked from network); 24 Jun 2001 18:52:02 -0000
Message-Id: <20010624185436.7669ED0@proven.weird.com>
Date: Sun, 24 Jun 2001 14:54:36 -0400 (EDT)
From: woods@weird.com (Greg A. Woods)
Reply-To: woods@planix.com (Greg A. Woods)
To: gnats-bugs@gnats.netbsd.org
Subject: sparc esp driver leaves processes stuck forever after a time out
X-Send-Pr-Version: 3.95

>Number:         13298
>Category:       port-sparc
>Synopsis:       sparc esp driver leaves processes stuck forever after a time out
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    port-sparc-maintainer
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Jun 24 18:53:01 +0000 2001
>Closed-Date:    Mon Dec 21 23:03:25 +0000 2009
>Last-Modified:  Mon Dec 21 23:03:25 +0000 2009
>Originator:     Greg A. Woods
>Release:        2001/06/19
>Organization:
Planix, Inc.; Toronto, Ontario; Canada
>Environment:

NetBSD sometimes 1.5W NetBSD 1.5W (GENERIC) #0: Wed Jun 20 17:12:11 EDT 2001     woods@sometimes:/proven/work/woods/NetBSD-src/sys/arch/sparc/compile/GENERIC sparc
Architecture: sparc
Machine: sparc

>Description:

	I'd been running a 'make build' on sparc, and just as it was
	getting near the end, it got stuck, apparently in an I/O
	operation.

  UID   PID  PPID CPU PRI  NI   VSZ  RSS WCHAN    STAT TT     TIME COMMAND
    0  6488  6487  29  -5   0   372  192 biowait  D+   p0  0:00.26 make -m /usr/share/mk _THISDIR_ gnu/usr.bin/binut
 1000  6829  6535   0  -2   0   416  140 vnlock   D+   p1  0:00.02 /bin/ls -lis obj/ 

USER  PID %CPU %MEM VSZ RSS TT STAT STARTED    TIME COMMAND
root 6488  0.0  0.1 372 192 p0 D+   12:34PM 0:00.26 make -m /usr/share/mk _THISDIR_ gnu/usr.bin/binutils/size/ 
woods 6829  0.0  0.1 416 140 p1 D+    1:52PM 0:00.02 /bin/ls -lis obj/ 

	Note that the make process appears stuck at the point where it's
	trying to read a directory or some inode.  The object
	directories are symlinked to a second disk mounted as /build,
	and that's what the 'ls' got stuck trying to look at.

	At the time the first process got stuck the kernel said:

Jun 24 12:35:50 sometimes /netbsd: sd1(esp0:0:1:0): esp0: timed out [ecb 0xf09494e0 (flags 0x1, dleft 800, stat 0)], <state 1, nexus 0x0, phase(l 10, c 100, p 3), resid 2000, msg(q 0,o 0) >

	and as you can see from the stuck "ls" any further I/O in that
	same area gets stuck.

	However not all access to 'sd1' is stuck.  Raw access seems fine
	(though I didn't explicitly try to get to the same data where
	the timeout was triggered), and filesystem access seems fine in
	any other directory I try peeking in.

	kill -9 has no effect on either process.

	On reboot only a simple warning was given:

	    resyncing disks... boot: WARNING: some process(es) wouldn't die
	    1 1 done

	However the system appeared to hang at that point.  I waited for
	about seven minutes, then sent a BREAK:

	    telnet> send brk
	    Stopped at      cpu_Debugger+0x4:       jmpl            [%o7 + 0x8], %g0
	    db> trace
	    zsc_intr_hard(0x8, 0xf0837ed0, 0xf0264000, 0xfe000000, 0x2bea, 0x160df32) at zsc_intr_hard+0x68
	    zshard(0x0, 0xf01b7c20, 0x3b3635bc, 0xd94e7, 0x3d, 0x1) at zshard+0x40
	    sparc_interrupt44c(0x0, 0x0, 0xf021785c, 0x0, 0xffffffff, 0xf02929e4) at sparc_interrupt44c+0x104
	    mi_switch(0xf0290c40, 0x528f, 0xf0262078, 0xf0292b40, 0x0, 0x70) at mi_switch+0x1a4
	    ltsleep(0x0, 0x4, 0xf0232750, 0x0, 0x20000000, 0xf0224320) at ltsleep+0x214
	    uvm_scheduler(0x30a, 0x1b3d, 0xf0263c00, 0x4a40, 0xffffffff, 0xa8c0) at uvm_scheduler+0xac
	    main(0x0, 0xfffffff8, 0xf00021f0, 0xf0263ec3, 0x38b908, 0x28d0bc) at main+0x880
	    Lgandul(0x388110, 0x3951b0, 0x387eb4, 0x0, 0x397400, 0xffffffff) at Lgandul+0xe8

	    db> 

	I decided to just force a reboot and get on with things....

	Here's the dmesg output, FYI:

[ using 275272 bytes of netbsd ELF symbol table ]

Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001
    The NetBSD Foundation, Inc.  All rights reserved.
Copyright (c) 1982, 1986, 1989, 1991, 1993
    The Regents of the University of California.  All rights reserved.

NetBSD 1.5W (GENERIC) #0: Wed Jun 20 17:12:11 EDT 2001
    woods@sometimes:/proven/work/woods/NetBSD-src/sys/arch/sparc/compile/GENERIC
total memory = 287 MB
avail memory = 262 MB
using 896 buffers containing 14812 KB of memory
bootpath: /iommu@f,e0000000/sbus@f,e0001000/espdma@f,400000/esp@f,800000/sd@0,0
mainbus0 (root): SUNW,Axil-320
cpu0 at mainbus0: TMS390Z50 v0 or TMS390Z55 @ 75 MHz, on-chip FPU
cpu0: physical 20K instruction (64 b/l), 16K data (32 b/l), 1024K external (32 b/l): cache enabled
obio0 at mainbus0
clock0 at obio0 slot 0 offset 0x200000: mk48t08: hostid 72971942
timer0 at obio0 slot 0 offset 0x300000 delay constant 35
zs0 at obio0 slot 0 offset 0x100000 level 12 softpri 6
zstty0 at zs0 channel 0 (console i/o)
zstty1 at zs0 channel 1
zs1 at obio0 slot 0 offset 0x0 level 12 softpri 6
kbd0 at zs1 channel 0
ms0 at zs1 channel 1
fdc0 at obio0 slot 0 offset 0x700000 level 11 softpri 4: chip 82077
fd0 at fdc0 drive 0: 1.44MB 80 cyl, 2 head, 18 sec
auxreg0 at obio0 slot 0 offset 0x800000
power0 at obio0 slot 0 offset 0xa01000 level 2
iommu0 at mainbus0 ioaddr 0xe0000000: version 0x1/0x1, page-size 4096, range 64MB
sbus0 at iommu0: clock = 25 MHz
dma0 at sbus0 slot 15 offset 0x400000: dma rev 2
esp0 at dma0 slot 15 offset 0x800000 level 4: ESP200, 40MHz, SCSI ID 7
scsibus0 at esp0: 8 targets, 8 luns per target
ledma0 at sbus0 slot 15 offset 0x400010: dma rev 2
le0 at ledma0 slot 15 offset 0xc00000 level 6: address 00:00:3b:80:3c:56
le0: 8 receive buffers, 2 transmit buffers
bpp0 at sbus0 slot 15 offset 0x4800000 level 2 (ipl 3): dma rev 2
SUNW,DBRIe at sbus0 slot 15 offset 0x8010000 level 9 not configured
eccmemctl0 at mainbus0: version 0x1/0x1
scsibus0: waiting 2 seconds for devices to settle...
esp0: wide mode 0
sd0 at scsibus0 target 0 lun 0: <WDIGTL, ENTERPRISE, 1.61> SCSI2 0/direct fixed
sd0: 4157 MB, 5720 cyl, 8 head, 186 sec, 512 bytes/sect x 8515173 sectors
sd0: sync (100.0ns offset 15), 8-bit (10.000MB/s) transfers, tagged queueing
sd1 at scsibus0 target 1 lun 0: <SEAGATE, ST32430N, 0510> SCSI2 0/direct fixed
sd1: 2049 MB, 3992 cyl, 9 head, 116 sec, 512 bytes/sect x 4197405 sectors
sd1: sync (100.0ns offset 15), 8-bit (10.000MB/s) transfers, tagged queueing
st0 at scsibus0 target 4 lun 0: <EXABYTE, EXB-4200, 216> SCSI2 1/sequential removable
st0: drive empty
st0: async, 8-bit transfers
cd0 at scsibus0 target 6 lun 0: <PLEXTOR, CD-ROM PX-4XCH, 1.23> SCSI2 5/cdrom removable
cd0: sync (248.0ns offset 15), 8-bit (4.032MB/s) transfers
root on sd0a dumps on sd0b
root file system type: ffs


>How-To-Repeat:

	unknown

>Fix:

	unknown

>Release-Note:
>Audit-Trail:

From: woods@weird.com (Greg A. Woods)
To: gnats-bugs@gnats.netbsd.org (NetBSD GNATS submissions and followups)
Cc: netbsd-bugs@NetBSD.ORG (NetBSD Bugs and PR posting List)
Subject: Re: port-sparc/13298: sparc esp driver leaves processes stuck forever after a time out
Date: Sun, 24 Jun 2001 22:46:14 -0400 (EDT)

 [ On Sunday, June 24, 2001 at 14:54:36 (-0400), Greg A. Woods wrote: ]
 > Subject: port-sparc/13298: sparc esp driver leaves processes stuck forever after a time out
 > 
 > Jun 24 12:35:50 sometimes /netbsd: sd1(esp0:0:1:0): esp0: timed out [ecb 0xf09494e0 (flags 0x1, dleft 800, stat 0)], <state 1, nexus 0x0, phase(l 10, c 100, p 3), resid 2000, msg(q 0,o 0) >

 Well it happened again, but with a bit more info this time:

 sd1: waiting for pack to spin up...
 sd1(esp0:0:1:0): esp0: timed out [ecb 0xf0852c08 (flags 0x1, dleft 2000, stat 0)], <state 1, nexus 0x0, phase(l 10, c 100, p 3), resid 0, msg(q 0,o 0) >
 sd1(esp0:0:1:0): esp0: timed out [ecb 0xf0852ab8 (flags 0x1, dleft 2000, stat 0)], <state 1, nexus 0x0, phase(l 10, c 100, p 3), resid 0, msg(q 0,o 0) >

 This time the triggering process was stuck in 'getblk':

     0 24532 24531  24  -5   0   192  192 getblk   D    p0 0:00.03 make -m /usr/


 Hmmmm... I thought... if the disk takes itself offline momentarily then
 that smells a lot like the disk is doing something bad, like maybe
 encountering an error that 'esp's not reporting properly....

 Well it turns out that ARRE wasn't enabled on the disk (damn I wish this
 would happen automatically if the driver's not going to do reassignment!)

 Using the old FreeBSD "scsi" tool (which still works just fine for
 mode-page adjustments on NetBSD, including on sparc), I turned it on and
 did a 'dd if=/dev/rsd1c of=/dev/null' with narry a problem.  The
 performance wasn't very stunning though:

 	4197405+0 records in
 	4197405+0 records out
 	2149071360 bytes transferred in 4885.573 secs (439881 bytes/sec)

 So, one more reboot to restart the 'make build' again....

 This time the disk synching failed but the reboot didn't hang....


 I wonder if this problem is the same or similar to the one Jim Bernard
 has been reporting on port-sparc under the heading "SCSI probs on spork 10."


 In any case it seems to me that the 'esp' driver hasn't kept up to the
 new scsipi times.....

 -- 
 							Greg A. Woods

 +1 416 218-0098      VE3TCP      <gwoods@acm.org>     <woods@robohack.ca>
 Planix, Inc. <woods@planix.com>;   Secrets of the Weird <woods@weird.com>

From: Julian Coleman <jdc@coris.demon.co.uk>
To: woods@weird.com, gnats-bugs@gnats.netbsd.org
Cc:  
Subject: Re: port-sparc/13298: sparc esp driver leaves processes stuck forever after a time out
Date: Mon, 25 Jun 2001 10:45:03 +0100

 > In any case it seems to me that the 'esp' driver hasn't kept up to the
 > new scsipi times.....

 No, I think that it has.  I've seen a similar problem before.  When my
 Exabyte was dying, it would fail to respond.  This caused a timeout and the
 dump/tar/mt process to get stuck.  This was back in the 1.4n timeframe.

 J

 -- 
                     My other computer also runs NetBSD
                           http://www.netbsd.org/

From: Manuel Bouyer <bouyer@antioche.lip6.fr>
To: NetBSD GNATS submissions and followups <gnats-bugs@gnats.netbsd.org>,
   NetBSD Bugs and PR posting List <netbsd-bugs@netbsd.org>
Cc:  
Subject: Re: port-sparc/13298: sparc esp driver leaves processes stuck forever after a time out
Date: Mon, 25 Jun 2001 20:41:19 +0200

 On Sun, Jun 24, 2001 at 10:46:14PM -0400, Greg A. Woods wrote:
 > [...]
 > 
 > In any case it seems to me that the 'esp' driver hasn't kept up to the
 > new scsipi times.....

 I got problems with the esp driver before the scsipi changes.

 --
 Manuel Bouyer <bouyer@antioche.eu.org>
 --
State-Changed-From-To: open->feedback
State-Changed-By: mrg@NetBSD.org
State-Changed-When: Sun, 20 Dec 2009 08:30:42 +0000
State-Changed-Why:
this looks like failing hardware to me.  did the problem ever reoccur or go away?


From: "Greg A. Woods" <woods@planix.com>
To: NetBSD GNATS <gnats-bugs@NetBSD.org>
Cc: port-sparc-maintainer@netbsd.org,
	NetBSD GNATS Administrator <gnats-admin@NetBSD.org>,
	mrg@NetBSD.org
Subject: Re: port-sparc/13298 (sparc esp driver leaves processes stuck forever after a time out)
Date: Mon, 21 Dec 2009 15:01:45 -0500

 --pgp-sign-Multipart_Mon_Dec_21_15:01:44_2009-1
 Content-Type: text/plain; charset=US-ASCII
 Content-Transfer-Encoding: quoted-printable

 At Sun, 20 Dec 2009 08:30:43 +0000 (UTC), mrg@NetBSD.org wrote:
 Subject: Re: port-sparc/13298 (sparc esp driver leaves processes stuck fore=
 ver after a time out)
 >=20
 > Synopsis: sparc esp driver leaves processes stuck forever after a time out
 >=20
 > State-Changed-From-To: open->feedback
 > State-Changed-By: mrg@NetBSD.org
 > State-Changed-When: Sun, 20 Dec 2009 08:30:42 +0000
 > State-Changed-Why:
 > this looks like failing hardware to me.  did the problem ever reoccur or =
 go away?
 >=20

 We should probably close this PR now.

 The disk could have been failing -- one of the disks on that machine did
 eventually die.

 My main complaint at the time though was that processes would get stuck
 and become unkillable, so in part it wasn't specifically about the esp
 driver or the possibly dying device.

 If I understand correctly much of the underlying issues here have been
 worked on subsequently and improved too.

 --=20
 						Greg A. Woods

 +1 416 218-0098                VE3TCP          RoboHack <woods@robohack.ca>
 Planix, Inc. <woods@planix.com>      Secrets of the Weird <woods@weird.com>

 --pgp-sign-Multipart_Mon_Dec_21_15:01:44_2009-1
 Content-Type: application/pgp-signature
 Content-Transfer-Encoding: 7bit

 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.9 (NetBSD)

 iD8DBQBLL9QoZn1xt3i/9H8RApWEAKDkRNOHdinNsWu3vVqPyOKbptr1PQCePhm9
 Yrzf5bkL0DlfNGmhvkNAszM=
 =G7KV
 -----END PGP SIGNATURE-----

 --pgp-sign-Multipart_Mon_Dec_21_15:01:44_2009-1--

State-Changed-From-To: feedback->closed
State-Changed-By: mrg@NetBSD.org
State-Changed-When: Mon, 21 Dec 2009 23:03:25 +0000
State-Changed-Why:
thanks.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.