NetBSD Problem Report #39993
From bouyer@antioche.lip6.fr Fri Nov 21 11:33:09 2008
Return-Path: <bouyer@antioche.lip6.fr>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
by narn.NetBSD.org (Postfix) with ESMTP id 138DC63BD2D
for <gnats-bugs@gnats.NetBSD.org>; Fri, 21 Nov 2008 11:33:09 +0000 (UTC)
Message-Id: <200811211133.mALBX4Yi023716@antioche.lip6.fr>
Date: Fri, 21 Nov 2008 12:33:04 +0100 (MET)
From: bouyer@antioche.eu.org
Reply-To: bouyer@antioche.eu.org
To: gnats-bugs@gnats.NetBSD.org
Subject: lockup on i386 SMP (raidframe related ?)
X-Send-Pr-Version: 3.95
>Number: 39993
>Category: kern
>Synopsis: lockup on i386 SMP (raidframe related ?)
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri Nov 21 11:35:00 +0000 2008
>Closed-Date: Wed Jun 01 22:47:46 +0000 2016
>Last-Modified: Wed Jun 01 22:47:46 +0000 2016
>Originator: Manuel Bouyer
>Release: NetBSD 5.0_BETA
>Organization:
>Environment:
System: NetBSD antioche.lip6.fr 5.0_BETA NetBSD 5.0_BETA (ANTIOCHE5) #2: Thu Nov 20 23:55:28 CET 2008 bouyer@roll:/dsk/l1/misc/bouyer/tmp/i386/obj/dsk/l1/misc/bouyer/netbsd-5/src/sys/arch/i386/compile/ANTIOCHE5 i386
Architecture: i386
Machine: i386
>Description:
This system is a dual-CPU PIII system, with several SCSI disks on
multiple esiop controllers. Some of them are part of raid-1
raidframe volume (2 disks per volume). A SMP kernel will lookup
within minutes after boot, under I/O load. The system is unresponsive
to network or console (no ping, and no characters echoed on serial
console) but I could enter ddb using cnmagic sequence. Here's
what I found from ddb:
fatal breakpoint trap in supervisor mode
trap type 1 code 0 eip c03aabec cs 8 eflags 202 cr2 bb504000 ilevel 8
Stopped in pid 0.4 (system) at netbsd:breakpoint+0x4: popl %ebp
db{0}> tr
breakpoint(0,3f8,0,6,ca7953c0,cbd01900,cbb5beb0,c1540010,c1541000,7fa) at netbsd:breakpoint+0x4
comintr(cbd017f4,cbb5bec0,6,10,c04f0030,cbb50010,c04f0010,c04f4800,1,cbb5bf4c) at netbsd:comintr+0x566
Xintr_ioapic_edge4() at netbsd:Xintr_ioapic_edge4+0xa9
--- interrupt ---
fatal page fault in supervisor mode
trap type 6 code 0 eip c03ad04f cs 8 eflags 10206 cr2 3e ilevel 8
kernel: supervisor trap page fault, code=0
Faulted in DDB; continuing...
db{0}> mach cpu 1
using CPU 1
db{0}> tr
__cpu_simple_lock(c1702c40,c15a99f8,0,0,c15560d8,c15560d0,0,c1556000,c01c52c0, cc5a2d20) at netbsd:__cpu_simple_lock+0x1c
rf_RaidIOThread(c1556000,0,c01002a7,0,c01002a7,0,0,0,0,0) at netbsd:rf_RaidIOThread+0x7f
rf_RaidIOThread+0x7f is:
0xc01c533f is in rf_RaidIOThread (/dsk/l1/misc/bouyer/netbsd-5/src/sys/dev/raidf
rame/rf_engine.c:863).
858 /* See what I/Os, if any, have arrived */
859 while ((req = TAILQ_FIRST(&(raidPtr->iodone))) != NULL) {
860 TAILQ_REMOVE(&(raidPtr->iodone), req, iodone_entries);
861 simple_unlock(&(raidPtr->iodone_lock));
862 rf_DiskIOComplete(req->queue, req, req->error);
863 (req->CompleteFunc) (req->argument, req->error);
864 simple_lock(&(raidPtr->iodone_lock));
865 }
0xc01c5324 <rf_RaidIOThread+100>: call 0xc010cd90 <__cpu_simple_unlock>
0xc01c5329 <rf_RaidIOThread+105>: mov 0x70(%ebx),%eax
0xc01c532c <rf_RaidIOThread+108>: mov 0x58(%ebx),%edx
0xc01c532f <rf_RaidIOThread+111>: mov %ebx,0x4(%esp)
0xc01c5333 <rf_RaidIOThread+115>: mov %eax,0x8(%esp)
0xc01c5337 <rf_RaidIOThread+119>: mov %edx,(%esp)
0xc01c533a <rf_RaidIOThread+122>: call 0xc01c1780 <rf_DiskIOComplete>
0xc01c533f <rf_RaidIOThread+127>: mov 0x2c(%ebx),%edx
0xc01c5342 <rf_RaidIOThread+130>: mov 0x70(%ebx),%eax
0xc01c5345 <rf_RaidIOThread+133>: mov 0x28(%ebx),%ecx
0xc01c5348 <rf_RaidIOThread+136>: mov %edx,(%esp)
0xc01c534b <rf_RaidIOThread+139>: mov %eax,0x4(%esp)
0xc01c534f <rf_RaidIOThread+143>: call *%ecx
0xc01c5351 <rf_RaidIOThread+145>: mov %esi,(%esp)
0xc01c5354 <rf_RaidIOThread+148>: call 0xc010cd70 <__cpu_simple_lock>
I also have a core dump. The stack traces were indentical in all hangs I got.
disabling SMP at boot (boot -1) work around the problem.
LOCKDEBUG+DEBUG+DIAGNOSTIC does't give any additionnal info.
This hardware was running without issues under 3.1 with SMP.
>How-To-Repeat:
boot a SMP system with raidframe and generate I/O ?
>Fix:
unknown
>Release-Note:
>Audit-Trail:
From: Greg Oster <oster@cs.usask.ca>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/39993: lockup on i386 SMP (raidframe related ?)
Date: Fri, 21 Nov 2008 09:19:31 -0600
bouyer@antioche.eu.org writes:
> I also have a core dump. The stack traces were indentical in all hangs I got.
> disabling SMP at boot (boot -1) work around the problem.
> LOCKDEBUG+DEBUG+DIAGNOSTIC does't give any additionnal info.
> This hardware was running without issues under 3.1 with SMP.
>
> >How-To-Repeat:
> boot a SMP system with raidframe and generate I/O ?
Were it only that simple, I'd be happy... Unfortunately, I've got a
couple of different boxes w/ 5.0_BETA+SMP+RAIDframe+heavy IO and I
havn't seen this problem at all.. :(
What happens if you do:
dd if=/dev/rsd0e of=/dev/null bs=1m &
dd if=/dev/rsd1e of=/dev/null bs=1m &
where rsd0e and rsd1e are the (raw) components of your RAID set?
Later...
Greg Oster
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/39993: lockup on i386 SMP (raidframe related ?)
Date: Fri, 21 Nov 2008 20:29:18 +0100
On Fri, Nov 21, 2008 at 03:20:03PM +0000, Greg Oster wrote:
> bouyer@antioche.eu.org writes:
> > I also have a core dump. The stack traces were indentical in all hangs I got.
> > disabling SMP at boot (boot -1) work around the problem.
> > LOCKDEBUG+DEBUG+DIAGNOSTIC does't give any additionnal info.
> > This hardware was running without issues under 3.1 with SMP.
> >
> > >How-To-Repeat:
> > boot a SMP system with raidframe and generate I/O ?
>
> Were it only that simple, I'd be happy... Unfortunately, I've got a
> couple of different boxes w/ 5.0_BETA+SMP+RAIDframe+heavy IO and I
> havn't seen this problem at all.. :(
>
> What happens if you do:
>
> dd if=/dev/rsd0e of=/dev/null bs=1m &
> dd if=/dev/rsd1e of=/dev/null bs=1m &
>
> where rsd0e and rsd1e are the (raw) components of your RAID set?
works fine (I tried with the different components of the RAID).
I also tried to reproduce it on a athlonx2 with 2 SATA drives, no luck.
other factors that may be relevant:
- when this happens a background parity rewrite is running
- there may be hardware issues with drives (like command timeouts),
so there may be aborted transfers/I/O errors reported to the raidframe
layer.
What I could do it try let it rebuild parity on the UP kernel, and reboot
to the SMP kernel after.
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/39993: lockup on i386 SMP (raidframe related ?)
Date: Fri, 21 Nov 2008 20:30:54 +0100
On Fri, Nov 21, 2008 at 03:20:03PM +0000, Greg Oster wrote:
> The following reply was made to PR kern/39993; it has been noted by GNATS.
>
> From: Greg Oster <oster@cs.usask.ca>
> To: gnats-bugs@NetBSD.org
> Cc:
> Subject: Re: kern/39993: lockup on i386 SMP (raidframe related ?)
> Date: Fri, 21 Nov 2008 09:19:31 -0600
>
> bouyer@antioche.eu.org writes:
> > I also have a core dump. The stack traces were indentical in all hangs I got.
> > disabling SMP at boot (boot -1) work around the problem.
> > LOCKDEBUG+DEBUG+DIAGNOSTIC does't give any additionnal info.
> > This hardware was running without issues under 3.1 with SMP.
> >
> > >How-To-Repeat:
> > boot a SMP system with raidframe and generate I/O ?
>
> Were it only that simple, I'd be happy... Unfortunately, I've got a
> couple of different boxes w/ 5.0_BETA+SMP+RAIDframe+heavy IO and I
> havn't seen this problem at all.. :(
>
> What happens if you do:
>
> dd if=/dev/rsd0e of=/dev/null bs=1m &
> dd if=/dev/rsd1e of=/dev/null bs=1m &
>
> where rsd0e and rsd1e are the (raw) components of your RAID set?
I forgot to say: the box hung again while SMP mutiuser, with the exact
same data from ddb. I couldn't get more infos.
It would be interesting to see what the cpu0 was doing when I entered
ddb, but ddb couln't figure it past the interrupt entry :(
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/39993: lockup on i386 SMP (raidframe related ?)
Date: Sat, 22 Nov 2008 15:11:06 +0100
On Fri, Nov 21, 2008 at 08:29:18PM +0100, Manuel Bouyer wrote:
> > Were it only that simple, I'd be happy... Unfortunately, I've got a
> > couple of different boxes w/ 5.0_BETA+SMP+RAIDframe+heavy IO and I
> > havn't seen this problem at all.. :(
> >
> > What happens if you do:
> >
> > dd if=/dev/rsd0e of=/dev/null bs=1m &
> > dd if=/dev/rsd1e of=/dev/null bs=1m &
> >
> > where rsd0e and rsd1e are the (raw) components of your RAID set?
>
> works fine (I tried with the different components of the RAID).
>
> I also tried to reproduce it on a athlonx2 with 2 SATA drives, no luck.
>
> other factors that may be relevant:
> - when this happens a background parity rewrite is running
> - there may be hardware issues with drives (like command timeouts),
> so there may be aborted transfers/I/O errors reported to the raidframe
> layer.
>
> What I could do it try let it rebuild parity on the UP kernel, and reboot
> to the SMP kernel after.
What I did:
- reboot in UP mode, let parity rebuild complete.
- reboot in SMP mode, multiuser
and now the system is up for about 17H, with its usual load. So it looks like
the issue is related to parity rebuild.
From the traces I gathered from ddb and gdb, it looks like CPU 1 is
trying to aquire a simple_lock (could it be in rf_DiskIOComplete, the
RF_LOCK_QUEUE_MUTEX(queue, "DiskIOComplete"); ?) while CPU 0 is halted with
this lock held.
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
From: Antti Kantee <pooka@cs.hut.fi>
To: jdc@netbsd.org, oster@netbsd.org, gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/39993
Date: Thu, 19 Feb 2009 22:39:03 +0200
--5mCyUwZo2JvN/JJP
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Just some guessing: it appears that bdev_strategy() can block and
we can call it with the queue spinlock held. Maybe this will make the
trouble go away?
(not blessed by Greg, so test at your own risk ;)
Index: rf_netbsdkintf.c
===================================================================
RCS file: /cvsroot/src/sys/dev/raidframe/rf_netbsdkintf.c,v
retrieving revision 1.255
diff -p -u -r1.255 rf_netbsdkintf.c
--- rf_netbsdkintf.c 7 Feb 2009 20:36:49 -0000 1.255
+++ rf_netbsdkintf.c 19 Feb 2009 20:37:01 -0000
@@ -2130,7 +2130,15 @@ rf_DispatchKernelIO(RF_DiskQueue_t *queu
(int) (req->numSector <<
queue->raidPtr->logBytesPerSector),
(int) queue->raidPtr->logBytesPerSector));
+
+ /*
+ * XXX: drop lock here since this can block at least
+ * with backing SCSI devices. Retake it
+ * to minimize fuss with calling interfaces.
+ */
+ RF_UNLOCK_QUEUE_MUTEX(queue, "unusedparam");
bdev_strategy(bp);
+ RF_LOCK_QUEUE_MUTEX(queue, "unusedparam");
break;
--5mCyUwZo2JvN/JJP
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="rf.diff"
Index: rf_netbsdkintf.c
===================================================================
RCS file: /cvsroot/src/sys/dev/raidframe/rf_netbsdkintf.c,v
retrieving revision 1.255
diff -p -u -r1.255 rf_netbsdkintf.c
--- rf_netbsdkintf.c 7 Feb 2009 20:36:49 -0000 1.255
+++ rf_netbsdkintf.c 19 Feb 2009 20:37:01 -0000
@@ -2130,7 +2130,15 @@ rf_DispatchKernelIO(RF_DiskQueue_t *queu
(int) (req->numSector <<
queue->raidPtr->logBytesPerSector),
(int) queue->raidPtr->logBytesPerSector));
+
+ /*
+ * XXX: drop lock here since this can block at least
+ * with backing SCSI devices. Retake it
+ * to minimize fuss with calling interfaces.
+ */
+ RF_UNLOCK_QUEUE_MUTEX(queue, "unusedparam");
bdev_strategy(bp);
+ RF_LOCK_QUEUE_MUTEX(queue, "unusedparam");
break;
--5mCyUwZo2JvN/JJP--
From: Julian Coleman <jdc@coris.org.uk>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/39993
Date: Thu, 19 Feb 2009 21:42:29 +0000
Hi,
> Just some guessing: it appears that bdev_strategy() can block and
> we can call it with the queue spinlock held. Maybe this will make the
> trouble go away?
> (not blessed by Greg, so test at your own risk ;)
Summary: I have a Sun E420r (4 CPU, esiop SCSI) that always locked up when
trying reconstruction (raidctl -R /dev/sd1a raid0) although parity rewrites
were fine. It never made it to the end of the reconstruction and usually
locked up within the first 10 seconds.
With this patch, it runs all the way through the reconstruction. It would
be good to get this in and also pulled up to netbsd-5.
Thanks,
J
--
My other computer also runs NetBSD / Sailing at Newbiggin
http://www.netbsd.org/ / http://www.newbigginsailingclub.org/
From: "Greg A. Woods" <woods@planix.com>
To: NetBSD GNATS <gnats-bugs@NetBSD.org>
Cc: NetBSD Kernel Bug People <kern-bug-people@netbsd.org>,
NetBSD GNATS Administrator <gnats-admin@NetBSD.org>
Subject: Re: kern/39993
Date: Thu, 19 Feb 2009 17:40:11 -0500
--pgp-sign-Multipart_Thu_Feb_19_17:40:11_2009-1
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable
At Thu, 19 Feb 2009 21:50:05 +0000 (UTC), Julian Coleman wrote:
Subject: Re: kern/39993
>=20
> With this patch, it runs all the way through the reconstruction. It wou=
ld
> be good to get this in and also pulled up to netbsd-5.
and of course the supported releases, netbsd-4 and netbsd-3, too, if relevant.
--=20
Greg A. Woods
H:+1 416 218-0098 W:+1 416 489-5852 VE3TCP RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com> Secrets of the Weird <woods@weird.com>
--pgp-sign-Multipart_Thu_Feb_19_17:40:11_2009-1
Content-Type: application/pgp-signature
Content-Transfer-Encoding: 7bit
-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 5.0i for non-commercial use
MessageID: kaM7W5gbuOZEHH7SLzRfwce/GQCDwaF3
iQA/AwUBSZ3fy2Z9cbd4v/R/EQKdpQCfTv4i9lcBQgN7U3Z1OAI0GDFWxwoAn0Kb
8ahycdtcNkYJecxKjOfy0IcT
=0qLo
-----END PGP SIGNATURE-----
--pgp-sign-Multipart_Thu_Feb_19_17:40:11_2009-1--
From: Greg Oster <oster@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/39993 CVS commit: src/sys/dev/raidframe
Date: Sat, 28 Feb 2009 22:02:17 +0000 (UTC)
Module Name: src
Committed By: oster
Date: Sat Feb 28 22:02:17 UTC 2009
Modified Files:
src/sys/dev/raidframe: rf_netbsdkintf.c
Log Message:
It seems that some SCSI drivers can block while the spinlock on a given
disk queue is being held. Work around this by dropping the lock before
bdev_strategy(), and re-grabbing the lock afterwards. This is a
temporary measure until I get to gutting this queue locking code.
There has been some success with this in addressing PR#39993.
This patch is from Antti Kantee. Thanks!
To generate a diff of this commit:
cvs rdiff -r1.255 -r1.256 src/sys/dev/raidframe/rf_netbsdkintf.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Soren Jacobsen <snj@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/39993 CVS commit: [netbsd-5] src/sys/dev/raidframe
Date: Mon, 2 Mar 2009 20:58:27 +0000
Module Name: src
Committed By: snj
Date: Mon Mar 2 20:58:27 UTC 2009
Modified Files:
src/sys/dev/raidframe [netbsd-5]: rf_netbsdkintf.c
Log Message:
Pull up following revision(s) (requested by oster in ticket #543):
sys/dev/raidframe/rf_netbsdkintf.c: revision 1.256
It seems that some SCSI drivers can block while the spinlock on a given
disk queue is being held. Work around this by dropping the lock before
bdev_strategy(), and re-grabbing the lock afterwards. This is a
temporary measure until I get to gutting this queue locking code.
There has been some success with this in addressing PR#39993.
This patch is from Antti Kantee. Thanks!
To generate a diff of this commit:
cvs rdiff -u -r1.250.4.2 -r1.250.4.3 src/sys/dev/raidframe/rf_netbsdkintf.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: open->closed
State-Changed-By: pgoyette@NetBSD.org
State-Changed-When: Wed, 01 Jun 2016 22:47:46 +0000
State-Changed-Why:
Fix committed by oster on Feb 28 22:02:17 UTC 2009
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.