NetBSD Problem Report #46168
From Manuel.Bouyer@lip6.fr Sat Mar 10 16:47:17 2012
Return-Path: <Manuel.Bouyer@lip6.fr>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
by www.NetBSD.org (Postfix) with ESMTP id 4367663BFFC
for <gnats-bugs@gnats.NetBSD.org>; Sat, 10 Mar 2012 16:47:17 +0000 (UTC)
Message-Id: <20120310164640.0E15C5B@xen1.soc.lip6.fr>
Date: Sat, 10 Mar 2012 17:46:40 +0100 (MET)
From: Manuel.Bouyer@lip6.fr
Reply-To: Manuel.Bouyer@lip6.fr
To: gnats-bugs@gnats.NetBSD.org
Subject: /usr/tests/lib/libpthread/t_cond somtimes unkillable
X-Send-Pr-Version: 3.95
>Number: 46168
>Category: kern
>Synopsis: /usr/tests/lib/libpthread/t_cond somtimes unkillable
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: rmind
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sat Mar 10 16:50:00 +0000 2012
>Closed-Date: Sat Oct 20 10:17:41 +0000 2012
>Last-Modified: Sat Oct 20 10:17:41 +0000 2012
>Originator: Manuel Bouyer
>Release: NetBSD 6.99.3
>Organization:
>Environment:
System: NetBSD xen1.soc.lip6.fr 6.99.3 NetBSD 6.99.3 (GENERIC_DEBUG) #6: Sat Mar 10 16:29:53 CET 2012 bouyer@hop:/dsk/l1/misc/bouyer/tmp/amd64/obj/dsk/l1/misc/bouyer/quota2/src/sys/arch/amd64/compile/GENERIC_DEBUG amd64
kernel with GENERIC+options DEBUG and the kmem_guard_depth setting as in
kmem(9).
Architecture: x86_64
Machine: amd64 (4-core with hyperthreading enabled Xeon system)
userland from recent netbsd-6
>Description:
/usr/tests/lib/libpthread/t_cond:cond_timedwait_race sometimes hang,
with an unkillable t_cond process:
xen1:/home/bouyer>ps axsw |grep t_cond
0 4066 1673 2578 20 2 43 0 93804 1628 - U- ttyp1 0:00.17 (t_cond)
0 4066 1673 2578 8 2 85 0 93804 1628 lwpwait D- ttyp1 0:00.17 (t_cond)
xen1:/home/bouyer>ps axsuw | grep t_cond
root 4066 0.0 0.0 93804 1628 ttyp1 DEl 5:27PM 0:00.17 (t_cond)
root 4066 0.0 0.0 93804 1628 ttyp1 DEl 5:27PM 0:00.17 (t_cond)
from ddb:
db{0}> ps
PID LID S CPU FLAGS STRUCT LWP * NAME WAIT
611 1 3 3 80 fffffe81b54c4c00 tcsh ttyraw
4162 1 3 0 80 fffffe81af6c2800 vi ttyraw
4363 1 3 0 80 fffffe81b653dc00 sh wait
4066 20 8 1 0 fffffe81ae6c2000 t_cond
4066 8 3 0 10000000 fffffe81b7c4b800 t_cond lwpwait
db{0}> tr/a 0xfffffe81ae6c2000
trace: pid 4066 lid 20 at 0x0
db{0}> tr/a 0xfffffe81b7c4b800
trace: pid 4066 lid 8 at 0xfffffe810c5e8a10
sleepq_block() at netbsd:sleepq_block+0xa4
cv_wait() at netbsd:cv_wait+0x101
lwp_wait1() at netbsd:lwp_wait1+0x274
exit_lwps() at netbsd:exit_lwps+0x108
exit1() at netbsd:exit1+0x5c
sys_exit() at netbsd:sys_exit+0x3e
syscall() at netbsd:syscall+0xac
>How-To-Repeat:
I can reproduce it reliably by running
atf-run |&tee /tmp/run | atf-report |&tee /tmp/report
in /usr/tests/lib/libpthread as root. Running just
atf-run | atf-report
doesn not reproduce it.
/tmp is not a separate filesystem; / is a regular ffs filesystem.
>Fix:
>Release-Note:
>Audit-Trail:
From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/46168: /usr/tests/lib/libpthread/t_cond somtimes unkillable
Date: Sat, 17 Mar 2012 13:12:49 +0000
(sent to gnats-admin instead of gnats-bugs)
------
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/46168: /usr/tests/lib/libpthread/t_cond somtimes unkillable
Date: Wed, 14 Mar 2012 17:35:32 +0100
On Sat, Mar 10, 2012 at 04:50:00PM +0000, Manuel.Bouyer@lip6.fr wrote:
> /usr/tests/lib/libpthread/t_cond:cond_timedwait_race sometimes hang,
> with an unkillable t_cond process:
Some more details from another instance:
db{0}> ps
PID LID S CPU FLAGS STRUCT LWP * NAME WAIT
29181 21 8 3 1000000 fffffe815c841000 t_cond
29181 10 3 2 11000000 fffffe815d1c1000 t_cond lwpwait
0x01000000 is LW_PENDSIG, 0x10000000 is LW_UNPARKED.
LW_WEXIT (0x00100000) is NOT set.
db{0}> tr/a fffffe815c841000
trace: pid 29181 lid 21 at 0x0
db{0}> tr/a fffffe815d1c1000
trace: pid 29181 lid 10 at 0xfffffe810ccfca10
sleepq_block() at netbsd:sleepq_block+0xa4
cv_wait() at netbsd:cv_wait+0x101
lwp_wait1() at netbsd:lwp_wait1+0x274
exit_lwps() at netbsd:exit_lwps+0x108
exit1() at netbsd:exit1+0x5c
sys_exit() at netbsd:sys_exit+0x3e
syscall() at netbsd:syscall+0xac
db{0}> x/Lx 0xfffffe815c8411a8
fffffe815c8411a8: fffffe81810a8aa0
db{0}> x/Lx fffffe815d1c11a8
fffffe815d1c11a8: fffffe81810a8aa0
struct proc is at 0xfffffe81810a8aa0
db{0}> x/Lx 0xfffffe81810a8ba8
fffffe81810a8ba8: fffffe815c841000
This is p_lwps, pointing to 29181.21
db{0}> x/Lx 0xfffffe815c8411b0
fffffe815c8411b0: fffffe815d1c1000
this is l_sibling from 29181.21, pointing to 29181.10
db{0}> x/Lx 0xfffffe815d1c11b0
fffffe815d1c11b0: 0
the l_sibling list ends here
db{0}> x/x 0xfffffe815c8411c0
fffffe815c8411c0: 0 #l_waiter
db{0}> x/x 0xfffffe815c8411c4
fffffe815c8411c4: 0 #l_waitingfor
db{0}> x/x 0xfffffe815c8411cc
fffffe815c8411cc: 1 #l_refcnt
db{0}> x/x 0xfffffe815c8411d0
fffffe815c8411d0: 15 #l_lid
db{0}> x/x 0xfffffe815d1c11c0
fffffe815d1c11c0: 0 #l_waiter
db{0}> x/x 0xfffffe815d1c11c4
fffffe815d1c11c4: 0 #l_waitingfor
db{0}> x/x 0xfffffe815d1c11cc
fffffe815d1c11cc: 1 #l_refcnt
db{0}> x/x 0xfffffe815d1c11d0
fffffe815d1c11d0: a #l_lid
db{0}> x/x 0xfffffe81810a8bb8
fffffe81810a8bb8: 2 #p_nlwps
db{0}> x/x 0xfffffe81810a8bbc
fffffe81810a8bbc: 0 #p_nzlwps
db{0}> x/x 0xfffffe81810a8bc0
fffffe81810a8bc0: 1 #p_nrlwps
db{0}> x/x 0xfffffe81810a8bc4
fffffe81810a8bc4: 1 #p_nlwpwait
db{0}> x/x 0xfffffe81810a8bc8
fffffe81810a8bc8: 2 #p_ndlwps
I think that the missing LW_WEXIT for 29181.21 is the cause of the problem.
lwp_wait1() can release and reaquire p_lock, even when returning 0.
I guess a new lwp can be created while p_lock is released, and so the list
of lwps always needs to be checked again.
The attached patch does this: it moves setting LW_EXIT inside the
while (p->p_nlwps > 1) {} loop; so that if a new lwp is added, it can be
properly flagged before waiting for it.
I've done 2 runs of atf-run in /usr/tests, no tests are failing after
this change (at last none that was not failing before), and I couldn't
make t_cond hang again.
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
Index: kern/kern_exit.c
===================================================================
RCS file: /cvsroot/src/sys/kern/kern_exit.c,v
retrieving revision 1.237
diff -u -p -u -r1.237 kern_exit.c
--- kern/kern_exit.c 19 Feb 2012 21:06:50 -0000 1.237
+++ kern/kern_exit.c 14 Mar 2012 16:22:47 -0000
@@ -598,44 +598,42 @@ exit_lwps(struct lwp *l)
p = l->l_proc;
KASSERT(mutex_owned(p->p_lock));
-retry:
- /*
- * Interrupt LWPs in interruptable sleep, unsuspend suspended
- * LWPs and then wait for everyone else to finish.
- */
- LIST_FOREACH(l2, &p->p_lwps, l_sibling) {
- if (l2 == l)
- continue;
- lwp_lock(l2);
- l2->l_flag |= LW_WEXIT;
- if ((l2->l_stat == LSSLEEP && (l2->l_flag & LW_SINTR)) ||
- l2->l_stat == LSSUSPENDED || l2->l_stat == LSSTOP) {
- /* setrunnable() will release the lock. */
- setrunnable(l2);
- DPRINTF(("exit_lwps: Made %d.%d runnable\n",
- p->p_pid, l2->l_lid));
- continue;
- }
- lwp_unlock(l2);
- }
while (p->p_nlwps > 1) {
+ /*
+ * Interrupt LWPs in interruptable sleep, unsuspend suspended
+ * LWPs and then wait for everyone else to finish.
+ */
+ LIST_FOREACH(l2, &p->p_lwps, l_sibling) {
+ if (l2 == l)
+ continue;
+ lwp_lock(l2);
+ l2->l_flag |= LW_WEXIT;
+ if ((l2->l_stat == LSSLEEP && (l2->l_flag & LW_SINTR)) ||
+ l2->l_stat == LSSUSPENDED || l2->l_stat == LSSTOP) {
+ /* setrunnable() will release the lock. */
+ setrunnable(l2);
+ DPRINTF(("exit_lwps: Made %d.%d runnable\n",
+ p->p_pid, l2->l_lid));
+ continue;
+ }
+ lwp_unlock(l2);
+ }
DPRINTF(("exit_lwps: waiting for %d LWPs (%d zombies)\n",
p->p_nlwps, p->p_nzlwps));
error = lwp_wait1(l, 0, &waited, LWPWAIT_EXITCONTROL);
if (p->p_nlwps == 1)
break;
- if (error == EDEADLK) {
- /*
- * LWPs can get suspended/slept behind us.
- * (eg. sa_setwoken)
- * kick them again and retry.
- */
- goto retry;
- }
- if (error)
+ if (error == 0)
+ DPRINTF(("exit_lwps: Got LWP %d from lwp_wait1()\n",
+ waited));
+ else if (error != EDEADLK)
panic("exit_lwps: lwp_wait1 failed with error %d",
error);
- DPRINTF(("exit_lwps: Got LWP %d from lwp_wait1()\n", waited));
+ /*
+ * LWPs can get suspended/slept behind us.
+ * (eg. sa_setwoken)
+ * kick them again and retry.
+ */
}
KERNEL_LOCK(nlocks, l);
Responsible-Changed-From-To: kern-bug-people->rmind
Responsible-Changed-By: rmind@NetBSD.org
Responsible-Changed-When: Wed, 19 Sep 2012 21:25:08 +0000
Responsible-Changed-Why:
State-Changed-From-To: open->feedback
State-Changed-By: rmind@NetBSD.org
State-Changed-When: Wed, 19 Sep 2012 21:25:08 +0000
State-Changed-Why:
Can you try the patch I sent for PR/46402 ?
From: "Mindaugas Rasiukevicius" <rmind@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/46168 CVS commit: src/sys
Date: Thu, 27 Sep 2012 20:43:16 +0000
Module Name: src
Committed By: rmind
Date: Thu Sep 27 20:43:16 UTC 2012
Modified Files:
src/sys/kern: kern_exit.c kern_lwp.c sys_lwp.c
src/sys/sys: lwp.h
Log Message:
exit_lwps, lwp_wait: fix a race condition by re-trying if p_lock was dropped
in a case of process exit. Necessary to re-flag all LWPs for exit, as their
state might have changed or new LWPs spawned.
Should fix PR/46168 and PR/46402.
To generate a diff of this commit:
cvs rdiff -u -r1.241 -r1.242 src/sys/kern/kern_exit.c
cvs rdiff -u -r1.172 -r1.173 src/sys/kern/kern_lwp.c
cvs rdiff -u -r1.54 -r1.55 src/sys/kern/sys_lwp.c
cvs rdiff -u -r1.163 -r1.164 src/sys/sys/lwp.h
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: feedback->closed
State-Changed-By: rmind@NetBSD.org
State-Changed-When: Sun, 30 Sep 2012 22:49:23 +0000
State-Changed-Why:
Feedback timeout. Please reopen (or open a new PR), if the problem reappears.
State-Changed-From-To: closed->feedback
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sun, 30 Sep 2012 22:55:16 +0000
State-Changed-Why:
One week is nothing like long enough for feedback timeout.
Please give the fix a try.
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: rmind@NetBSD.org, kern-bug-people@NetBSD.org, netbsd-bugs@NetBSD.org,
gnats-admin@NetBSD.org
Subject: Re: kern/46168 (/usr/tests/lib/libpthread/t_cond somtimes unkillable)
Date: Mon, 1 Oct 2012 17:49:01 +0200
On Wed, Sep 19, 2012 at 09:25:11PM +0000, rmind@NetBSD.org wrote:
> State-Changed-From-To: open->feedback
> State-Changed-By: rmind@NetBSD.org
> State-Changed-When: Wed, 19 Sep 2012 21:25:08 +0000
> State-Changed-Why:
> Can you try the patch I sent for PR/46402 ?
The problem is that I can't reproduce this any more with a recent netbsd-6
kernel. So it's hard to say if the patch really fixes this issue or not ...
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/46168 (/usr/tests/lib/libpthread/t_cond somtimes unkillable)
Date: Mon, 1 Oct 2012 17:56:17 +0000
On Mon, Oct 01, 2012 at 03:50:03PM +0000, Manuel Bouyer wrote:
> > Can you try the patch I sent for PR/46402 ?
>
> The problem is that I can't reproduce this any more with a recent netbsd-6
> kernel. So it's hard to say if the patch really fixes this issue or not ...
Then perhaps the best way forward is to close the PR and if the
problem comes back later, worry about it then...
--
David A. Holland
dholland@netbsd.org
From: "Jeff Rizzo" <riz@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/46168 CVS commit: [netbsd-6] src/sys
Date: Mon, 1 Oct 2012 23:07:08 +0000
Module Name: src
Committed By: riz
Date: Mon Oct 1 23:07:08 UTC 2012
Modified Files:
src/sys/kern [netbsd-6]: kern_exit.c kern_lwp.c sys_lwp.c
src/sys/sys [netbsd-6]: lwp.h
Log Message:
Pull up following revision(s) (requested by rmind in ticket #583):
sys/kern/sys_lwp.c: revision 1.55
sys/sys/lwp.h: revision 1.164
sys/kern/kern_exit.c: revision 1.242
sys/kern/kern_lwp.c: revision 1.173
exit_lwps, lwp_wait: fix a race condition by re-trying if p_lock was dropped
in a case of process exit. Necessary to re-flag all LWPs for exit, as their
state might have changed or new LWPs spawned.
Should fix PR/46168 and PR/46402.
To generate a diff of this commit:
cvs rdiff -u -r1.236.2.1 -r1.236.2.2 src/sys/kern/kern_exit.c
cvs rdiff -u -r1.166 -r1.166.2.1 src/sys/kern/kern_lwp.c
cvs rdiff -u -r1.52.14.1 -r1.52.14.2 src/sys/kern/sys_lwp.c
cvs rdiff -u -r1.159.2.1 -r1.159.2.2 src/sys/sys/lwp.h
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: feedback->closed
State-Changed-By: bouyer@NetBSD.org
State-Changed-When: Sat, 20 Oct 2012 10:17:41 +0000
State-Changed-Why:
I did provide feedback, which basically is "I can't reproduce the problem
so I can't tell if the patch helps". Assume this is fixed.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.