NetBSD Problem Report #46168

From Manuel.Bouyer@lip6.fr  Sat Mar 10 16:47:17 2012
Return-Path: <Manuel.Bouyer@lip6.fr>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	by www.NetBSD.org (Postfix) with ESMTP id 4367663BFFC
	for <gnats-bugs@gnats.NetBSD.org>; Sat, 10 Mar 2012 16:47:17 +0000 (UTC)
Message-Id: <20120310164640.0E15C5B@xen1.soc.lip6.fr>
Date: Sat, 10 Mar 2012 17:46:40 +0100 (MET)
From: Manuel.Bouyer@lip6.fr
Reply-To: Manuel.Bouyer@lip6.fr
To: gnats-bugs@gnats.NetBSD.org
Subject: /usr/tests/lib/libpthread/t_cond somtimes unkillable
X-Send-Pr-Version: 3.95

>Number:         46168
>Category:       kern
>Synopsis:       /usr/tests/lib/libpthread/t_cond somtimes unkillable
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    rmind
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Mar 10 16:50:00 +0000 2012
>Closed-Date:    Sat Oct 20 10:17:41 +0000 2012
>Last-Modified:  Sat Oct 20 10:17:41 +0000 2012
>Originator:     Manuel Bouyer
>Release:        NetBSD 6.99.3
>Organization:
>Environment:
System: NetBSD xen1.soc.lip6.fr 6.99.3 NetBSD 6.99.3 (GENERIC_DEBUG) #6: Sat Mar 10 16:29:53 CET 2012 bouyer@hop:/dsk/l1/misc/bouyer/tmp/amd64/obj/dsk/l1/misc/bouyer/quota2/src/sys/arch/amd64/compile/GENERIC_DEBUG amd64
kernel with GENERIC+options DEBUG and the kmem_guard_depth setting as in
kmem(9).
Architecture: x86_64
Machine: amd64 (4-core with hyperthreading enabled Xeon system)
userland from recent netbsd-6
>Description:
	/usr/tests/lib/libpthread/t_cond:cond_timedwait_race sometimes hang,
	with an unkillable t_cond process:
xen1:/home/bouyer>ps axsw |grep t_cond
  0 4066 1673 2578  20    2  43  0 93804  1628 -        U-   ttyp1 0:00.17 (t_cond)
  0 4066 1673 2578   8    2  85  0 93804  1628 lwpwait  D-   ttyp1 0:00.17 (t_cond)
xen1:/home/bouyer>ps axsuw | grep t_cond
root    4066  0.0  0.0 93804  1628 ttyp1 DEl   5:27PM 0:00.17 (t_cond)
root    4066  0.0  0.0 93804  1628 ttyp1 DEl   5:27PM 0:00.17 (t_cond)

	from ddb:
db{0}> ps
PID    LID S CPU     FLAGS       STRUCT LWP *              NAME WAIT
611      1 3   3        80   fffffe81b54c4c00               tcsh ttyraw
4162     1 3   0        80   fffffe81af6c2800                 vi ttyraw
4363     1 3   0        80   fffffe81b653dc00                 sh wait
4066    20 8   1         0   fffffe81ae6c2000             t_cond
4066     8 3   0  10000000   fffffe81b7c4b800             t_cond lwpwait
db{0}> tr/a 0xfffffe81ae6c2000
trace: pid 4066 lid 20 at 0x0
db{0}> tr/a 0xfffffe81b7c4b800
trace: pid 4066 lid 8 at 0xfffffe810c5e8a10
sleepq_block() at netbsd:sleepq_block+0xa4
cv_wait() at netbsd:cv_wait+0x101
lwp_wait1() at netbsd:lwp_wait1+0x274
exit_lwps() at netbsd:exit_lwps+0x108
exit1() at netbsd:exit1+0x5c
sys_exit() at netbsd:sys_exit+0x3e
syscall() at netbsd:syscall+0xac

>How-To-Repeat:
	I can reproduce it reliably by running
	atf-run |&tee /tmp/run | atf-report |&tee /tmp/report
	in /usr/tests/lib/libpthread as root. Running just
	atf-run | atf-report
	doesn not reproduce it.
	/tmp is not a separate filesystem; / is a regular ffs filesystem.
>Fix:

>Release-Note:

>Audit-Trail:
From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/46168: /usr/tests/lib/libpthread/t_cond somtimes unkillable
Date: Sat, 17 Mar 2012 13:12:49 +0000

 (sent to gnats-admin instead of gnats-bugs)

    ------

 From: Manuel Bouyer <bouyer@antioche.eu.org>
 To: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
 Subject: Re: kern/46168: /usr/tests/lib/libpthread/t_cond somtimes unkillable
 Date: Wed, 14 Mar 2012 17:35:32 +0100

 On Sat, Mar 10, 2012 at 04:50:00PM +0000, Manuel.Bouyer@lip6.fr wrote:
 > 	/usr/tests/lib/libpthread/t_cond:cond_timedwait_race sometimes hang,
 > 	with an unkillable t_cond process:


 Some more details from another instance:
 db{0}> ps
 PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
 29181   21 8   3   1000000   fffffe815c841000             t_cond
 29181   10 3   2  11000000   fffffe815d1c1000             t_cond lwpwait

 0x01000000 is LW_PENDSIG, 0x10000000 is LW_UNPARKED.
 LW_WEXIT (0x00100000) is NOT set.

 db{0}> tr/a fffffe815c841000
 trace: pid 29181 lid 21 at 0x0
 db{0}> tr/a fffffe815d1c1000
 trace: pid 29181 lid 10 at 0xfffffe810ccfca10
 sleepq_block() at netbsd:sleepq_block+0xa4
 cv_wait() at netbsd:cv_wait+0x101
 lwp_wait1() at netbsd:lwp_wait1+0x274
 exit_lwps() at netbsd:exit_lwps+0x108
 exit1() at netbsd:exit1+0x5c
 sys_exit() at netbsd:sys_exit+0x3e
 syscall() at netbsd:syscall+0xac

 db{0}> x/Lx 0xfffffe815c8411a8
 fffffe815c8411a8:       fffffe81810a8aa0
 db{0}> x/Lx fffffe815d1c11a8
 fffffe815d1c11a8:       fffffe81810a8aa0

 struct proc is at 0xfffffe81810a8aa0
 db{0}> x/Lx 0xfffffe81810a8ba8
 fffffe81810a8ba8:       fffffe815c841000

 This is p_lwps, pointing to 29181.21
 db{0}> x/Lx 0xfffffe815c8411b0
 fffffe815c8411b0:       fffffe815d1c1000
 this is l_sibling from 29181.21, pointing to 29181.10
 db{0}> x/Lx 0xfffffe815d1c11b0
 fffffe815d1c11b0:       0
 the l_sibling list ends here

 db{0}> x/x 0xfffffe815c8411c0  
 fffffe815c8411c0:       0 #l_waiter
 db{0}> x/x 0xfffffe815c8411c4
 fffffe815c8411c4:       0 #l_waitingfor
 db{0}> x/x 0xfffffe815c8411cc
 fffffe815c8411cc:       1 #l_refcnt
 db{0}> x/x 0xfffffe815c8411d0
 fffffe815c8411d0:       15 #l_lid

 db{0}> x/x 0xfffffe815d1c11c0  
 fffffe815d1c11c0:       0 #l_waiter
 db{0}> x/x 0xfffffe815d1c11c4
 fffffe815d1c11c4:       0 #l_waitingfor
 db{0}> x/x 0xfffffe815d1c11cc
 fffffe815d1c11cc:       1 #l_refcnt
 db{0}> x/x 0xfffffe815d1c11d0
 fffffe815d1c11d0:       a #l_lid

 db{0}> x/x 0xfffffe81810a8bb8
 fffffe81810a8bb8:       2 #p_nlwps
 db{0}> x/x 0xfffffe81810a8bbc
 fffffe81810a8bbc:       0 #p_nzlwps
 db{0}> x/x 0xfffffe81810a8bc0
 fffffe81810a8bc0:       1 #p_nrlwps
 db{0}> x/x 0xfffffe81810a8bc4
 fffffe81810a8bc4:       1 #p_nlwpwait
 db{0}> x/x 0xfffffe81810a8bc8
 fffffe81810a8bc8:       2 #p_ndlwps

 I think that the missing LW_WEXIT for 29181.21 is the cause of the problem.
 lwp_wait1() can release and reaquire p_lock, even when returning 0.
 I guess a new lwp can be created while p_lock is released, and so the list
 of lwps always needs to be checked again.
 The attached patch does this: it moves setting LW_EXIT inside the
 while (p->p_nlwps > 1) {} loop; so that if a new lwp is added, it can be 
 properly flagged before waiting for it.

 I've done 2 runs of atf-run in /usr/tests, no tests are failing after
 this change (at last none that was not failing before), and I couldn't
 make t_cond hang again.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

 Index: kern/kern_exit.c
 ===================================================================
 RCS file: /cvsroot/src/sys/kern/kern_exit.c,v
 retrieving revision 1.237
 diff -u -p -u -r1.237 kern_exit.c
 --- kern/kern_exit.c	19 Feb 2012 21:06:50 -0000	1.237
 +++ kern/kern_exit.c	14 Mar 2012 16:22:47 -0000
 @@ -598,44 +598,42 @@ exit_lwps(struct lwp *l)

  	p = l->l_proc;
  	KASSERT(mutex_owned(p->p_lock));
 -retry:
 -	/*
 -	 * Interrupt LWPs in interruptable sleep, unsuspend suspended
 -	 * LWPs and then wait for everyone else to finish.
 -	 */
 -	LIST_FOREACH(l2, &p->p_lwps, l_sibling) {
 -		if (l2 == l)
 -			continue;
 -		lwp_lock(l2);
 -		l2->l_flag |= LW_WEXIT;
 -		if ((l2->l_stat == LSSLEEP && (l2->l_flag & LW_SINTR)) ||
 -		    l2->l_stat == LSSUSPENDED || l2->l_stat == LSSTOP) {
 -		    	/* setrunnable() will release the lock. */
 -			setrunnable(l2);
 -			DPRINTF(("exit_lwps: Made %d.%d runnable\n",
 -			    p->p_pid, l2->l_lid));
 -			continue;
 -		}
 -		lwp_unlock(l2);
 -	}
  	while (p->p_nlwps > 1) {
 +		/*
 +		 * Interrupt LWPs in interruptable sleep, unsuspend suspended
 +		 * LWPs and then wait for everyone else to finish.
 +		 */
 +		LIST_FOREACH(l2, &p->p_lwps, l_sibling) {
 +			if (l2 == l)
 +				continue;
 +			lwp_lock(l2);
 +			l2->l_flag |= LW_WEXIT;
 +			if ((l2->l_stat == LSSLEEP && (l2->l_flag & LW_SINTR)) ||
 +			    l2->l_stat == LSSUSPENDED || l2->l_stat == LSSTOP) {
 +				/* setrunnable() will release the lock. */
 +				setrunnable(l2);
 +				DPRINTF(("exit_lwps: Made %d.%d runnable\n",
 +				    p->p_pid, l2->l_lid));
 +				continue;
 +			}
 +			lwp_unlock(l2);
 +		}
  		DPRINTF(("exit_lwps: waiting for %d LWPs (%d zombies)\n",
  		    p->p_nlwps, p->p_nzlwps));
  		error = lwp_wait1(l, 0, &waited, LWPWAIT_EXITCONTROL);
  		if (p->p_nlwps == 1)
  			break;
 -		if (error == EDEADLK) {
 -			/*
 -			 * LWPs can get suspended/slept behind us.
 -			 * (eg. sa_setwoken)
 -			 * kick them again and retry.
 -			 */
 -			goto retry;
 -		}
 -		if (error)
 +		if (error == 0)
 +			DPRINTF(("exit_lwps: Got LWP %d from lwp_wait1()\n",
 +			    waited));
 +		else if (error != EDEADLK)
  			panic("exit_lwps: lwp_wait1 failed with error %d",
  			    error);
 -		DPRINTF(("exit_lwps: Got LWP %d from lwp_wait1()\n", waited));
 +		/*
 +		 * LWPs can get suspended/slept behind us.
 +		 * (eg. sa_setwoken)
 +		 * kick them again and retry.
 +		 */
  	}

  	KERNEL_LOCK(nlocks, l);

Responsible-Changed-From-To: kern-bug-people->rmind
Responsible-Changed-By: rmind@NetBSD.org
Responsible-Changed-When: Wed, 19 Sep 2012 21:25:08 +0000
Responsible-Changed-Why:


State-Changed-From-To: open->feedback
State-Changed-By: rmind@NetBSD.org
State-Changed-When: Wed, 19 Sep 2012 21:25:08 +0000
State-Changed-Why:
Can you try the patch I sent for PR/46402 ?


From: "Mindaugas Rasiukevicius" <rmind@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/46168 CVS commit: src/sys
Date: Thu, 27 Sep 2012 20:43:16 +0000

 Module Name:	src
 Committed By:	rmind
 Date:		Thu Sep 27 20:43:16 UTC 2012

 Modified Files:
 	src/sys/kern: kern_exit.c kern_lwp.c sys_lwp.c
 	src/sys/sys: lwp.h

 Log Message:
 exit_lwps, lwp_wait: fix a race condition by re-trying if p_lock was dropped
 in a case of process exit.  Necessary to re-flag all LWPs for exit, as their
 state might have changed or new LWPs spawned.

 Should fix PR/46168 and PR/46402.


 To generate a diff of this commit:
 cvs rdiff -u -r1.241 -r1.242 src/sys/kern/kern_exit.c
 cvs rdiff -u -r1.172 -r1.173 src/sys/kern/kern_lwp.c
 cvs rdiff -u -r1.54 -r1.55 src/sys/kern/sys_lwp.c
 cvs rdiff -u -r1.163 -r1.164 src/sys/sys/lwp.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: feedback->closed
State-Changed-By: rmind@NetBSD.org
State-Changed-When: Sun, 30 Sep 2012 22:49:23 +0000
State-Changed-Why:
Feedback timeout.  Please reopen (or open a new PR), if the problem reappears.


State-Changed-From-To: closed->feedback
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sun, 30 Sep 2012 22:55:16 +0000
State-Changed-Why:
One week is nothing like long enough for feedback timeout.

Please give the fix a try.


From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: rmind@NetBSD.org, kern-bug-people@NetBSD.org, netbsd-bugs@NetBSD.org,
        gnats-admin@NetBSD.org
Subject: Re: kern/46168 (/usr/tests/lib/libpthread/t_cond somtimes unkillable)
Date: Mon, 1 Oct 2012 17:49:01 +0200

 On Wed, Sep 19, 2012 at 09:25:11PM +0000, rmind@NetBSD.org wrote:
 > State-Changed-From-To: open->feedback
 > State-Changed-By: rmind@NetBSD.org
 > State-Changed-When: Wed, 19 Sep 2012 21:25:08 +0000
 > State-Changed-Why:
 > Can you try the patch I sent for PR/46402 ?

 The problem is that I can't reproduce this any more with a recent netbsd-6
 kernel. So it's hard to say if the patch really fixes this issue or not ...

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/46168 (/usr/tests/lib/libpthread/t_cond somtimes unkillable)
Date: Mon, 1 Oct 2012 17:56:17 +0000

 On Mon, Oct 01, 2012 at 03:50:03PM +0000, Manuel Bouyer wrote:
  >  > Can you try the patch I sent for PR/46402 ?
  >  
  >  The problem is that I can't reproduce this any more with a recent netbsd-6
  >  kernel. So it's hard to say if the patch really fixes this issue or not ...

 Then perhaps the best way forward is to close the PR and if the
 problem comes back later, worry about it then...

 -- 
 David A. Holland
 dholland@netbsd.org

From: "Jeff Rizzo" <riz@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/46168 CVS commit: [netbsd-6] src/sys
Date: Mon, 1 Oct 2012 23:07:08 +0000

 Module Name:	src
 Committed By:	riz
 Date:		Mon Oct  1 23:07:08 UTC 2012

 Modified Files:
 	src/sys/kern [netbsd-6]: kern_exit.c kern_lwp.c sys_lwp.c
 	src/sys/sys [netbsd-6]: lwp.h

 Log Message:
 Pull up following revision(s) (requested by rmind in ticket #583):
 	sys/kern/sys_lwp.c: revision 1.55
 	sys/sys/lwp.h: revision 1.164
 	sys/kern/kern_exit.c: revision 1.242
 	sys/kern/kern_lwp.c: revision 1.173
 exit_lwps, lwp_wait: fix a race condition by re-trying if p_lock was dropped
 in a case of process exit.  Necessary to re-flag all LWPs for exit, as their
 state might have changed or new LWPs spawned.
 Should fix PR/46168 and PR/46402.


 To generate a diff of this commit:
 cvs rdiff -u -r1.236.2.1 -r1.236.2.2 src/sys/kern/kern_exit.c
 cvs rdiff -u -r1.166 -r1.166.2.1 src/sys/kern/kern_lwp.c
 cvs rdiff -u -r1.52.14.1 -r1.52.14.2 src/sys/kern/sys_lwp.c
 cvs rdiff -u -r1.159.2.1 -r1.159.2.2 src/sys/sys/lwp.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: feedback->closed
State-Changed-By: bouyer@NetBSD.org
State-Changed-When: Sat, 20 Oct 2012 10:17:41 +0000
State-Changed-Why:
I did provide feedback, which basically is "I can't reproduce the problem
so I can't tell if the patch helps". Assume this is fixed.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.