NetBSD Problem Report #46402

From oster@mickey.localdomain  Wed May  2 21:29:51 2012
Return-Path: <oster@mickey.localdomain>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	by www.NetBSD.org (Postfix) with ESMTP id B056E63B915
	for <gnats-bugs@gnats.NetBSD.org>; Wed,  2 May 2012 21:29:51 +0000 (UTC)
Message-Id: <20120502213003.B6AB487CF4@mickey.localdomain>
Date: Wed,  2 May 2012 15:30:03 -0600 (CST)
From: oster@netbsd.org
Reply-To: oster@netbsd.org
To: gnats-bugs@gnats.NetBSD.org
Subject: LWPs created after exit_lwp() is called can hang the process....
X-Send-Pr-Version: 3.95

>Number:         46402
>Category:       kern
>Synopsis:       LWPs created after exit_lwp() is called can hang the process...
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    rmind
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed May 02 21:30:00 +0000 2012
>Closed-Date:    Sun Sep 30 22:50:13 +0000 2012
>Last-Modified:  Tue Nov 06 20:15:10 +0000 2012
>Originator:     Greg Oster
>Release:        NetBSD 6.0_BETA
>Organization:
>Environment:
System: NetBSD mickey 6.0_BETA NetBSD 6.0_BETA (QUAD) #0: Wed Apr 25 11:03:11 CST 2012 oster@quad:/u1/builds/build211/src/obj/amd64/u1/builds/build211/src/sys/arch/amd64/compile/QUAD amd64
Architecture: x86_64
Machine: amd64
>Description:


While running an extended series of aft-tests I noticed that the test were
occasionally hanging.  It turns out they were hanging at the same place,
and that the test is repeatable.  Running the following:

 cd /usr/tests/lib/libpthread
 foreach i (`jot 10000`)
 ./t_cond cond_timedwait_race 
 echo $i
 end

on a NetBSD-6.0_BETA amd64 Xen DOMU (XEN3_DOMU kernel) I observed
that, on occasion, the t_cond process would hang.  As viewed from ddb,
the lwps associated with the t_cond process look like:  

 PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
402     11 8   2         0   ffffa00004ca2b40             t_cond
402      9 3   2  10000000   ffffa00004ca5740             t_cond lwpwait

After some extensive debugging, it seems that:
 1) lid 11 above was created *after* exit_lwp() had been called for
 the first time for pid 402.
 2) lid 11 was created in LSSUSPENDED state.
 3) in lwp_wait1() the second lwp (lid 9 above) goes to sleep, 
 expecting that the first lwp will eventually wake it up.
 4) the first lwp is never woken up.
 5) the two lwps remain as above until the system is rebooted.
 6) the second lwp is jammed in the 

                 if (exiting) {
                        KASSERT(p->p_nlwps > 1);
			cv_wait(&p->p_lwpcv, p->p_lock);
			continue;
		 }

 portion of lwp_wait1().
 7) The t_cond process cannot be killed.

 The issue is also seen on a real i7-2600 box, running netbsd-6 as well.

 The issue is very much a race condition of some sort, as sometimes
 the above testing loop will not get past the first test case, and
 other times it can get through up to a few hundred (or even thousand)
 test cases without failing.

 Additional details are available upon request.

>How-To-Repeat:

 cd /usr/tests/lib/libpthread
 foreach i (`jot 10000`)
 ./t_cond cond_timedwait_race 
 echo $i
 end

 *wait for t_cond to hang and be unkillable */

>Fix:

 Unknown. 



>Release-Note:

>Audit-Trail:
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/46402: LWPs created after exit_lwp() is called can hang the
 process....
Date: Thu, 3 May 2012 16:57:09 +0200

 On Wed, May 02, 2012 at 09:30:00PM +0000, oster@NetBSD.org wrote:
 > While running an extended series of aft-tests I noticed that the test were
 > occasionally hanging.  It turns out they were hanging at the same place,
 > and that the test is repeatable.  Running the following:
 > 
 >  cd /usr/tests/lib/libpthread
 >  foreach i (`jot 10000`)
 >  ./t_cond cond_timedwait_race 
 >  echo $i
 >  end
 > 

 Sounds familiar: kern/46168

 Andrew Doran had some improvements to the patch I proposed here, but it
 seems he forgot to send me an updated patch (this was a month ago).
 I guess we should get the patch in kern/46168 commited then ...

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: Greg Oster <oster@cs.usask.ca>
To: gnats-bugs@NetBSD.org
Cc: bouyer@antioche.eu.org, kern-bug-people@netbsd.org,
 gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, oster@netbsd.org
Subject: Re: kern/46402: LWPs created after exit_lwp() is called can hang
 the process....
Date: Thu, 3 May 2012 10:22:21 -0600

 On Thu,  3 May 2012 15:00:07 +0000 (UTC)
 Manuel Bouyer <bouyer@antioche.eu.org> wrote:

 > The following reply was made to PR kern/46402; it has been noted by
 > GNATS.
 > 
 > From: Manuel Bouyer <bouyer@antioche.eu.org>
 > To: gnats-bugs@NetBSD.org
 > Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org,
 > netbsd-bugs@NetBSD.org Subject: Re: kern/46402: LWPs created after
 > exit_lwp() is called can hang the process....
 > Date: Thu, 3 May 2012 16:57:09 +0200
 > 
 >  On Wed, May 02, 2012 at 09:30:00PM +0000, oster@NetBSD.org wrote:
 >  > While running an extended series of aft-tests I noticed that the
 >  > test were occasionally hanging.  It turns out they were hanging at
 >  > the same place, and that the test is repeatable.  Running the
 >  > following:
 >  > 
 >  >  cd /usr/tests/lib/libpthread
 >  >  foreach i (`jot 10000`)
 >  >  ./t_cond cond_timedwait_race 
 >  >  echo $i
 >  >  end
 >  > 
 >  
 >  Sounds familiar: kern/46168

 yes!  That's exactly the same bug....  I ran with a patch very, very
 similar to the one you had in 46168, but I'd end up with a very hard
 kernel hang after some time... (couldn't break into ddb as normal,
 even).

 >  Andrew Doran had some improvements to the patch I proposed here, but
 > it seems he forgot to send me an updated patch (this was a month ago).
 >  I guess we should get the patch in kern/46168 commited then ...

 I don't think it's sufficient, unfortunately... there's still something
 else missing.. :(  

 Later...

 Greg Oster

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Greg Oster <oster@cs.usask.ca>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org,
        netbsd-bugs@NetBSD.org, oster@NetBSD.org
Subject: Re: kern/46402: LWPs created after exit_lwp() is called can hang the
 process....
Date: Thu, 3 May 2012 18:31:51 +0200

 On Thu, May 03, 2012 at 10:22:21AM -0600, Greg Oster wrote:
 > On Thu,  3 May 2012 15:00:07 +0000 (UTC)
 > Manuel Bouyer <bouyer@antioche.eu.org> wrote:
 > 
 > > The following reply was made to PR kern/46402; it has been noted by
 > > GNATS.
 > > 
 > > From: Manuel Bouyer <bouyer@antioche.eu.org>
 > > To: gnats-bugs@NetBSD.org
 > > Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org,
 > > netbsd-bugs@NetBSD.org Subject: Re: kern/46402: LWPs created after
 > > exit_lwp() is called can hang the process....
 > > Date: Thu, 3 May 2012 16:57:09 +0200
 > > 
 > >  On Wed, May 02, 2012 at 09:30:00PM +0000, oster@NetBSD.org wrote:
 > >  > While running an extended series of aft-tests I noticed that the
 > >  > test were occasionally hanging.  It turns out they were hanging at
 > >  > the same place, and that the test is repeatable.  Running the
 > >  > following:
 > >  > 
 > >  >  cd /usr/tests/lib/libpthread
 > >  >  foreach i (`jot 10000`)
 > >  >  ./t_cond cond_timedwait_race 
 > >  >  echo $i
 > >  >  end
 > >  > 
 > >  
 > >  Sounds familiar: kern/46168
 > 
 > yes!  That's exactly the same bug....  I ran with a patch very, very
 > similar to the one you had in 46168, but I'd end up with a very hard
 > kernel hang after some time... (couldn't break into ddb as normal,
 > even).
 > 
 > >  Andrew Doran had some improvements to the patch I proposed here, but
 > > it seems he forgot to send me an updated patch (this was a month ago).
 > >  I guess we should get the patch in kern/46168 commited then ...
 > 
 > I don't think it's sufficient, unfortunately... there's still something
 > else missing.. :(  

 It may be a different bug; see the thread about hard hang on port-amd64.
 I've never seen these hard hangs myself, and I've tested different hardware ...

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

Responsible-Changed-From-To: kern-bug-people->rmind
Responsible-Changed-By: rmind@NetBSD.org
Responsible-Changed-When: Mon, 17 Sep 2012 23:12:16 +0000
Responsible-Changed-Why:


State-Changed-From-To: open->analyzed
State-Changed-By: rmind@NetBSD.org
State-Changed-When: Mon, 17 Sep 2012 23:12:16 +0000
State-Changed-Why:
Take.


State-Changed-From-To: analyzed->feedback
State-Changed-By: rmind@NetBSD.org
State-Changed-When: Wed, 19 Sep 2012 21:23:06 +0000
State-Changed-Why:
Can you please try the following patch?

http://www.netbsd.org/~rmind/lwp_wait_fix.diff


From: Greg Oster <oster@cs.usask.ca>
To: gnats-bugs@NetBSD.org
Cc: rmind@NetBSD.org
Subject: Re: kern/46402 (LWPs created after exit_lwp() is called can hang
 the process...)
Date: Wed, 19 Sep 2012 16:34:13 -0600

 On Wed, 19 Sep 2012 21:23:07 +0000 (UTC)
 rmind@NetBSD.org wrote:

 > Synopsis: LWPs created after exit_lwp() is called can hang the
 > process...
 > 
 > State-Changed-From-To: analyzed->feedback
 > State-Changed-By: rmind@NetBSD.org
 > State-Changed-When: Wed, 19 Sep 2012 21:23:06 +0000
 > State-Changed-Why:
 > Can you please try the following patch?
 > 
 > http://www.netbsd.org/~rmind/lwp_wait_fix.diff
 > 

 I'm heading out of town tomorrow for about 5 days, but I'll try to get
 to testing this ASAP... 

 Later...

 Greg Oster

From: Greg Oster <oster@cs.usask.ca>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/46402 (LWPs created after exit_lwp() is called can hang
 the process...)
Date: Thu, 20 Sep 2012 07:43:53 -0600

 On Wed, 19 Sep 2012 21:23:07 +0000 (UTC)
 rmind@NetBSD.org wrote:

 > Synopsis: LWPs created after exit_lwp() is called can hang the
 > process...
 > 
 > State-Changed-From-To: analyzed->feedback
 > State-Changed-By: rmind@NetBSD.org
 > State-Changed-When: Wed, 19 Sep 2012 21:23:06 +0000
 > State-Changed-Why:
 > Can you please try the following patch?
 > 
 > http://www.netbsd.org/~rmind/lwp_wait_fix.diff
 > 

 After 376883 iterations on the test loop thing hung:

 USER      PID %CPU %MEM    VSZ   RSS TTY     STAT STARTED    TIME
 COMMAND UID   PID  PPID   CPU PRI NI    VSZ   RSS WCHAN    STAT
 TTY        TIME COMMAND
 root    27779  0.0  0.1 798304  1396 ttyp0   DEl+  2:19AM 0:00.01
 (t_cond   0 27779   410 93985  63  0 798304  1396 -        DEl+ ttyp0
 0:00.01 (t_cond) root    27779  0.0  0.1 798304  1396 ttyp0   DEl+
 2:19AM 0:00.01 (t_cond   0 27779   410 93985  95  0 798304  1396
 lwpwait  DEl+ ttyp0 0:00.01 (t_cond)

 (sorry for the cut'n'paste mess... weekend plans have changed, and I
 have to run.. I can attempt to test more over the next days...)

 Later...

 Greg Oster

From: "Mindaugas Rasiukevicius" <rmind@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/46402 CVS commit: src/sys
Date: Thu, 27 Sep 2012 20:43:16 +0000

 Module Name:	src
 Committed By:	rmind
 Date:		Thu Sep 27 20:43:16 UTC 2012

 Modified Files:
 	src/sys/kern: kern_exit.c kern_lwp.c sys_lwp.c
 	src/sys/sys: lwp.h

 Log Message:
 exit_lwps, lwp_wait: fix a race condition by re-trying if p_lock was dropped
 in a case of process exit.  Necessary to re-flag all LWPs for exit, as their
 state might have changed or new LWPs spawned.

 Should fix PR/46168 and PR/46402.


 To generate a diff of this commit:
 cvs rdiff -u -r1.241 -r1.242 src/sys/kern/kern_exit.c
 cvs rdiff -u -r1.172 -r1.173 src/sys/kern/kern_lwp.c
 cvs rdiff -u -r1.54 -r1.55 src/sys/kern/sys_lwp.c
 cvs rdiff -u -r1.163 -r1.164 src/sys/sys/lwp.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: feedback->closed
State-Changed-By: rmind@NetBSD.org
State-Changed-When: Sun, 30 Sep 2012 22:50:13 +0000
State-Changed-Why:
Assume fixed.  Please reopen (or open a new PR), if the problem reappears.


From: "Jeff Rizzo" <riz@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/46402 CVS commit: [netbsd-6] src/sys
Date: Mon, 1 Oct 2012 23:07:08 +0000

 Module Name:	src
 Committed By:	riz
 Date:		Mon Oct  1 23:07:08 UTC 2012

 Modified Files:
 	src/sys/kern [netbsd-6]: kern_exit.c kern_lwp.c sys_lwp.c
 	src/sys/sys [netbsd-6]: lwp.h

 Log Message:
 Pull up following revision(s) (requested by rmind in ticket #583):
 	sys/kern/sys_lwp.c: revision 1.55
 	sys/sys/lwp.h: revision 1.164
 	sys/kern/kern_exit.c: revision 1.242
 	sys/kern/kern_lwp.c: revision 1.173
 exit_lwps, lwp_wait: fix a race condition by re-trying if p_lock was dropped
 in a case of process exit.  Necessary to re-flag all LWPs for exit, as their
 state might have changed or new LWPs spawned.
 Should fix PR/46168 and PR/46402.


 To generate a diff of this commit:
 cvs rdiff -u -r1.236.2.1 -r1.236.2.2 src/sys/kern/kern_exit.c
 cvs rdiff -u -r1.166 -r1.166.2.1 src/sys/kern/kern_lwp.c
 cvs rdiff -u -r1.52.14.1 -r1.52.14.2 src/sys/kern/sys_lwp.c
 cvs rdiff -u -r1.159.2.1 -r1.159.2.2 src/sys/sys/lwp.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/46402 (LWPs created after exit_lwp() is called can hang the
 process...)
Date: Tue, 6 Nov 2012 20:10:33 +0000

 two mails not sent to gnats

    ------

 From: Mindaugas Rasiukevicius <rmind@netbsd.org>
 To: Greg Oster <oster@cs.usask.ca>
 Cc: gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, oster@netbsd.org
 Subject: Re: kern/46402 (LWPs created after exit_lwp() is called can hang the
 	process...)
 Date: Sat, 22 Sep 2012 14:21:40 +0100

 Greg Oster <oster@cs.usask.ca> wrote:
 >  > Synopsis: LWPs created after exit_lwp() is called can hang the
 >  > process...
 >  > 
 >  > State-Changed-From-To: analyzed->feedback
 >  > State-Changed-By: rmind@NetBSD.org
 >  > State-Changed-When: Wed, 19 Sep 2012 21:23:06 +0000
 >  > State-Changed-Why:
 >  > Can you please try the following patch?
 >  > 
 >  > http://www.netbsd.org/~rmind/lwp_wait_fix.diff
 >  > 
 >  
 >  After 376883 iterations on the test loop thing hung:
 >  
 >  USER      PID %CPU %MEM    VSZ   RSS TTY     STAT STARTED    TIME
 >  COMMAND UID   PID  PPID   CPU PRI NI    VSZ   RSS WCHAN    STAT
 >  TTY        TIME COMMAND
 >  root    27779  0.0  0.1 798304  1396 ttyp0   DEl+  2:19AM 0:00.01
 >  (t_cond   0 27779   410 93985  63  0 798304  1396 -        DEl+ ttyp0
 >  0:00.01 (t_cond) root    27779  0.0  0.1 798304  1396 ttyp0   DEl+
 >  2:19AM 0:00.01 (t_cond   0 27779   410 93985  95  0 798304  1396
 >  lwpwait  DEl+ ttyp0 0:00.01 (t_cond)
 >  
 >  (sorry for the cut'n'paste mess... weekend plans have changed, and I
 >  have to run.. I can attempt to test more over the next days...)

 Thanks.  No hurry. :)  Perhaps you could get the output of ps from crash(8)
 or DDB?  Also, contents of struct proc of that process might be useful.

 -- 
 Mindaugas

 From: Greg Oster <oster@cs.usask.ca>
 To: Mindaugas Rasiukevicius <rmind@netbsd.org>
 Cc: gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
 Subject: Re: kern/46402 (LWPs created after exit_lwp() is called can hang the
 	process...)
 Date: Wed, 26 Sep 2012 16:18:50 -0600

 On Sat, 22 Sep 2012 14:21:40 +0100
 Mindaugas Rasiukevicius <rmind@netbsd.org> wrote:

 > Greg Oster <oster@cs.usask.ca> wrote:
 > >  > Synopsis: LWPs created after exit_lwp() is called can hang the
 > >  > process...
 > >  > 
 > >  > State-Changed-From-To: analyzed->feedback
 > >  > State-Changed-By: rmind@NetBSD.org
 > >  > State-Changed-When: Wed, 19 Sep 2012 21:23:06 +0000
 > >  > State-Changed-Why:
 > >  > Can you please try the following patch?
 > >  > 
 > >  > http://www.netbsd.org/~rmind/lwp_wait_fix.diff
 > >  > 
 > >  
 > >  After 376883 iterations on the test loop thing hung:
 > >  
 > >  USER      PID %CPU %MEM    VSZ   RSS TTY     STAT STARTED    TIME
 > >  COMMAND UID   PID  PPID   CPU PRI NI    VSZ   RSS WCHAN    STAT
 > >  TTY        TIME COMMAND
 > >  root    27779  0.0  0.1 798304  1396 ttyp0   DEl+  2:19AM 0:00.01
 > >  (t_cond   0 27779   410 93985  63  0 798304  1396 -        DEl+
 > > ttyp0 0:00.01 (t_cond) root    27779  0.0  0.1 798304  1396 ttyp0
 > > DEl+ 2:19AM 0:00.01 (t_cond   0 27779   410 93985  95  0 798304
 > > 1396 lwpwait  DEl+ ttyp0 0:00.01 (t_cond)
 > >  
 > >  (sorry for the cut'n'paste mess... weekend plans have changed, and
 > > I have to run.. I can attempt to test more over the next days...)
 > 
 > Thanks.  No hurry. :)  Perhaps you could get the output of ps from
 > crash(8) or DDB?  Also, contents of struct proc of that process might
 > be useful.

 I can't explain why, but in another 3 million runs the above was the
 *only* issue encountered.  When I test without the patch, it hangs
 instantly (in the last case, on the first test case).

 It's entirely possible that the t_cond hang was caused by something
 unrelated... and I've been unable to replicate the issue since
 rebooting the machine after the last hang.

 I'd say "Please commit this patch".

 Later...

 Greg Oster

From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/46402 (LWPs created after exit_lwp() is called can hang the
 process...)
Date: Tue, 6 Nov 2012 20:12:28 +0000

 another one

    ------

 From: Mindaugas Rasiukevicius <rmind@netbsd.org>
 To: Greg Oster <oster@cs.usask.ca>
 Cc: gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
 Subject: Re: kern/46402 (LWPs created after exit_lwp() is called can hang the
 	process...)
 Date: Thu, 27 Sep 2012 21:48:02 +0100

 Greg Oster <oster@cs.usask.ca> wrote:
 > > >  (sorry for the cut'n'paste mess... weekend plans have changed, and
 > > > I have to run.. I can attempt to test more over the next days...)
 > > 
 > > Thanks.  No hurry. :)  Perhaps you could get the output of ps from
 > > crash(8) or DDB?  Also, contents of struct proc of that process might
 > > be useful.
 > 
 > I can't explain why, but in another 3 million runs the above was the
 > *only* issue encountered.  When I test without the patch, it hangs
 > instantly (in the last case, on the first test case).
 > 
 > It's entirely possible that the t_cond hang was caused by something
 > unrelated... and I've been unable to replicate the issue since
 > rebooting the machine after the last hang.
 > 
 > I'd say "Please commit this patch".

 Thanks for the testing!  I committed the patch, it fixes few quite clear
 problems.  It is possible that you have hit some other corner case.  When
 you will have time, could you try the latest kernel on some other machines
 where you triggered the problem before?

 -- 
 Mindaugas

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.