NetBSD Problem Report #46402
From oster@mickey.localdomain Wed May 2 21:29:51 2012
Return-Path: <oster@mickey.localdomain>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
by www.NetBSD.org (Postfix) with ESMTP id B056E63B915
for <gnats-bugs@gnats.NetBSD.org>; Wed, 2 May 2012 21:29:51 +0000 (UTC)
Message-Id: <20120502213003.B6AB487CF4@mickey.localdomain>
Date: Wed, 2 May 2012 15:30:03 -0600 (CST)
From: oster@netbsd.org
Reply-To: oster@netbsd.org
To: gnats-bugs@gnats.NetBSD.org
Subject: LWPs created after exit_lwp() is called can hang the process....
X-Send-Pr-Version: 3.95
>Number: 46402
>Category: kern
>Synopsis: LWPs created after exit_lwp() is called can hang the process...
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: rmind
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Wed May 02 21:30:00 +0000 2012
>Closed-Date: Sun Sep 30 22:50:13 +0000 2012
>Last-Modified: Tue Nov 06 20:15:10 +0000 2012
>Originator: Greg Oster
>Release: NetBSD 6.0_BETA
>Organization:
>Environment:
System: NetBSD mickey 6.0_BETA NetBSD 6.0_BETA (QUAD) #0: Wed Apr 25 11:03:11 CST 2012 oster@quad:/u1/builds/build211/src/obj/amd64/u1/builds/build211/src/sys/arch/amd64/compile/QUAD amd64
Architecture: x86_64
Machine: amd64
>Description:
While running an extended series of aft-tests I noticed that the test were
occasionally hanging. It turns out they were hanging at the same place,
and that the test is repeatable. Running the following:
cd /usr/tests/lib/libpthread
foreach i (`jot 10000`)
./t_cond cond_timedwait_race
echo $i
end
on a NetBSD-6.0_BETA amd64 Xen DOMU (XEN3_DOMU kernel) I observed
that, on occasion, the t_cond process would hang. As viewed from ddb,
the lwps associated with the t_cond process look like:
PID LID S CPU FLAGS STRUCT LWP * NAME WAIT
402 11 8 2 0 ffffa00004ca2b40 t_cond
402 9 3 2 10000000 ffffa00004ca5740 t_cond lwpwait
After some extensive debugging, it seems that:
1) lid 11 above was created *after* exit_lwp() had been called for
the first time for pid 402.
2) lid 11 was created in LSSUSPENDED state.
3) in lwp_wait1() the second lwp (lid 9 above) goes to sleep,
expecting that the first lwp will eventually wake it up.
4) the first lwp is never woken up.
5) the two lwps remain as above until the system is rebooted.
6) the second lwp is jammed in the
if (exiting) {
KASSERT(p->p_nlwps > 1);
cv_wait(&p->p_lwpcv, p->p_lock);
continue;
}
portion of lwp_wait1().
7) The t_cond process cannot be killed.
The issue is also seen on a real i7-2600 box, running netbsd-6 as well.
The issue is very much a race condition of some sort, as sometimes
the above testing loop will not get past the first test case, and
other times it can get through up to a few hundred (or even thousand)
test cases without failing.
Additional details are available upon request.
>How-To-Repeat:
cd /usr/tests/lib/libpthread
foreach i (`jot 10000`)
./t_cond cond_timedwait_race
echo $i
end
*wait for t_cond to hang and be unkillable */
>Fix:
Unknown.
>Release-Note:
>Audit-Trail:
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/46402: LWPs created after exit_lwp() is called can hang the
process....
Date: Thu, 3 May 2012 16:57:09 +0200
On Wed, May 02, 2012 at 09:30:00PM +0000, oster@NetBSD.org wrote:
> While running an extended series of aft-tests I noticed that the test were
> occasionally hanging. It turns out they were hanging at the same place,
> and that the test is repeatable. Running the following:
>
> cd /usr/tests/lib/libpthread
> foreach i (`jot 10000`)
> ./t_cond cond_timedwait_race
> echo $i
> end
>
Sounds familiar: kern/46168
Andrew Doran had some improvements to the patch I proposed here, but it
seems he forgot to send me an updated patch (this was a month ago).
I guess we should get the patch in kern/46168 commited then ...
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
From: Greg Oster <oster@cs.usask.ca>
To: gnats-bugs@NetBSD.org
Cc: bouyer@antioche.eu.org, kern-bug-people@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, oster@netbsd.org
Subject: Re: kern/46402: LWPs created after exit_lwp() is called can hang
the process....
Date: Thu, 3 May 2012 10:22:21 -0600
On Thu, 3 May 2012 15:00:07 +0000 (UTC)
Manuel Bouyer <bouyer@antioche.eu.org> wrote:
> The following reply was made to PR kern/46402; it has been noted by
> GNATS.
>
> From: Manuel Bouyer <bouyer@antioche.eu.org>
> To: gnats-bugs@NetBSD.org
> Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org,
> netbsd-bugs@NetBSD.org Subject: Re: kern/46402: LWPs created after
> exit_lwp() is called can hang the process....
> Date: Thu, 3 May 2012 16:57:09 +0200
>
> On Wed, May 02, 2012 at 09:30:00PM +0000, oster@NetBSD.org wrote:
> > While running an extended series of aft-tests I noticed that the
> > test were occasionally hanging. It turns out they were hanging at
> > the same place, and that the test is repeatable. Running the
> > following:
> >
> > cd /usr/tests/lib/libpthread
> > foreach i (`jot 10000`)
> > ./t_cond cond_timedwait_race
> > echo $i
> > end
> >
>
> Sounds familiar: kern/46168
yes! That's exactly the same bug.... I ran with a patch very, very
similar to the one you had in 46168, but I'd end up with a very hard
kernel hang after some time... (couldn't break into ddb as normal,
even).
> Andrew Doran had some improvements to the patch I proposed here, but
> it seems he forgot to send me an updated patch (this was a month ago).
> I guess we should get the patch in kern/46168 commited then ...
I don't think it's sufficient, unfortunately... there's still something
else missing.. :(
Later...
Greg Oster
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Greg Oster <oster@cs.usask.ca>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org,
netbsd-bugs@NetBSD.org, oster@NetBSD.org
Subject: Re: kern/46402: LWPs created after exit_lwp() is called can hang the
process....
Date: Thu, 3 May 2012 18:31:51 +0200
On Thu, May 03, 2012 at 10:22:21AM -0600, Greg Oster wrote:
> On Thu, 3 May 2012 15:00:07 +0000 (UTC)
> Manuel Bouyer <bouyer@antioche.eu.org> wrote:
>
> > The following reply was made to PR kern/46402; it has been noted by
> > GNATS.
> >
> > From: Manuel Bouyer <bouyer@antioche.eu.org>
> > To: gnats-bugs@NetBSD.org
> > Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org,
> > netbsd-bugs@NetBSD.org Subject: Re: kern/46402: LWPs created after
> > exit_lwp() is called can hang the process....
> > Date: Thu, 3 May 2012 16:57:09 +0200
> >
> > On Wed, May 02, 2012 at 09:30:00PM +0000, oster@NetBSD.org wrote:
> > > While running an extended series of aft-tests I noticed that the
> > > test were occasionally hanging. It turns out they were hanging at
> > > the same place, and that the test is repeatable. Running the
> > > following:
> > >
> > > cd /usr/tests/lib/libpthread
> > > foreach i (`jot 10000`)
> > > ./t_cond cond_timedwait_race
> > > echo $i
> > > end
> > >
> >
> > Sounds familiar: kern/46168
>
> yes! That's exactly the same bug.... I ran with a patch very, very
> similar to the one you had in 46168, but I'd end up with a very hard
> kernel hang after some time... (couldn't break into ddb as normal,
> even).
>
> > Andrew Doran had some improvements to the patch I proposed here, but
> > it seems he forgot to send me an updated patch (this was a month ago).
> > I guess we should get the patch in kern/46168 commited then ...
>
> I don't think it's sufficient, unfortunately... there's still something
> else missing.. :(
It may be a different bug; see the thread about hard hang on port-amd64.
I've never seen these hard hangs myself, and I've tested different hardware ...
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
Responsible-Changed-From-To: kern-bug-people->rmind
Responsible-Changed-By: rmind@NetBSD.org
Responsible-Changed-When: Mon, 17 Sep 2012 23:12:16 +0000
Responsible-Changed-Why:
State-Changed-From-To: open->analyzed
State-Changed-By: rmind@NetBSD.org
State-Changed-When: Mon, 17 Sep 2012 23:12:16 +0000
State-Changed-Why:
Take.
State-Changed-From-To: analyzed->feedback
State-Changed-By: rmind@NetBSD.org
State-Changed-When: Wed, 19 Sep 2012 21:23:06 +0000
State-Changed-Why:
Can you please try the following patch?
http://www.netbsd.org/~rmind/lwp_wait_fix.diff
From: Greg Oster <oster@cs.usask.ca>
To: gnats-bugs@NetBSD.org
Cc: rmind@NetBSD.org
Subject: Re: kern/46402 (LWPs created after exit_lwp() is called can hang
the process...)
Date: Wed, 19 Sep 2012 16:34:13 -0600
On Wed, 19 Sep 2012 21:23:07 +0000 (UTC)
rmind@NetBSD.org wrote:
> Synopsis: LWPs created after exit_lwp() is called can hang the
> process...
>
> State-Changed-From-To: analyzed->feedback
> State-Changed-By: rmind@NetBSD.org
> State-Changed-When: Wed, 19 Sep 2012 21:23:06 +0000
> State-Changed-Why:
> Can you please try the following patch?
>
> http://www.netbsd.org/~rmind/lwp_wait_fix.diff
>
I'm heading out of town tomorrow for about 5 days, but I'll try to get
to testing this ASAP...
Later...
Greg Oster
From: Greg Oster <oster@cs.usask.ca>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/46402 (LWPs created after exit_lwp() is called can hang
the process...)
Date: Thu, 20 Sep 2012 07:43:53 -0600
On Wed, 19 Sep 2012 21:23:07 +0000 (UTC)
rmind@NetBSD.org wrote:
> Synopsis: LWPs created after exit_lwp() is called can hang the
> process...
>
> State-Changed-From-To: analyzed->feedback
> State-Changed-By: rmind@NetBSD.org
> State-Changed-When: Wed, 19 Sep 2012 21:23:06 +0000
> State-Changed-Why:
> Can you please try the following patch?
>
> http://www.netbsd.org/~rmind/lwp_wait_fix.diff
>
After 376883 iterations on the test loop thing hung:
USER PID %CPU %MEM VSZ RSS TTY STAT STARTED TIME
COMMAND UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT
TTY TIME COMMAND
root 27779 0.0 0.1 798304 1396 ttyp0 DEl+ 2:19AM 0:00.01
(t_cond 0 27779 410 93985 63 0 798304 1396 - DEl+ ttyp0
0:00.01 (t_cond) root 27779 0.0 0.1 798304 1396 ttyp0 DEl+
2:19AM 0:00.01 (t_cond 0 27779 410 93985 95 0 798304 1396
lwpwait DEl+ ttyp0 0:00.01 (t_cond)
(sorry for the cut'n'paste mess... weekend plans have changed, and I
have to run.. I can attempt to test more over the next days...)
Later...
Greg Oster
From: "Mindaugas Rasiukevicius" <rmind@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/46402 CVS commit: src/sys
Date: Thu, 27 Sep 2012 20:43:16 +0000
Module Name: src
Committed By: rmind
Date: Thu Sep 27 20:43:16 UTC 2012
Modified Files:
src/sys/kern: kern_exit.c kern_lwp.c sys_lwp.c
src/sys/sys: lwp.h
Log Message:
exit_lwps, lwp_wait: fix a race condition by re-trying if p_lock was dropped
in a case of process exit. Necessary to re-flag all LWPs for exit, as their
state might have changed or new LWPs spawned.
Should fix PR/46168 and PR/46402.
To generate a diff of this commit:
cvs rdiff -u -r1.241 -r1.242 src/sys/kern/kern_exit.c
cvs rdiff -u -r1.172 -r1.173 src/sys/kern/kern_lwp.c
cvs rdiff -u -r1.54 -r1.55 src/sys/kern/sys_lwp.c
cvs rdiff -u -r1.163 -r1.164 src/sys/sys/lwp.h
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: feedback->closed
State-Changed-By: rmind@NetBSD.org
State-Changed-When: Sun, 30 Sep 2012 22:50:13 +0000
State-Changed-Why:
Assume fixed. Please reopen (or open a new PR), if the problem reappears.
From: "Jeff Rizzo" <riz@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/46402 CVS commit: [netbsd-6] src/sys
Date: Mon, 1 Oct 2012 23:07:08 +0000
Module Name: src
Committed By: riz
Date: Mon Oct 1 23:07:08 UTC 2012
Modified Files:
src/sys/kern [netbsd-6]: kern_exit.c kern_lwp.c sys_lwp.c
src/sys/sys [netbsd-6]: lwp.h
Log Message:
Pull up following revision(s) (requested by rmind in ticket #583):
sys/kern/sys_lwp.c: revision 1.55
sys/sys/lwp.h: revision 1.164
sys/kern/kern_exit.c: revision 1.242
sys/kern/kern_lwp.c: revision 1.173
exit_lwps, lwp_wait: fix a race condition by re-trying if p_lock was dropped
in a case of process exit. Necessary to re-flag all LWPs for exit, as their
state might have changed or new LWPs spawned.
Should fix PR/46168 and PR/46402.
To generate a diff of this commit:
cvs rdiff -u -r1.236.2.1 -r1.236.2.2 src/sys/kern/kern_exit.c
cvs rdiff -u -r1.166 -r1.166.2.1 src/sys/kern/kern_lwp.c
cvs rdiff -u -r1.52.14.1 -r1.52.14.2 src/sys/kern/sys_lwp.c
cvs rdiff -u -r1.159.2.1 -r1.159.2.2 src/sys/sys/lwp.h
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/46402 (LWPs created after exit_lwp() is called can hang the
process...)
Date: Tue, 6 Nov 2012 20:10:33 +0000
two mails not sent to gnats
------
From: Mindaugas Rasiukevicius <rmind@netbsd.org>
To: Greg Oster <oster@cs.usask.ca>
Cc: gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, oster@netbsd.org
Subject: Re: kern/46402 (LWPs created after exit_lwp() is called can hang the
process...)
Date: Sat, 22 Sep 2012 14:21:40 +0100
Greg Oster <oster@cs.usask.ca> wrote:
> > Synopsis: LWPs created after exit_lwp() is called can hang the
> > process...
> >
> > State-Changed-From-To: analyzed->feedback
> > State-Changed-By: rmind@NetBSD.org
> > State-Changed-When: Wed, 19 Sep 2012 21:23:06 +0000
> > State-Changed-Why:
> > Can you please try the following patch?
> >
> > http://www.netbsd.org/~rmind/lwp_wait_fix.diff
> >
>
> After 376883 iterations on the test loop thing hung:
>
> USER PID %CPU %MEM VSZ RSS TTY STAT STARTED TIME
> COMMAND UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT
> TTY TIME COMMAND
> root 27779 0.0 0.1 798304 1396 ttyp0 DEl+ 2:19AM 0:00.01
> (t_cond 0 27779 410 93985 63 0 798304 1396 - DEl+ ttyp0
> 0:00.01 (t_cond) root 27779 0.0 0.1 798304 1396 ttyp0 DEl+
> 2:19AM 0:00.01 (t_cond 0 27779 410 93985 95 0 798304 1396
> lwpwait DEl+ ttyp0 0:00.01 (t_cond)
>
> (sorry for the cut'n'paste mess... weekend plans have changed, and I
> have to run.. I can attempt to test more over the next days...)
Thanks. No hurry. :) Perhaps you could get the output of ps from crash(8)
or DDB? Also, contents of struct proc of that process might be useful.
--
Mindaugas
From: Greg Oster <oster@cs.usask.ca>
To: Mindaugas Rasiukevicius <rmind@netbsd.org>
Cc: gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/46402 (LWPs created after exit_lwp() is called can hang the
process...)
Date: Wed, 26 Sep 2012 16:18:50 -0600
On Sat, 22 Sep 2012 14:21:40 +0100
Mindaugas Rasiukevicius <rmind@netbsd.org> wrote:
> Greg Oster <oster@cs.usask.ca> wrote:
> > > Synopsis: LWPs created after exit_lwp() is called can hang the
> > > process...
> > >
> > > State-Changed-From-To: analyzed->feedback
> > > State-Changed-By: rmind@NetBSD.org
> > > State-Changed-When: Wed, 19 Sep 2012 21:23:06 +0000
> > > State-Changed-Why:
> > > Can you please try the following patch?
> > >
> > > http://www.netbsd.org/~rmind/lwp_wait_fix.diff
> > >
> >
> > After 376883 iterations on the test loop thing hung:
> >
> > USER PID %CPU %MEM VSZ RSS TTY STAT STARTED TIME
> > COMMAND UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT
> > TTY TIME COMMAND
> > root 27779 0.0 0.1 798304 1396 ttyp0 DEl+ 2:19AM 0:00.01
> > (t_cond 0 27779 410 93985 63 0 798304 1396 - DEl+
> > ttyp0 0:00.01 (t_cond) root 27779 0.0 0.1 798304 1396 ttyp0
> > DEl+ 2:19AM 0:00.01 (t_cond 0 27779 410 93985 95 0 798304
> > 1396 lwpwait DEl+ ttyp0 0:00.01 (t_cond)
> >
> > (sorry for the cut'n'paste mess... weekend plans have changed, and
> > I have to run.. I can attempt to test more over the next days...)
>
> Thanks. No hurry. :) Perhaps you could get the output of ps from
> crash(8) or DDB? Also, contents of struct proc of that process might
> be useful.
I can't explain why, but in another 3 million runs the above was the
*only* issue encountered. When I test without the patch, it hangs
instantly (in the last case, on the first test case).
It's entirely possible that the t_cond hang was caused by something
unrelated... and I've been unable to replicate the issue since
rebooting the machine after the last hang.
I'd say "Please commit this patch".
Later...
Greg Oster
From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/46402 (LWPs created after exit_lwp() is called can hang the
process...)
Date: Tue, 6 Nov 2012 20:12:28 +0000
another one
------
From: Mindaugas Rasiukevicius <rmind@netbsd.org>
To: Greg Oster <oster@cs.usask.ca>
Cc: gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/46402 (LWPs created after exit_lwp() is called can hang the
process...)
Date: Thu, 27 Sep 2012 21:48:02 +0100
Greg Oster <oster@cs.usask.ca> wrote:
> > > (sorry for the cut'n'paste mess... weekend plans have changed, and
> > > I have to run.. I can attempt to test more over the next days...)
> >
> > Thanks. No hurry. :) Perhaps you could get the output of ps from
> > crash(8) or DDB? Also, contents of struct proc of that process might
> > be useful.
>
> I can't explain why, but in another 3 million runs the above was the
> *only* issue encountered. When I test without the patch, it hangs
> instantly (in the last case, on the first test case).
>
> It's entirely possible that the t_cond hang was caused by something
> unrelated... and I've been unable to replicate the issue since
> rebooting the machine after the last hang.
>
> I'd say "Please commit this patch".
Thanks for the testing! I committed the patch, it fixes few quite clear
problems. It is possible that you have hit some other corner case. When
you will have time, could you try the latest kernel on some other machines
where you triggered the problem before?
--
Mindaugas
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.