NetBSD Problem Report #50300
From paul@pokey.whooppee.com Mon Oct 5 02:21:01 2015
Return-Path: <paul@pokey.whooppee.com>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
by mollari.NetBSD.org (Postfix) with ESMTPS id 1204FA5864
for <gnats-bugs@gnats.NetBSD.org>; Mon, 5 Oct 2015 02:21:01 +0000 (UTC)
Message-Id: <20151005021932.C92A616E72@pokey.whooppee.com>
Date: Mon, 5 Oct 2015 10:19:32 +0800 (PHT)
From: paul@whooppee.com
Reply-To: paul@whooppee.com
To: gnats-bugs@NetBSD.org
Subject: Issue with updating p_nstopchild count
X-Send-Pr-Version: 3.95
>Number: 50300
>Category: kern
>Synopsis: p_nstopchild count not being correctly updated
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: pgoyette
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Mon Oct 05 02:25:00 +0000 2015
>Closed-Date: Sat Nov 07 22:18:07 +0000 2015
>Last-Modified: Sat Nov 07 22:18:07 +0000 2015
>Originator: paul@whooppee.com
>Release: NetBSD 7.99.21
>Organization:
+------------------+--------------------------+-------------------------+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses: |
| (Retired) | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org |
+------------------+--------------------------+-------------------------+
>Environment:
System: NetBSD pokey.whooppee.com 7.99.21 NetBSD 7.99.21 (POKEY 2015-09-20 02:09:58) #0: Sun Sep 20 14:01:44 PHT 2015 paul@pokey.whooppee.com:/build/netbsd-local/obj/amd64/sys/arch/amd64/compile/POKEY amd64
Architecture: x86_64
Machine: amd64
>Description:
In sys/kern/kern_exit.c routine exit1() ...
At line 492 we mark the exiting process as SDEAD, and increment the parent's
p_nstopchild count. Then we make sure that the process is at the head of
the p_children list.
Now if the parent has decided not to wait for the child, we re-parent the
process to init.
At lines 964-965, proc_reparent() adjusts the p_nstopchild counts for the
old and new parents. But it only does it for processes that are SZOMB or
SSTOP. That doesn't help for this process, since it is still SDEAD.
>How-To-Repeat:
Unknown - found by code inspection
>Fix:
It seems to me that the check at line 962 should include SDEAD as well as
SZOMB
===================================================================
RCS file: /cvsroot/src/sys/kern/kern_exit.c,v
retrieving revision 1.245
diff -u -p -r1.245 kern_exit.c
--- kern_exit.c 2 Oct 2015 16:54:15 -0000 1.245
+++ kern_exit.c 5 Oct 2015 02:12:19 -0000
@@ -959,7 +959,7 @@ proc_reparent(struct proc *child, struct
if (child->p_pptr == parent)
return;
- if (child->p_stat == SZOMB ||
+ if (child->p_stat == SZOMB || child->p_stat == SDEAD ||
(child->p_stat == SSTOP && !child->p_waited)) {
child->p_pptr->p_nstopchild--;
parent->p_nstopchild++;
>Release-Note:
>Audit-Trail:
From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/50300: Issue with updating p_nstopchild count
Date: Tue, 06 Oct 2015 06:02:32 +0700
Date: Mon, 5 Oct 2015 02:25:00 +0000 (UTC)
From: paul@whooppee.com
Message-ID: <20151005022500.9D035A5B2E@mollari.NetBSD.org>
A better fix might be ...
| - if (child->p_stat == SZOMB ||
| + if (P_ZOMBIE(child) ||
| (child->p_stat == SSTOP && !child->p_waited)) {
| child->p_pptr->p_nstopchild--;
| parent->p_nstopchild++;
The case Paul found, where a DEAD process is reparented, is most likely
not the cause of the problem he's seeing.
But a process in SDYING state, which gets reparented, might very easily
be the problem - that's a much more likely scenario for his workload.
Such a process won't reparent itself (unlike the SDEAD case), but if its
parent dies, it will be reparented to init.
Because the code above ignores SDYING processes, just as it did SDEAD ones,
the effect will be that init will gain dying (and eventually, zombie)
children without having its p_nstopchild count incremented. Then, after it
has waited a few times, p_nstopchild will drop to 0, and any future
wait() calls will fail (that is, just hang, ignoring other processes that
should be located).
Given a sufficiently busy workload, like the one Paul uses to trigger the
problem, it is entirely possible that processes will sometimes exit while
having children that are temporarily stalled in SDYING state.
kre
From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/50300: Issue with updating p_nstopchild count
Date: Tue, 06 Oct 2015 06:06:40 +0700
No, ignore that suggestion, in SDYING state, the stopped child count
hasn't het been incremented, that doesn't happen until SDEAD, so
adjusting things in that case (however attractive it seemed as a potential
solution) cannot be right.
Use Paul's fix instead.
kre
Responsible-Changed-From-To: kern-bug-people->pgoyette
Responsible-Changed-By: pgoyette@NetBSD.org
Responsible-Changed-When: Sun, 11 Oct 2015 23:31:05 +0000
Responsible-Changed-Why:
I have the fix.
From: "Paul Goyette" <pgoyette@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/50300 CVS commit: src/sys/kern
Date: Tue, 13 Oct 2015 00:27:19 +0000
Module Name: src
Committed By: pgoyette
Date: Tue Oct 13 00:27:19 UTC 2015
Modified Files:
src/sys/kern: kern_exit.c
Log Message:
Currently, if a process is exiting and its parent has indicated no intent
of reaping the process (nor any other children), the process wil get
reparented to init. Since the state of the exiting process at this point
is SDEAD, proc_reparent() will not update either the old or new parent's
p_nstopchild counters.
This change causes both old and new parents to be properly updated.
Fixes PR kern/50300
Pullups will be requested for:
NetBSD-7, -6, -6-0, -6-1, -5, -5-0, -5-1, and -5-2
To generate a diff of this commit:
cvs rdiff -u -r1.245 -r1.246 src/sys/kern/kern_exit.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: open->pending-pullups
State-Changed-By: pgoyette@NetBSD.org
State-Changed-When: Tue, 13 Oct 2015 01:34:49 +0000
State-Changed-Why:
Committed to head, pending pullups
State-Changed-From-To: pending-pullups->closed
State-Changed-By: pgoyette@NetBSD.org
State-Changed-When: Sat, 07 Nov 2015 22:18:07 +0000
State-Changed-Why:
Fixed, pullups completed.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.