NetBSD Problem Report #50300

From paul@pokey.whooppee.com  Mon Oct  5 02:21:01 2015
Return-Path: <paul@pokey.whooppee.com>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 1204FA5864
	for <gnats-bugs@gnats.NetBSD.org>; Mon,  5 Oct 2015 02:21:01 +0000 (UTC)
Message-Id: <20151005021932.C92A616E72@pokey.whooppee.com>
Date: Mon,  5 Oct 2015 10:19:32 +0800 (PHT)
From: paul@whooppee.com
Reply-To: paul@whooppee.com
To: gnats-bugs@NetBSD.org
Subject: Issue with updating p_nstopchild count
X-Send-Pr-Version: 3.95

>Number:         50300
>Category:       kern
>Synopsis:       p_nstopchild count not being correctly updated
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    pgoyette
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Oct 05 02:25:00 +0000 2015
>Closed-Date:    Sat Nov 07 22:18:07 +0000 2015
>Last-Modified:  Sat Nov 07 22:18:07 +0000 2015
>Originator:     paul@whooppee.com
>Release:        NetBSD 7.99.21
>Organization:
+------------------+--------------------------+-------------------------+
| Paul Goyette     | PGP Key fingerprint:     | E-mail addresses:       |
| (Retired)        | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com    |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
+------------------+--------------------------+-------------------------+
>Environment:


System: NetBSD pokey.whooppee.com 7.99.21 NetBSD 7.99.21 (POKEY 2015-09-20 02:09:58) #0: Sun Sep 20 14:01:44 PHT 2015 paul@pokey.whooppee.com:/build/netbsd-local/obj/amd64/sys/arch/amd64/compile/POKEY amd64
Architecture: x86_64
Machine: amd64
>Description:
In sys/kern/kern_exit.c routine exit1() ...

At line 492 we mark the exiting process as SDEAD, and increment the parent's
p_nstopchild count.  Then we make sure that the process is at the head of
the p_children list.

Now if the parent has decided not to wait for the child, we re-parent the
process to init.

At lines 964-965, proc_reparent() adjusts the p_nstopchild counts for the
old and new parents.  But it only does it for processes that are SZOMB or
SSTOP.  That doesn't help for this process, since it is still SDEAD.

>How-To-Repeat:
Unknown - found by code inspection
>Fix:
It seems to me that the check at line 962 should include SDEAD as well as
SZOMB
===================================================================
RCS file: /cvsroot/src/sys/kern/kern_exit.c,v
retrieving revision 1.245
diff -u -p -r1.245 kern_exit.c
--- kern_exit.c 2 Oct 2015 16:54:15 -0000       1.245
+++ kern_exit.c 5 Oct 2015 02:12:19 -0000
@@ -959,7 +959,7 @@ proc_reparent(struct proc *child, struct
 	if (child->p_pptr == parent)
 		return;

-	if (child->p_stat == SZOMB ||
+	if (child->p_stat == SZOMB || child->p_stat == SDEAD ||
 	    (child->p_stat == SSTOP && !child->p_waited)) {
 		child->p_pptr->p_nstopchild--;
 		parent->p_nstopchild++;

>Release-Note:

>Audit-Trail:
From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/50300: Issue with updating p_nstopchild count
Date: Tue, 06 Oct 2015 06:02:32 +0700

     Date:        Mon,  5 Oct 2015 02:25:00 +0000 (UTC)
     From:        paul@whooppee.com
     Message-ID:  <20151005022500.9D035A5B2E@mollari.NetBSD.org>

 A better fix might be ...

   | -	if (child->p_stat == SZOMB ||
   | +	if (P_ZOMBIE(child) ||
   |  	    (child->p_stat == SSTOP && !child->p_waited)) {
   |  		child->p_pptr->p_nstopchild--;
   |  		parent->p_nstopchild++;

 The case Paul found, where a DEAD process is reparented, is most likely
 not the cause of the problem he's seeing.

 But a process in SDYING state, which gets reparented, might very easily
 be the problem - that's a much more likely scenario for his workload.

 Such a process won't reparent itself (unlike the SDEAD case), but if its
 parent dies, it will be reparented to init.

 Because the code above ignores SDYING processes, just as it did SDEAD ones,
 the effect will be that init will gain dying (and eventually, zombie)
 children without having its p_nstopchild count incremented.  Then, after it
 has waited a few times, p_nstopchild will drop to 0, and any future
 wait() calls will fail (that is, just hang, ignoring other processes that
 should be located).

 Given a sufficiently busy workload, like the one Paul uses to trigger the
 problem, it is entirely possible that processes will sometimes exit while
 having children that are temporarily stalled in SDYING state.

 kre

From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/50300: Issue with updating p_nstopchild count
Date: Tue, 06 Oct 2015 06:06:40 +0700

 No, ignore that suggestion, in SDYING state, the stopped child count
 hasn't het been incremented, that doesn't happen until SDEAD, so
 adjusting things in that case (however attractive it seemed as a potential
 solution) cannot be right.

 Use Paul's fix instead.

 kre

Responsible-Changed-From-To: kern-bug-people->pgoyette
Responsible-Changed-By: pgoyette@NetBSD.org
Responsible-Changed-When: Sun, 11 Oct 2015 23:31:05 +0000
Responsible-Changed-Why:
I have the fix.


From: "Paul Goyette" <pgoyette@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/50300 CVS commit: src/sys/kern
Date: Tue, 13 Oct 2015 00:27:19 +0000

 Module Name:	src
 Committed By:	pgoyette
 Date:		Tue Oct 13 00:27:19 UTC 2015

 Modified Files:
 	src/sys/kern: kern_exit.c

 Log Message:
 Currently, if a process is exiting and its parent has indicated no intent
 of reaping the process (nor any other children), the process wil get
 reparented to init.  Since the state of the exiting process at this point
 is SDEAD, proc_reparent() will not update either the old or new parent's
 p_nstopchild counters.

 This change causes both old and new parents to be properly updated.

 Fixes PR kern/50300

 Pullups will be requested for:

        NetBSD-7, -6, -6-0, -6-1, -5, -5-0, -5-1, and -5-2


 To generate a diff of this commit:
 cvs rdiff -u -r1.245 -r1.246 src/sys/kern/kern_exit.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->pending-pullups
State-Changed-By: pgoyette@NetBSD.org
State-Changed-When: Tue, 13 Oct 2015 01:34:49 +0000
State-Changed-Why:
Committed to head, pending pullups


State-Changed-From-To: pending-pullups->closed
State-Changed-By: pgoyette@NetBSD.org
State-Changed-When: Sat, 07 Nov 2015 22:18:07 +0000
State-Changed-Why:
Fixed, pullups completed.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.