NetBSD Problem Report #52640

From kre@munnari.OZ.AU  Mon Oct 23 05:39:57 2017
Return-Path: <kre@munnari.OZ.AU>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id AED3F7A1AF
	for <gnats-bugs@www.NetBSD.org>; Mon, 23 Oct 2017 05:39:57 +0000 (UTC)
Message-Id: <201710230539.v9N5dVCh029045@andromeda.noi.kre.to>
Date: Mon, 23 Oct 2017 12:39:31 +0700 (ICT)
From: kre@munnari.OZ.AU
To: gnats-bugs@www.NetBSD.org
Subject: /bin/sh can "lose" background children when waiting on foreground ones
X-Send-Pr-Version: 3.95

>Number:         52640
>Category:       bin
>Synopsis:       /bin/sh can "lose" background children when waiting on foreground ones
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kre
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Oct 23 05:40:00 +0000 2017
>Closed-Date:    Fri Nov 17 18:33:00 +0000 2017
>Last-Modified:  Fri Nov 17 18:33:00 +0000 2017
>Originator:     Robert Elz
>Release:        NetBSD 8.99.1  (lots and lots of releases...)
>Organization:
>Environment:
System: NetBSD andromeda.noi.kre.to 8.99.1 NetBSD 8.99.1 (VBOX64-1.3-20170812) #39: Sat Aug 12 15:25:04 ICT 2017 kre@magnolia.noi.kre.to:/usr/obj/current/kernels/amd64/VBOX64 amd64
Architecture: x86_64
Machine: amd64
>Description:
	The following script

#! /bin/sh

(sleep 3; exit 3) & PID=$!
sleep 10

(wait $PID; echo "In child:  status" $?)

wait $PID; echo "In parent: status" $?

	should print:

In child:  status 127
In parent: status 3

	as all other shells I could find to test do (except bosh,
	which is just broken, and appears to return status 0 from
	the wait command in all cases, and zsh, which is just weird,
	in this and so many other ways)

	instead, on all currently available NetBSD sh's we see

In child:  status 127
In parent: status 127

	That's because the background job completes while sh is waiting
	for the later foreground job, and when that happens (at least
	in many cases) the background job is simply discarded (if it
	exited with a signal, that will be immediately reported, but
	it will still be discarded.)

	Fix that problem and we instead get

In child:  status 3
In parent: status 3

	!!!   The sub-shell has no children, it should not be
	      able to get status from one of its siblings.

	This only happens when the child has already exited before
	the sub-shell is forked, and only when the status of that
	child has not already been discarded (including incorrectly
	discarded as above.)

	This is because when a sub-shell is forked, the job table
	(which holds the results of completed tasks, and the status
	of active ones) is just marked invalid, not actually cleared
	(until a new job needs to be created), but the shell's "wait"
	command only bothers to look at the "invalid" flag in the
	case of a simple "wait" (ie: not "wait pid") which is actually
	backwards - the "wait" case does not really need it, though it
	avoids wasting (cpu) time, whereas the "wait pid" case does.

>How-To-Repeat:

	Write any script that runs a short background job, then a
	longer foreground one (which is probably why this hasn't
	been noticed - most commonly the timings are inverted),
	and observe what happens when the script eventually waits
	for the (already completed) background job.

>Fix:
	Coming soon....   Will request pullup to -8, the shells on
	the older systems are so out of date that they can just
	continue to suffer with this (and many other) problems that
	are usually never noticed in the wild.

>Release-Note:

>Audit-Trail:

Responsible-Changed-From-To: bin-bug-people->kre
Responsible-Changed-By: kre@NetBSD.org
Responsible-Changed-When: Mon, 23 Oct 2017 06:03:33 +0000
Responsible-Changed-Why:
I am looking into this PR


From: "Robert Elz" <kre@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/52640 CVS commit: src/bin/sh
Date: Mon, 23 Oct 2017 10:52:07 +0000

 Module Name:	src
 Committed By:	kre
 Date:		Mon Oct 23 10:52:07 UTC 2017

 Modified Files:
 	src/bin/sh: jobs.c

 Log Message:
 PR bin/52640  PR bin/52641

 Don't delete jobs from the jobs table merely because they finished,
 if they are not the job we are waiting upon.   (bin/52640 part 1)

 In a sub-shell environment, don't allow wait to find jobs from the
 parent shell that had already exited (before the sub-shell was
 created) and return status for them as if they are our children.
 (bin/52640 part 2)

 Don't have the "jobs" command also be an implicit "wait" command
 in non-interactive shells.  (bin/52641)

 Use WCONTINUED (when it exists) so we can report on stopped jobs that
 "mysteriously" move back to running state without the user issuing
 a "bg" command (eg: kill -CONT <pid>)   Previously they would keep
 being reported as stopped until they exited.

 When a job is detected as having changed status just as we're
 issuing a "jobs" command (i.e.: the change occurred between the last
 prompt and the jobs command being entered) don't report it twice,
 once from the status change, and then again in the jobs command
 output.   Once is enough (keep the jobs output, suppress the other).

 Apply some sanity to the way jobs_invalid is processed - ignore it
 in getjob() instead of just ignoring it most of the time there, and
 instead always check it before calling getjob() in situations where
 we can handle only children of the current shell.  This allows the
 (totally broken) save/clear/restore of jobs_invalid in jobscmd() to
 be done away with (previously an error while in the clear state would
 have left jobs_invalid incorrectly cleared - shouldn't have mattered
 since jobs_invalid => subshell => error causes exit, but better to be safe).

 Add/improve the DEBUG more tracing.

 XXX pullup -8


 To generate a diff of this commit:
 cvs rdiff -u -r1.90 -r1.91 src/bin/sh/jobs.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->needs-pullups
State-Changed-By: kre@NetBSD.org
State-Changed-When: Mon, 23 Oct 2017 10:57:00 +0000
State-Changed-Why:
waiting on some testing before requesting puttup to -8


State-Changed-From-To: needs-pullups->pending-pullups
State-Changed-By: kre@NetBSD.org
State-Changed-When: Sun, 29 Oct 2017 16:02:35 +0000
State-Changed-Why:
Ticket 337 for netbsd-8


From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/52640 CVS commit: [netbsd-8] src/bin/sh
Date: Fri, 17 Nov 2017 14:56:52 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Fri Nov 17 14:56:52 UTC 2017

 Modified Files:
 	src/bin/sh [netbsd-8]: jobs.c

 Log Message:
 Pull up following revision(s) (requested by kre in ticket #337):
 	bin/sh/jobs.c: revision 1.91 (patch)

 PR bin/52640  PR bin/52641

 Don't delete jobs from the jobs table merely because they finished,
 if they are not the job we are waiting upon.   (bin/52640 part 1)

 In a sub-shell environment, don't allow wait to find jobs from the
 parent shell that had already exited (before the sub-shell was
 created) and return status for them as if they are our children.
 (bin/52640 part 2)

 Don't have the "jobs" command also be an implicit "wait" command
 in non-interactive shells.  (bin/52641)

 Use WCONTINUED (when it exists) so we can report on stopped jobs that
 "mysteriously" move back to running state without the user issuing
 a "bg" command (eg: kill -CONT <pid>)   Previously they would keep
 being reported as stopped until they exited.
 When a job is detected as having changed status just as we're
 issuing a "jobs" command (i.e.: the change occurred between the last
 prompt and the jobs command being entered) don't report it twice,
 once from the status change, and then again in the jobs command
 output.   Once is enough (keep the jobs output, suppress the other).

 Apply some sanity to the way jobs_invalid is processed - ignore it
 in getjob() instead of just ignoring it most of the time there, and
 instead always check it before calling getjob() in situations where
 we can handle only children of the current shell.  This allows the
 (totally broken) save/clear/restore of jobs_invalid in jobscmd() to
 be done away with (previously an error while in the clear state would
 have left jobs_invalid incorrectly cleared - shouldn't have mattered
 since jobs_invalid => subshell => error causes exit, but better to be safe).

 Add/improve the DEBUG more tracing.


 To generate a diff of this commit:
 cvs rdiff -u -r1.85.2.1 -r1.85.2.2 src/bin/sh/jobs.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: pending-pullups->closed
State-Changed-By: kre@NetBSD.org
State-Changed-When: Fri, 17 Nov 2017 18:33:00 +0000
State-Changed-Why:
Pullups completed


>Unformatted:
Home
PR Database Search
(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.