NetBSD Problem Report #38670

From christos@zoulas.com  Thu May 15 20:49:59 2008
Return-Path: <christos@zoulas.com>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by narn.NetBSD.org (Postfix) with ESMTP id 0066363B293
	for <gnats-bugs@gnats.NetBSD.org>; Thu, 15 May 2008 20:49:58 +0000 (UTC)
Message-Id: <20080515204957.4666F5654E@rebar.astron.com>
Date: Thu, 15 May 2008 16:49:57 -0400 (EDT)
From: christos@zoulas.com
Reply-To: christos@zoulas.com
To: gnats-bugs@gnats.NetBSD.org
Subject: ^Z does not work anymore for this program.
X-Send-Pr-Version: 3.95

>Number:         38670
>Category:       kern
>Synopsis:       ^Z does not seems to suspend programs that vfork'ed and wait.
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu May 15 20:50:00 +0000 2008
>Last-Modified:  Sun Feb 22 19:15:01 +0000 2009
>Originator:     Christos Zoulas
>Release:        NetBSD 4.99.62
>Organization:
	Slobs 'R' US
>Environment:
System: NetBSD mx4.twosigma.com 4.99.62 NetBSD 4.99.62 (CRUSOE.debug) #0: Fri May  2 16:07:49 EDT 2008  sdegler@crusoe.degler.net:/vol1/NetBSD/kernels/CRUSOE.debug amd64

Architecture: amd64
Machine: amd64
>Description:
	Hitting ^Z has no effect on the following program. Other tty signals
	work.
>How-To-Repeat:

#include <stdio.h>
#include <errno.h>
#include <stdlib.h>

main(argc, argv)
int argc;
char *argv[];
{
    int i, load;
    if (argc != 2 || sscanf(argv[1], "%d", &load) != 1) {
        (void) fprintf(stderr, "Usage: %s <load-value>.\n", argv[0]);
        exit(0);
    }

    for ( i = 0; i < load; i++ )
        switch (vfork()) {
        case 0:
            /* The child */
            break;
        case -1:
            (void) fprintf(stderr, "%s: Vfork failed (%s).\n", argv[0], 
                strerror(errno));
            exit(errno);
        default:
            wait((int *) 0);
            break;
        }
    sleep(100000);
    exit(0);
}

>Fix:
	?

>Audit-Trail:
From: Andrew Doran <ad@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/38670: ^Z does not work anymore for this program.
Date: Sun, 18 May 2008 15:29:45 +0100

 kern_sig.c:

    1607 			if ((p->p_sflag & PS_PPWAIT) != 0)
    1608 				sigminusset(&stopsigmask, &ss);
 ...
    1644 		 * by the debugger.  If the our parent process is waiting
    1645 		 * for us, don't hang as we could deadlock.
    1646 		 */
    1647 		if ((p->p_slflag & PSL_TRACED) != 0 &&
    1648 		    (p->p_sflag & PS_PPWAIT) == 0 && signo != SIGKILL) {

 That seems to have come in with 4.4BSD, not sure what it's all about.

 Andrew

From: christos@zoulas.com (Christos Zoulas)
To: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org, 
	gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Cc: 
Subject: Re: kern/38670: ^Z does not work anymore for this program.
Date: Sun, 18 May 2008 13:46:57 -0400

 On May 18,  2:30pm, ad@netbsd.org (Andrew Doran) wrote:
 -- Subject: Re: kern/38670: ^Z does not work anymore for this program.

 | The following reply was made to PR kern/38670; it has been noted by GNATS.
 | 
 | From: Andrew Doran <ad@netbsd.org>
 | To: gnats-bugs@NetBSD.org
 | Cc: 
 | Subject: Re: kern/38670: ^Z does not work anymore for this program.
 | Date: Sun, 18 May 2008 15:29:45 +0100
 | 
 |  kern_sig.c:
 |  
 |     1607 			if ((p->p_sflag & PS_PPWAIT) != 0)
 |     1608 				sigminusset(&stopsigmask, &ss);
 |  ...
 |     1644 		 * by the debugger.  If the our parent process is waiting
 |     1645 		 * for us, don't hang as we could deadlock.
 |     1646 		 */
 |     1647 		if ((p->p_slflag & PSL_TRACED) != 0 &&
 |     1648 		    (p->p_sflag & PS_PPWAIT) == 0 && signo != SIGKILL) {
 |  
 |  That seems to have come in with 4.4BSD, not sure what it's all about.

 The regression has been introduced recently though. This works fine with
 NetBSD rebar.astron.com 4.99.1 NetBSD 4.99.1 (ASTRON) #2: Fri Sep  8 15:10:53 EDT 2006  christos@rebar.astron.com:/usr/src/sys/arch/i386/compile/ASTRON i38

 christos

From: David Holland <dholland-bugs@netbsd.org>
To: Christos Zoulas <christos@zoulas.com>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
	gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/38670: ^Z does not work anymore for this program.
Date: Sun, 22 Feb 2009 10:01:06 +0000

 On Sun, May 18, 2008 at 01:46:57PM -0400, Christos Zoulas wrote:
  > |  That seems to have come in with 4.4BSD, not sure what it's all about.

 This comment was lost during some of the reorganization:

                    /*
                     * If a child holding parent blocked, stopping could
                     * cause deadlock: discard the signal.
                     */

 I'm not sure what this hypothetical deadlock would be, though.

  > The regression has been introduced recently though. This works fine with
  > NetBSD rebar.astron.com 4.99.1 NetBSD 4.99.1 (ASTRON) #2: Fri Sep  8 15:10:53 EDT 2006  christos@rebar.astron.com:/usr/src/sys/arch/i386/compile/ASTRON i38

 What happened is that the interruptible tsleep() in the parent process
 that waits for the child to exec got changed to an uninterruptible
 cv_wait(). Thus, in your example, the parent processes of your example
 would stop, which is sufficient for the shell to report a stopped job,
 and the child calling sleep() wouldn't.

 Changing that call to cv_wait_sig() ought to restore the previous
 behavior; however, it's not clear that this is a particularly good
 idea, because if a signal arrives and results in ERESTARTSYS there'll
 be another child process created, and if it results in EINTR then the
 parent and the child will both be running on the same stack in the
 same address space, and demons will fly out of someone's nose.

 In 4.99.1 it might have worked to just stop and continue the parent,
 provided SIG_DFL for both SIGTSTP and SIGCONT, because stopped
 processes got stopped in their tracks wherever they happened to be in
 the kernel (and while holding whatever locks they happened to be
 working with, etc.) but that apparently got fixed last March; now it
 requires either EINTR or ERESTARTSYS.

 However, since having arbitrarily long uninterruptible waits isn't
 such a great idea, maybe we should try to come up with a way to make
 this work. Or maybe an adequate substitute is to change the WCHAN to
 "vfork" so one can at least tell what's happening and find/kill off
 the child process if things are stuck. But this probably would
 probably require breaking the CV abstraction.

 Also, I wonder what happens if someone does ptrace(PT_ATTACH, ...) on
 a vfork child. This should probably be forbidden; it currently isn't
 and I suspect it will make a mess.

 -- 
 David A. Holland
 dholland@netbsd.org

From: christos@zoulas.com (Christos Zoulas)
To: David Holland <dholland-bugs@netbsd.org>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org, 
	gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/38670: ^Z does not work anymore for this program.
Date: Sun, 22 Feb 2009 11:09:38 -0500

 On Feb 22, 10:01am, dholland-bugs@netbsd.org (David Holland) wrote:
 -- Subject: Re: kern/38670: ^Z does not work anymore for this program.

 | Changing that call to cv_wait_sig() ought to restore the previous
 | behavior; however, it's not clear that this is a particularly good
 | idea, because if a signal arrives and results in ERESTARTSYS there'll
 | be another child process created, and if it results in EINTR then the
 | parent and the child will both be running on the same stack in the
 | same address space, and demons will fly out of someone's nose.
 | 
 | However, since having arbitrarily long uninterruptible waits isn't
 | such a great idea, maybe we should try to come up with a way to make
 | this work. Or maybe an adequate substitute is to change the WCHAN to
 | "vfork" so one can at least tell what's happening and find/kill off
 | the child process if things are stuck. But this probably would
 | probably require breaking the CV abstraction.
 | 
 | Also, I wonder what happens if someone does ptrace(PT_ATTACH, ...) on
 | a vfork child. This should probably be forbidden; it currently isn't
 | and I suspect it will make a mess.

 Thanks for the explanation. I guess we should leave this open until we
 fix at least the ptrace case.

 christos

From: Andrew Doran <ad@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org, christos@zoulas.com
Subject: Re: kern/38670: ^Z does not work anymore for this program.
Date: Sun, 22 Feb 2009 19:13:50 +0000

 [Just to re-iterate, the described behaviour came in with 4.4BSD. I may have
 fixed a bug that prevented it from working as intended in NetBSD, at least
 as far as I can tell. I'm not saying that the behaviour is correct. It has
 been a couple of years so my memory of it is sketchy.]

 I don't have time to look into this too deeply right now but at a glance it
 seems that stop signals are left in the child's pending set but simply
 ignored. If that is the case, the child should stop at its next interrupable
 sleep after clearing vfork().

 In the test for PL_PPWAIT in exec(), we could test for any pending signals
 with sigispending(), and if so, do a signotify() on curlwp. That would make
 it catch the signal immediately after exec() instead of at "some point in
 the future". However this does not affect the test case in the PR.

 If the problem is that the child simply does not stop, even while calling
 e.g. sleep(), then the signal must be getting ripped out of its pending set.

 >  Changing that call to cv_wait_sig() ought to restore the previous
 >  behavior; however, it's not clear that this is a particularly good
 >  idea, because if a signal arrives and results in ERESTARTSYS there'll
 >  be another child process created, and if it results in EINTR then the
 >  parent and the child will both be running on the same stack in the
 >  same address space, and demons will fly out of someone's nose.

 I'm not sure what the supposed deadlock is either. Maybe someone with access
 to the CSRG SCCS files could tell us who made the modifiction. We could
 arrange for a special-cased wait in the parent where it will obey STOP and a
 few other conditions and handle it in-kernel, if it would help. Ugly though.
 It may mean checking that we can arrange for the parent to die even if it
 has a child from vfork().

 Andrew

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.