NetBSD Problem Report #38670
From christos@zoulas.com Thu May 15 20:49:59 2008
Return-Path: <christos@zoulas.com>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
by narn.NetBSD.org (Postfix) with ESMTP id 0066363B293
for <gnats-bugs@gnats.NetBSD.org>; Thu, 15 May 2008 20:49:58 +0000 (UTC)
Message-Id: <20080515204957.4666F5654E@rebar.astron.com>
Date: Thu, 15 May 2008 16:49:57 -0400 (EDT)
From: christos@zoulas.com
Reply-To: christos@zoulas.com
To: gnats-bugs@gnats.NetBSD.org
Subject: ^Z does not work anymore for this program.
X-Send-Pr-Version: 3.95
>Number: 38670
>Category: kern
>Synopsis: ^Z does not seems to suspend programs that vfork'ed and wait.
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Thu May 15 20:50:00 +0000 2008
>Last-Modified: Sun Feb 22 19:15:01 +0000 2009
>Originator: Christos Zoulas
>Release: NetBSD 4.99.62
>Organization:
Slobs 'R' US
>Environment:
System: NetBSD mx4.twosigma.com 4.99.62 NetBSD 4.99.62 (CRUSOE.debug) #0: Fri May 2 16:07:49 EDT 2008 sdegler@crusoe.degler.net:/vol1/NetBSD/kernels/CRUSOE.debug amd64
Architecture: amd64
Machine: amd64
>Description:
Hitting ^Z has no effect on the following program. Other tty signals
work.
>How-To-Repeat:
#include <stdio.h>
#include <errno.h>
#include <stdlib.h>
main(argc, argv)
int argc;
char *argv[];
{
int i, load;
if (argc != 2 || sscanf(argv[1], "%d", &load) != 1) {
(void) fprintf(stderr, "Usage: %s <load-value>.\n", argv[0]);
exit(0);
}
for ( i = 0; i < load; i++ )
switch (vfork()) {
case 0:
/* The child */
break;
case -1:
(void) fprintf(stderr, "%s: Vfork failed (%s).\n", argv[0],
strerror(errno));
exit(errno);
default:
wait((int *) 0);
break;
}
sleep(100000);
exit(0);
}
>Fix:
?
>Audit-Trail:
From: Andrew Doran <ad@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/38670: ^Z does not work anymore for this program.
Date: Sun, 18 May 2008 15:29:45 +0100
kern_sig.c:
1607 if ((p->p_sflag & PS_PPWAIT) != 0)
1608 sigminusset(&stopsigmask, &ss);
...
1644 * by the debugger. If the our parent process is waiting
1645 * for us, don't hang as we could deadlock.
1646 */
1647 if ((p->p_slflag & PSL_TRACED) != 0 &&
1648 (p->p_sflag & PS_PPWAIT) == 0 && signo != SIGKILL) {
That seems to have come in with 4.4BSD, not sure what it's all about.
Andrew
From: christos@zoulas.com (Christos Zoulas)
To: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Cc:
Subject: Re: kern/38670: ^Z does not work anymore for this program.
Date: Sun, 18 May 2008 13:46:57 -0400
On May 18, 2:30pm, ad@netbsd.org (Andrew Doran) wrote:
-- Subject: Re: kern/38670: ^Z does not work anymore for this program.
| The following reply was made to PR kern/38670; it has been noted by GNATS.
|
| From: Andrew Doran <ad@netbsd.org>
| To: gnats-bugs@NetBSD.org
| Cc:
| Subject: Re: kern/38670: ^Z does not work anymore for this program.
| Date: Sun, 18 May 2008 15:29:45 +0100
|
| kern_sig.c:
|
| 1607 if ((p->p_sflag & PS_PPWAIT) != 0)
| 1608 sigminusset(&stopsigmask, &ss);
| ...
| 1644 * by the debugger. If the our parent process is waiting
| 1645 * for us, don't hang as we could deadlock.
| 1646 */
| 1647 if ((p->p_slflag & PSL_TRACED) != 0 &&
| 1648 (p->p_sflag & PS_PPWAIT) == 0 && signo != SIGKILL) {
|
| That seems to have come in with 4.4BSD, not sure what it's all about.
The regression has been introduced recently though. This works fine with
NetBSD rebar.astron.com 4.99.1 NetBSD 4.99.1 (ASTRON) #2: Fri Sep 8 15:10:53 EDT 2006 christos@rebar.astron.com:/usr/src/sys/arch/i386/compile/ASTRON i38
christos
From: David Holland <dholland-bugs@netbsd.org>
To: Christos Zoulas <christos@zoulas.com>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/38670: ^Z does not work anymore for this program.
Date: Sun, 22 Feb 2009 10:01:06 +0000
On Sun, May 18, 2008 at 01:46:57PM -0400, Christos Zoulas wrote:
> | That seems to have come in with 4.4BSD, not sure what it's all about.
This comment was lost during some of the reorganization:
/*
* If a child holding parent blocked, stopping could
* cause deadlock: discard the signal.
*/
I'm not sure what this hypothetical deadlock would be, though.
> The regression has been introduced recently though. This works fine with
> NetBSD rebar.astron.com 4.99.1 NetBSD 4.99.1 (ASTRON) #2: Fri Sep 8 15:10:53 EDT 2006 christos@rebar.astron.com:/usr/src/sys/arch/i386/compile/ASTRON i38
What happened is that the interruptible tsleep() in the parent process
that waits for the child to exec got changed to an uninterruptible
cv_wait(). Thus, in your example, the parent processes of your example
would stop, which is sufficient for the shell to report a stopped job,
and the child calling sleep() wouldn't.
Changing that call to cv_wait_sig() ought to restore the previous
behavior; however, it's not clear that this is a particularly good
idea, because if a signal arrives and results in ERESTARTSYS there'll
be another child process created, and if it results in EINTR then the
parent and the child will both be running on the same stack in the
same address space, and demons will fly out of someone's nose.
In 4.99.1 it might have worked to just stop and continue the parent,
provided SIG_DFL for both SIGTSTP and SIGCONT, because stopped
processes got stopped in their tracks wherever they happened to be in
the kernel (and while holding whatever locks they happened to be
working with, etc.) but that apparently got fixed last March; now it
requires either EINTR or ERESTARTSYS.
However, since having arbitrarily long uninterruptible waits isn't
such a great idea, maybe we should try to come up with a way to make
this work. Or maybe an adequate substitute is to change the WCHAN to
"vfork" so one can at least tell what's happening and find/kill off
the child process if things are stuck. But this probably would
probably require breaking the CV abstraction.
Also, I wonder what happens if someone does ptrace(PT_ATTACH, ...) on
a vfork child. This should probably be forbidden; it currently isn't
and I suspect it will make a mess.
--
David A. Holland
dholland@netbsd.org
From: christos@zoulas.com (Christos Zoulas)
To: David Holland <dholland-bugs@netbsd.org>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/38670: ^Z does not work anymore for this program.
Date: Sun, 22 Feb 2009 11:09:38 -0500
On Feb 22, 10:01am, dholland-bugs@netbsd.org (David Holland) wrote:
-- Subject: Re: kern/38670: ^Z does not work anymore for this program.
| Changing that call to cv_wait_sig() ought to restore the previous
| behavior; however, it's not clear that this is a particularly good
| idea, because if a signal arrives and results in ERESTARTSYS there'll
| be another child process created, and if it results in EINTR then the
| parent and the child will both be running on the same stack in the
| same address space, and demons will fly out of someone's nose.
|
| However, since having arbitrarily long uninterruptible waits isn't
| such a great idea, maybe we should try to come up with a way to make
| this work. Or maybe an adequate substitute is to change the WCHAN to
| "vfork" so one can at least tell what's happening and find/kill off
| the child process if things are stuck. But this probably would
| probably require breaking the CV abstraction.
|
| Also, I wonder what happens if someone does ptrace(PT_ATTACH, ...) on
| a vfork child. This should probably be forbidden; it currently isn't
| and I suspect it will make a mess.
Thanks for the explanation. I guess we should leave this open until we
fix at least the ptrace case.
christos
From: Andrew Doran <ad@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org, christos@zoulas.com
Subject: Re: kern/38670: ^Z does not work anymore for this program.
Date: Sun, 22 Feb 2009 19:13:50 +0000
[Just to re-iterate, the described behaviour came in with 4.4BSD. I may have
fixed a bug that prevented it from working as intended in NetBSD, at least
as far as I can tell. I'm not saying that the behaviour is correct. It has
been a couple of years so my memory of it is sketchy.]
I don't have time to look into this too deeply right now but at a glance it
seems that stop signals are left in the child's pending set but simply
ignored. If that is the case, the child should stop at its next interrupable
sleep after clearing vfork().
In the test for PL_PPWAIT in exec(), we could test for any pending signals
with sigispending(), and if so, do a signotify() on curlwp. That would make
it catch the signal immediately after exec() instead of at "some point in
the future". However this does not affect the test case in the PR.
If the problem is that the child simply does not stop, even while calling
e.g. sleep(), then the signal must be getting ripped out of its pending set.
> Changing that call to cv_wait_sig() ought to restore the previous
> behavior; however, it's not clear that this is a particularly good
> idea, because if a signal arrives and results in ERESTARTSYS there'll
> be another child process created, and if it results in EINTR then the
> parent and the child will both be running on the same stack in the
> same address space, and demons will fly out of someone's nose.
I'm not sure what the supposed deadlock is either. Maybe someone with access
to the CSRG SCCS files could tell us who made the modifiction. We could
arrange for a special-cased wait in the parent where it will obey STOP and a
few other conditions and handle it in-kernel, if it would help. Ugly though.
It may mean checking that we can arrange for the parent to die even if it
has a child from vfork().
Andrew
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.