NetBSD Problem Report #49017

From christos@astron.com  Fri Jul 18 17:07:54 2014
Return-Path: <christos@astron.com>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id DB6A0A5672
	for <gnats-bugs@gnats.NetBSD.org>; Fri, 18 Jul 2014 17:07:53 +0000 (UTC)
Message-Id: <20140718155148.475D614B68@quasar.astron.com>
Date: Fri, 18 Jul 2014 15:51:48 +0000 (UTC)
From: christos@netbsd.org
Reply-To: christos@netbsd.org
To: gnats-bugs@NetBSD.org
Subject: vfork does not suspend all threads
X-Send-Pr-Version: 3.95

>Number:         49017
>Category:       kern
>Synopsis:       vfork does not suspend all threads
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Jul 18 17:10:00 +0000 2014
>Last-Modified:  Mon Apr 10 00:40:00 +0000 2017
>Originator:     Christos Zoulas
>Release:        NetBSD 6.99.47
>Organization:
	You've been vforked!
>Environment:
System: NetBSD quasar.astron.com 6.99.47 NetBSD 6.99.47 (QUASAR) #2: Fri Jul 18 08:23:06 EDT 2014 christos@quasar.astron.com:/usr/src/sys/arch/amd64/compile/QUASAR amd64
Architecture: x86_64
Machine: amd64
>Description:
vfork is supposed to suspend the parent while the child is preparing for
exec. The parent is resumed after the child exec's or exits. This is done
so that the memory image shared between the parent and the child is not
changed by the parent while it is preparing to exec.
>How-To-Repeat:
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <err.h>

#define NTHREAD 2

void *
worker(void *arg) {
	size_t *i = arg;
	for (size_t j = 0;; j++) {
		printf("[%d] %zu %zu\n", getpid(), *i, j);
		sleep(1);
	}
}

int
main(int argc, char *argv[])
{
	pthread_t t[NTHREAD];

	for (size_t i = 0; i < NTHREAD; i++)
		pthread_create(&t[i], NULL, worker, &i);

	switch (vfork()) {
	case -1:
		err(1, "vfork");
	case 0:
	default:
		sleep(100000);
		break;
	}
	return 0;
}

>Fix:

>Audit-Trail:
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/49017: vfork does not suspend all threads
Date: Fri, 18 Jul 2014 19:38:00 +0200

 Isn't calling vfork in a threaded program just a no-no?

 Martin

From: Nico Williams <Nico.Williams@twosigma.com>
To: <gnats-bugs@netbsd.org>
Cc: 
Subject: Re: kern/49017: vfork does not suspend all threads
Date: Wed, 5 Apr 2017 18:42:25 +0000

 Please do NOT stop any threads in the vfork() parent other than the one that
 called vfork().  Also, allow me to make an argument for this.

 First, I suppose I should look at rationales for stopping all threads in a
 vfork() parent.  I can think of two (but if I'm missing some, let me know!): a)
 that the man page always said that the parent process is stopped, ergo it must
 now mean "all threads in the parent process", and b) that the set of safe
 functions to call in the vfork() child is made unacceptably smaller by not
 stopping all threads in the parent.

 Before refuting my strawman rationales for stopping all threads, I'll explain
 why stopping all threads is highly undesirable: it kills performance, the very
 reason for vfork()'s existence.

 There are several ways to use vfork() to spawn children in a high-performance
 way:

  - First, obviously, in posix_spawn().

    It would be terrible to have to stop all of a JVM's many threads just to
    spawn a child, and would negate some of vfork()'s massive performance
    advantage over fork().

    Why should unrelated threads in the parent suffer?  (This gets to the safety
    issues which I posit might motivate stopping all parent threads, and which I
    address below.)  Even if there were a strong safety argument for this, we
    should aim to make it go away as the performance rationale for using vfork()
    is extremely important in real life cases.

    (I should point out that, for example, Linux's vfork() does not stop all
    other threads in the parent.  I can provide a test program that demonstrates
    this.)

  - Second, one can implement a very fast popen()-like API that uses a threaded
    taskq where threads pre-vfork(), enabling a program to spawn processes
    faster than with posix_spawn(): without blocking for the child to spin up
    then execve()-or-_exit() -- the threads that pre-call vfork() block that
    way, but the threads that dispatch the requests to the pre-vfork()ed
    children do not block at all, they only call write(2) to write the job to a
    pipe to the child.

    See my gist about this where I describe this in detail and propose a new
    function with this signature:

         pid_t avfork(int (*start_routine)(void *), void *arg);

    and provide a partial implementation based on a pre-vforking threaded taskq:

    https://gist.github.com/nicowilliams/a8a07b0fc75df05f684c23c18d7db234

    There is such a very fast popen()-like implementation here:

        https://github.com/famzah/popen-noshell (warning GPLv3)

    that uses clone() on Linux to get something very much like the avfork() that
    I argue for.  Its author needs to be able to spawn thousands of processes
    very quickly sometimes (see
    https://github.com/famzah/popen-noshell/issues/11#issuecomment-287235234).

 Now, to knock down my strawman rationales for stopping all threads in the
 vfork() parent:

  - Regarding (a), pre-threads vfork() man page text saying "stops the parent
    process" should not be interpreted as meaning "all threads" now that we have
    a threaded world.  Clearly the original authors could not have meant that,
    nor for that matter would they have meant that only the thread that called
    vfork() in the parent must be stopped.  We must decide this matter de novo.

    Clearly the thread that called vfork() must be stopped until the child
    execve()s or _exit()s.  That much is utterly clear: because two schedulable
    threads/entities simply cannot share a stack concurrently.  So we only need
    to decide whether other threads in the parent must also be stopped, and the
    original man page text simply can't guide us as to that as it predates
    threads.

  - Regarding (b), it may already the case that the set of functions that may
    safely be called in the vfork() child is somewhat smaller than the set of
    functions that may be called in a fork() child.  Since POSIX has deprecated
    vfork(), we don't know what that set is (though we can inspect earlier POSIX
    standards) and may now define it to our liking.

    In any case, the set of async-signal-safe functions defined by POSIX looks
    like it should be safe to call in a vfork() child on any reasonable OS since
    all of them should be system calls that do not affect the shared address
    space (or anything else that might still be shared between the child and the
    parent): http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_04.html

    As an aside, obviously the child might also probably not want to change FD
    or FL flags with fcntl() for file descriptors shared with the parent.  And
    it should also not use the horrible POSIX file locking, though that's mostly
    because nothing should use the horrible POSIX file locking!  This aside
    brought to you by intense feelings of disgust elicited by POSIX file locking.

    Note that there are a number of functions NOT INCLUDED in the standard list
    of async-signal-safe functions:

     - pthread_*()

     - brk(), sbrk(), mmap(), munmap(), mprot()

     - the heap allocator (quite naturally, since it might need to call
       brk()/sbrk() and/or mmap()/munmap(), or pthread_*() functions, none of
       which are async-signal-safe)

    which means that the scariest functions one might call on the child-side of
    vfork() are by definition (e.g., the old POSIX vfork() specification)
    already not safe to call on the child-side of vfork().

    In any case, again, NetBSD is free to further narrow the set of functions
    that are safe to call in the child-side of vfork() should that be necessary.

 The biggest problem with vfork(), really, is that unsafe signal handlers in the
 might run in the child before the child can block them.  This could be bad even
 if all threads in the parent are stopped.

 Indeed, I would argue that the set of functions that are safe to call in an
 asynchronous signal handler (as opposed to the child-side of fork() or vfork())
 is smaller than that which POSIX says.  The only things I ever do in the signal
 handlers I write are:

  - write to sig_atomic_t variables

  - call write(2) to write a single byte into a pipe that is used in the
    application's event loop

    If the application does not have an event loop I do sometimes ensure that
    there's a thread blocking on read(2) on the other side of that pipe.

  - call write(2) to write to stderr

  - call _exit(2)

 If I had my way those would be actions things I'd allow in signal handlers in
 POSIX!  (And then we'd have to give a new name to the async-signal-safe
 function set that we reuse to define the functions that are safe to call in
 various other contexts such as the child-side of fork()!)

 Thanks for taking the time to read this -- it's probably too long, and I
 apologize about that.  If I'm wrong about something here, please let me know!

 Nico
 -- 

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/49017: vfork does not suspend all threads
Date: Wed, 5 Apr 2017 21:17:17 +0200

 On Wed, Apr 05, 2017 at 06:45:01PM +0000, Nico Williams wrote:
 >  There are several ways to use vfork() to spawn children in a high-performance
 >  way:
 >  
 >   - First, obviously, in posix_spawn().

 At leaset in NetBSD posix_spawn() is completely unrelated to vfork().
 Noone is suggesting to stop any thread in a process doing posix_spawn().

 Using vfork() in a program with multiple active threads is madness,
 posix_spawn() is the only sensible way.

 Martin

From: Kamil Rytarowski <n54@gmx.com>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/49017: vfork does not suspend all threads
Date: Wed, 5 Apr 2017 21:35:03 +0200

 This is an OpenPGP/MIME signed message (RFC 4880 and 3156)
 --9dKGxhHN2jDhmwoBomF2VlDWlkmB7NHmb
 Content-Type: multipart/mixed; boundary="bRWBLgUFq29jGiVIHBEcLfb3l0vIxScq9";
  protected-headers="v1"
 From: Kamil Rytarowski <n54@gmx.com>
 To: gnats-bugs@NetBSD.org
 Message-ID: <89c89a59-2ac0-8acc-61ff-f3d1a89298cb@gmx.com>
 Subject: Re: kern/49017: vfork does not suspend all threads
 References: <pr-kern-49017@gnats.netbsd.org>
  <20140718155148.475D614B68@quasar.astron.com>
  <20170405192000.C1F147A2BC@mollari.NetBSD.org>
 In-Reply-To: <20170405192000.C1F147A2BC@mollari.NetBSD.org>

 --bRWBLgUFq29jGiVIHBEcLfb3l0vIxScq9
 Content-Type: text/plain; charset=windows-1252
 Content-Transfer-Encoding: quoted-printable

 On 05.04.2017 21:20, Martin Husemann wrote:
 > The following reply was made to PR kern/49017; it has been noted by GNA=
 TS.
 >=20
 > From: Martin Husemann <martin@duskware.de>
 > To: gnats-bugs@NetBSD.org
 > Cc:=20
 > Subject: Re: kern/49017: vfork does not suspend all threads
 > Date: Wed, 5 Apr 2017 21:17:17 +0200
 >=20
 >  On Wed, Apr 05, 2017 at 06:45:01PM +0000, Nico Williams wrote:
 >  >  There are several ways to use vfork() to spawn children in a high-p=
 erformance
 >  >  way:
 >  > =20
 >  >   - First, obviously, in posix_spawn().
 > =20
 >  At leaset in NetBSD posix_spawn() is completely unrelated to vfork().
 >  Noone is suggesting to stop any thread in a process doing posix_spawn(=
 ).
 > =20
 >  Using vfork() in a program with multiple active threads is madness,
 >  posix_spawn() is the only sensible way.
 > =20
 >  Martin
 > =20
 >=20

 Well vfork(2) is supposed to suspend a parent process.

 "The parent process is suspended while the child is using its resources."=


  -- vfork(2)


 It's out of POSIX so it's rather harsh to dictate behavior change.

 Also going for your proposal is imho violating thread-process model in
 NetBSD. It's Linux concept to emulate threads with clone(2), while they
 are still regular processes.

 In ptrace(2) we have two interfaces: PTRACE_FORK and PTRACE_VFORK. The
 difference between them is only in the point whether the parent process
 (with all threads) has been suspended or not. No matter what the
 original syscall or API was used (clone(2), __clone(2), posix_spwan(2),
 fork(2)...).

 I thing you might be interested in designing something like _lwp_vfork().=



 --bRWBLgUFq29jGiVIHBEcLfb3l0vIxScq9--

 --9dKGxhHN2jDhmwoBomF2VlDWlkmB7NHmb
 Content-Type: application/pgp-signature; name="signature.asc"
 Content-Description: OpenPGP digital signature
 Content-Disposition: attachment; filename="signature.asc"

 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2

 iQIcBAEBCAAGBQJY5UbsAAoJEEuzCOmwLnZs/ocP/0uZTFLgXhZCZU0f1e9UoLFq
 0xgpnbl5nJdIMZsgCw7NMDzOiSAQnBFUOGKsI/6KsSJrc8eQtSzRgxlZW7MnFBUb
 QuDl589usGblCoXfInfdQ91kXcqYOJrjyEC6eB62QxfS3+sYsy036zIaFIUGSOiu
 TUxZgs85zw8cw+b8Ll2tt1CeJoG+ooWUFLtl5uuAlKNoP/3k8bKj3E0ZLfPn7BCI
 +Ms8gjGntXVdNGRX3Xs0mHrfF6gaMZIKU/5P6lz8tHr1M7sQXcb0EbkY9y7YBObd
 dXdROxSn8BTGQAXrPfuD6wIxNTkkZ0em/vsn8dxuCToC1k2ePBjklwuQ4daZA5Cq
 d6RdSjxq7fi+EXeYZRrsv0kWhmK4qTDlSOEICRLu0wIdoGkHvZtpMBL7fKDwSAef
 mh2zopBw0bNyOvXBQHeQE17iUeRWdUtXbjy/sn4oXvn5Rz/Kn5PIYq0Qhc+8uZxj
 TtPM08wE0zf3sO7JT+YzOc351bZYeVTWNpqatGEkHUNxoc48V2FeMTumG9gclZBY
 i+1iELDsi1PUtiBkfiMhSFOPNwc+Rc2ZARdWumxjL84UPsLt2Y4Odnac3qsD/Qwq
 Jqevn9QqDtHk++igR11iJPe82Ljz7xMdfzMtc8cqhNCi/i52FZ/6cqD/A37wy52B
 fs8G20B1qKE2ZOFkVWFW
 =vBxX
 -----END PGP SIGNATURE-----

 --9dKGxhHN2jDhmwoBomF2VlDWlkmB7NHmb--

From: christos@zoulas.com (Christos Zoulas)
To: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org, 
	gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Cc: 
Subject: Re: kern/49017: vfork does not suspend all threads
Date: Wed, 5 Apr 2017 16:26:04 -0400

 On Apr 5,  7:20pm, martin@duskware.de (Martin Husemann) wrote:
 -- Subject: Re: kern/49017: vfork does not suspend all threads

 |  Using vfork() in a program with multiple active threads is madness,
 |  posix_spawn() is the only sensible way.

 Why is that? We allow it, so it should do something reasonable/useful...
 Or we should not allow it...

 christos

From: christos@zoulas.com (Christos Zoulas)
To: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org, 
	gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Cc: 
Subject: Re: kern/49017: vfork does not suspend all threads
Date: Wed, 5 Apr 2017 16:38:32 -0400

 On Apr 5,  7:35pm, n54@gmx.com (Kamil Rytarowski) wrote:
 -- Subject: Re: kern/49017: vfork does not suspend all threads

 |  Well vfork(2) is supposed to suspend a parent process.
 |  "The parent process is suspended while the child is using its resources."
 |  
 |   -- vfork(2)
 |  
 |  It's out of POSIX so it's rather harsh to dictate behavior change.

 This wording predates threads and it is not specified in ToG:

 http://pubs.opengroup.org/onlinepubs/009695399/functions/vfork.html

 I.e. the suspension of the parent is historical behavior mandated by
 implementation convience; it would be difficult to make anything working
 reliably if the parent was altering the stack frame the child is currently
 executing.

 |  Also going for your proposal is imho violating thread-process model in
 |  NetBSD. It's Linux concept to emulate threads with clone(2), while they
 |  are still regular processes.

 Well, they are not regular processes; linux just does not differentiate
 between threads and processes by re-using the proc structure to describe
 both. In both implementations they end up sharing vmspace, file descriptors,
 etc.

 |  In ptrace(2) we have two interfaces: PTRACE_FORK and PTRACE_VFORK. The
 |  difference between them is only in the point whether the parent process
 |  (with all threads) has been suspended or not. No matter what the
 |  original syscall or API was used (clone(2), __clone(2), posix_spwan(2),
 |  fork(2)...).

 That is orthogonal; in fork() the parent is not suspended and ptrace has
 no problem with that. In vfork() only the thread executing vfork() is,
 and again ptrace has no problem with that. The semantics if the other
 threads should be suspended is the question here. Nico claims that it
 is not harmful if they are, and it is actually beneficial. I have come
 to the realization that this is true if the child is careful not to
 alter the parent data (which has been always the case). The question is:

     Can the other threads harm the child or the parent while it is vfork()ing?

 I can't think of a way.

 |  I thing you might be interested in designing something like _lwp_vfork().=

 How does this solve the problem? What does _lwp_vfork() fork? Does it create
 a new thread? In what process context?

 christos

From: Joerg Sonnenberger <joerg@bec.de>
To: Christos Zoulas <christos@zoulas.com>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
	gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/49017: vfork does not suspend all threads
Date: Wed, 5 Apr 2017 23:03:19 +0200

 On Wed, Apr 05, 2017 at 04:38:32PM -0400, Christos Zoulas wrote:
 >     Can the other threads harm the child or the parent while it is vfork()ing?
 > 
 > I can't think of a way.

 Race conditions in mutex code. Keep in mind that the park/unpark events
 are per-process.

 I fully support the proposal to block all threads in the parent. If you
 want to create child processes from multi-threaded programs and care at
 all about performance, use posix_spawn. How it is implemented is an
 implementation detail and independent of vfork.

 Joerg

From: Nico Williams <Nico.Williams@twosigma.com>
To: <gnats-bugs@netbsd.org>
Cc: 
Subject: Re: kern/49017: vfork does not suspend all threads
Date: Wed, 5 Apr 2017 20:45:53 +0000

 On Wed, Apr 05, 2017 at 06:42:25PM +0000, Nico Williams wrote:
 > Please do NOT stop any threads in the vfork() parent other than the one that
 > called vfork().  Also, allow me to make an argument for this.
 > 
 > First, I suppose I should look at rationales for stopping all threads in a
 > vfork() parent.  [...]

 Let me respond to each response so far:

 Martin Husemann <martin@duskware.de> wrote:
 > At leaset in NetBSD posix_spawn() is completely unrelated to vfork().
 > Noone is suggesting to stop any thread in a process doing posix_spawn().

 I'm glad that's the case (it seems posix_spawn() is a system call in NetBSD).

 The only relevance of this to vfork() is that you could obsolete it (but
 probably not remove it for a long time yet), but then why make any changes to
 vfork()?  Are there applications that are breaking that couldn't be fixed in
 some other way?

 (ISTR Thor telling me that the kernel-coded posix_spawn() reduced performance
 and that it was removed.  It seems I remembered half-correctly.  He did tell me
 that the kernel implementation is slower than the user-land implementation.)

 > Using vfork() in a program with multiple active threads is madness,
 > posix_spawn() is the only sensible way.

 Such a statement requires explanation.  I was careful to provide explanation
 for my possibly-extraordinary-seeming statements about this.

 Kamil Rytarowski wrote:
 > Well vfork(2) is supposed to suspend a parent process.
 > 
 > "The parent process is suspended while the child is using its resources."=

 I addressed this very specifically.  I believe it's incorrect to say that
 because vfork() absolutely must stop the parent thread that called it, and
 because the man page said so without referring to threads because it long
 predated threads, that it must mean "stop all parent threads" now that we have
 threads.

 Please recall that an incorrect "vfork() is harmful" meme caused it to be
 removed (broken, actually) in 4.4BSD, and later to be re-added in all
 subsequent BSDs.  Cargo culting "stops the parent process" now is just that.

 > It's out of POSIX so it's rather harsh to dictate behavior change.

 How is it harsh?  It's out of POSIX -- that means you're _free_ to change it.

 Moreover, since some OSes don't stop all threads in the parent (e.g., Linux and
 NetBSD), one perfectly legitimate thing to do at the Open Group would be to
 seek to modify the specification to allow it, if POSIX still specified it at
 all.  Participants do this all the time.  And even POSIX never said that
 vfork() stops all threads in the parent process -- it merely cargo copied the
 original text from BSD and then added a scary warning that one must never use
 vfork(), as if fork() were somehow trivial to use safely (it is not).

 > Also going for your proposal is imho violating thread-process model in
 > NetBSD. It's Linux concept to emulate threads with clone(2), while they
 > are still regular processes.

 The only way to make sense of this statement is that you're taking text from
 the pre-threads vfork() man page about stopping the parent process and
 interpreting that in terms of the threaded process model.  But that's WRONG.
 The pre-threads vfork() man page is pre-threaded process model.  In a threaded
 world it is perfectly sensible to revisit that text rather than cargo cult it.

 > In ptrace(2) we have two interfaces: PTRACE_FORK and PTRACE_VFORK. The
 > difference between them is only in the point whether the parent process
 > (with all threads) has been suspended or not. [...]

 How does that force vfork() (when not being ptrace'd) to also stop all threads
 in the parent??  I don't follow.

 Thanks,

 Nico
 -- 

From: Nico Williams <Nico.Williams@twosigma.com>
To: <gnats-bugs@netbsd.org>
Cc: 
Subject: Re: kern/49017: vfork does not suspend all threads
Date: Wed, 5 Apr 2017 20:57:03 +0000

 Kamil Rytarowski also wrote:
 > I thing you might be interested in designing something like _lwp_vfork().=

 Actually, I am interested in something like that:

     pid_t avfork(int (*)(void *), void *);

 It would have similar constraints for the child as vfork().  E.g., no allocator
 calls, etc.

 avfork()'s main benefit over vfork() is that it wouldn't have to stop even the
 thread that called it in the parent.

 avfork() can even be implemented in terms of threads and vfork()... if only
 vfork() doesn't stop all threads in the parent.  See:

 https://gist.github.com/nicowilliams/a8a07b0fc75df05f684c23c18d7db234

 for a partial implementation and lots more about vfork() in general.

 Nico
 -- 

From: Nico Williams <Nico.Williams@twosigma.com>
To: <gnats-bugs@netbsd.org>
Cc: 
Subject: Re: kern/49017: vfork does not suspend all threads
Date: Wed, 5 Apr 2017 21:24:06 +0000

 Joerg Sonnenberger <joerg@bec.de> wrote:
 > Race conditions in mutex code. Keep in mind that the park/unpark events
 > are per-process.

 As I mentioned in my oroginal post, pthread_mutex_*lock() and such are NOT in
 the async-signal-safe function set, therefore vfork() children cannot safely
 call them, and if they don't call them, how can they cause other threads in the
 parent any problems?

 (An OS could certainly make pthread functions work in the vfork() child,
 provided that neither child nor parent exit due signals, say, while holding a
 lock.  However, it is not necessary to do this as pthread functions are already
 not async-signal-safe.)

 Nico
 -- 

From: Kamil Rytarowski <n54@gmx.com>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/49017: vfork does not suspend all threads
Date: Thu, 6 Apr 2017 01:45:41 +0200

 This is an OpenPGP/MIME signed message (RFC 4880 and 3156)
 --loBFkqqmbaTL6qMB5HOwdEd3T6KoipVTW
 Content-Type: multipart/mixed; boundary="sjhsCvm4OOlJkfE1ExweiNW4qTjujRatf";
  protected-headers="v1"
 From: Kamil Rytarowski <n54@gmx.com>
 To: gnats-bugs@NetBSD.org
 Message-ID: <aa93a804-22df-1f49-2f74-d60140bab630@gmx.com>
 Subject: Re: kern/49017: vfork does not suspend all threads
 References: <pr-kern-49017@gnats.netbsd.org>
  <20140718155148.475D614B68@quasar.astron.com>
  <20170405204001.D2D987A27F@mollari.NetBSD.org>
 In-Reply-To: <20170405204001.D2D987A27F@mollari.NetBSD.org>

 --sjhsCvm4OOlJkfE1ExweiNW4qTjujRatf
 Content-Type: text/plain; charset=windows-1252
 Content-Transfer-Encoding: quoted-printable

 On 05.04.2017 22:40, Christos Zoulas wrote:
 > The following reply was made to PR kern/49017; it has been noted by GNA=
 TS.
 >=20
 > From: christos@zoulas.com (Christos Zoulas)
 > To: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,=20
 > 	gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
 > Cc:=20
 > Subject: Re: kern/49017: vfork does not suspend all threads
 > Date: Wed, 5 Apr 2017 16:38:32 -0400
 >=20
 >  On Apr 5,  7:35pm, n54@gmx.com (Kamil Rytarowski) wrote:
 >  -- Subject: Re: kern/49017: vfork does not suspend all threads
 > =20
 >  |  Well vfork(2) is supposed to suspend a parent process.
 >  |  "The parent process is suspended while the child is using its resou=
 rces."
 >  | =20
 >  |   -- vfork(2)
 >  | =20
 >  |  It's out of POSIX so it's rather harsh to dictate behavior change.
 > =20
 >  This wording predates threads and it is not specified in ToG:
 > =20
 >  http://pubs.opengroup.org/onlinepubs/009695399/functions/vfork.html
 > =20
 >  I.e. the suspension of the parent is historical behavior mandated by
 >  implementation convience; it would be difficult to make anything worki=
 ng
 >  reliably if the parent was altering the stack frame the child is curre=
 ntly
 >  executing.
 > =20
 >  |  Also going for your proposal is imho violating thread-process model=
  in
 >  |  NetBSD. It's Linux concept to emulate threads with clone(2), while =
 they
 >  |  are still regular processes.
 > =20
 >  Well, they are not regular processes; linux just does not differentiat=
 e
 >  between threads and processes by re-using the proc structure to descri=
 be
 >  both. In both implementations they end up sharing vmspace, file descri=
 ptors,
 >  etc.
 > =20
 >  |  In ptrace(2) we have two interfaces: PTRACE_FORK and PTRACE_VFORK. =
 The
 >  |  difference between them is only in the point whether the parent pro=
 cess
 >  |  (with all threads) has been suspended or not. No matter what the
 >  |  original syscall or API was used (clone(2), __clone(2), posix_spwan=
 (2),
 >  |  fork(2)...).
 > =20
 >  That is orthogonal; in fork() the parent is not suspended and ptrace h=
 as
 >  no problem with that. In vfork() only the thread executing vfork() is,=

 >  and again ptrace has no problem with that. The semantics if the other
 >  threads should be suspended is the question here. Nico claims that it
 >  is not harmful if they are, and it is actually beneficial. I have come=

 >  to the realization that this is true if the child is careful not to
 >  alter the parent data (which has been always the case). The question i=
 s:
 > =20
 >      Can the other threads harm the child or the parent while it is vfo=
 rk()ing?
 > =20
 >  I can't think of a way.
 > =20
 >  |  I thing you might be interested in designing something like _lwp_vf=
 ork().=3D
 > =20
 >  How does this solve the problem? What does _lwp_vfork() fork? Does it =
 create
 >  a new thread? In what process context?

 I don't have strong opinions.

 If we can ensure that vfork(2) is always safe first, then we can go for
 it... assuming that someone has to do the work.

 Regarding _lwp_fork().. I got an impression that optimizing forking is
 today like optimizing thread creation. There is LWP_VFORK in the kernel.

 I know that threads and processes are different.. however there are
 still Unices out there without POSIX threads like Minix.

 > =20
 >  christos
 > =20
 >=20



 --sjhsCvm4OOlJkfE1ExweiNW4qTjujRatf--

 --loBFkqqmbaTL6qMB5HOwdEd3T6KoipVTW
 Content-Type: application/pgp-signature; name="signature.asc"
 Content-Description: OpenPGP digital signature
 Content-Disposition: attachment; filename="signature.asc"

 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2

 iQIcBAEBCAAGBQJY5YG1AAoJEEuzCOmwLnZsWb8QAJ+J2PgQm5P4Z5RkR8Dh3GqN
 FFVG3P80GOoVpoKyeEtrYca1DQ0KHYNwYQNpfzwrtSECizzEiRQUiakOXn0vGhp9
 z7H8IYhZ59NPJF5Q8NZxncAo+4/jvIhDiJzRqnbXsK2Sv4RWmM9UDIxD6pHr10jV
 rYnty5yItghI1RHpCMD4pWus63+EmXxV9AqxvxMpzZckJC0LLhAN0Ms7+SpRh2mX
 Vwrl3+r/kKV79eATxSg8jpSDuViPot2pnLFD47zC++VVEc0BeFVBOZJsVGAIJb4r
 0esPXg+ftxq6+3HvLue9gJHy4y9IaZ1vgGLW2xOqJ+s6gBspW1GOOIV1HzCrj76p
 PCvYFN2oKwumNPSMtt3uELdw4zLy0ak9KzZZgrVf98Hv+KCSHwfpxUX9yZ66ipjp
 bQoNP5+AS1vjtySsSAFRBXXkZyLCiyM/pihcm2ymd/1Ayy7tFBtms38M02EYDUeO
 dpWGeK/FdvNL7W/iUM9Pcc6h66W1HUzP4ZkxGUblTqZADlz4InD5BtjtUBc1J6f9
 2NPWwVsMNz3coKKkhOl3cNFdtBOTG/M0lDhcC196WevZkFSKOUuhqaGci8Y9X5hD
 LXm5gsX01tX+cN/z5aWLlUEQW0I8G07FVq7HZuHsUH8fJso6EFd+pVy9HnH3LAG7
 hxJJgfAVOgzdm7LK6Ll3
 =uRNq
 -----END PGP SIGNATURE-----

 --loBFkqqmbaTL6qMB5HOwdEd3T6KoipVTW--

From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/49017: vfork does not suspend all threads
Date: Thu, 06 Apr 2017 10:08:40 +0700

 It appears to me as if the discussion on this has largely divided
 along traditional lines - those who (fundamentally) believe that
 vfork() is an evil hack, and should be removed (ideally) but in the
 meantime restricted in all ways possible to limit its use, so that
 hopefully it can go away some day not too far away - and those who
 believe vfork() is a valuable tool, which should be exploited in
 all ways possible, and enhanced where practical to make it even more
 useful.

 As long as that division persists, there will never be consensus on
 anything related to vfork().

 While I certainly agree that the man page description of vfork() as
 "stopping the parent process" is irrelevant to the current issue, as
 others have said, that was written in an age when there were no threads,
 not even as a possibility on the horizon, and was just never changed again.
 Whatever relationship might have existed between vfork() and threads seems
 to have simply been ignored (because vfork() proponents, and thread
 proponents, seem to largely be disjoint sets, or were.)

 Before going further, I think it useful to actually understand vfork() a
 little better - first it is (kind of) badly named, as in most respects it
 is not a fork() type function at all, it is rather a kind of setjmp()/longjmp()
 with some peculiar semantics - and a new proc struct (and hence new pid,
 and anything that affects only the proc struct, like setuid() being magic).

 That is, in reality, what happens is that the "parent" process both stalls,
 and continues running (just like with setjmp()) until the terminating
 condition occurs (the longjmp() equiv - _exit() or exec()) - and (here
 I disagree with Nico) the "child" process after a vfork() can do just about
 anything that would be safe between a setjmp() and longjmp(), unless the
 operation does something which would require a proc struct alteration which
 is also reflected in user space (so brk() is bad).

 There's no need to restrict vfork() children to async signal safe operations
 (the process limiting itself that way certainly won't hurt it, but it is
 not required) - it can do anything that the parent can do that affects only
 its internal (userland) state, or which affects purely the proc struct
 state in the kernel (so it can close files, or change the "close on exec"
 state, but not other file status flags).

 Of course, the parent needs to expect all of this - it needs to co-operate
 with the child, just as is required with a setjmp()/longjmp(), and understand
 just what the "child" might have done with the memory image after it gets
 a chance to observe.

 What all this means to a threaded process, is that overall I'm of the opinion
 that only the parent thread should block (just as if the parent thread used
 a setjmp()) and whatever sync with other threads should be just the same for
 the child of the vfork() as it would have been for the thread had it not done
 a vfork(), with the sole exception that the child cannot use, or rely upon,
 anything that uses kernel process private state (and hence cannot access, or
 change, anything which would be protected by such a mechanism).   In process
 type spin locks, would be safe, sys call activated sleeping locks would not be.

 kre

 ps: unrelated here, but the one facility missing from vfork() that would make
 it more useful, would be a "complete the fork" sys call, which would turn the
 vfork() into a fork() (dup the addr space) and be a third "wakeup the parent"
 operation.

From: Martin Husemann <martin@duskware.de>
To: Christos Zoulas <christos@zoulas.com>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/49017: vfork does not suspend all threads
Date: Thu, 6 Apr 2017 11:03:09 +0200

 On Wed, Apr 05, 2017 at 04:26:04PM -0400, Christos Zoulas wrote:
 > On Apr 5,  7:20pm, martin@duskware.de (Martin Husemann) wrote:
 > -- Subject: Re: kern/49017: vfork does not suspend all threads
 > 
 > |  Using vfork() in a program with multiple active threads is madness,
 > |  posix_spawn() is the only sensible way.
 > 
 > Why is that? We allow it, so it should do something reasonable/useful...
 > Or we should not allow it...

 We now have two processes with active threads each and a shared vmpspace.
 This sounds like completely out of spec for the unix process model to me
 and I'd call it madness.

 Martin

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/49017: vfork does not suspend all threads
Date: Thu, 6 Apr 2017 11:04:48 +0200

 On Wed, Apr 05, 2017 at 10:05:01PM +0000, Nico Williams wrote:
 >  (ISTR Thor telling me that the kernel-coded posix_spawn() reduced performance
 >  and that it was removed.  It seems I remembered half-correctly.  He did tell me
 >  that the kernel implementation is slower than the user-land implementation.)

 Are you sure that was about NetBSD?
 Would be nice to see numbers for that, I would be suprised.

 Martin

From: christos@zoulas.com (Christos Zoulas)
To: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org, 
	gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Cc: 
Subject: Re: kern/49017: vfork does not suspend all threads
Date: Thu, 6 Apr 2017 07:16:36 -0400

 On Apr 6,  3:10am, kre@munnari.OZ.AU (Robert Elz) wrote:
 -- Subject: Re: kern/49017: vfork does not suspend all threads

 |  That is, in reality, what happens is that the "parent" process both stalls,
 |  and continues running (just like with setjmp()) until the terminating
 |  condition occurs (the longjmp() equiv - _exit() or exec()) - and (here
 |  I disagree with Nico) the "child" process after a vfork() can do just about
 |  anything that would be safe between a setjmp() and longjmp(), unless the
 |  operation does something which would require a proc struct alteration which
 |  is also reflected in user space (so brk() is bad).

 It is much more restricted than that. For example if the child
 continues and returns from its current stack frame to other higher
 up frames, it makes changes to them so when the parent resumes it
 finds an inconsistent state. So one of the restrictions is "you
 can't safely return from the current stack frame... It is also
 very similar (as you said) to setjmp and longjmp as far as the
 current function frame is concerned (and register liveness) but
 the compiler takes care of that.

 |  What all this means to a threaded process, is that overall I'm of the opinion
 |  that only the parent thread should block (just as if the parent thread used
 |  a setjmp()) and whatever sync with other threads should be just the same for
 |  the child of the vfork() as it would have been for the thread had it not done
 |  a vfork(), with the sole exception that the child cannot use, or rely upon,
 |  anything that uses kernel process private state (and hence cannot access, or
 |  change, anything which would be protected by such a mechanism).   In process
 |  type spin locks, would be safe, sys call activated sleeping locks would not be.

 We agree there.

 |  ps: unrelated here, but the one facility missing from vfork() that would make
 |  it more useful, would be a "complete the fork" sys call, which would turn the
 |  vfork() into a fork() (dup the addr space) and be a third "wakeup the parent"
 |  operation.

 The difficulty with that is that the child needs to know that it has been
 vforked instead of forked... So the child needs code to handle both cases
 in general which is annoying.

 christos

From: Joerg Sonnenberger <joerg@bec.de>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org, christos@netbsd.org
Subject: Re: kern/49017: vfork does not suspend all threads
Date: Thu, 6 Apr 2017 13:22:21 +0200

 On Thu, Apr 06, 2017 at 03:10:01AM +0000, Robert Elz wrote:
 >  There's no need to restrict vfork() children to async signal safe operations
 >  (the process limiting itself that way certainly won't hurt it, but it is
 >  not required) - it can do anything that the parent can do that affects only
 >  its internal (userland) state, or which affects purely the proc struct
 >  state in the kernel (so it can close files, or change the "close on exec"
 >  state, but not other file status flags).

 The problem is that many of the functions outside the "async signal
 safe" category are exactly allowed to such things. Anything using
 mutexes will not correctly work after vfork. That is an implementation
 restriction related to how NetBSD's libpthread work, but it is certainly
 not the only possible pitfall.

 I don't classify vfork in general as a hack -- it is a building block
 with a number of serious restrictions. Not working well with threads is
 just another one of those restrictions.

 Joerg

From: Robert Elz <kre@munnari.OZ.AU>
To: christos@zoulas.com (Christos Zoulas)
Cc: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/49017: vfork does not suspend all threads
Date: Thu, 06 Apr 2017 21:15:08 +0700

     Date:        Thu, 6 Apr 2017 07:16:36 -0400
     From:        christos@zoulas.com (Christos Zoulas)
     Message-ID:  <20170406111636.92DF217FDA8@rebar.astron.com>

   | It is much more restricted than that. For example if the child
   | continues and returns from its current stack frame to other higher
   | up frames, it makes changes to them so when the parent resumes it
   | finds an inconsistent state.

 Just the same as setjmp() ... no question but that there are restrictions
 after a vfork(), but they're not as strict (not nearly as strict) as async
 signal safe (which needs to be prepared to execute with the "parent" in
 any state at all - here we know the parent is blocked in vfork() so is
 not currently doing anything.)

   | |  would be a "complete the fork" sys call, which would turn the
   | |  vfork() into a fork()

   | The difficulty with that is that the child needs to know that it has been
   | vforked instead of forked... So the child needs code to handle both cases
   | in general which is annoying.

 Huh?   The code is

 	if (vfork() == 0) {
 		/* here I am the child */
 	}

 How does the child possibly not know it has been vforked?   What I
 suggested (as some kind of dream, not a serious proposal, or not without
 trying it anyway) would be to allow

 	if (vfork() == 0) {
 		/* child, with parent blocked and shared mem */

 		/* code here, probably tests to decide what next */

 		vfork_into_fork();
 		/* now parent is unblocked, and child has its own addr space */
 	}

 (and no, I don't think "vfork_into_fork()" is a sensible name...)


 Joerg Sonnenberger <joerg@bec.de> said:
   | The problem is that many of the functions outside the "async signal safe"
   | category are exactly allowed to such things.

 Sure, but there are plenty that are, much more than what is safe in a
 signal handler.

 There's no question but that using vfork() requires some care (or very
 simple operations) but that's not a reason to restrict it more than what
 it already is.

 For the PR that started all this, my suggestion would be to simply fix the
 man page, and close it.

 kre

From: Nico Williams <Nico.Williams@twosigma.com>
To: <gnats-bugs@netbsd.org>
Cc: 
Subject: Re: kern/49017: vfork does not suspend all threads
Date: Thu, 6 Apr 2017 15:01:56 +0000

 Robert Elz <kre@munnari.OZ.AU> wrote:
 > It appears to me as if the discussion on this has largely divided
 > along traditional lines - those who (fundamentally) believe that
 > vfork() is an evil hack, and should be removed (ideally) but in the

 Well, no one seems to have said that, though several seem wedded to
 pre-threaded-model text about stopping the parent process and then extending it
 to the threaded model in the obvious, but horrible way.

 > meantime restricted in all ways possible to limit its use, so that
 > hopefully it can go away some day not too far away - and those who
 > believe vfork() is a valuable tool, which should be exploited in
 > all ways possible, and enhanced where practical to make it even more
 > useful.

 I'll take it much further: it is fork() that is EVIL, and vfork() that is GOOD.

 Here's my rationale for such an extraordinary statement:

 https://gist.github.com/nicowilliams/a8a07b0fc75df05f684c23c18d7db234

 Briefly: fork()'s copying and/or COW are terrible and would never have been
 necessary had Dennis Ritchie et. al. thought of vfork()'s semantics.  In the
 beginning processes were so small that fork()'s copying was not obviously a
 horrible feature.  Besides, fork() has a ton of safety issues (which I mostly
 did not address in that gist, but I could and will) that make vfork() look like
 not so bad.  Anyone who says that vfork() is impossibly hard to use correctly
 without pointing out that fork() is in some ways harder still to use
 correctly... just hasn't had the misfortune of having to fix fork-safety
 problems in real-life, production code (I have).

 Now, vfork() is... clumsy because of the stack sharing silliness, but it
 predates threads, so its authors probably did not realize that taking a
 callback function and argument to run in a new stack would have been a superior
 design -- I forgive them this oversight because vfork() is nonetheless awesome
 goodness, and all the more so if it doesn't stop all other threads in the
 parent process.

 I didn't always think this way.

 I used to think: "haha, look at CreateProcess() on Windows, what a disaster,
 how terribly hard it is to use, and all because they had the elegance of
 fork()+exec()".  But that was wrong.  fork() is inelegant.  Whereas vfork()+
 exec() is quite elegant.

 Nico
 -- 

From: Nico Williams <Nico.Williams@twosigma.com>
To: <gnats-bugs@netbsd.org>
Cc: 
Subject: Re: kern/49017: vfork does not suspend all threads
Date: Thu, 6 Apr 2017 15:26:53 +0000

 Robert Elz <kre@munnari.OZ.AU> wrote:
 > There's no need to restrict vfork() children to async signal safe operations
 > (the process limiting itself that way certainly won't hurt it, but it is

 It is certainly necessary to restrict what a fork() child can do when the
 parent process has built up a lot of state...  If the fork() parent has built
 very little state, if the fork() call comes early in its life, then it is safe
 to do just about anything on the child-side.

 OTOH, if the fork() parent has built a lot of state, possibly with various
 possibly-unknown-to-it libraries, it may not be safest to call anything but
 async-signal-safe functions.  And that's why POSIX says that's what the child
 can do.

 (Some APIs are explicitly fork-unsafe (e.g., PKCS#11); using them on the parent
 side of fork() means that one cannot use them on the child side without
 reinitializing the library or execve()'ing.)

 Heck, "async-signal-safe" is a somewhat misleading concept because fork() and
 vfork() do not atomically block signals on the child side, so a number of
 async-signal-safe functions are actually very much unsafe to call in a signal
 handler without first checking that getpid() returns the expected PID.

 With few exceptions, the only things I ever do in a signal handler are: check
 if getpid() returns the expected PID, write to sig_atomic_t global variables,
 and/or write(2) a single byte to a pipe the other end of which is handled by an
 event loop.

 "Async-signal-safe" is what POSIX calls the set of functions that it thinks
 are safe to call in signal handlers and the child side of fork().  Even if it
 is too small a set, it is useful enough a concept that we can use it to talk
 about what kinds of things are safe to do on the child side of vfork().

 > not required) - it can do anything that the parent can do that affects only
 > its internal (userland) state, or which affects purely the proc struct
 > state in the kernel (so it can close files, or change the "close on exec"
 > state, but not other file status flags).

 Certainly one can make safe use of some functions outside the async-signal-safe
 set in the fork()/vfork() child sides.  It does help to have some idea of what
 might go wrong when one does that.  POSIX defines such a set in part so that
 one can write portable code without having to know much about particular OSes.

 Thor pointed out to me yesterday that using mutexes on the child side of
 vfork() should have the same sorts of semantics and dangers as using shared
 mutexes, so one should not categorically dismiss the use of mutexes on the
 child side of vfork().  I agree with Thor on this, though I would generally
 discourage the use of shared mutexes anyways.

 Nico
 -- 

From: Nico Williams <Nico.Williams@twosigma.com>
To: <gnats-bugs@netbsd.org>
Cc: 
Subject: Re: kern/49017: vfork does not suspend all threads
Date: Thu, 6 Apr 2017 15:36:41 +0000

 Martin Husemann <martin@duskware.de> wrote:
 > On Wed, Apr 05, 2017 at 04:26:04PM -0400, Christos Zoulas wrote:
 > > On Apr 5,  7:20pm, martin@duskware.de (Martin Husemann) wrote:
 > > -- Subject: Re: kern/49017: vfork does not suspend all threads
 > > 
 > > |  Using vfork() in a program with multiple active threads is madness,
 > > |  posix_spawn() is the only sensible way.
 > > 
 > > Why is that? We allow it, so it should do something reasonable/useful...
 > > Or we should not allow it...
 > 
 > We now have two processes with active threads each and a shared vmpspace.
 > This sounds like completely out of spec for the unix process model to me
 > and I'd call it madness.

 "sounds like"

 Everything that BSD ever did prior to the advent of POSIX and other such
 standards... was "completely out of spec" for Unix.

 That is precisely how one innovates: by stepping outside the spec.

 Your rejection of this seems emotional rather than thought out.  It happens
 because we're humans; naturally I do this too sometimes.

 I urge you to read what I've written rather than merely react to the one-line
 summary of the proposal.

 Please leave behind the idea that vfork() is dangerous.  It absolutely is not.
 Decades of experience with it bears that out:

  - posix_spawn() on Linux, Solaris, Illumos, NetBSD (before posix_spawn()
    became a system call), and other BSDs -- all use vfork() internally

  - many app suse vfork(), including, famously, csh (now, I know, csh is evil,
    but that it successfully and safely uses vfork() cannot be denied)

  - you can search online nowadays for more vfork()-using code, and you can look
    at https://github.com/famzah/popen-noshell (warning: GPLv3), including the
    long discussion I had with the author in the issues

 I've yet to see a single bug report anywhere about vfork() not stopping all
 other parent threads causing an application to break.  Objectively speaking,
 without such a report, and without POSIX saying so (it does not! POSIX removed
 vfork()), NetBSD should not make that change!

 Nico
 -- 

From: Martin Husemann <martin@duskware.de>
To: Robert Elz <kre@munnari.OZ.AU>
Cc: Christos Zoulas <christos@zoulas.com>, gnats-bugs@NetBSD.org,
	kern-bug-people@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/49017: vfork does not suspend all threads
Date: Thu, 6 Apr 2017 17:59:36 +0200

 On Thu, Apr 06, 2017 at 09:15:08PM +0700, Robert Elz wrote:
 > For the PR that started all this, my suggestion would be to simply fix the
 > man page, and close it.

 That gets my vote too.

 Martin

From: Nico Williams <Nico.Williams@twosigma.com>
To: <gnats-bugs@netbsd.org>
Cc: 
Subject: Re: kern/49017: vfork does not suspend all threads
Date: Thu, 6 Apr 2017 16:57:14 +0000

 Robert Elz <kre@munnari.OZ.AU> wrote:
 > [ description of vfork_into_fork() elided ]

 That's a neat idea, but I don't think it's needed.  I can't think of why I
 would ever need it or any time that I could have used it.

 What I really want is

     pid_t avfork(int (*)(void *), void *);

 which is like vfork() but allocates a new stack, calls the given callback in it
 just like pthread_create() would, and does not stop any threads in the parent,
 not even the one that called it.  The 'a' stands for "asynchronous".

 Note that avfork() would have much the same constraints for the child as
 vfork() does, except, naturally, that the avfork() child could return while the
 vfork() child cannot.

 I have written portable multi-processed daemons that build on Unix and Windows.
 What I do on Windows is I spawn() the child processes, exec'ing the same
 executable as the parent and passing in information needed by the child via
 arguments or a pipe.  On Unix I get lazy and fork(), but what I do on Windows
 would work just as well on Unix.  You can see this here:

 https://github.com/heimdal/heimdal/blob/master/lib/roken/detach.c

 Nico
 -- 

From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/49017: vfork does not suspend all threads
Date: Fri, 07 Apr 2017 10:07:46 +0700

     Date:        Thu,  6 Apr 2017 15:05:01 +0000 (UTC)
     From:        Nico Williams <Nico.Williams@twosigma.com>
     Message-ID:  <20170406150501.356437A2B8@mollari.NetBSD.org>

   |  I'll take it much further: it is fork() that is EVIL, and vfork()
   |  that is GOOD.
   |  
   |  Here's my rationale for such an extraordinary statement:
   |  https://gist.github.com/nicowilliams/a8a07b0fc75df05f684c23c18d7db234

 All that shows is that fork() is (or can be) expensive, which is hardly
 news, nothing at all about evilness, in fact, the closest I can see that
 it comes are these sentences ...

     Since back then programs and processes were small that inelegance was
     easy to overlook. But now processes tend to be huge, and that makes
     copying even just a parent's resident set, and page table fiddling for
     the rest, extremely expensive.

 That's kind of like saying that Ferrari's are evil, because they cost
 too much if all you do is drive them grocery shopping once a week...

 Being expensive (even too expensive to routinely use) is not evil, just
 not the right (or perhaps even, just not the best) choice in many cases.

 If anything is "evil" from your text (IMO) it would be "But now processes
 tend to be huge" - that is the problem, not fork().

 But fork() permits

 	if (fork() > 0)
 		_exit(0);

 which most of the other methods (I know nothing of clone(), so that one
 possibly excepted) do not - expensive perhaps (in some cases) but useful
 nevertheless.

   |  Briefly: fork()'s copying and/or COW are terrible and would never have been
   |  necessary had Dennis Ritchie et. al. thought of vfork()'s semantics.

 While I suspect that fork() is really Ken's, not Dennis's (irrelevant here)
 I kind of doubt that.  First because neither of them is/was in any way
 deficient in their thinking (simply ignoring a possibility like that is
 not something I would expect) and second, because fork(), expensive or not,
 is simply far more general than vfork().

   |  Besides, fork() has a ton of safety issues (which I mostly
   |  did not address in that gist, 

 Nor anywhere else I have seen - I'm sure it is possible to write code
 badly enough that fork() would cause problems, (and it is certainly
 possible to make a mess using buffered I/O) but almost all of that is
 trivially overcome.

   |  Now, vfork() is... clumsy because of the stack sharing silliness, but it
   |  predates threads, so its authors probably did not realize that taking a
   |  callback function and argument to run in a new stack would have been a
   |  superior design

 When vfork() was designed, the total (guaranteed) address space was just
 64KB (text, data, stack, all combined).   Duplicating stacks (adding an
 extra stack - and if you want to be able to return in the child, it actually
 means copying the existing stack, while adjusting any self-referencing pointers
 that occur there)  would have been laughed away as absurd.

 In another message Nico.Williams@twosigma.com (kind of) quotes me:
    |  Robert Elz <kre@munnari.OZ.AU> wrote:
    |  > [ description of vfork_into_fork() elided ]

 and then says...

    | That's a neat idea, but I don't think it's needed.  I can't think of why I
    | would ever need it or any time that I could have used it.

 Maybe you never would have, but I know of one immediate use - that is /bin/sh

 Our sh uses vfork() whenever it can (for the obvious reason) but sometimes,
 while evaluating the code to be executed in the sub-shell, it discovers that
 it simply cannot do that after a vfork() and really needs a whole new process.

 What happens now is that the child sets a magic "do me again using fork()"
 flag (in the parent's address space, which it shares of course) and then
 exits.  The parent observes the flag, fork()'s, and the child starts all over
 again.   If the child could have simply converted its vfork() state into a
 fork() state that wacky dance would not be needed.

 Now, of course, the shell could avoid this by examining the tree of commands
 to be executed before the fork()/vfork() (it does that for the very common
 cases that will certainly require fork() rather than vfork()) but that would
 mean duplicating the whole process, initially just to discover which kind of
 fork() is required, and then again to actually do the work - for every
 sub-shell invocation (more or less every command executed that isn't a
 function) and all this for a relatively rare circumstance.

    | What I really want is
    |        pid_t avfork(int (*)(void *), void *);
    | which is like vfork() but allocates a new stack, calls the given callback
    | in it just like pthread_create() would, and does not stop any threads in
    | the parent, not even the one that called it.

 I have no objection to that, go ahead, write the code for it, and submit
 it, it sounds useful enough to consider at least.

 But...

   | Note that avfork() would have much the same constraints for the child as
   | vfork() does, except, naturally, that the avfork() child could return while
   | the vfork() child cannot.

 Return to what?   You're having it execute a callback, are you saying that
 that function can return?   Return to where exactly?   And what does that
 mean?   What would be the difference between

 	child = avfork(func, &sp);
 and
 	if ((child = avfork(&sp)) == 0) func();
 ??

 If there's none, why the need for the callback?  If avfork() cannot
 actually return in the child, so the second is not possible, then neither
 can func() right?

 kre

From: Nico Williams <Nico.Williams@twosigma.com>
To: <gnats-bugs@netbsd.org>
Cc: 
Subject: Re: kern/49017: vfork does not suspend all threads
Date: Fri, 7 Apr 2017 16:35:04 +0000

 Robert Elz <kre@munnari.OZ.AU> wrote:
 >   |  I'll take it much further: it is fork() that is EVIL, and vfork()
 >   |  that is GOOD.
 >   |  
 >   |  Here's my rationale for such an extraordinary statement:
 >   |  https://gist.github.com/nicowilliams/a8a07b0fc75df05f684c23c18d7db234
 > 
 > All that shows is that fork() is (or can be) expensive, which is hardly
 > news, nothing at all about evilness, in fact, the closest I can see that
 > it comes are these sentences ...

 It isn't just expensive.  I didn't go into detail about fork-safety
 issues, but those are real enough, and also unnecessary.

 Fork-safety issues are a necessary result of sharing state from a
 starting snapshot of it (which is easy enough to do if you're a small
 shell, but quite difficult if you're a large process with many libraries
 loaded, many unbeknownst to the original program).

 >     [...]
 > 
 > That's kind of like saying that Ferrari's are evil, because they cost
 > too much if all you do is drive them grocery shopping once a week...

 To me it's "evil" (I know, hyperbole; a lifeless tool can't really be
 evil) because of the fork-safety issues.  Admittedly I did not go into
 much detail on those.  The inherently-slow design does not exactly help
 either.

 The Unix community has basically been saying "fork() good, vfork() bad"
 for decades.  In this we have been sorely mistaken.  At the very least,
 vfork() is not bad.

 > If anything is "evil" from your text (IMO) it would be "But now processes
 > tend to be huge" - that is the problem, not fork().

 Certainly "slow" is not good.  However, layering issues involving APIs
 with shortcomings are a fact of life because we are too willing to
 re-use code.

 Layering issues in Java and similar are something else (legendary?); I
 won't go into them.  You might object to JVMs in the first place, so
 let's not go there.

 Layering issues in C can still be quite complex though!  Here's one
 case:

  - main program
    -> getXbyY() name service switch
     -> LDAP plugin
      -> OpenSSL
       -> SASL
        -> SASL GSS plugin
         -> GSS-API
          -> Kerberos
           -> OpenSSL

 Here the main program source is simple-looking, but turns out to be
 complex at run-time.  "But use nscd!"  Yes, but nscd itself looks like
 this internally.

 Slight API deficiencies in various layers in this example mean that
 passing down configuration, or intent to _exit() a fork() parent, and so
 on, is basically impossible.  One could open-code all of it to avoid
 this.  One could say "screw TLS, GSS, Kerberos, I'll use IPsec, and
 open-code everything", but IPsec is actually the hardest of these
 security protocols to use correctly, and anyways, open-coding everything
 will a lot of take time and effort.

 > But fork() permits
 > 
 >    if (fork() > 0)
 >        _exit(0);

 Yes!  This is true, this is very helpful, and you'll see I make use of
 this myself.

 Nothing, mind you, really prevents vfork() from supporting the same,
 except that the parent must block :(

 Naturally, "if (avfork() > 0) _exit(0);" would be cheaper :)

 One can also daemonize ("detach from tty", whatever) by doing vfork()
 and then exec(self).  That's effectively how one has to do such things
 on Windows due to its lack of fork() (though perhaps now with their WSL
 thing to support Ubuntu on Windows they now have a fork()??).

 I've written code that does this, including open source code (e.g., in
 Heimdal).

 >   |  Briefly: fork()'s copying and/or COW are terrible and would never have been
 >   |  necessary had Dennis Ritchie et. al. thought of vfork()'s semantics.
 > 
 > While I suspect that fork() is really Ken's, not Dennis's (irrelevant here)
 > I kind of doubt that.  First because neither of them is/was in any way
 > deficient in their thinking (simply ignoring a possibility like that is

 Oh, I certainly did not mean to imply that they were!

 They are/were luminaries who gave us the best OS of its time, with the
 best derivative lineage since.  For this I am ever thankful.

 That does not mean that they can't have made mistakes (e.g., the lack of
 a "create time" for files!), including ones they simply would not have
 recognized as mistakes then, but which perhaps later it turns out could
 have been designed differently to stand the test of time.

 > not something I would expect) and second, because fork(), expensive or not,
 > is simply far more general than vfork().

 We could have done without fork().  But we could not have done without a
 monstrosity like CreateProcess() unless we had either fork() or vfork().
 Better then to have fork() than not, but even better to have vfork() to
 begin with.  vfork() was a bit of brilliance that had to come from
 outside New Jersey.

 I speculate that the brilliance of fork() in the beginning lay in making
 it easy to develop programs like shells by placing the critical process
 spawning code in user-land as opposed to kernel-land.

 >   |  Besides, fork() has a ton of safety issues (which I mostly
 >   |  did not address in that gist, 
 > 
 > Nor anywhere else I have seen - I'm sure it is possible to write code
 > badly enough that fork() would cause problems, (and it is certainly
 > possible to make a mess using buffered I/O) but almost all of that is
 > trivially overcome.

 Is this PR right place to do this?  (A bit late to ask that, I know.)  I
 promise to write up a gist about fork-safety some time soon.

 The gist of it is this: sharing state based on a one-time snapshot +
 shared file descriptors can be devilishly difficult, if not impossible
 to do.

 A classic example is PKCS#11 and cryptography APIs in general.  Recall
 the complex layering mentioned above: there may not be a way for the
 code that calls fork() to re-setup state that cannot be shared.

 PKCS#11 explicitly says that the child-side of fork() MUST call
 C_Initialize() and lose all its previous state.  This follows in part
 because the API might internally communicate with a device (e.g., a TPM,
 smartcard, other token) via a file descriptor, and it would be difficult
 to have two processes communicate with said device over the same file
 descriptor in non-atomic ways (the fd not being anything like a
 SOCK_DGRAM fd).

 Even if you arrange to re-open the device on the child side, your open
 sessions will need to be re-logged-in!

 Even if you arrange to establish new sessions by reference to old
 sessions, some cryptographic primitives fail catastrophically when
 reused incorrectly...

 So one can use pthread_atfork() (e.g., libpkcs11 in Illumos uses it to
 automatically re-initialize on the child-side of fork()) to avoid a lot
 of these issues, but again, suppose you want to do

     if (fork() > 0)
         _exit(0);

 But how do you indicate intent to continue with pre-fork() state in the
 child and not the parent?

 If the PKCS#11 / whatever state is buried N>2 layers deep then
 indicating intent to exit the parent can be impossible to do.

 Now, PKCS#11-using libraries could be made to use pthread_atfork() to
 reestablish state on the child side of fork(), but again, some things
 can't safely be reused, so intent to exit one or the other side of
 fork() is critical.

 We could have a variant of fork() that runs the pthread_atfork() child
 handlers in the parent and the parent handlers in the child... but that
 would have other weirdness.

 So if you want to exit the parent, then the only thing that actually
 works is this: fork() early, before complex state is setup.

 This brings me to a related issue: daemon() is bad.  It's bad because
 either the parent exits before the child is ready or complex state must
 be setup before daemon() that might not survive the fork().  Oops.  A
 decade ago in Solaris/Illumos we adopted an alternative design (which I
 use in Heimdal now) where two functions are used: one that fork()s and
 has the parent wait for the child to signal readiness, and the other
 (executed in the child) that signals readiness:

     daemon_prep(); /* Returns here in the child-side of fork(); waits in
                       read(2) on a pipe in the parent*/
     <setup code>
     /*
      * Tell the parent waiting inside daemon_prep() that the child is ready.
      *
      * The parent will exit.  If we exit, the parent will notice and exit with
      * an error.
      */
     daemon_ready();

 This has no fork-safety issues because all the code with fork-safety
 issues happens on the child-side of an early fork().

 This is extremely convenient:

 # kdc && echo ready
 ready
 # kinit -k && echo yes
 yes
 # 

 when you get the shell prompt back that means the service is either
 running or failed to start.  There is no way you can get the prompt back
 and the service subsequently fails to start.

 Whereas using daemon() this can happen:

 # kdc && echo ready
 ready
 # kinit || echo no
 no
 # 

 We adopted this approach in Solaris/Illumos because we replaced the SysV
 init and inetd system with a new one (SMF) that understands
 inter-service dependencies and does not want to start a service until
 its dependencies are running.  And that means needing to know precisely
 that a service has started, and the way we do that is by having the
 service's main program behave as described above.

 (SMF also has a process grouping mechanism for representing
 multi-process services.  This is used to, among other things, detect
 crashes of some such processes in order to restart the service.)

 One need not like/adopt SMF in order to appreciate/adopt the
 daemon_prep()/daemon_ready() approach.

 >   |  Now, vfork() is... clumsy because of the stack sharing silliness, but it
 >   |  predates threads, so its authors probably did not realize that taking a
 >   |  callback function and argument to run in a new stack would have been a
 >   |  superior design
 > 
 > When vfork() was designed, the total (guaranteed) address space was just
 > 64KB (text, data, stack, all combined).   Duplicating stacks (adding an
 > extra stack - and if you want to be able to return in the child, it actually
 > means copying the existing stack, while adjusting any self-referencing pointers
 > that occur there)  would have been laughed away as absurd.

 The extra stack can be tiny, since one expects the child to
 exec-or-exit..  But sure, I understand.  OTOH, copying as in fork()
 isn't exactly light on resource usage either!

 > In another message Nico.Williams@twosigma.com (kind of) quotes me:
 >    |  Robert Elz <kre@munnari.OZ.AU> wrote:
 >    |  > [ description of vfork_into_fork() elided ]
 > 
 > and then says...
 > 
 >    | That's a neat idea, but I don't think it's needed.  I can't think of why I
 >    | would ever need it or any time that I could have used it.
 > 
 > Maybe you never would have, but I know of one immediate use - that is /bin/sh
 > 
 > Our sh [...]

 Aha, thanks.  I get that a fork-me-after-all system call would simplify
 that shell.  That seems like a valid use case indeed (even if there are
 other ways to handle this).

 >    | What I really want is
 >    |        pid_t avfork(int (*)(void *), void *);
 >    | which is like vfork() but allocates a new stack, calls the given callback
 >    | in it just like pthread_create() would, and does not stop any threads in
 >    | the parent, not even the one that called it.
 > 
 > I have no objection to that, go ahead, write the code for it, and submit
 > it, it sounds useful enough to consider at least.
 > 
 > But...
 > 
 >   | Note that avfork() would have much the same constraints for the child as
 >   | vfork() does, except, naturally, that the avfork() child could return while
 >   | the vfork() child cannot.
 > 
 > Return to what?   You're having it execute a callback, are you saying that
 > that function can return?   Return to where exactly?   And what does that
 > mean?   What would be the difference between

 When main() returns, the program exits.

 When the callback function in pthread_create() returns, the thread
 exits.

 Ditto with avfork(): when the callback returns, the child process exits.

 >    child = avfork(func, &sp);
 > and
 >    if ((child = avfork(&sp)) == 0) func();
 > ??

 func() has to run in a separate stack in order to avoid having to stop the
 parent thread that called avfork().  Sharing a stack is the reason that
 the vfork() parent must stop while the child goes on.

 avfork() looks almost exactly like pthread_create() (minus pthread_attr_t).

 > If there's none, why the need for the callback?  If avfork() cannot

 The callback is the function to call on a new stack in the child.  Samd as with
 pthread_create(), only creating a child process that shares the parent's
 address space just like vfork().

 avfork() is like a combination of pthread_create() and vfork().

 > actually return in the child, so the second is not possible, then neither
 > can func() right?

 The func() is expected to execve() or _exit(), just like vfork()
 children.  But it can also return since it is a C function!  And just
 like main(), if it returns, the process (the child in this case) exits.

 Nico
 -- 

From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/49017: vfork does not suspend all threads
Date: Sat, 08 Apr 2017 22:19:36 +0700

     Date:        Fri,  7 Apr 2017 20:40:01 +0000 (UTC)
     From:        Nico Williams <Nico.Williams@twosigma.com>
     Message-ID:  <20170407204001.1F0147A2B8@mollari.NetBSD.org>

   |  Fork-safety issues are a necessary result of sharing state from a
   |  starting snapshot of it

 But that kind of issue is the same whatever kind of fork() (with the
 possible execption of the current vfork()) is used, there's nothing in
 any of that which is peculiar to fork() over avfork() or perhaps even
 lwp_create().

   |  Better then to have fork() than not, but even better to have vfork() to
   |  begin with.

 I disagree with that, fork() is essential, vfork() is a nice optimisation
 that is sometimes useful.

   |  That does not mean that they can't have made mistakes (e.g., the lack of
   |  a "create time" for files!),

 You really don't want to get me started on that one ... "create time"
 (aka the current birthtime in UFS2) is the greatest crock of sh*t of
 all time.   I have never yet been able to find anyone who could explain
 a use case for that nonsense that actually corresponds to anything that
 has ever been implemented, or is even implementable.

 That is it is easy to explain what would be useful for a create time,
 but no-one has ever implemented it in a way that those uses work, and
 it is probably impossible (since much of what is actually desired depends
 on intangibles of what is going on inside the user's head.)   On the other
 hand, as the current UFS2 illustrates, implementing something called
 a birthtime (or create time) is easy, it just doesn't correspond to
 anything actually useful in practice (which is why it is the most
 underused filesystem feature of all - probably the least used feature of
 the whole system, including the exotic stuff.)

   |  Is this PR right place to do this?

 No...


   |  When main() returns, the program exits.
   |  When the callback function in pthread_create() returns, the thread
   |  exits.

 That's not what we mean we we say the child of a vfork() cannot return,
 what we mean is that it cannot unwind its stack, if one says the child
 can return (as it can after fork()) then the two processes can continue
 in parallel, each doing their own thing (as much as that makes sense to
 the logic).  "You can return, but that means you exit" is not particularly
 useful.

 kre

From: Nico Williams <Nico.Williams@twosigma.com>
To: <gnats-bugs@netbsd.org>
Cc: 
Subject: Re: kern/49017: vfork does not suspend all threads
Date: Mon, 10 Apr 2017 00:38:43 +0000

 Robert Elz <kre@munnari.OZ.AU> wrote:
 >   |  Fork-safety issues are a necessary result of sharing state from a
 >   |  starting snapshot of it
 > 
 > But that kind of issue is the same whatever kind of fork() (with the
 > possible execption of the current vfork()) is used, there's nothing in
 > any of that which is peculiar to fork() over avfork() or perhaps even
 > lwp_create().

 vfork() has no fork-safety issues: the child must execve() or _exit(),
 and the restrictions on what functions in can call before then make it
 very difficult to do anything other than setup the execve().  The
 vfork() child certainly can't make any PKCS#11 function calls or what
 have you.

 avfork() would have the same execve()-or-_exit() requirement as
 vfork(), except _perhaps_ modified as "child-must-execve()-or-_exit()-
 OR-parent-must-_exit()".

 Any library code that calls getpid() to discover forks will be tripped
 though in the avfork() child, so I think we can't really make that
 relaxation.  But aside from that, if the parent _exit()s, then the
 avfork() child should be free from fork-safety concerns.

 >   |  Better then to have fork() than not, but even better to have vfork() to
 >   |  begin with.
 > 
 > I disagree with that, fork() is essential, vfork() is a nice optimisation
 > that is sometimes useful.

 posix_spawn()/_spawn()/CreateProcess() demonstrates that fork() isn't
 essential.  fork() was essential to speeding up development of shells by
 moving the spawning code into user-land -- that's my theory.

 >   |  That does not mean that they can't have made mistakes (e.g., the lack of
 >   |  a "create time" for files!),
 > 
 > You really don't want to get me started on that one ... "create time"
 > (aka the current birthtime in UFS2) is the greatest crock of sh*t of
 > all time.   I have never yet been able to find anyone who could explain
 > a use case for that nonsense that actually corresponds to anything that
 > has ever been implemented, or is even implementable.

 Really?  Maybe I'll ask you off-list.  I'm quite curious.

 >   |  Is this PR right place to do this?
 > 
 > No...

 Agreed.

 >   |  When main() returns, the program exits.
 >   |  When the callback function in pthread_create() returns, the thread
 >   |  exits.
 > 
 > That's not what we mean we we say the child of a vfork() cannot return,
 > what we mean is that it cannot unwind its stack, if one says the child

 Yes, but the reason it can't return is the shared stack.  avfork() would
 have no shared stack, therefore it wouldn't have that problem.

 > can return (as it can after fork()) then the two processes can continue
 > in parallel, each doing their own thing (as much as that makes sense to
 > the logic).  "You can return, but that means you exit" is not particularly
 > useful.

 It's what happens with: main() and pthread_create().  It's quite
 sensible.

 Nico
 -- 

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.