NetBSD Problem Report #59255
From www@netbsd.org Sun Apr 6 14:16:11 2025
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
key-exchange X25519 server-signature RSA-PSS (2048 bits)
client-signature RSA-PSS (2048 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 070DA1A9239
for <gnats-bugs@gnats.NetBSD.org>; Sun, 6 Apr 2025 14:16:11 +0000 (UTC)
Message-Id: <20250406141609.C729D1A923C@mollari.NetBSD.org>
Date: Sun, 6 Apr 2025 14:16:09 +0000 (UTC)
From: campbell+netbsd@mumble.net
Reply-To: campbell+netbsd@mumble.net
To: gnats-bugs@NetBSD.org
Subject: tests/lib/librumpclient/t_exec: intermittent failures
X-Send-Pr-Version: www-1.0
>Number: 59255
>Category: misc
>Synopsis: tests/lib/librumpclient/t_exec: intermittent failures
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: misc-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sun Apr 06 14:20:00 +0000 2025
>Last-Modified: Mon Apr 07 16:00:02 +0000 2025
>Originator: Taylor R Campbell
>Release: current
>Organization:
The RumpBSD Execution
>Environment:
>Description:
Various test cases in tests/lib/librumpclient/t_exec have been intermittently failing for a while:
https://releng.netbsd.org/b5reports/i386/2025/2025.03.02.08.14.26/test.html#lib_librumpclient_t_exec_vfork
https://releng.netbsd.org/b5reports/i386/2025/2025.03.02.14.13.22/test.html#lib_librumpclient_t_exec_vfork
https://releng.netbsd.org/b5reports/i386/2025/2025.03.04.00.41.00/test.html#lib_librumpclient_t_exec_vfork
https://releng.netbsd.org/b5reports/i386/2025/2025.03.04.16.40.46/test.html#lib_librumpclient_t_exec_vfork
https://releng.netbsd.org/b5reports/i386/2025/2025.03.08.19.09.46/test.html#lib_librumpclient_t_exec_threxec
https://releng.netbsd.org/b5reports/i386/2025/2025.03.09.18.50.20/test.html#lib_librumpclient_t_exec_vfork
https://releng.netbsd.org/b5reports/i386/2025/2025.03.09.18.58.18/test.html#lib_librumpclient_t_exec_cloexec
https://releng.netbsd.org/b5reports/i386/2025/2025.03.09.22.06.28/test.html#lib_librumpclient_t_exec_cloexec
https://releng.netbsd.org/b5reports/i386/2025/2025.03.10.05.06.02/test.html#lib_librumpclient_t_exec_vfork
https://releng.netbsd.org/b5reports/i386/2025/2025.03.11.05.48.26/test.html#lib_librumpclient_t_exec_vfork
https://releng.netbsd.org/b5reports/i386/2025/2025.03.11.14.13.45/test.html#lib_librumpclient_t_exec_cloexec
https://releng.netbsd.org/b5reports/i386/2025/2025.03.12.07.57.05/test.html#lib_librumpclient_t_exec_vfork
https://releng.netbsd.org/b5reports/i386/2025/2025.03.14.06.40.51/test.html#lib_librumpclient_t_exec_vfork
https://releng.netbsd.org/b5reports/i386/2025/2025.03.14.18.50.03/test.html#lib_librumpclient_t_exec_threxec
https://releng.netbsd.org/b5reports/i386/2025/2025.03.18.07.58.09/test.html#lib_librumpclient_t_exec_vfork
https://releng.netbsd.org/b5reports/i386/2025/2025.03.19.18.15.27/test.html#lib_librumpclient_t_exec_cloexec
https://releng.netbsd.org/b5reports/i386/2025/2025.03.21.07.09.58/test.html#lib_librumpclient_t_exec_vfork
https://releng.netbsd.org/b5reports/i386/2025/2025.03.24.00.13.58/test.html#lib_librumpclient_t_exec_vfork
https://releng.netbsd.org/b5reports/i386/2025/2025.03.26.00.05.56/test.html#lib_librumpclient_t_exec_vfork
https://releng.netbsd.org/b5reports/i386/2025/2025.03.29.11.51.54/test.html#lib_librumpclient_t_exec_threxec
https://releng.netbsd.org/b5reports/i386/2025/2025.03.29.17.29.20/test.html#lib_librumpclient_t_exec_vfork
https://releng.netbsd.org/b5reports/i386/2025/2025.03.29.21.45.08/test.html#lib_librumpclient_t_exec_vfork
https://releng.netbsd.org/b5reports/i386/2025/2025.03.30.14.13.59/test.html#lib_librumpclient_t_exec_threxec
https://releng.netbsd.org/b5reports/i386/2025/2025.03.30.16.23.13/test.html#lib_librumpclient_t_exec_threxec
https://releng.netbsd.org/b5reports/i386/2025/2025.03.31.13.03.23/test.html#lib_librumpclient_t_exec_vfork
https://releng.netbsd.org/b5reports/i386/2025/2025.03.31.14.46.42/test.html#lib_librumpclient_t_exec_threxec
https://releng.netbsd.org/b5reports/i386/2025/2025.04.01.23.02.29/test.html#lib_librumpclient_t_exec_vfork
https://releng.netbsd.org/b5reports/i386/2025/2025.04.02.17.44.07/test.html#lib_librumpclient_t_exec_vfork
https://releng.netbsd.org/b5reports/i386/2025/2025.04.03.14.51.37/test.html#lib_librumpclient_t_exec_vfork
https://releng.netbsd.org/b5reports/i386/2025/2025.04.03.17.49.49/test.html#lib_librumpclient_t_exec_vfork
https://releng.netbsd.org/b5reports/i386/2025/2025.04.04.18.57.01/test.html#lib_librumpclient_t_exec_vfork
https://releng.netbsd.org/b5reports/i386/2025/2025.04.06.03.33.51/test.html#lib_librumpclient_t_exec_cloexec
>How-To-Repeat:
cd /usr/tests/lib/librumpclient
atf-run t_exec | atf-report
>Fix:
Yes, please!
>Audit-Trail:
From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: misc/59255: tests/lib/librumpclient/t_exec: intermittent failures
Date: Mon, 07 Apr 2025 02:00:06 +0700
Date: Sun, 6 Apr 2025 14:20:01 +0000 (UTC)
From: campbell+netbsd@mumble.net
Message-ID: <20250406142001.10B3B1A923E@mollari.NetBSD.org>
| Various test cases in tests/lib/librumpclient/t_exec have
| been intermittently failing for a while:
Probably forever. I took a look at this one in particular a
while ago, there looks to be a fairly obvious race condition in
the test - it vforks, then the child and parent each set argv[0]
and re-exec the test file again - the test case looks to see that
both processes have a shared socket open - or something like that.
But that test happens when the parent exits, simply assuming that by
that time the child will have had time to establish its setup (since,
being vfork(), it has to exec() or exit() before the parent gets
control back from the vfork()), and often, that works, but not always.
When it doesn't, when the test code (the script) looks to see the state
of the sockets it isn't what it is expecting.
I would have fixed it at the time (a few months ago now), but I couldn't
determine exactly what the test was supposed to be testing, and different
ways to overcome the problem might break whatever that was supposed to
be (rendering the test even more useless than it currently is).
If I were to do anything, I'd probably just delete the whole test as
being basically useless.
I didn't examine the other test cases (some of which also intermittently
fail) but I'd be not at all surprised to see much of the same from them
(except perhaps the very basic one which doesn't do almost anything, and
probably never fails).
kre
From: Taylor R Campbell <riastradh@NetBSD.org>
To: Robert Elz <kre@munnari.OZ.AU>
Cc: gnats-bugs@netbsd.org, misc-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: Re: misc/59255: tests/lib/librumpclient/t_exec: intermittent failures
Date: Mon, 7 Apr 2025 12:51:52 +0000
This is a multi-part message in MIME format.
--=_rFLsIlDMAmZS54PhaM3eE2ii5a8qA3Zz
> Date: Mon, 07 Apr 2025 02:00:06 +0700
> From: Robert Elz <kre@munnari.OZ.AU>
>
> Date: Sun, 6 Apr 2025 14:20:01 +0000 (UTC)
> From: campbell+netbsd@mumble.net
> Message-ID: <20250406142001.10B3B1A923E@mollari.NetBSD.org>
>
>
> | Various test cases in tests/lib/librumpclient/t_exec have
> | been intermittently failing for a while:
>
> Probably forever. I took a look at this one in particular a
> while ago, there looks to be a fairly obvious race condition in
> the test - it vforks, then the child and parent each set argv[0]
> and re-exec the test file again - the test case looks to see that
> both processes have a shared socket open - or something like that.
>
> But that test happens when the parent exits, simply assuming that by
> that time the child will have had time to establish its setup (since,
> being vfork(), it has to exec() or exit() before the parent gets
> control back from the vfork()), and often, that works, but not always.
> When it doesn't, when the test code (the script) looks to see the state
> of the sockets it isn't what it is expecting.
The sequence of events is something like this:
1. vfork()
2. vfork returns in child
3. child sets childran=1
4. child sends HANDSHAKE_FORK to rump_server
5. server sets up file descriptors
6. child receives HANDSHAKE_FORK reply from rump_server
7. child calls rumpclient_exec
8. child calls execve (no communication with rump_server)
At this point, the child and parent run in parallel:
9(a). child calls rumpclient_init -> lwproc_execnotify to tell
rump_server its p_comm (and whatever else, like closing
O_CLOEXEC descriptors)
9(b). vfork returns in parent, parent exits, test runs rump.sockstat
The test fails if 9(b) runs before 9(a) so rump.sockstat still shows
the old p_comm rather than the new p_comm.
We can ensure these are sequenced, preserving the non-rumpy vfork(2)
semantics, by creating a pipe shared between parent and child. The
attached patch implements this.
(I have not been able to reproduce the failure at all in a VM on my
laptop, though, after thousands of trials, so I can't confirm it
eliminates the symptom.)
That said, I'm not entirely sure that p_comm access is _guaranteed_ to
be ready by the time a vforked execve(2) wakes the parent. There is a
similar question about psstrings with posix_spawn:
https://gnats.netbsd.org/59175
But it's not really that costly to add this additional logic to
rumpclient to dispense with the question altogether; it's more for
testing and experiments than performance.
--=_rFLsIlDMAmZS54PhaM3eE2ii5a8qA3Zz
Content-Type: text/plain; charset="ISO-8859-1"; name="pr59255-rumpvforkexecwait"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment; filename="pr59255-rumpvforkexecwait.patch"
# HG changeset patch
# User Taylor R Campbell <riastradh@NetBSD.org>
# Date 1744029106 0
# Mon Apr 07 12:31:46 2025 +0000
# Branch trunk
# Node ID a51ca88c1069b743be56446e0cc8b6e102719bec
# Parent cce289282f347391206ac4392309e42775902465
# EXP-Topic riastradh-pr59255-rumpexectests
librumpclient: Make rumpclient_vfork wait for child to finish exec.
`Finish exec' here means making it into rumpclient_init past
lwproc_execnotify.
No change to rumpclient_fork, other than reserving thread-local
storage for one int object to pass a file descriptor from
rumpclient_vfork to rumpclient_exec.
PR misc/59255: tests/lib/librumpclient/t_exec: intermittent failures
diff -r cce289282f34 -r a51ca88c1069 distrib/sets/lists/base/shl.mi
--- a/distrib/sets/lists/base/shl.mi Mon Apr 07 01:54:02 2025 +0000
+++ b/distrib/sets/lists/base/shl.mi Mon Apr 07 12:31:46 2025 +0000
@@ -90,7 +90,7 @@
./lib/libradius.so.5.0 base-sys-shlib dynamicroot
./lib/librumpclient.so base-sys-shlib dynamicroot,rump
./lib/librumpclient.so.0 base-sys-shlib dynamicroot,rump
-./lib/librumpclient.so.0.0 base-sys-shlib dynamicroot,rump
+./lib/librumpclient.so.0.1 base-sys-shlib dynamicroot,rump
./lib/librumpres.so base-sys-shlib dynamicroot,rump
./lib/librumpres.so.0 base-sys-shlib dynamicroot,rump
./lib/librumpres.so.0.0 base-sys-shlib dynamicroot,rump
diff -r cce289282f34 -r a51ca88c1069 distrib/sets/lists/debug/shl.mi
--- a/distrib/sets/lists/debug/shl.mi Mon Apr 07 01:54:02 2025 +0000
+++ b/distrib/sets/lists/debug/shl.mi Mon Apr 07 12:31:46 2025 +0000
@@ -29,7 +29,7 @@
./usr/libdata/debug/lib/libprop.so.1.2.debug comp-sys-debug debug,dynami=
croot
./usr/libdata/debug/lib/libpthread.so.1.5.debug comp-sys-debug debug,dyn=
amicroot
./usr/libdata/debug/lib/libradius.so.5.0.debug comp-sys-debug debug,dyna=
microot
-./usr/libdata/debug/lib/librumpclient.so.0.0.debug comp-rump-debug debug,=
dynamicroot,rump
+./usr/libdata/debug/lib/librumpclient.so.0.1.debug comp-rump-debug debug,=
dynamicroot,rump
./usr/libdata/debug/lib/librumpres.so.0.0.debug comp-rump-debug debug,dy=
namicroot,rump
./usr/libdata/debug/lib/libterminfo.so.2.0.debug comp-sys-debug debug,dyn=
amicroot
./usr/libdata/debug/lib/libumem.so.0.0.debug comp-zfs-debug debug,dynami=
croot,zfs
diff -r cce289282f34 -r a51ca88c1069 lib/librumpclient/rumpclient.c
--- a/lib/librumpclient/rumpclient.c Mon Apr 07 01:54:02 2025 +0000
+++ b/lib/librumpclient/rumpclient.c Mon Apr 07 12:31:46 2025 +0000
@@ -83,6 +83,7 @@
=20
#define HOSTOPS
int (*host_socket)(int, int, int);
+int (*host_socketpair)(int, int, int, int *);
int (*host_close)(int);
int (*host_connect)(int, const struct sockaddr *, socklen_t);
int (*host_fcntl)(int, int, ...);
@@ -121,9 +122,13 @@ static struct spclient clispc =3D {
static int holyfd =3D -1;
static sigset_t fullset;
=20
+static __thread int waitforexecnotifyparentfd =3D -1;
+
static int doconnect(void);
static int handshake_req(struct spclient *, int, void *, int, bool);
=20
+static void waitforexec_notify(int);
+
/*
* Default: don't retry. Most clients can't handle it
* (consider e.g. fds suddenly going missing).
@@ -864,6 +869,7 @@ rumpclient_init(void)
int error;
int rv =3D -1;
int hstype;
+ int notifyparentfd;
pid_t mypid;
=20
/*
@@ -911,6 +917,7 @@ rumpclient_init(void)
FINDSYM(socket)
#endif
=20
+ FINDSYM(socketpair)
FINDSYM(close)
FINDSYM(connect)
FINDSYM(fcntl)
@@ -960,7 +967,8 @@ rumpclient_init(void)
goto out;
=20
if ((p =3D getenv("RUMPCLIENT__EXECFD")) !=3D NULL) {
- sscanf(p, "%d,%d", &clispc.spc_fd, &holyfd);
+ sscanf(p, "%d,%d,%d", &clispc.spc_fd, &holyfd,
+ ¬ifyparentfd);
unsetenv("RUMPCLIENT__EXECFD");
hstype =3D HANDSHAKE_EXEC;
} else {
@@ -980,6 +988,12 @@ rumpclient_init(void)
}
rv =3D 0;
=20
+ /*
+ * Notify the parent that exec has completed, in case it is
+ * waiting on vfork.
+ */
+ waitforexec_notify(notifyparentfd);
+
out:
if (rv =3D=3D -1)
init_done =3D 0;
@@ -1023,6 +1037,129 @@ rumpclient_prefork(void)
return rpf;
}
=20
+#define WAITFOREXECFD_PARENT 0
+#define WAITFOREXECFD_CHILD 1
+
+/*
+ * rumpclient_waitforexec_prepare(waitforexecfd)
+ *
+ * Called from the parent before forking when the parent wants to
+ * wait for exec to complete in the child, for vfork(2) semantics.
+ * Initialize execfd[0] and execfd[1] with file descriptors for
+ * use with rumpclient_waitforexec_parent/child (or
+ * rumpclient_waitforexec_cancel).
+ */
+int
+rumpclient_waitforexec_prepare(int waitforexecfd[2])
+{
+
+ return host_socketpair(PF_LOCAL, SOCK_STREAM, 0, waitforexecfd);
+}
+
+/*
+ * rumpclient_waitforexec_cancel(waitforexecfd)
+ *
+ * Called from the parent after rumpclient_waitforexec_prepare if
+ * fork/vfork fails.
+ */
+void
+rumpclient_waitforexec_cancel(int waitforexecfd[2])
+{
+
+ (void)host_close(waitforexecfd[WAITFOREXECFD_PARENT]);
+ (void)host_close(waitforexecfd[WAITFOREXECFD_CHILD]);
+}
+
+/*
+ * rumpclient_waitforexec_child(waitforexecfd)
+ *
+ * Called by the child between rumpclient_vfork and
+ * rumpclient_exec with the fds created by
+ * rumpclient_waitforexec_prepare.
+ */
+void
+rumpclient_waitforexec_child(int waitforexecfd[2])
+{
+
+ /*
+ * Close the parent's fd -- we don't need it any more.
+ */
+ (void)host_close(waitforexecfd[WAITFOREXECFD_PARENT]);
+
+ /*
+ * Record the fd we will use to notify the parent after exec,
+ * passed through the RUMPCLIENT__EXECFD environment variable.
+ *
+ * We are running as the single thread of a child process, but
+ * we still share address space with the parent, so it's really
+ * multithreaded. waitforexecnotifyparentfd is a thread-local
+ * variable to obviate any need for serialization here.
+ */
+ waitforexecnotifyparentfd =3D waitforexecfd[WAITFOREXECFD_CHILD];
+}
+
+/*
+ * rumpclient_waitforexec_parent(waitforexecfd)
+ *
+ * Called by the parent after rumpclient_vfork to wait for the
+ * child to call rumpclient_exec, or exit.
+ */
+void
+rumpclient_waitforexec_parent(int waitforexecfd[2])
+{
+ char c;
+
+ /*
+ * Close the child's end for writing to us so we won't hang
+ * forever if the child exits without writing anything. Next,
+ * read from our end to wait until the child has written or
+ * exited.
+ */
+ (void)host_close(waitforexecfd[WAITFOREXECFD_CHILD]);
+ (void)host_read(waitforexecfd[WAITFOREXECFD_PARENT], &c, 1);
+
+ /*
+ * Close our end now that we've read from it, and out of
+ * paranoia, clear it out of waitforexecnotifyparentfd.
+ *
+ * We must wait until _after_ the read to clear out
+ * waitforexecnotifyparentfd because the child shares address
+ * space until it execs.
+ */
+ (void)host_close(waitforexecfd[WAITFOREXECFD_PARENT]);
+ waitforexecnotifyparentfd =3D -1;
+}
+
+/*
+ * waitforexec_notify(notifyparentfd)
+ *
+ * Called by the child after exec. Wakes the parent waiting in
+ * rumpclient_waitforexec_parent, if any.
+ */
+static void
+waitforexec_notify(int notifyparentfd)
+{
+ char b =3D 0;
+ struct iovec iov =3D { .iov_base =3D &b, .iov_len =3D 1 };
+ struct msghdr msg =3D { .msg_iov =3D &iov, .msg_iovlen =3D 1 };
+
+ /*
+ * If there's no notifyparentfd, because the child was created
+ * with fork rather than vfork, nothing to do.
+ */
+ if (notifyparentfd =3D=3D -1)
+ return;
+
+ /*
+ * Parent is waiting for exec to complete. Notify them that
+ * exec has completed -- but if the parent has died, don't
+ * raise SIGPIPE; just move on. After this we have no more
+ * need of the connection, so close the fd.
+ */
+ (void)host_sendmsg(notifyparentfd, &msg, MSG_NOSIGNAL);
+ (void)host_close(notifyparentfd);
+}
+
int
rumpclient_fork_init(struct rumpclient_fork *rpf)
{
@@ -1148,7 +1285,7 @@ pid_t
rumpclient_fork(void)
{
=20
- return rumpclient__dofork(fork);
+ return rumpclient__dofork(fork, /*waitforexec*/0);
}
=20
/*
@@ -1166,8 +1303,8 @@ rumpclient_exec(const char *path, char *
size_t nelem;
int rv, sverrno;
=20
- snprintf(buf, sizeof(buf), "RUMPCLIENT__EXECFD=3D%d,%d",
- clispc.spc_fd, holyfd);
+ snprintf(buf, sizeof(buf), "RUMPCLIENT__EXECFD=3D%d,%d,%d",
+ clispc.spc_fd, holyfd, waitforexecnotifyparentfd);
envstr =3D malloc(strlen(buf)+1);
if (envstr =3D=3D NULL) {
return ENOMEM;
diff -r cce289282f34 -r a51ca88c1069 lib/librumpclient/rumpclient.h
--- a/lib/librumpclient/rumpclient.h Mon Apr 07 01:54:02 2025 +0000
+++ b/lib/librumpclient/rumpclient.h Mon Apr 07 12:31:46 2025 +0000
@@ -48,7 +48,7 @@ typedef RUMP_REGISTER_T register_t;
=20
struct rumpclient_fork;
=20
-#define rumpclient_vfork() rumpclient__dofork(vfork)
+#define rumpclient_vfork() rumpclient__dofork(vfork, /*waitforexec*/1)
=20
#ifdef __BEGIN_DECLS
__BEGIN_DECLS
@@ -63,6 +63,10 @@ struct rumpclient_fork *rumpclient_prefo
int rumpclient_fork_init(struct rumpclient_fork *);
void rumpclient_fork_cancel(struct rumpclient_fork *);
void rumpclient_fork_vparent(struct rumpclient_fork *);
+int rumpclient_waitforexec_prepare(int[2]);
+void rumpclient_waitforexec_cancel(int[2]);
+void rumpclient_waitforexec_child(int[2]);
+void rumpclient_waitforexec_parent(int[2]);
=20
pid_t rumpclient_fork(void);
int rumpclient_exec(const char *, char *const [], char *const[]);
@@ -86,21 +90,31 @@ int rumpclient__closenotify(int *, enum=20
* run in the caller's stackframe.
*/
static __attribute__((__always_inline__)) __returns_twice inline pid_t
-rumpclient__dofork(pid_t (*forkfn)(void))
+rumpclient__dofork(pid_t (*forkfn)(void), int waitforexec)
{
struct rumpclient_fork *rf;
pid_t pid;
int childran =3D 0;
+ int waitforexecfd[2];
=20
if (!(rf =3D rumpclient_prefork()))
return -1;
- =20
+ if (waitforexec) {
+ if (rumpclient_waitforexec_prepare(waitforexecfd) =3D=3D -1) {
+ rumpclient_fork_cancel(rf);
+ return -1;
+ }
+ }
switch ((pid =3D forkfn())) {
case -1:
+ if (waitforexec)
+ rumpclient_waitforexec_cancel(waitforexecfd);
rumpclient_fork_cancel(rf);
break;
case 0:
childran =3D 1;
+ if (waitforexec)
+ rumpclient_waitforexec_child(waitforexecfd);
if (rumpclient_fork_init(rf) =3D=3D -1)
pid =3D -1;
break;
@@ -108,6 +122,8 @@ rumpclient__dofork(pid_t (*forkfn)(void)
/* XXX: multithreaded vforker? do they exist? */
if (childran)
rumpclient_fork_vparent(rf);
+ if (waitforexec)
+ rumpclient_waitforexec_parent(waitforexecfd);
break;
}
=20
diff -r cce289282f34 -r a51ca88c1069 lib/librumpclient/shlib_version
--- a/lib/librumpclient/shlib_version Mon Apr 07 01:54:02 2025 +0000
+++ b/lib/librumpclient/shlib_version Mon Apr 07 12:31:46 2025 +0000
@@ -1,4 +1,4 @@
# $NetBSD: shlib_version,v 1.1 2010/11/04 21:01:29 pooka Exp $
#
major=3D0
-minor=3D0
+minor=3D1
--=_rFLsIlDMAmZS54PhaM3eE2ii5a8qA3Zz--
From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: misc/59255: tests/lib/librumpclient/t_exec: intermittent failures
Date: Mon, 07 Apr 2025 22:56:03 +0700
Date: Mon, 7 Apr 2025 12:55:01 +0000 (UTC)
From: "Taylor R Campbell via gnats" <gnats-admin@NetBSD.org>
Message-ID: <20250407125501.54C371A923C@mollari.NetBSD.org>
| The sequence of events is something like this:
| 9(b). vfork returns in parent, parent exits, test runs rump.sockstat
As I recall it (9(b)) is a little more complicated, but that's just
incidental details, in essence exactly, and
| The test fails if 9(b) runs before 9(a) so rump.sockstat still shows
| the old p_comm rather than the new p_comm.
Yes, that was my conclusion. On most systems this is probably rare, the
child is already using the CPU, and would normally just keep on running,
while the parent has been sleeping and needs to get itself scheduled.
That's likely why you can't make it fail in local tests. b5 is something
of an unusual environment - I haven't attempted to look, but it could be
that the probability of failure is higher when b5 is simultaneously
doing several other parallel builds/test runs when the test is run, and
much less likely to fail when it is (for b5) relatively idle (or even
perhaps vice versa).
| We can ensure these are sequenced, preserving the non-rumpy vfork(2)
| semantics, by creating a pipe shared between parent and child. The
| attached patch implements this.
That's one way - what's needed is some way for the child to inform the
parent that it has completed its task, and is ready for the script to
test the results. A pipe can achieve that, so could sending a (caught)
signal from the child to the parent (which would not require any kind of
detour via rump). The are other more heavyweight possibilities.
But before doing any of that I think we really need to understand the
purpose of the test, if it is to test that sockstat can get owner info
from sockets, that can be done with a much simpler test. If it is to
test that vfork works, that can also be done with a much simpler test
(the one that is there now would be satisfied by fork() instead I believe,
whereas we need that vfork() have vfork() properties and not just be
fork()) if it is to test that exec passes args than can be parsed, that
can also be done with a much simpler test.
I just cannot fathom what the test is actually testing. Without that
what ought be done to it remains mysterious, and is why I just gave up
on looking at it.
| That said, I'm not entirely sure that p_comm access is _guaranteed_ to
| be ready by the time a vforked execve(2) wakes the parent.
Aside from the race condition above, I didn't look further, so that may
indeed also be an issue.
| But it's not really that costly to add this additional logic to
| rumpclient to dispense with the question altogether; it's more for
| testing and experiments than performance.
Yes, and while keeping the test runtime on b5 down to something reasonable
(ie: not adding anything not really necessary for a test) is a good thing,
its hard to see any changes here making any material difference.
kre
(Contact us)
$NetBSD: query-full-pr,v 1.49 2026/05/14 01:52:41 riastradh Exp $
$NetBSD: gnats_config.sh,v 1.10 2026/05/13 22:00:09 riastradh Exp $
Copyright © 1994-2026
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.