NetBSD Problem Report #58091

From www@netbsd.org  Sat Mar 30 13:31:46 2024
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 95CA11A923B
	for <gnats-bugs@gnats.NetBSD.org>; Sat, 30 Mar 2024 13:31:46 +0000 (UTC)
Message-Id: <20240330133145.957871A923C@mollari.NetBSD.org>
Date: Sat, 30 Mar 2024 13:31:45 +0000 (UTC)
From: michael.dusan@gmail.com
Reply-To: michael.dusan@gmail.com
To: gnats-bugs@NetBSD.org
Subject: after fork/execve or posix_spawn, parent kill(child, SIGTERM) has race condition making it unreliable
X-Send-Pr-Version: www-1.0

>Number:         58091
>Category:       kern
>Synopsis:       after fork/execve or posix_spawn, parent kill(child, SIGTERM) has race condition making it unreliable
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Mar 30 13:35:00 +0000 2024
>Originator:     Michael Dusan
>Release:        
>Organization:
Zig Software Foundation
>Environment:
NetBSD netbsd100-amd64 10.0_RC6 NetBSD 10.0_RC6 (GENERIC) #0: Tue Mar 12 10:19:02 UTC 2024  mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/amd64/compile/GENERIC amd64

NetBSD netbsd93-amd64 9.3 NetBSD 9.3 (GENERIC) #0: Thu Aug  4 15:30:37 UTC 2022  mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/amd64/compile/GENERIC amd64
>Description:
Fork/exec a child and first action of parent, send SIGTERM to child and ~3 out of million times the signal is never received by child.

Variant using posix_spawn tends to manifest much more frequently on netbsd 10.0 RC6, and more frequently on netbsd 9.3 .

Unable to reproduce this bug on archlinux, macos 14.0, freebsd 14.4,, openbsd 7.4, dragonfly 6.4 .

Using ktrace, I was able to see the bug (with the motivating .zig programming language code for this bug report) much more frequently and observed that the closer parent `kill()` call is in ktrace output to the child calling `execve()`, ie: immediately preceding, this bug manifests.

It seems that the signal is lost somewhere in kernel execve preparation.

>How-To-Repeat:
0. caution: running this bug may hose the system. In another incarnation it would end my ssh session (and other sessions to same netbsd system), requiring a reboot
1. see affixed but.c code
2. cc -o bug bug.c
3. in shell `repeat 1000 ./bug`
4. over time, the output "whups" indicates child did not end due to signal
5. it sometimes help to busy the sytem, eg. concurrently run step #3 in another shell
6. I usually observe 2 or 3 "whups" per invocation
7. testing env 1: qemu VM netbsd 10.0_RC6 as "8 core" guest
8. testing env 2: qemu VM netbsd 9.3 amd64 as "8 core" guest
9. VM host: archlinux, AMD Ryzen 9 7900X 12-Core Processor

///////////////////////////////////////////////////////////////////////////////
// bug.c
#include <errno.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/wait.h>

void doit() {
    pid_t pid = fork();
    if (pid == 0) {
        char *argv[] = { "sleep", "10", NULL };
        int res = execve("/bin/sleep", argv, NULL);
    } else {
        // we are parent
        if (kill(pid, SIGTERM) == -1) {
            fprintf(stderr, "kill: errno=%d\n", errno);
            return;
        }
        int status;
        if (waitpid(pid, &status, 0) == -1) {
            fprintf(stderr, "kill: errno=%d\n", errno);
            return;
        }
        if (!WIFSIGNALED(status)) {
            fprintf(stderr, "whups!\n");
        }
    }
}

int main() {
    for (int i = 0; i < 1000; i++) {
        doit();
    }
}


///////////////////////////////////////////////////////////////////////////////
// bug_posix.c
// this variant uses `posix_spawn()` instead of fork/execve
// here it's set to do 1 million iterations
// netbsd 10.0_RC3 emits "whups" over a hundred times on average
// netbsd 9.3 emits "whups" maybe 20 times on average

#include <errno.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
#include <spawn.h>
#include <sys/wait.h>

void doit() {
    char *argv[] = { "sleep", "1", NULL };
    pid_t pid;
    if (posix_spawn(&pid, "/bin/sleep", NULL, NULL, argv, NULL) == -1) {
        fprintf(stderr, "posix_spawn: errno=%d\n", errno);
        return;
    }

    if (kill(pid, SIGTERM) == -1) {
        fprintf(stderr, "kill: errno=%d\n", errno);
        return;
    }

    int status;
    if (waitpid(pid, &status, 0) == -1) {
        fprintf(stderr, "kill: errno=%d\n", errno);
        return;
    }
    if (!WIFSIGNALED(status)) {
        fprintf(stderr, "whups!\n");
    }
}

int main() {
    for (int i = 0; i < 1000000; i++) {
        doit();
    }
}
>Fix:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2024 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.