NetBSD Problem Report #56535
From www@netbsd.org Fri Dec 3 23:15:28 2021
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 6502B1A9239
for <gnats-bugs@gnats.NetBSD.org>; Fri, 3 Dec 2021 23:15:28 +0000 (UTC)
Message-Id: <20211203231526.75D4C1A923A@mollari.NetBSD.org>
Date: Fri, 3 Dec 2021 23:15:26 +0000 (UTC)
From: mpratt@google.com
Reply-To: mpratt@google.com
To: gnats-bugs@NetBSD.org
Subject: Memory corruption in multi-threaded Go parent process following fork() on AMD CPUs
X-Send-Pr-Version: www-1.0
>Number: 56535
>Notify-List: bsiegert@gmail.com
>Category: kern
>Synopsis: Memory corruption in multi-threaded Go parent process following fork() on AMD CPUs
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri Dec 03 23:20:00 +0000 2021
>Last-Modified: Sat Aug 19 08:25:01 +0000 2023
>Originator: Michael Pratt
>Release: 9.0, 9.99.92
>Organization:
Google / Go Programming Language
>Environment:
NetBSD buildlet-netbsd-amd64-9-0-n2d-rne414d8e.c.symbolic-datum-552.internal 9.0 NetBSD 9.0 (GENERIC) #0: Fri Feb 14 00:06:28 UTC 2020 mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/amd64/compile/GENERIC amd64
>Description:
The Go project CI infrastructure recently switched to using VMs with a mix of Intel and AMD-based instances (from Intel-only previously). With this change we saw a big uptick in flakiness.
Upon further investigation, we discovered that Go build/test is extremely flaky on AMD CPUs in particular. So bad in fact that we can almost never successfully complete tests on AMD CPUs (amd64 or 386).
This problem is being investigated in https://golang.org/issue/49209 and https://golang.org/issue/34988. (FWIW, OpenBSD/386 seems to have a similar problem). As far as I know, Intel CPUs are never affected.
In https://github.com/golang/go/issues/49209#issuecomment-984360815, Maya tested the following CPU models:
AMD 10h: OK (Turion II Neo N40L)
AMD 15h: OK
AMD 17h: NOT OK (Zen 1950X, Zen2 3600)
AMD 19h: NOT OK (Zen3 5950X)
We see a wide variety of different crashes, but the common theme is that they all appear to be due to memory corruption. In the cases where we tell the precise value, I believe we are always seeing reads of 0 that should not be 0, but there are plenty of crashes where we can't precisely tell the problematic value.
Investigation this week has narrowed down the problem to some relationship with fork(). The corruption always occurs (in the parent process) shortly after a fork() system call. It may occur on the thread that called fork(), or on another thread.
The simplest reproducer I have currently is at https://github.com/golang/go/issues/34988#issuecomment-985874750 (available as a repo at https://github.com/prattmic/go-bsd-corruption-issue49209).
The behavior of this reproducer is:
* One goroutine calling fork() and wait4() in a loop (child immediately exits).
* One goroutine making a dummy getpid() system call in a loop.
* N.B. since these are infinite loops, each goroutine will generally be running on a dedicated OS thread, but could theoretically move threads by the Go scheduler (adding runtime.LockOSThread() calls would force dedicated threads. The crashes still reproduce this way).
* The syscall.Syscall functions do not just make a raw system call, they also call runtime.entersyscall and runtime.exitsyscall (https://cs.opensource.google/go/go/+/master:src/runtime/proc.go;l=3840;drc=9b0de0854d5a5655890ef0b2b9052da2541182a3) which do some synchronization with the Go scheduler. Perhaps notably, this fiddles around with TLS variables (the "g" returned by getg(), which is ultimately in FS_BASE).
* If either goroutine uses syscall.RawSyscall, which literally just makes direct system calls, then the crashes disappear.
* If Go is running single-threaded (set GOMAXPROCS=1 and apply the patch in https://github.com/golang/go/issues/34988#issuecomment-985729313), then the crashes disappear. (The Go scheduler will time-share the two goroutines in this case).
I have unfortunately not yet managed to successfully create a reproducer in C, but want to get a bug started here because it seems to me that there must be some interaction beyond Go at work here: This reproducer makes a direct fork() system call and then memory in the parent process is corrupted. That should simply not be possible, barring some restriction on using fork() on NetBSD that I am unaware of. Even better, the child exits immediately and the _only_ memory it should touch is the instruction fetches of the executable page of the few instructions it contains.
I have a bunch of pet theories (missing/misbehaving TLB shootdowns when pages are zapped at fork() for later CoW? Bug in CoW that causes the parent to get a new zero page instead of a copy?), but I am not familiar with NetBSD's mm code so I've not had much luck looking into this beyond some basic dtrace'ing.
My latest analysis is in https://github.com/golang/go/issues/34988#issuecomment-985874750.
>How-To-Repeat:
You must use a machine with an AMD CPU (Zen or newer).
I have been working primarily with GCE N2D instance types (https://cloud.google.com/compute/docs/general-purpose-machines#n2d_machines), but this issue has been reproduced on AWS m5a instances, and bare-metal AMD machines.
This should reproduce with the latest Go release, Go 1.17.4. I haven't directly tested older versions, but our builders have experienced crashes back to the Go 1.10 toolchain (which is used for bootstrapping).
$ git clone https://github.com/prattmic/go-bsd-corruption-issue49209
$ GOOS=netbsd go build
$ ./fork-bug
If you do not see a crash within 60s, edit loop.go to replace the GETPID system call with the call to runtime.Gosched(). This should cause a crash much more quickly.
The most likely crash you will see is something like
entersyscall inconsistent 0xc00003a778 [0xc00003a000,0xc00003a800]
fatal error: entersyscall
... followed by a stack trace ...
But a variety of different fatal crashes are possible. This program never crashes when run on an Intel CPU.
>Fix:
>Release-Note:
>Audit-Trail:
From: Michael Pratt <mpratt@google.com>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/56535
Date: Tue, 14 Dec 2021 17:58:07 -0500
We now have a reproducer for this problem in C, which I have posted at
https://github.com/golang/go/issues/34988#issuecomment-994115345, and
below for completeness:
Notes:
* Anecdotally, running several instances at the same time seems to
speed up repro more than a linear speedup (system under more load?).
That said, a single instance will eventually fail.
* Sometimes I get several failures immediately, sometimes it takes
>60s to get a failure.
* The most immediately interesting part here is that I call the fork
syscall directly rather than using fork()` from libc. I think this is
simply because libc fork() is significantly slower to return than
my_fork(), and we seem to have a small race window.
* I've looked through _malloc_prefork/postfork and (I believe) all
of the registered atfork callbacks, and none of them seem important,
as neither thread is interacting with pthread or malloc.
The summarized behavior we see is:
1. `page->b = 102`
2. `fork()`
3. `page->b = 2`
4. Read `page->b`, observe 102 instead of 2.
5. When logging the corruption, we load `page->b` again, which
typically observes 2 again.
All while another thread is spinning writing to `page->c` (unrelated
word in the same page).
forkstress.c:
#include <pthread.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <unistd.h>
uint64_t my_fork(void);
void __attribute((noinline)) spin(uint64_t loops) {
for (volatile uint64_t i = 0; i < loops; i++) {
}
}
struct thing {
uint64_t b;
uint32_t c; // Making this (plus 'sink' below) uint64_t may make
repro take longer?
};
volatile struct thing* page;
volatile uint32_t sink;
int ready;
void* thread(void* arg) {
__atomic_store_n(&ready, 1, __ATOMIC_SEQ_CST);
while (1) {
// Spin not strictly required, but it speeds up repro in my case.
spin(40*1000);
// Atomic not required, this works too:
// page->c = sink;
__atomic_store_n(&page->c, sink, __ATOMIC_SEQ_CST);
sink++;
}
}
int main(void) {
page = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_ANONYMOUS |
MAP_PRIVATE, -1, 0);
if (page == MAP_FAILED) {
perror("mmap");
return 1;
}
pthread_t thread_id;
int ret = pthread_create(&thread_id, NULL, &thread, NULL);
if (ret != 0) {
perror("pthread_create");
return 1;
}
// Wait for child thread to start.
//
// This is not required to repro, but eliminates racing fork+thread create as
// a possibility.
while (!__atomic_load_n(&ready, __ATOMIC_SEQ_CST)) {
}
int64_t i = 0;
while (1) {
i++;
if (i % 10000 == 0) {
printf("Loop %d...\n", i);
}
page->b = 102;
// Does not work with libc fork(). libc fork() is significantly slower,
// which may be the problem.
uint64_t pid = my_fork();
if (pid == 0) {
/* Child */
_exit(0);
}
/* Parent */
/* spin(40*1000); may speed up repro. */
page->b = 2;
uint64_t pb = page->b;
if (pb != 2) {
printf("Corruption! pb, page->b = %lu, %lu\n", pb, page->b);
_exit(1);
}
int status;
ret = waitpid(pid, &status, 0);
if (ret < 0) {
perror("waitpid");
return 1;
}
if (WEXITSTATUS(status) != 0) {
printf("Bad child status %#x\n", status);
return 1;
}
}
}
fork.S:
/* These are NetBSD syscall numbers! */
#define SYS_EXIT 1
#define SYS_FORK 2
.globl my_fork
my_fork:
movq $SYS_FORK, %rax
syscall
cmpq $0, %rax
jne parent
movq $0, %rdi
movq $SYS_EXIT, %rax
syscall
hlt
parent:
ret
Build and run:
$ cc -pthread forkstress.c fork.S
$ ./a.out & ./a.out & ./a.out & ./a.out
Loop 10000...
Corruption! pb, page->b = 102, 2
Loop 10000...
Loop 10000...
Loop 10000...
Corruption! pb, page->b = 102, 2
Loop 20000...
Loop 20000...
Corruption! pb, page->b = 102, 2
From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/56535: Memory corruption in multi-threaded Go parent
process following fork() on AMD CPUs
Date: Wed, 15 Dec 2021 11:56:35 +0000
You already mention
> AMD 17h: NOT OK (Zen 1950X, Zen2 3600)
I reproduced the failure on
cpu0: "AMD Ryzen 7 1700 Eight-Core Processor "
cpu0: AMD Family 17h (686-class), 2994.38 MHz
cpu0: family 0x17 model 0x1 stepping 0x1 (id 0x800f11)
From: "Chuck Silvers" <chs@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/56535 CVS commit: src/sys/uvm
Date: Sun, 13 Aug 2023 23:06:07 +0000
Module Name: src
Committed By: chs
Date: Sun Aug 13 23:06:07 UTC 2023
Modified Files:
src/sys/uvm: uvm_fault.c
Log Message:
uvm: prevent TLB invalidation races during COW resolution
When a thread takes a page fault which results in COW resolution,
other threads in the same process can be concurrently accessing that
same mapping on other CPUs. When the faulting thread updates the pmap
entry at the end of COW processing, the resulting TLB invalidations to
other CPUs are not done atomically, so another thread can write to the
new writable page and then a third thread might still read from the
old read-only page, resulting in inconsistent views of the page by the
latter two threads. Fix this by removing the pmap entry entirely for
the original page before we install the new pmap entry for the new
page, so that the new page can only be modified after the old page is
no longer accessible.
This fixes PR 56535 as well as the netbsd versions of problems
described in various bug trackers:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225584
https://reviews.freebsd.org/D14347
https://github.com/golang/go/issues/34988
To generate a diff of this commit:
cvs rdiff -u -r1.233 -r1.234 src/sys/uvm/uvm_fault.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Taylor R Campbell <riastradh@NetBSD.org>
To: chs@NetBSD.org
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/56535: Memory corruption in multi-threaded Go parent process following fork() on AMD CPUs
Date: Mon, 14 Aug 2023 05:47:12 +0000
> Module Name: src
> Committed By: chs
> Date: Sun Aug 13 23:06:07 UTC 2023
>
> Modified Files:
> src/sys/uvm: uvm_fault.c
>
> Log Message:
> uvm: prevent TLB invalidation races during COW resolution
> [...]
> This fixes PR 56535 as well as the netbsd versions of problems
> described in various bug trackers:
I don't think this is accurate. I have no theory for how this could
actually fix PR 56535, and I don't think anyone else has presented
such a theory either. This change avoids a race between memory access
and TLB invalidation IPI, as in the following code:
struct { long x; char pad[512]; int y; } *f = mmap(...);
/* thread A */
for (;;)
f->y = 1; /* trigger cow */
/* thread B */
for (long x = 0;; x++) {
lock();
f->x = x; /* write to new page after cow tlb flush */
unlock();
}
/* thread C */
for (;;) {
lock();
long x0 = f->x; /* read from old page before tlb flush */
/* XXX tlb flush may happen asynchronously here */
long x1 = f->x; /* read from new page after tlb flush */
unlock();
assert(x0 == x1);
}
/* main thread */
for (;;) {
pid_t pid = fork();
if (pid == 0) _exit(0);
waitpid(pid, NULL, 0);
}
When thread A triggers copy-on-write, the kernel allocates a new
backing page for f, copies the old page content to the new page, and
then updates the page tables to point at the new page with write
permission.
However, the new page may be exposed writably to thread B while thread
C still has TLB entries pointing at the old page. So the following
sequence of events may happen:
thread B thread C
-------- --------
*interrupt* tlb flush
lock();
f->x = x; /* write to new page */
unlock();
lock();
x0 = f->x; /* read from old page */
*interrupt* tlb flush
x1 = f->x; /* read from new page */
unlock();
assert(x0 == x1);
*crash*
This change addresses the problem by making the old page nonwritable
before copying its content to the new page and publishing the new
page; that way, in this scenario, threads B and C will both fault and
wait for the cow handling to complete.
(Actually, I'm a little confused: this change only changes _one
virtual mapping_ of the old page, but there can be other virtual
mappings of the same page. Why isn't it necessary to use
pmap_page_remove here?)
But by instrumenting the reproducer here with lfence;rdtsc around
page->b = 2;
uint64_t pb = page->b;
we have found latencies as small as 10ns, without migration across
cores which might have desynchronized TSCs, and yet pb is shown to be
102, not 2. It is physically impossible for the TLB invalidation IPI
handler to be triggered between the two reads of page->b within a span
of 10ns.
My best hypothesis is still that this is an AMD CPU bug, and the fix
happens to paper over the obscure microarchitectural circumstances
that lead to the bug.
From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/56535 CVS commit: [netbsd-10] src/sys/uvm
Date: Tue, 15 Aug 2023 09:44:09 +0000
Module Name: src
Committed By: martin
Date: Tue Aug 15 09:44:09 UTC 2023
Modified Files:
src/sys/uvm [netbsd-10]: uvm_fault.c
Log Message:
Pull up following revision(s) (requested by chs in ticket #327):
sys/uvm/uvm_fault.c: revision 1.234
uvm: prevent TLB invalidation races during COW resolution
When a thread takes a page fault which results in COW resolution,
other threads in the same process can be concurrently accessing that
same mapping on other CPUs. When the faulting thread updates the pmap
entry at the end of COW processing, the resulting TLB invalidations to
other CPUs are not done atomically, so another thread can write to the
new writable page and then a third thread might still read from the
old read-only page, resulting in inconsistent views of the page by the
latter two threads. Fix this by removing the pmap entry entirely for
the original page before we install the new pmap entry for the new
page, so that the new page can only be modified after the old page is
no longer accessible.
This fixes PR 56535 as well as the netbsd versions of problems
described in various bug trackers:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225584
https://reviews.freebsd.org/D14347
https://github.com/golang/go/issues/34988
To generate a diff of this commit:
cvs rdiff -u -r1.231 -r1.231.2.1 src/sys/uvm/uvm_fault.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/56535 CVS commit: [netbsd-9] src/sys/uvm
Date: Tue, 15 Aug 2023 09:46:23 +0000
Module Name: src
Committed By: martin
Date: Tue Aug 15 09:46:23 UTC 2023
Modified Files:
src/sys/uvm [netbsd-9]: uvm_fault.c
Log Message:
Pull up following revision(s) (requested by chs in ticket #1714):
sys/uvm/uvm_fault.c: revision 1.234
uvm: prevent TLB invalidation races during COW resolution
When a thread takes a page fault which results in COW resolution,
other threads in the same process can be concurrently accessing that
same mapping on other CPUs. When the faulting thread updates the pmap
entry at the end of COW processing, the resulting TLB invalidations to
other CPUs are not done atomically, so another thread can write to the
new writable page and then a third thread might still read from the
old read-only page, resulting in inconsistent views of the page by the
latter two threads. Fix this by removing the pmap entry entirely for
the original page before we install the new pmap entry for the new
page, so that the new page can only be modified after the old page is
no longer accessible.
This fixes PR 56535 as well as the netbsd versions of problems
described in various bug trackers:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225584
https://reviews.freebsd.org/D14347
https://github.com/golang/go/issues/34988
To generate a diff of this commit:
cvs rdiff -u -r1.206.2.2 -r1.206.2.3 src/sys/uvm/uvm_fault.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Patrick Welche <prlw1@welche.eu>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/56535: Memory corruption in multi-threaded Go parent
process following fork() on AMD CPUs
Date: Sat, 19 Aug 2023 09:23:09 +0100
Experimentally, this fixes the reproducer in the PR:
cpu0: "AMD Ryzen 7 5700X 8-Core Processor "
cpu0: AMD Family 19h (686-class), 3393.62 MHz
cpu0: family 0x19 model 0x21 stepping 0x2 (id 0xa20f12)
Kernel with
$NetBSD: uvm_fault.c,v 1.233 2023/07/17 12:55:37 riastradh Exp $
$ ./a.out & ./a.out & ./a.out & ./a.out
Corruption! pb, page->b = 102, 2
Corruption! pb, page->b = 102, 102
Corruption! pb, page->b = 102, 2
Loop 10000...
Corruption! pb, page->b = 102, 2
Update to
$NetBSD: uvm_fault.c,v 1.234 2023/08/13 23:06:07 chs Exp $
and:
$ ./a.out & ./a.out & ./a.out & ./a.out
Loop 10000...
Loop 10000...
Loop 10000...
Loop 10000...
Loop 20000...
Loop 20000...
Loop 20000...
Loop 20000...
...
Loop 1000000...
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2023
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.