NetBSD Problem Report #56535

From www@netbsd.org  Fri Dec  3 23:15:28 2021
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 6502B1A9239
	for <gnats-bugs@gnats.NetBSD.org>; Fri,  3 Dec 2021 23:15:28 +0000 (UTC)
Message-Id: <20211203231526.75D4C1A923A@mollari.NetBSD.org>
Date: Fri,  3 Dec 2021 23:15:26 +0000 (UTC)
From: mpratt@google.com
Reply-To: mpratt@google.com
To: gnats-bugs@NetBSD.org
Subject: Memory corruption in multi-threaded Go parent process following fork() on AMD CPUs
X-Send-Pr-Version: www-1.0

>Number:         56535
>Notify-List:    bsiegert@gmail.com
>Category:       kern
>Synopsis:       Memory corruption in multi-threaded Go parent process following fork() on AMD CPUs
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Dec 03 23:20:00 +0000 2021
>Last-Modified:  Sat Aug 19 08:25:01 +0000 2023
>Originator:     Michael Pratt
>Release:        9.0, 9.99.92
>Organization:
Google / Go Programming Language
>Environment:
NetBSD buildlet-netbsd-amd64-9-0-n2d-rne414d8e.c.symbolic-datum-552.internal 9.0 NetBSD 9.0 (GENERIC) #0: Fri Feb 14 00:06:28 UTC 2020  mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/amd64/compile/GENERIC amd64

>Description:
The Go project CI infrastructure recently switched to using VMs with a mix of Intel and AMD-based instances (from Intel-only previously). With this change we saw a big uptick in flakiness.

Upon further investigation, we discovered that Go build/test is extremely flaky on AMD CPUs in particular. So bad in fact that we can almost never successfully complete tests on AMD CPUs (amd64 or 386).

This problem is being investigated in https://golang.org/issue/49209 and https://golang.org/issue/34988. (FWIW, OpenBSD/386 seems to have a similar problem). As far as I know, Intel CPUs are never affected.

In https://github.com/golang/go/issues/49209#issuecomment-984360815, Maya tested the following CPU models:

AMD 10h: OK (Turion II Neo N40L)
AMD 15h: OK
AMD 17h: NOT OK (Zen 1950X, Zen2 3600)
AMD 19h: NOT OK (Zen3 5950X)

We see a wide variety of different crashes, but the common theme is that they all appear to be due to memory corruption. In the cases where we tell the precise value, I believe we are always seeing reads of 0 that should not be 0, but there are plenty of crashes where we can't precisely tell the problematic value.

Investigation this week has narrowed down the problem to some relationship with fork(). The corruption always occurs (in the parent process) shortly after a fork() system call. It may occur on the thread that called fork(), or on another thread.

The simplest reproducer I have currently is at https://github.com/golang/go/issues/34988#issuecomment-985874750 (available as a repo at https://github.com/prattmic/go-bsd-corruption-issue49209).

The behavior of this reproducer is:

* One goroutine calling fork() and wait4() in a loop (child immediately exits).
* One goroutine making a dummy getpid() system call in a loop.
* N.B. since these are infinite loops, each goroutine will generally be running on a dedicated OS thread, but could theoretically move threads by the Go scheduler (adding runtime.LockOSThread() calls would force dedicated threads. The crashes still reproduce this way).
* The syscall.Syscall functions do not just make a raw system call, they also call runtime.entersyscall and runtime.exitsyscall (https://cs.opensource.google/go/go/+/master:src/runtime/proc.go;l=3840;drc=9b0de0854d5a5655890ef0b2b9052da2541182a3) which do some synchronization with the Go scheduler. Perhaps notably, this fiddles around with TLS variables (the "g" returned by getg(), which is ultimately in FS_BASE).
* If either goroutine uses syscall.RawSyscall, which literally just makes direct system calls, then the crashes disappear.
* If Go is running single-threaded (set GOMAXPROCS=1 and apply the patch in https://github.com/golang/go/issues/34988#issuecomment-985729313), then the crashes disappear. (The Go scheduler will time-share the two goroutines in this case).

I have unfortunately not yet managed to successfully create a reproducer in C, but want to get a bug started here because it seems to me that there must be some interaction beyond Go at work here: This reproducer makes a direct fork() system call and then memory in the parent process is corrupted. That should simply not be possible, barring some restriction on using fork() on NetBSD that I am unaware of. Even better, the child exits immediately and the _only_ memory it should touch is the instruction fetches of the executable page of the few instructions it contains.

I have a bunch of pet theories (missing/misbehaving TLB shootdowns when pages are zapped at fork() for later CoW? Bug in CoW that causes the parent to get a new zero page instead of a copy?), but I am not familiar with NetBSD's mm code so I've not had much luck looking into this beyond some basic dtrace'ing.

My latest analysis is in https://github.com/golang/go/issues/34988#issuecomment-985874750.
>How-To-Repeat:
You must use a machine with an AMD CPU (Zen or newer).

I have been working primarily with GCE N2D instance types (https://cloud.google.com/compute/docs/general-purpose-machines#n2d_machines), but this issue has been reproduced on AWS m5a instances, and bare-metal AMD machines.

This should reproduce with the latest Go release, Go 1.17.4. I haven't directly tested older versions, but our builders have experienced crashes back to the Go 1.10 toolchain (which is used for bootstrapping).

$ git clone https://github.com/prattmic/go-bsd-corruption-issue49209
$ GOOS=netbsd go build
$ ./fork-bug

If you do not see a crash within 60s, edit loop.go to replace the GETPID system call with the call to runtime.Gosched(). This should cause a crash much more quickly.

The most likely crash you will see is something like

entersyscall inconsistent 0xc00003a778 [0xc00003a000,0xc00003a800]                                                                           
fatal error: entersyscall 
... followed by a stack trace ...

But a variety of different fatal crashes are possible. This program never crashes when run on an Intel CPU.
>Fix:

>Release-Note:

>Audit-Trail:
From: Michael Pratt <mpratt@google.com>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/56535
Date: Tue, 14 Dec 2021 17:58:07 -0500

 We now have a reproducer for this problem in C, which I have posted at
 https://github.com/golang/go/issues/34988#issuecomment-994115345, and
 below for completeness:

 Notes:
 * Anecdotally, running several instances at the same time seems to
 speed up repro more than a linear speedup (system under more load?).
 That said, a single instance will eventually fail.
 * Sometimes I get several failures immediately, sometimes it takes
 >60s to get a failure.
 * The most immediately interesting part here is that I call the fork
 syscall directly rather than using fork()` from libc. I think this is
 simply because libc fork() is significantly slower to return than
 my_fork(), and we seem to have a small race window.
   * I've looked through _malloc_prefork/postfork and (I believe) all
 of the registered atfork callbacks, and none of them seem important,
 as neither thread is interacting with pthread or malloc.

 The summarized behavior we see is:
 1. `page->b = 102`
 2. `fork()`
 3. `page->b = 2`
 4. Read `page->b`, observe 102 instead of 2.
 5. When logging the corruption, we load `page->b` again, which
 typically observes 2 again.

 All while another thread is spinning writing to `page->c` (unrelated
 word in the same page).

 forkstress.c:

 #include <pthread.h>
 #include <stdbool.h>
 #include <stdint.h>
 #include <stdlib.h>
 #include <stdio.h>
 #include <string.h>
 #include <sys/mman.h>
 #include <sys/wait.h>
 #include <unistd.h>

 uint64_t my_fork(void);

 void __attribute((noinline)) spin(uint64_t loops) {
   for (volatile uint64_t i = 0; i < loops; i++) {
   }
 }

 struct thing {
   uint64_t b;
   uint32_t c;  // Making this (plus 'sink' below) uint64_t may make
 repro take longer?
 };

 volatile struct thing* page;

 volatile uint32_t sink;

 int ready;

 void* thread(void* arg) {
   __atomic_store_n(&ready, 1, __ATOMIC_SEQ_CST);

   while (1) {
     // Spin not strictly required, but it speeds up repro in my case.
     spin(40*1000);

     // Atomic not required, this works too:
     // page->c = sink;
     __atomic_store_n(&page->c, sink, __ATOMIC_SEQ_CST);
     sink++;
   }
 }

 int main(void) {
   page = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_ANONYMOUS |
 MAP_PRIVATE, -1, 0);
   if (page == MAP_FAILED) {
     perror("mmap");
     return 1;
   }

   pthread_t thread_id;
   int ret = pthread_create(&thread_id, NULL, &thread, NULL);
   if (ret != 0) {
     perror("pthread_create");
     return 1;
   }

   // Wait for child thread to start.
   //
   // This is not required to repro, but eliminates racing fork+thread create as
   // a possibility.
   while (!__atomic_load_n(&ready, __ATOMIC_SEQ_CST)) {
   }

   int64_t i = 0;
   while (1) {
     i++;
     if (i % 10000 == 0) {
       printf("Loop %d...\n", i);
     }

     page->b = 102;
     // Does not work with libc fork(). libc fork() is significantly slower,
     // which may be the problem.
     uint64_t pid = my_fork();
     if (pid == 0) {
       /* Child */
       _exit(0);
     }

     /* Parent */
     /* spin(40*1000); may speed up repro. */
     page->b = 2;
     uint64_t pb = page->b;
     if (pb != 2) {
       printf("Corruption! pb, page->b = %lu, %lu\n", pb, page->b);
       _exit(1);
     }

     int status;
     ret = waitpid(pid, &status, 0);
     if (ret < 0) {
       perror("waitpid");
       return 1;
     }
     if (WEXITSTATUS(status) != 0) {
       printf("Bad child status %#x\n", status);
       return 1;
     }
   }
 }

 fork.S:

 /* These are NetBSD syscall numbers! */
 #define SYS_EXIT        1
 #define SYS_FORK        2

 .globl my_fork
 my_fork:
         movq $SYS_FORK, %rax
         syscall

         cmpq $0, %rax
         jne parent

         movq $0, %rdi
         movq $SYS_EXIT, %rax
         syscall
         hlt

 parent:
         ret

 Build and run:

 $ cc -pthread forkstress.c fork.S
 $ ./a.out & ./a.out & ./a.out  & ./a.out
 Loop 10000...
 Corruption! pb, page->b = 102, 2
 Loop 10000...
 Loop 10000...
 Loop 10000...
 Corruption! pb, page->b = 102, 2
 Loop 20000...
 Loop 20000...
 Corruption! pb, page->b = 102, 2

From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/56535: Memory corruption in multi-threaded Go parent
 process following fork() on AMD CPUs
Date: Wed, 15 Dec 2021 11:56:35 +0000

 You already mention

 > AMD 17h: NOT OK (Zen 1950X, Zen2 3600)

 I reproduced the failure on

 cpu0: "AMD Ryzen 7 1700 Eight-Core Processor          "
 cpu0: AMD Family 17h (686-class), 2994.38 MHz
 cpu0: family 0x17 model 0x1 stepping 0x1 (id 0x800f11)

From: "Chuck Silvers" <chs@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/56535 CVS commit: src/sys/uvm
Date: Sun, 13 Aug 2023 23:06:07 +0000

 Module Name:	src
 Committed By:	chs
 Date:		Sun Aug 13 23:06:07 UTC 2023

 Modified Files:
 	src/sys/uvm: uvm_fault.c

 Log Message:
 uvm: prevent TLB invalidation races during COW resolution

 When a thread takes a page fault which results in COW resolution,
 other threads in the same process can be concurrently accessing that
 same mapping on other CPUs.  When the faulting thread updates the pmap
 entry at the end of COW processing, the resulting TLB invalidations to
 other CPUs are not done atomically, so another thread can write to the
 new writable page and then a third thread might still read from the
 old read-only page, resulting in inconsistent views of the page by the
 latter two threads.  Fix this by removing the pmap entry entirely for
 the original page before we install the new pmap entry for the new
 page, so that the new page can only be modified after the old page is
 no longer accessible.

 This fixes PR 56535 as well as the netbsd versions of problems
 described in various bug trackers:

 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225584
 https://reviews.freebsd.org/D14347
 https://github.com/golang/go/issues/34988


 To generate a diff of this commit:
 cvs rdiff -u -r1.233 -r1.234 src/sys/uvm/uvm_fault.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Taylor R Campbell <riastradh@NetBSD.org>
To: chs@NetBSD.org
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/56535: Memory corruption in multi-threaded Go parent process following fork() on AMD CPUs
Date: Mon, 14 Aug 2023 05:47:12 +0000

 > Module Name:	src
 > Committed By:	chs
 > Date:		Sun Aug 13 23:06:07 UTC 2023
 >
 > Modified Files:
 > 	src/sys/uvm: uvm_fault.c
 >
 > Log Message:
 > uvm: prevent TLB invalidation races during COW resolution
 > [...]
 > This fixes PR 56535 as well as the netbsd versions of problems
 > described in various bug trackers:

 I don't think this is accurate.  I have no theory for how this could
 actually fix PR 56535, and I don't think anyone else has presented
 such a theory either.  This change avoids a race between memory access
 and TLB invalidation IPI, as in the following code:

         struct { long x; char pad[512]; int y; } *f = mmap(...);

         /* thread A */
         for (;;)
                 f->y = 1;  /* trigger cow */

         /* thread B */
         for (long x = 0;; x++) {
                 lock();
                 f->x = x; /* write to new page after cow tlb flush */
                 unlock();
         }

         /* thread C */
         for (;;) {
                 lock();
                 long x0 = f->x; /* read from old page before tlb flush */
                 /* XXX tlb flush may happen asynchronously here */
                 long x1 = f->x; /* read from new page after tlb flush */
                 unlock();
                 assert(x0 == x1);
         }

         /* main thread */
         for (;;) {
                 pid_t pid = fork();
                 if (pid == 0) _exit(0);
                 waitpid(pid, NULL, 0);
         }

 When thread A triggers copy-on-write, the kernel allocates a new
 backing page for f, copies the old page content to the new page, and
 then updates the page tables to point at the new page with write
 permission.

 However, the new page may be exposed writably to thread B while thread
 C still has TLB entries pointing at the old page.  So the following
 sequence of events may happen:

         thread B                thread C
         --------                --------
         *interrupt* tlb flush
         lock();
         f->x = x; /* write to new page */
         unlock();
                                 lock();
                                 x0 = f->x; /* read from old page */
                                 *interrupt* tlb flush
                                 x1 = f->x; /* read from new page */
                                 unlock();
                                 assert(x0 == x1);
                                 *crash*

 This change addresses the problem by making the old page nonwritable
 before copying its content to the new page and publishing the new
 page; that way, in this scenario, threads B and C will both fault and
 wait for the cow handling to complete.

 (Actually, I'm a little confused: this change only changes _one
 virtual mapping_ of the old page, but there can be other virtual
 mappings of the same page.  Why isn't it necessary to use
 pmap_page_remove here?)

 But by instrumenting the reproducer here with lfence;rdtsc around

         page->b = 2;
         uint64_t pb = page->b;

 we have found latencies as small as 10ns, without migration across
 cores which might have desynchronized TSCs, and yet pb is shown to be
 102, not 2.  It is physically impossible for the TLB invalidation IPI
 handler to be triggered between the two reads of page->b within a span
 of 10ns.

 My best hypothesis is still that this is an AMD CPU bug, and the fix
 happens to paper over the obscure microarchitectural circumstances
 that lead to the bug.

From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/56535 CVS commit: [netbsd-10] src/sys/uvm
Date: Tue, 15 Aug 2023 09:44:09 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Tue Aug 15 09:44:09 UTC 2023

 Modified Files:
 	src/sys/uvm [netbsd-10]: uvm_fault.c

 Log Message:
 Pull up following revision(s) (requested by chs in ticket #327):

 	sys/uvm/uvm_fault.c: revision 1.234

 uvm: prevent TLB invalidation races during COW resolution

 When a thread takes a page fault which results in COW resolution,
 other threads in the same process can be concurrently accessing that
 same mapping on other CPUs.  When the faulting thread updates the pmap
 entry at the end of COW processing, the resulting TLB invalidations to
 other CPUs are not done atomically, so another thread can write to the
 new writable page and then a third thread might still read from the
 old read-only page, resulting in inconsistent views of the page by the
 latter two threads.  Fix this by removing the pmap entry entirely for
 the original page before we install the new pmap entry for the new
 page, so that the new page can only be modified after the old page is
 no longer accessible.

 This fixes PR 56535 as well as the netbsd versions of problems
 described in various bug trackers:
 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225584
 https://reviews.freebsd.org/D14347
 https://github.com/golang/go/issues/34988


 To generate a diff of this commit:
 cvs rdiff -u -r1.231 -r1.231.2.1 src/sys/uvm/uvm_fault.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/56535 CVS commit: [netbsd-9] src/sys/uvm
Date: Tue, 15 Aug 2023 09:46:23 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Tue Aug 15 09:46:23 UTC 2023

 Modified Files:
 	src/sys/uvm [netbsd-9]: uvm_fault.c

 Log Message:
 Pull up following revision(s) (requested by chs in ticket #1714):

 	sys/uvm/uvm_fault.c: revision 1.234

 uvm: prevent TLB invalidation races during COW resolution

 When a thread takes a page fault which results in COW resolution,
 other threads in the same process can be concurrently accessing that
 same mapping on other CPUs.  When the faulting thread updates the pmap
 entry at the end of COW processing, the resulting TLB invalidations to
 other CPUs are not done atomically, so another thread can write to the
 new writable page and then a third thread might still read from the
 old read-only page, resulting in inconsistent views of the page by the
 latter two threads.  Fix this by removing the pmap entry entirely for
 the original page before we install the new pmap entry for the new
 page, so that the new page can only be modified after the old page is
 no longer accessible.

 This fixes PR 56535 as well as the netbsd versions of problems
 described in various bug trackers:
 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225584
 https://reviews.freebsd.org/D14347
 https://github.com/golang/go/issues/34988


 To generate a diff of this commit:
 cvs rdiff -u -r1.206.2.2 -r1.206.2.3 src/sys/uvm/uvm_fault.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Patrick Welche <prlw1@welche.eu>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/56535: Memory corruption in multi-threaded Go parent
 process following fork() on AMD CPUs
Date: Sat, 19 Aug 2023 09:23:09 +0100

 Experimentally, this fixes the reproducer in the PR:

 cpu0: "AMD Ryzen 7 5700X 8-Core Processor             "
 cpu0: AMD Family 19h (686-class), 3393.62 MHz
 cpu0: family 0x19 model 0x21 stepping 0x2 (id 0xa20f12)

 Kernel with

    $NetBSD: uvm_fault.c,v 1.233 2023/07/17 12:55:37 riastradh Exp $

 $ ./a.out & ./a.out & ./a.out  & ./a.out
 Corruption! pb, page->b = 102, 2
 Corruption! pb, page->b = 102, 102
 Corruption! pb, page->b = 102, 2
 Loop 10000...
 Corruption! pb, page->b = 102, 2

 Update to

      $NetBSD: uvm_fault.c,v 1.234 2023/08/13 23:06:07 chs Exp $

 and:

 $ ./a.out & ./a.out & ./a.out & ./a.out
 Loop 10000...
 Loop 10000...
 Loop 10000...
 Loop 10000...
 Loop 20000...
 Loop 20000...
 Loop 20000...
 Loop 20000...
 ...
 Loop 1000000...

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2023 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.