NetBSD Problem Report #57661

From www@netbsd.org  Sun Oct 15 12:12:43 2023
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 866001A9238
	for <gnats-bugs@gnats.NetBSD.org>; Sun, 15 Oct 2023 12:12:43 +0000 (UTC)
Message-Id: <20231015121242.36BCF1A923A@mollari.NetBSD.org>
Date: Sun, 15 Oct 2023 12:12:42 +0000 (UTC)
From: logix@foobar.franken.de
Reply-To: logix@foobar.franken.de
To: gnats-bugs@NetBSD.org
Subject: Crash when booting on Xeon Silver 4416+ in KVM/Qemu
X-Send-Pr-Version: www-1.0

>Number:         57661
>Category:       port-amd64
>Synopsis:       Crash when booting on Xeon Silver 4416+ in KVM/Qemu
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    port-amd64-maintainer
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Oct 15 12:15:00 +0000 2023
>Last-Modified:  Tue Oct 17 14:20:01 +0000 2023
>Originator:     Harold Gutch
>Release:        NetBSD current
>Organization:
>Environment:
NetBSD  10.99.10 NetBSD 10.99.10 (GENERIC) #0: Thu Oct 12 23:51:05 UTC 2023  mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/amd64/compile/GENERIC amd64
>Description:
Booting a current kernel on a Xeon Silver 4416+ CPU inside KVM/Qemu yields:

[   1.5179371] uvm_fault(0xffffffff81911220, 0xffffac805182a000, 2) -> e
[   1.5179371] fatal page fault in supervisor mode
[   1.5179371] trap type 6 code 0x2 rip 0xffffffff80fdcfdc cs 0x8 rflags 0x10206 cr2 0xffffac805182a000 ilevel 0 rsp 0xffffffff81d56d38
[   1.5280253] curlwp 0xffffffff8188b9c0 pid 0.0 lowest kstack 0xffffffff81d512c0
kernel: page fault trap, code=0
Stopped in pid 0.0 (system) at  netbsd:memset+0x2c:     repe stosq      %es:(%rdi)
memset() at netbsd:memset+0x2c
lwp_create() at netbsd:lwp_create+0x325
fork1() at netbsd:fork1+0x43b
main()at netbsd:main+0x47a


The stack trace is a bit of a red herring, I traced down the memset to line 343 of src/sys/arch/x86/x86/fpu.c, so we actually have:
lwp_create() -> uvm_lwp_fork() -> cpu_lwp_fork() -> fpu_lwp_fork() -> memset()

Adding
  printf("DEBUG: sizeof(pcb2->pcb_savefpu)==%ld\n", sizeof(pcb2->pcb_savefpu));
  printf("DEBUG: x86_fpu_save_size==%d\n", x86_fpu_save_size);

before the memset() call prints

  [   1.8432366] DEBUG: sizeof(pcb2->pcb_savefpu)==576
  [   1.8432366] DEBUG: x86_fpu_save_size==11008

Changing the VM's CPU to Sandy Bridge prints

  [  1.8897648] DEBUG: sizeof(pcb2->pcb_savefpu)==576
  [  1.8897648] DEBUG: x86_fpu_save_size==832

... which also *seems* odd, but the machine works then.  But the comment in line 80 of src/sys/arch/amd64/include/pcb.h appears to suggest that pcb_savefpu goes until the end of the page, so I guess the 832 vs 576 bytes discrepancy falls under "yes... but that's OK".  But with x86_fpu_save_size==11008 we are writing far beyond the end of the page.
>How-To-Repeat:
Boot NetBSD on Linux in Qemu with "-cpu host" on a host with a Xeon Silver 4416+ CPU.

Possibly alternatively:  Boot NetBSD natively on a machine with such a CPU (untested as of now, I don't have such a machine in testing state available right now)
>Fix:
In Qemu, select a different CPU type without AVX-512, e.g., Sandy Bridge.

>Audit-Trail:
From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57661 CVS commit: src/sys/arch/x86/x86
Date: Sun, 15 Oct 2023 13:13:22 +0000

 Module Name:	src
 Committed By:	riastradh
 Date:		Sun Oct 15 13:13:22 UTC 2023

 Modified Files:
 	src/sys/arch/x86/x86: identcpu.c

 Log Message:
 x86: Panic if cpuid's fpu save size is larger than we support.

 Ideally this wouldn't panic, but the alternative right now is to
 crash in a memset later -- or silently corrupt kernel memory -- so
 this doesn't make the situation worse than it was before.

 PR kern/57661

 XXX pullup-10
 XXX pullup-9
 XXX pullup-8


 To generate a diff of this commit:
 cvs rdiff -u -r1.123 -r1.124 src/sys/arch/x86/x86/identcpu.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57661 CVS commit: src/sys/arch/x86/x86
Date: Tue, 17 Oct 2023 11:12:33 +0000

 Module Name:	src
 Committed By:	riastradh
 Date:		Tue Oct 17 11:12:33 UTC 2023

 Modified Files:
 	src/sys/arch/x86/x86: identcpu.c

 Log Message:
 x86: Panic early if fpu save size is too large, take 2.

 This shouldn't break any existing systems (for real this time), but
 it should make the failure mode more obvious on systems that are
 already broken.

 PR kern/57661

 XXX pullup-10
 XXX pullup-9
 XXX pullup-8


 To generate a diff of this commit:
 cvs rdiff -u -r1.126 -r1.127 src/sys/arch/x86/x86/identcpu.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57661 CVS commit: src/sys/arch/x86/x86
Date: Tue, 17 Oct 2023 14:17:42 +0000

 Module Name:	src
 Committed By:	riastradh
 Date:		Tue Oct 17 14:17:42 UTC 2023

 Modified Files:
 	src/sys/arch/x86/x86: identcpu.c

 Log Message:
 Revert "x86: Panic early if fpu save size is too large, take 2."

 Apparently this is too early to print anything useful, so it just
 causes a reboot loop.

 PR kern/57661


 To generate a diff of this commit:
 cvs rdiff -u -r1.127 -r1.128 src/sys/arch/x86/x86/identcpu.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2023 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.