NetBSD Problem Report #57661
From www@netbsd.org Sun Oct 15 12:12:43 2023
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 866001A9238
for <gnats-bugs@gnats.NetBSD.org>; Sun, 15 Oct 2023 12:12:43 +0000 (UTC)
Message-Id: <20231015121242.36BCF1A923A@mollari.NetBSD.org>
Date: Sun, 15 Oct 2023 12:12:42 +0000 (UTC)
From: logix@foobar.franken.de
Reply-To: logix@foobar.franken.de
To: gnats-bugs@NetBSD.org
Subject: Crash when booting on Xeon Silver 4416+ in KVM/Qemu
X-Send-Pr-Version: www-1.0
>Number: 57661
>Category: port-amd64
>Synopsis: Crash when booting on Xeon Silver 4416+ in KVM/Qemu
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: port-amd64-maintainer
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sun Oct 15 12:15:00 +0000 2023
>Last-Modified: Tue Oct 17 14:20:01 +0000 2023
>Originator: Harold Gutch
>Release: NetBSD current
>Organization:
>Environment:
NetBSD 10.99.10 NetBSD 10.99.10 (GENERIC) #0: Thu Oct 12 23:51:05 UTC 2023 mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/amd64/compile/GENERIC amd64
>Description:
Booting a current kernel on a Xeon Silver 4416+ CPU inside KVM/Qemu yields:
[ 1.5179371] uvm_fault(0xffffffff81911220, 0xffffac805182a000, 2) -> e
[ 1.5179371] fatal page fault in supervisor mode
[ 1.5179371] trap type 6 code 0x2 rip 0xffffffff80fdcfdc cs 0x8 rflags 0x10206 cr2 0xffffac805182a000 ilevel 0 rsp 0xffffffff81d56d38
[ 1.5280253] curlwp 0xffffffff8188b9c0 pid 0.0 lowest kstack 0xffffffff81d512c0
kernel: page fault trap, code=0
Stopped in pid 0.0 (system) at netbsd:memset+0x2c: repe stosq %es:(%rdi)
memset() at netbsd:memset+0x2c
lwp_create() at netbsd:lwp_create+0x325
fork1() at netbsd:fork1+0x43b
main()at netbsd:main+0x47a
The stack trace is a bit of a red herring, I traced down the memset to line 343 of src/sys/arch/x86/x86/fpu.c, so we actually have:
lwp_create() -> uvm_lwp_fork() -> cpu_lwp_fork() -> fpu_lwp_fork() -> memset()
Adding
printf("DEBUG: sizeof(pcb2->pcb_savefpu)==%ld\n", sizeof(pcb2->pcb_savefpu));
printf("DEBUG: x86_fpu_save_size==%d\n", x86_fpu_save_size);
before the memset() call prints
[ 1.8432366] DEBUG: sizeof(pcb2->pcb_savefpu)==576
[ 1.8432366] DEBUG: x86_fpu_save_size==11008
Changing the VM's CPU to Sandy Bridge prints
[ 1.8897648] DEBUG: sizeof(pcb2->pcb_savefpu)==576
[ 1.8897648] DEBUG: x86_fpu_save_size==832
... which also *seems* odd, but the machine works then. But the comment in line 80 of src/sys/arch/amd64/include/pcb.h appears to suggest that pcb_savefpu goes until the end of the page, so I guess the 832 vs 576 bytes discrepancy falls under "yes... but that's OK". But with x86_fpu_save_size==11008 we are writing far beyond the end of the page.
>How-To-Repeat:
Boot NetBSD on Linux in Qemu with "-cpu host" on a host with a Xeon Silver 4416+ CPU.
Possibly alternatively: Boot NetBSD natively on a machine with such a CPU (untested as of now, I don't have such a machine in testing state available right now)
>Fix:
In Qemu, select a different CPU type without AVX-512, e.g., Sandy Bridge.
>Audit-Trail:
From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/57661 CVS commit: src/sys/arch/x86/x86
Date: Sun, 15 Oct 2023 13:13:22 +0000
Module Name: src
Committed By: riastradh
Date: Sun Oct 15 13:13:22 UTC 2023
Modified Files:
src/sys/arch/x86/x86: identcpu.c
Log Message:
x86: Panic if cpuid's fpu save size is larger than we support.
Ideally this wouldn't panic, but the alternative right now is to
crash in a memset later -- or silently corrupt kernel memory -- so
this doesn't make the situation worse than it was before.
PR kern/57661
XXX pullup-10
XXX pullup-9
XXX pullup-8
To generate a diff of this commit:
cvs rdiff -u -r1.123 -r1.124 src/sys/arch/x86/x86/identcpu.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/57661 CVS commit: src/sys/arch/x86/x86
Date: Tue, 17 Oct 2023 11:12:33 +0000
Module Name: src
Committed By: riastradh
Date: Tue Oct 17 11:12:33 UTC 2023
Modified Files:
src/sys/arch/x86/x86: identcpu.c
Log Message:
x86: Panic early if fpu save size is too large, take 2.
This shouldn't break any existing systems (for real this time), but
it should make the failure mode more obvious on systems that are
already broken.
PR kern/57661
XXX pullup-10
XXX pullup-9
XXX pullup-8
To generate a diff of this commit:
cvs rdiff -u -r1.126 -r1.127 src/sys/arch/x86/x86/identcpu.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/57661 CVS commit: src/sys/arch/x86/x86
Date: Tue, 17 Oct 2023 14:17:42 +0000
Module Name: src
Committed By: riastradh
Date: Tue Oct 17 14:17:42 UTC 2023
Modified Files:
src/sys/arch/x86/x86: identcpu.c
Log Message:
Revert "x86: Panic early if fpu save size is too large, take 2."
Apparently this is too early to print anything useful, so it just
causes a reboot loop.
PR kern/57661
To generate a diff of this commit:
cvs rdiff -u -r1.127 -r1.128 src/sys/arch/x86/x86/identcpu.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
(Contact us)
$NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2023
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.