NetBSD Problem Report #57661

From www@netbsd.org  Sun Oct 15 12:12:43 2023
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 866001A9238
	for <gnats-bugs@gnats.NetBSD.org>; Sun, 15 Oct 2023 12:12:43 +0000 (UTC)
Message-Id: <20231015121242.36BCF1A923A@mollari.NetBSD.org>
Date: Sun, 15 Oct 2023 12:12:42 +0000 (UTC)
From: logix@foobar.franken.de
Reply-To: logix@foobar.franken.de
To: gnats-bugs@NetBSD.org
Subject: Crash when booting on Xeon Silver 4416+ in KVM/Qemu
X-Send-Pr-Version: www-1.0

>Number:         57661
>Notify-List:    riastradh@NetBSD.org, manu@NetBSD.org, andrew.cagney@gmail.com
>Category:       port-amd64
>Synopsis:       Crash when booting on Xeon Silver 4416+ in KVM/Qemu
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    port-amd64-maintainer
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Oct 15 12:15:00 +0000 2023
>Closed-Date:    Mon Feb 02 03:44:26 +0000 2026
>Last-Modified:  Mon Feb 02 22:55:01 +0000 2026
>Originator:     Harold Gutch
>Release:        NetBSD current
>Organization:
>Environment:
NetBSD  10.99.10 NetBSD 10.99.10 (GENERIC) #0: Thu Oct 12 23:51:05 UTC 2023  mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/amd64/compile/GENERIC amd64
>Description:
Booting a current kernel on a Xeon Silver 4416+ CPU inside KVM/Qemu yields:

[   1.5179371] uvm_fault(0xffffffff81911220, 0xffffac805182a000, 2) -> e
[   1.5179371] fatal page fault in supervisor mode
[   1.5179371] trap type 6 code 0x2 rip 0xffffffff80fdcfdc cs 0x8 rflags 0x10206 cr2 0xffffac805182a000 ilevel 0 rsp 0xffffffff81d56d38
[   1.5280253] curlwp 0xffffffff8188b9c0 pid 0.0 lowest kstack 0xffffffff81d512c0
kernel: page fault trap, code=0
Stopped in pid 0.0 (system) at  netbsd:memset+0x2c:     repe stosq      %es:(%rdi)
memset() at netbsd:memset+0x2c
lwp_create() at netbsd:lwp_create+0x325
fork1() at netbsd:fork1+0x43b
main()at netbsd:main+0x47a


The stack trace is a bit of a red herring, I traced down the memset to line 343 of src/sys/arch/x86/x86/fpu.c, so we actually have:
lwp_create() -> uvm_lwp_fork() -> cpu_lwp_fork() -> fpu_lwp_fork() -> memset()

Adding
  printf("DEBUG: sizeof(pcb2->pcb_savefpu)==%ld\n", sizeof(pcb2->pcb_savefpu));
  printf("DEBUG: x86_fpu_save_size==%d\n", x86_fpu_save_size);

before the memset() call prints

  [   1.8432366] DEBUG: sizeof(pcb2->pcb_savefpu)==576
  [   1.8432366] DEBUG: x86_fpu_save_size==11008

Changing the VM's CPU to Sandy Bridge prints

  [  1.8897648] DEBUG: sizeof(pcb2->pcb_savefpu)==576
  [  1.8897648] DEBUG: x86_fpu_save_size==832

... which also *seems* odd, but the machine works then.  But the comment in line 80 of src/sys/arch/amd64/include/pcb.h appears to suggest that pcb_savefpu goes until the end of the page, so I guess the 832 vs 576 bytes discrepancy falls under "yes... but that's OK".  But with x86_fpu_save_size==11008 we are writing far beyond the end of the page.
>How-To-Repeat:
Boot NetBSD on Linux in Qemu with "-cpu host" on a host with a Xeon Silver 4416+ CPU.

Possibly alternatively:  Boot NetBSD natively on a machine with such a CPU (untested as of now, I don't have such a machine in testing state available right now)
>Fix:
In Qemu, select a different CPU type without AVX-512, e.g., Sandy Bridge.

>Release-Note:

>Audit-Trail:
From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57661 CVS commit: src/sys/arch/x86/x86
Date: Sun, 15 Oct 2023 13:13:22 +0000

 Module Name:	src
 Committed By:	riastradh
 Date:		Sun Oct 15 13:13:22 UTC 2023

 Modified Files:
 	src/sys/arch/x86/x86: identcpu.c

 Log Message:
 x86: Panic if cpuid's fpu save size is larger than we support.

 Ideally this wouldn't panic, but the alternative right now is to
 crash in a memset later -- or silently corrupt kernel memory -- so
 this doesn't make the situation worse than it was before.

 PR kern/57661

 XXX pullup-10
 XXX pullup-9
 XXX pullup-8


 To generate a diff of this commit:
 cvs rdiff -u -r1.123 -r1.124 src/sys/arch/x86/x86/identcpu.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57661 CVS commit: src/sys/arch/x86/x86
Date: Tue, 17 Oct 2023 11:12:33 +0000

 Module Name:	src
 Committed By:	riastradh
 Date:		Tue Oct 17 11:12:33 UTC 2023

 Modified Files:
 	src/sys/arch/x86/x86: identcpu.c

 Log Message:
 x86: Panic early if fpu save size is too large, take 2.

 This shouldn't break any existing systems (for real this time), but
 it should make the failure mode more obvious on systems that are
 already broken.

 PR kern/57661

 XXX pullup-10
 XXX pullup-9
 XXX pullup-8


 To generate a diff of this commit:
 cvs rdiff -u -r1.126 -r1.127 src/sys/arch/x86/x86/identcpu.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57661 CVS commit: src/sys/arch/x86/x86
Date: Tue, 17 Oct 2023 14:17:42 +0000

 Module Name:	src
 Committed By:	riastradh
 Date:		Tue Oct 17 14:17:42 UTC 2023

 Modified Files:
 	src/sys/arch/x86/x86: identcpu.c

 Log Message:
 Revert "x86: Panic early if fpu save size is too large, take 2."

 Apparently this is too early to print anything useful, so it just
 causes a reboot loop.

 PR kern/57661


 To generate a diff of this commit:
 cvs rdiff -u -r1.127 -r1.128 src/sys/arch/x86/x86/identcpu.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Taylor R Campbell <riastradh@NetBSD.org>
To: netbsd-bugs@NetBSD.org, gnats-bugs@NetBSD.org
Cc: 
Subject: Re: port-amd64/57661: Crash when booting on Xeon Silver 4416+ in KVM/Qemu
Date: Fri, 11 Apr 2025 14:25:46 +0000

 This is a multi-part message in MIME format.
 --=_tVq9BOFzAnpXS9N5Qy8XBZQoi+hFOgDk

 Followup from thread on tech-kern:

 https://mail-index.netbsd.org/tech-kern/2025/04/11/msg030349.html

 --=_tVq9BOFzAnpXS9N5Qy8XBZQoi+hFOgDk
 Content-Type: message/rfc822
 Content-Disposition: inline

 Date: Fri, 11 Apr 2025 14:24:47 +0000
 From: Taylor R Campbell <riastradh@NetBSD.org>
 To: Emmanuel Dreyfus <manu@netbsd.org>
 CC: tech-kern@netbsd.org
 Subject: Re: Page fault early in boot
 Message-Id: <20250411142453.75A4284D3F@mail.netbsd.org>
 In-reply-to: <Z_jRY4rZhMk4WALO@homeworld.netbsd.org> (manu@netbsd.org)
 MIME-Version: 1.0
 Content-Type: text/plain; charset=us-ascii
 Content-Transfer-Encoding: quoted-printable

 > Date: Fri, 11 Apr 2025 08:22:59 +0000
 > From: Emmanuel Dreyfus <manu@netbsd.org>
 >=20
 > I tried to boot NetBSD-current (i386 and amd6), BIOS and UEFI modes) on a=
 =20
 > Dell server, it always panic on a "fatal page fault in supervisor mode"
 > early during the boot. The backtrace says we went there through
 > main/fork1/lwp_create/memset
 >=20
 > A screenshot is available there:
 > https://dl.espci.fr/ticket/2a34048a0a9749c68cfb0dc380cc2259
 >=20
 > Any idea how to debug that? I expected NetBSD to lack some hardware
 > support on this recent machine, not to be unable to reach single user=20
 > mode.

 This looks like the same issue as:

 PR kern/57661: Crash when booting on Xeon Silver 4416+ in KVM/Qemu
 https://gnats.netbsd.org/57661

 Some additional diagnostics from the PR:

 > The stack trace is a bit of a red herring, I traced down the memset to li=
 ne 343 of src/sys/arch/x86/x86/fpu.c, so we actually have:
 > lwp_create() -> uvm_lwp_fork() -> cpu_lwp_fork() -> fpu_lwp_fork() -> mem=
 set()
 >=20
 > Adding
 >   printf("DEBUG: sizeof(pcb2->pcb_savefpu)=3D=3D%ld\n", sizeof(pcb2->pcb_=
 savefpu));
 >   printf("DEBUG: x86_fpu_save_size=3D=3D%d\n", x86_fpu_save_size);
 >=20
 > before the memset() call prints
 >=20
 >   [   1.8432366] DEBUG: sizeof(pcb2->pcb_savefpu)=3D=3D576
 >   [   1.8432366] DEBUG: x86_fpu_save_size=3D=3D11008
 >=20
 > Changing the VM's CPU to Sandy Bridge prints
 >=20
 >   [  1.8897648] DEBUG: sizeof(pcb2->pcb_savefpu)=3D=3D576
 >   [  1.8897648] DEBUG: x86_fpu_save_size=3D=3D832

 So, the x86_fpu_save_size is larger than we can handle and we don't
 report the problem gracefully either -- my initial attempt to put a
 more obvious panic in x86/identcpu.c when we initially determine
 x86_fpu_save_size hit the problem that it happens too early and sends
 the machine into a reboot loop.

 Unfortunately, simply expanding the space for pcb_savefpu is not
 trivial:

      80 /*
      81  * IMPORTANT NOTE: this structure, including the variable-sized FPU=
  state at
      82  * the end, must fit within one page.
      83  */
      84 struct pcb {
 ...
     103 	union savefpu	pcb_savefpu __aligned(64); /* floating point state */
     104 	/* **** DO NOT ADD ANYTHING HERE **** */
     105 };

 https://nxr.netbsd.org/xref/src/sys/arch/amd64/include/pcb.h?r=3D1.32#80

 The top comment came from this commit:

 > Module Name:    src
 > Committed By:   maxv
 > Date:           Tue Mar 17 17:18:49 UTC 2020
 >=20
 > Modified Files:
 >         src/sys/arch/amd64/include: cpu.h param.h pcb.h types.h
 >         src/sys/arch/x86/x86: vm_machdep.c
 >=20
 > Log Message:
 > Add a redzone between the pcb and the stack. Sent to port-amd64@.

 https://mail-index.netbsd.org/source-changes/2020/03/17/msg115178.html

 So, the immediate issue is that the memset in fpu_lwp_fork is hitting
 the guard page.  Perhaps resolving that is simply a matter of teaching
 cpu_uarea_alloc to advance by round_page(offsetof(struct pcb,
 pcb_savefpu) + x86_fpu_save_size) rather than by PAGE_SIZE when
 deciding where to set the guard page:

     364 void *
     365 cpu_uarea_alloc(bool system)
     366 {
     367 	vaddr_t base, va;
     368 	paddr_t pa;
     369=20
     370 	base =3D uvm_km_alloc(kernel_map, USPACE + PAGE_SIZE, 0,
     371 	    UVM_KMF_WIRED|UVM_KMF_WAITVA);
     372=20
     373 	/* Page[1] =3D RedZone */
     374 	va =3D base + PAGE_SIZE;
     375 	if (!pmap_extract(pmap_kernel(), va, &pa)) {
     376 		panic("%s: impossible, Page[1] unmapped", __func__);
     377 	}
     378 	pmap_kremove(va, PAGE_SIZE);
     379 	uvm_pagefree(PHYS_TO_VM_PAGE(pa));

 https://nxr.netbsd.org/xref/src/sys/arch/x86/x86/vm_machdep.c?r=3D1.46#352

 But the bottom comment, /* **** DO NOT ADD ANYTHING HERE **** */, is
 much older:

 > Module Name:    src
 > Committed By:   dsl
 > Date:           Thu Feb 20 18:19:10 UTC 2014
 >=20
 > Modified Files:
 >         src/sys/arch/amd64/amd64: genassym.cf machdep.c
 >         src/sys/arch/amd64/include: pcb.h proc.h
 >         src/sys/arch/i386/i386: locore.S machdep.c
 >         src/sys/arch/i386/include: pcb.h proc.h
 >         src/sys/arch/x86/x86: vm_machdep.c
 >         src/sys/sys: param.h
 >=20
 > Log Message:
 > Move the amd64 and i386 pcb to the bottom of the uarea, and move the
 >   kernel stack to the top.
 > Change the pcb layouts so that fpu save area is at the end and is
 >   64byte aligned ready for xsave (saving the ymm registers).
 > Welcome to 6.99.32

 https://mail-index.netbsd.org/source-changes/2014/02/20/msg051841.html

 It's not obvious to me whether the comment affects anything at hand.
 Perhaps it's just admonishing future developers not to add things
 _other than fpu state_ because the fpu state is actually variable size
 and may extend past the size of a union savefpu object.

 --=_tVq9BOFzAnpXS9N5Qy8XBZQoi+hFOgDk--

From: Taylor R Campbell <riastradh@NetBSD.org>
To: hgutch@NetBSD.org, manu@NetBSD.org, andrew.cagney@gmail.com
Cc: netbsd-bugs@NetBSD.org, gnats-bugs@NetBSD.org
Subject: Re: port-amd64/57661: Crash when booting on Xeon Silver 4416+ in KVM/Qemu
Date: Fri, 11 Apr 2025 18:12:49 +0000

 This is a multi-part message in MIME format.
 --=_jGdl80rZzbJ3AV6nW+2j7DYaEAx46dnv

 The attached patch bumps UPAGES and USPACE to make room for the extra
 FPU save size.  Can you please try it out on the affected machines?
 And, if it boots, can you run the attached program xfeat2.c and share
 the output?

 The patch isn't great as is -- it unconditionally increases the memory
 usage of every thread by two pages, even on machines where they'll
 never be used.  But if this works, we can look into sizing it (mostly)
 dynamically based on the cpuid results.

 --=_jGdl80rZzbJ3AV6nW+2j7DYaEAx46dnv
 Content-Type: text/plain; charset="ISO-8859-1"; name="pr57661-x86fpusaveoverflow-growuspacepoc"
 Content-Transfer-Encoding: quoted-printable
 Content-Disposition: attachment; filename="pr57661-x86fpusaveoverflow-growuspacepoc.patch"

 diff -r 1cb0546d18b6 sys/arch/amd64/amd64/machdep.c
 --- a/sys/arch/amd64/amd64/machdep.c	Thu Apr 10 18:53:29 2025 +0000
 +++ b/sys/arch/amd64/amd64/machdep.c	Fri Apr 11 17:01:24 2025 +0000
 @@ -1784,6 +1784,26 @@ init_x86_64(paddr_t first_avail)
  	consinit();	/* XXX SHOULD NOT BE DONE HERE */
 =20
  	/*
 +	 * Validate x86_fpu_save_size (determined by cpu_probe above)
 +	 * and set uspace based on it.  We do this now, rather than
 +	 * earlier, so that if we have to panic, the console has been
 +	 * initialized.  We don't allow arbitrary x86_fpu_save_size
 +	 * because we statically preallocate the largest possible size
 +	 * early at start in locore.S, before we know the save size.
 +	 *
 +	 * cpu_uarea_alloc in x86/vm_machdep.c relies on this
 +	 * validation.
 +	 */
 +	__CTASSERT(offsetof(struct pcb, pcb_savefpu) < PAGE_SIZE);
 +	if (x86_fpu_save_size > PAGE_SIZE - offsetof(struct pcb, pcb_savefpu) +
 +	    UPAGES_FPU*PAGE_SIZE) {
 +		panic("x86_fpu_save_size too large: %u > %zu",
 +		    x86_fpu_save_size,
 +		    (PAGE_SIZE - offsetof(struct pcb, pcb_savefpu) +
 +			UPAGES_FPU*PAGE_SIZE));
 +	}
 +
 +	/*
  	 * Initialize PAGE_SIZE-dependent variables.
  	 */
  	uvm_md_init();
 diff -r 1cb0546d18b6 sys/arch/amd64/include/param.h
 --- a/sys/arch/amd64/include/param.h	Thu Apr 10 18:53:29 2025 +0000
 +++ b/sys/arch/amd64/include/param.h	Fri Apr 11 17:01:24 2025 +0000
 @@ -69,11 +69,29 @@
  #define	SINCR		1		/* increment of stack/NBPG */
 =20
  #if defined(KASAN) || defined(KMSAN)
 -#define	UPAGES		8
 +#define UPAGES_KxSAN	2
 +#else
 +#define	UPAGES_KxSAN	0
 +#endif
 +#if defined(SVS)
 +#define	UPAGES_SVS	1
 +#else
 +#define	UPAGES_SVS	0
 +#endif
 +#define	UPAGES_PCB	1	/* one page for the PCB */
 +#define	UPAGES_FPU	2	/* two extra pages for fpusave */
 +#define	UPAGES_RED	1	/* one page for red zone between pcb/stack */
 +#define	UPAGES_STACK	3	/* three pages (12 KiB) of stack space */
 +#define	UPAGES		\
 +	(UPAGES_PCB + UPAGES_FPU + UPAGES_RED + UPAGES_STACK + UPAGES_SVS +   \
 +	    UPAGES_KxSAN)
 +
 +#if defined(KASAN) || defined(KMSAN)
 +__CTASSERT(UPAGES =3D=3D 10);
  #elif defined(SVS)
 -#define	UPAGES		6		/* 1 page used internally by SVS */
 +__CTASSERT(UPAGES =3D=3D 8);
  #else
 -#define	UPAGES		5		/* pages of u-area (1 for redzone) */
 +__CTASSERT(UPAGES =3D=3D 7);
  #endif
  #define	USPACE		(UPAGES * NBPG)	/* total size of u-area */
 =20
 diff -r 1cb0546d18b6 sys/arch/x86/x86/vm_machdep.c
 --- a/sys/arch/x86/x86/vm_machdep.c	Thu Apr 10 18:53:29 2025 +0000
 +++ b/sys/arch/x86/x86/vm_machdep.c	Fri Apr 11 17:01:24 2025 +0000
 @@ -351,9 +351,11 @@ vunmapbuf(struct buf *bp, vsize_t len)
  #ifdef __HAVE_CPU_UAREA_ROUTINES
  /*
   * Layout of the uarea:
 - *    Page[0]        =3D PCB
 - *    Page[1]        =3D RedZone
 - *    Page[2]        =3D Stack
 + *    Page[0]        =3D PCB and start of FPU save area
 + *    Page[1]        =3D FPU save extension page 1
 + *    Page[2]        =3D FPU save extension page 2
 + *    Page[3]        =3D RedZone
 + *    Page[4]        =3D Stack
   *    Page[...]      =3D Stack
   *    Page[UPAGES-1] =3D Stack
   *    Page[UPAGES]   =3D RedZone
 @@ -370,8 +372,8 @@ cpu_uarea_alloc(bool system)
  	base =3D uvm_km_alloc(kernel_map, USPACE + PAGE_SIZE, 0,
  	    UVM_KMF_WIRED|UVM_KMF_WAITVA);
 =20
 -	/* Page[1] =3D RedZone */
 -	va =3D base + PAGE_SIZE;
 +	/* Page[UPAGES_PCB + UPAGES_FPU] =3D RedZone */
 +	va =3D base + (UPAGES_PCB + UPAGES_FPU)*PAGE_SIZE;
  	if (!pmap_extract(pmap_kernel(), va, &pa)) {
  		panic("%s: impossible, Page[1] unmapped", __func__);
  	}

 --=_jGdl80rZzbJ3AV6nW+2j7DYaEAx46dnv
 Content-Type: text/plain; charset="ISO-8859-1"; name="xfeat2"
 Content-Transfer-Encoding: quoted-printable
 Content-Disposition: attachment; filename="xfeat2.c"

 #include <stdint.h>
 #include <stdio.h>
 #include <util.h>

 #define	XCR0_FMT	"\177\020"					      \
 	"b\000"		"x87 FPU/MMX\0"					      \
 	"b\001"		"SSE\0"						      \
 	"b\002"		"AVX\0"						      \
 	"b\003"		"BNDREG\0"					      \
 	"b\004"		"BNDCSR\0"					      \
 	"b\005"		"Opmask\0"					      \
 	"b\006"		"ZMM_Hi256\0"					      \
 	"b\007"		"Hi16_ZMM\0"					      \
 	"b\011"		"PKRU\0"					      \
 	/* end of XCR0_FMT */

 int
 main(void)
 {
 	uint32_t eax, ebx, ecx, edx, xcr0lo, xcr0hi;
 	uint64_t xcr0;
 	char buf[128];

 	asm("cpuid"
 	    : "=3Da"(eax), "=3Db"(ebx), "=3Dc"(ecx), "=3Dd"(edx)
 	    : "a"(0x0d), "c"(0x00));
 	snprintb(buf, sizeof(buf), XCR0_FMT, eax);
 	printf("XFeatureSupportedMask[0:31]	=3D %s\n", buf);
 	printf("XFeatureEnabledSizeMask		=3D 0x%x\n", ebx);
 	printf("XFeatureSupportedSizeMask	=3D 0x%x\n", ecx);
 	snprintb(buf, sizeof(buf), XCR0_FMT, (uint64_t)edx << 32);
 	printf("XFeatureSupportedMask[32:63]	=3D %s\n", buf);
 	fflush(stdout);

 	asm("xgetbv" : "=3Da"(xcr0lo), "=3Dd"(xcr0hi) : "c"(0));
 	xcr0 =3D (uint64_t)xcr0hi << 32 | xcr0lo;
 	snprintb(buf, sizeof(buf), XCR0_FMT, xcr0);
 	printf("xcr0				=3D %s\n", buf);
 	fflush(stdout);

 	fflush(stdout);
 	return ferror(stdout);
 }

 --=_jGdl80rZzbJ3AV6nW+2j7DYaEAx46dnv--

From: Harold Gutch <logix@foobar.franken.de>
To: gnats-bugs@netbsd.org
Cc: port-amd64-maintainer@netbsd.org, gnats-admin@netbsd.org,
        netbsd-bugs@netbsd.org, riastradh@NetBSD.org, manu@NetBSD.org,
        andrew.cagney@gmail.com
Subject: Re: port-amd64/57661: Crash when booting on Xeon Silver 4416+ in KVM/Qemu
Date: Fri, 11 Apr 2025 20:32:46 +0200

 Success for me (in a VM) with your patch.

   [     1.000004] cpu0 at mainbus0 apid 0
   [     1.000004] cpu0: Use lfence to serialize rdtsc
   [     1.000004] cpu0: Intel(R) Xeon(R) Silver 4416+, id 0x806f8
   [     1.000004] cpu0: node 0, package 0, core 0, smt 0
   [...]

 and the system reaches multiuser.

   : {7} ./xfeat2
   XFeatureSupportedMask[0:31]	= 0x602e7<x87 FPU/MMX,SSE,AVX,Opmask,ZMM_Hi256,Hi16_ZMM,PKRU>
   XFeatureEnabledSizeMask		= 0xa80
   XFeatureSupportedSizeMask	= 0x2b00
   XFeatureSupportedMask[32:63]	= 0
   xcr0				= 0xe7<x87 FPU/MMX,SSE,AVX,Opmask,ZMM_Hi256,Hi16_ZMM>


 thanks!
   Harold

From: Taylor R Campbell <riastradh@NetBSD.org>
To: Harold Gutch <logix@foobar.franken.de>
Cc: gnats-bugs@netbsd.org, port-amd64-maintainer@netbsd.org,
	gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
	manu@NetBSD.org, andrew.cagney@gmail.com
Subject: Re: port-amd64/57661: Crash when booting on Xeon Silver 4416+ in KVM/Qemu
Date: Fri, 11 Apr 2025 21:09:44 +0000

 This is a multi-part message in MIME format.
 --=_9iRVuQui7SP0XI0my0i42KB+RbbJU+ZS

 Thanks, can you try with this patch?

 It incurs less cost on machines with smaller FPU save sizes -- a few
 branches and an extra pointer indirection in the save/restore path,
 but no pages of memory wasted on every thread.  But it's a lot more
 involved, so it's more likely I've screwed something up.

 (Another alternative would be to dynamically change USPACE, but that
 is also pretty involved and likely to get things wrong -- plus it
 still entails requires allocating the pages for kernel threads which
 will never use them.)

 --=_9iRVuQui7SP0XI0my0i42KB+RbbJU+ZS
 Content-Type: text/plain; charset="ISO-8859-1"; name="pr57661-x86fpusaveoverflow-separatefpusave"
 Content-Transfer-Encoding: quoted-printable
 Content-Disposition: attachment; filename="pr57661-x86fpusaveoverflow-separatefpusave.patch"

 diff -r 1cb0546d18b6 sys/arch/amd64/amd64/genassym.cf
 --- a/sys/arch/amd64/amd64/genassym.cf	Thu Apr 10 18:53:29 2025 +0000
 +++ b/sys/arch/amd64/amd64/genassym.cf	Fri Apr 11 21:07:26 2025 +0000
 @@ -186,7 +186,6 @@ define	PCB_FLAGS		offsetof(struct pcb, p
  define	PCB_COMPAT32		PCB_COMPAT32
  define	PCB_FS			offsetof(struct pcb, pcb_fs)
  define	PCB_GS			offsetof(struct pcb, pcb_gs)
 -define	PCB_SAVEFPU		offsetof(struct pcb, pcb_savefpu)
 =20
  define	TF_RDI			offsetof(struct trapframe, tf_rdi)
  define	TF_RSI			offsetof(struct trapframe, tf_rsi)
 diff -r 1cb0546d18b6 sys/arch/amd64/include/param.h
 --- a/sys/arch/amd64/include/param.h	Thu Apr 10 18:53:29 2025 +0000
 +++ b/sys/arch/amd64/include/param.h	Fri Apr 11 21:07:26 2025 +0000
 @@ -69,11 +69,27 @@
  #define	SINCR		1		/* increment of stack/NBPG */
 =20
  #if defined(KASAN) || defined(KMSAN)
 -#define	UPAGES		8
 +#define UPAGES_KxSAN	2
 +#else
 +#define	UPAGES_KxSAN	0
 +#endif
 +#if defined(SVS)
 +#define	UPAGES_SVS	1
 +#else
 +#define	UPAGES_SVS	0
 +#endif
 +#define	UPAGES_PCB	1	/* one page for the PCB */
 +#define	UPAGES_RED	1	/* one page for red zone between pcb/stack */
 +#define	UPAGES_STACK	3	/* three pages (12 KiB) of stack space */
 +#define	UPAGES		\
 +	(UPAGES_PCB + UPAGES_RED + UPAGES_STACK + UPAGES_SVS + UPAGES_KxSAN)
 +
 +#if defined(KASAN) || defined(KMSAN)
 +__CTASSERT(UPAGES =3D=3D 8);
  #elif defined(SVS)
 -#define	UPAGES		6		/* 1 page used internally by SVS */
 +__CTASSERT(UPAGES =3D=3D 6);
  #else
 -#define	UPAGES		5		/* pages of u-area (1 for redzone) */
 +__CTASSERT(UPAGES =3D=3D 5);
  #endif
  #define	USPACE		(UPAGES * NBPG)	/* total size of u-area */
 =20
 diff -r 1cb0546d18b6 sys/arch/amd64/include/pcb.h
 --- a/sys/arch/amd64/include/pcb.h	Thu Apr 10 18:53:29 2025 +0000
 +++ b/sys/arch/amd64/include/pcb.h	Fri Apr 11 21:07:26 2025 +0000
 @@ -97,13 +97,15 @@ struct pcb {
  	struct dbreg *pcb_dbregs;
  	uint16_t pcb_fpu_dflt_cw;
  	int pcb_iopl;
 +	union savefpu *pcb_savefpu;
 =20
 -	uint32_t pcb_unused[8];		/* unused */
 +	uint32_t pcb_unused[6];		/* unused */
 =20
 -	union savefpu	pcb_savefpu __aligned(64); /* floating point state */
 +	union savefpu	pcb_savefpusmall __aligned(64); /* small fpu state */
  	/* **** DO NOT ADD ANYTHING HERE **** */
  };
  #ifndef __lint__
 +__CTASSERT(offsetof(struct pcb, pcb_savefpusmall) =3D=3D 128);
  __CTASSERT(sizeof(struct pcb) - sizeof (union savefpu) =3D=3D  128);
  #endif
 =20
 diff -r 1cb0546d18b6 sys/arch/i386/i386/genassym.cf
 --- a/sys/arch/i386/i386/genassym.cf	Thu Apr 10 18:53:29 2025 +0000
 +++ b/sys/arch/i386/i386/genassym.cf	Fri Apr 11 21:07:26 2025 +0000
 @@ -192,7 +192,6 @@ define	PCB_ESP0		offsetof(struct pcb, pc
  define	PCB_FSD			offsetof(struct pcb, pcb_fsd)
  define	PCB_GSD			offsetof(struct pcb, pcb_gsd)
  define	PCB_IOMAP		offsetof(struct pcb, pcb_iomap)
 -define	PCB_SAVEFPU		offsetof(struct pcb, pcb_savefpu)
 =20
  define	TF_CS			offsetof(struct trapframe, tf_cs)
  define	TF_EIP			offsetof(struct trapframe, tf_eip)
 diff -r 1cb0546d18b6 sys/arch/i386/include/pcb.h
 --- a/sys/arch/i386/include/pcb.h	Thu Apr 10 18:53:29 2025 +0000
 +++ b/sys/arch/i386/include/pcb.h	Fri Apr 11 21:07:26 2025 +0000
 @@ -95,16 +95,18 @@ struct pcb {
 =20
  #define	PCB_DBREGS	0x01
  	int	pcb_flags;
 -
 -	int	not_used[15];
 +	union savefpu	*pcb_savefpu;
 =20
 -	/* floating point state */
 -	union savefpu	pcb_savefpu __aligned(64);
 +	int	not_used[14];
 +
 +	/* small fpu state */
 +	union savefpu	pcb_savefpusmall __aligned(64);
  	/* **** DO NOT ADD ANYTHING HERE **** */
 =20
  };
  #ifndef __lint__
  /* This doesn't really matter, but there is a lot of implied padding */
 +__CTASSERT(offsetof(struct pcb, pcb_savefpusmall) =3D=3D 128);
  __CTASSERT(sizeof(struct pcb) - sizeof (union savefpu) =3D=3D 128);
  #endif
 =20
 diff -r 1cb0546d18b6 sys/arch/x86/include/cpu.h
 --- a/sys/arch/x86/include/cpu.h	Thu Apr 10 18:53:29 2025 +0000
 +++ b/sys/arch/x86/include/cpu.h	Fri Apr 11 21:07:26 2025 +0000
 @@ -481,6 +481,7 @@ extern uint64_t x86_xsave_features;
  extern size_t x86_xsave_offsets[];
  extern size_t x86_xsave_sizes[];
  extern uint32_t x86_fpu_mxcsr_mask;
 +bool x86_fpu_save_separate_p(void);
 =20
  extern void (*x86_cpu_idle)(void);
  #define	cpu_idle() (*x86_cpu_idle)()
 diff -r 1cb0546d18b6 sys/arch/x86/include/cpu_extended_state.h
 --- a/sys/arch/x86/include/cpu_extended_state.h	Thu Apr 10 18:53:29 2025 +0=
 000
 +++ b/sys/arch/x86/include/cpu_extended_state.h	Fri Apr 11 21:07:26 2025 +0=
 000
 @@ -226,6 +226,8 @@ struct xstate {
   * It is defined this way to separate the definitions and to
   * minimise the number of union/struct selectors.
   * NB: Some userspace stuff (eg firefox) uses it to parse ucontext.
 + * NB: This is not actually the largest possible save space;
 + *     x86_fpu_save_size may be larger.
   */
  union savefpu {
  	struct save87 sv_87;
 diff -r 1cb0546d18b6 sys/arch/x86/x86/fpu.c
 --- a/sys/arch/x86/x86/fpu.c	Thu Apr 10 18:53:29 2025 +0000
 +++ b/sys/arch/x86/x86/fpu.c	Fri Apr 11 21:07:26 2025 +0000
 @@ -136,11 +136,35 @@ void fpu_switch(struct lwp *, struct lwp
 =20
  uint32_t x86_fpu_mxcsr_mask __read_mostly =3D 0;
 =20
 +static const union savefpu safe_fpu_storage __aligned(64) =3D {
 +	.sv_xmm =3D {
 +		.fx_mxcsr =3D __SAFE_MXCSR__,
 +	},
 +};
 +static const union savefpu zero_fpu_storage __aligned(64);
 +
 +static const void *safe_fpu __read_mostly =3D &safe_fpu_storage;
 +static const void *zero_fpu __read_mostly =3D &zero_fpu_storage;
 +
 +/*
 + * x86_fpu_save_separate_p()
 + *
 + *	True if we allocate the FPU save space separately, outside the
 + *	struct pcb itself, because it doesn't fit in a single page.
 + */
 +bool
 +x86_fpu_save_separate_p(void)
 +{
 +
 +	return x86_fpu_save_size >
 +	    PAGE_SIZE - offsetof(struct pcb, pcb_savefpusmall);
 +}
 +
  static inline union savefpu *
  fpu_lwp_area(struct lwp *l)
  {
  	struct pcb *pcb =3D lwp_getpcb(l);
 -	union savefpu *area =3D &pcb->pcb_savefpu;
 +	union savefpu *area =3D pcb->pcb_savefpu;
 =20
  	KASSERT((l->l_flag & LW_SYSTEM) =3D=3D 0);
  	if (l =3D=3D curlwp) {
 @@ -155,7 +179,7 @@ static inline void
  fpu_save_lwp(struct lwp *l)
  {
  	struct pcb *pcb =3D lwp_getpcb(l);
 -	union savefpu *area =3D &pcb->pcb_savefpu;
 +	union savefpu *area =3D pcb->pcb_savefpu;
  	int s;
 =20
  	s =3D splvm();
 @@ -192,11 +216,55 @@ fpuinit(struct cpu_info *ci)
  void
  fpuinit_mxcsr_mask(void)
  {
 +	/*
 +	 * If the CPU's x86 fpu save size is larger than union savefpu,
 +	 * we have to allocate larger buffers for the safe and zero FPU
 +	 * states used here and by fpu_kern_enter/leave.
 +	 *
 +	 * Note: This is NOT the same as x86_fpu_save_separate_p(),
 +	 * which may have a little more space than union savefpu.
 +	 */
 +	const bool allocfpusave =3D x86_fpu_save_size > sizeof(union savefpu);
 +	vaddr_t va;
 +
 +#ifdef __i386__
 +	if (x86_fpu_save_separate_p()) {
 +		/*
 +		 * XXX Need to teach cpu_uarea_alloc/free to allocate a
 +		 * separate fpu save space -- currently only done on
 +		 * amd64, not on i386.  Also need to make fpu_lwp_fork
 +		 * not clobber pcb_savefpu.
 +		 */
 +		panic("NetBSD/i386 does not support fpu save size %u",
 +		    x86_fpu_save_size);
 +	}
 +#endif
 +
  #ifndef XENPV
 -	union savefpu fpusave __aligned(64);
 +	union savefpu fpusave_stack __aligned(64);
 +	union savefpu *fpusave;
  	u_long psl;
 =20
 -	memset(&fpusave, 0, sizeof(fpusave));
 +	/*
 +	 * Allocate a temporary save space from the stack if it fits,
 +	 * or from the heap otherwise, so we can query its mxcsr mask.
 +	 */
 +	if (allocfpusave) {
 +		/*
 +		 * Need 64-byte alignment for XSAVE instructions.
 +		 * kmem_* doesn't guarantee that and we don't have a
 +		 * handy posix_memalign in the kernel unless we hack it
 +		 * ourselves with vmem(9), so just ask for page
 +		 * alignment with uvm_km(9).
 +		 */
 +		__CTASSERT(PAGE_SIZE >=3D 64);
 +		va =3D uvm_km_alloc(kernel_map, x86_fpu_save_size, PAGE_SIZE,
 +		    UVM_KMF_WIRED|UVM_KMF_ZERO|UVM_KMF_WAITVA);
 +		fpusave =3D (void *)va;
 +	} else {
 +		fpusave =3D &fpusave_stack;
 +		memset(fpusave, 0, sizeof(*fpusave));
 +	}
 =20
  	/* Disable interrupts, and enable FPU */
  	psl =3D x86_read_psl();
 @@ -204,16 +272,25 @@ fpuinit_mxcsr_mask(void)
  	clts();
 =20
  	/* Fill in the FPU area */
 -	fxsave(&fpusave);
 +	fxsave(fpusave);
 =20
  	/* Restore previous state */
  	stts();
  	x86_write_psl(psl);
 =20
 -	if (fpusave.sv_xmm.fx_mxcsr_mask =3D=3D 0) {
 +	if (fpusave->sv_xmm.fx_mxcsr_mask =3D=3D 0) {
  		x86_fpu_mxcsr_mask =3D __INITIAL_MXCSR_MASK__;
  	} else {
 -		x86_fpu_mxcsr_mask =3D fpusave.sv_xmm.fx_mxcsr_mask;
 +		x86_fpu_mxcsr_mask =3D fpusave->sv_xmm.fx_mxcsr_mask;
 +	}
 +
 +	/*
 +	 * Free the temporary save space.
 +	 */
 +	if (allocfpusave) {
 +		uvm_km_free(kernel_map, va, x86_fpu_save_size, UVM_KMF_WIRED);
 +		fpusave =3D NULL;
 +		va =3D 0;
  	}
  #else
  	/*
 @@ -223,6 +300,26 @@ fpuinit_mxcsr_mask(void)
  	 */
  	x86_fpu_mxcsr_mask =3D __INITIAL_MXCSR_MASK__;
  #endif
 +
 +	/*
 +	 * If necessary, allocate FPU save spaces for safe or zero FPU
 +	 * state, for fpu_kern_enter/leave.
 +	 */
 +	if (allocfpusave) {
 +		__CTASSERT(PAGE_SIZE >=3D 64);
 +
 +		va =3D uvm_km_alloc(kernel_map, x86_fpu_save_size, PAGE_SIZE,
 +		    UVM_KMF_WIRED|UVM_KMF_ZERO|UVM_KMF_WAITVA);
 +		memcpy((void *)va, &safe_fpu_storage,
 +		    sizeof(safe_fpu_storage));
 +		safe_fpu =3D (void *)va;
 +
 +		va =3D uvm_km_alloc(kernel_map, x86_fpu_save_size, PAGE_SIZE,
 +		    UVM_KMF_WIRED|UVM_KMF_ZERO|UVM_KMF_WAITVA);
 +		memcpy((void *)va, &zero_fpu_storage,
 +		    sizeof(zero_fpu_storage));
 +		zero_fpu =3D (void *)va;
 +	}
  }
 =20
  static inline void
 @@ -305,7 +402,7 @@ void
  fpu_handle_deferred(void)
  {
  	struct pcb *pcb =3D lwp_getpcb(curlwp);
 -	fpu_area_restore(&pcb->pcb_savefpu, x86_xsave_features,
 +	fpu_area_restore(pcb->pcb_savefpu, x86_xsave_features,
  	    !(curlwp->l_proc->p_flag & PK_32));
  }
 =20
 @@ -321,7 +418,7 @@ fpu_switch(struct lwp *oldlwp, struct lw
  	if (oldlwp->l_md.md_flags & MDL_FPU_IN_CPU) {
  		KASSERT(!(oldlwp->l_flag & LW_SYSTEM));
  		pcb =3D lwp_getpcb(oldlwp);
 -		fpu_area_save(&pcb->pcb_savefpu, x86_xsave_features,
 +		fpu_area_save(pcb->pcb_savefpu, x86_xsave_features,
  		    !(oldlwp->l_proc->p_flag & PK_32));
  		oldlwp->l_md.md_flags &=3D ~MDL_FPU_IN_CPU;
  	}
 @@ -338,14 +435,28 @@ fpu_lwp_fork(struct lwp *l1, struct lwp=20
  	if (__predict_false(l2->l_flag & LW_SYSTEM)) {
  		return;
  	}
 +
 +#ifdef __i386__
 +	/*
 +	 * XXX i386 currently never allocates separate FPU save space
 +	 * (so it's limited to 4096 - 128 bytes of space available in
 +	 * the pcb page).  Changing this requires removing the panic at
 +	 * boot in fpuinit_mxcsr_mask and adopting something like the
 +	 * uarea allocation in cpu_uarea_alloc/free like amd64.
 +	 */
 +	KASSERT(x86_fpu_save_size <=3D
 +	    PAGE_SIZE - offsetof(struct pcb, pcb_savefpusmall));
 +	pcb2->pcb_savefpu =3D &pcb2->pcb_savefpusmall;
 +#endif
 +
  	/* For init(8). */
  	if (__predict_false(l1->l_flag & LW_SYSTEM)) {
 -		memset(&pcb2->pcb_savefpu, 0, x86_fpu_save_size);
 +		memset(pcb2->pcb_savefpu, 0, x86_fpu_save_size);
  		return;
  	}
 =20
  	fpu_save =3D fpu_lwp_area(l1);
 -	memcpy(&pcb2->pcb_savefpu, fpu_save, x86_fpu_save_size);
 +	memcpy(pcb2->pcb_savefpu, fpu_save, x86_fpu_save_size);
  	l2->l_md.md_flags &=3D ~MDL_FPU_IN_CPU;
  }
 =20
 @@ -378,11 +489,6 @@ fpu_lwp_abandon(struct lwp *l)
  void
  fpu_kern_enter(void)
  {
 -	static const union savefpu safe_fpu __aligned(64) =3D {
 -		.sv_xmm =3D {
 -			.fx_mxcsr =3D __SAFE_MXCSR__,
 -		},
 -	};
  	struct lwp *l =3D curlwp;
  	struct cpu_info *ci;
  	int s;
 @@ -421,7 +527,7 @@ fpu_kern_enter(void)
  	/*
  	 * Zero the FPU registers and install safe control words.
  	 */
 -	fpu_area_restore(&safe_fpu, x86_xsave_features, /*is_64bit*/false);
 +	fpu_area_restore(safe_fpu, x86_xsave_features, /*is_64bit*/false);
  }
 =20
  /*
 @@ -432,7 +538,6 @@ fpu_kern_enter(void)
  void
  fpu_kern_leave(void)
  {
 -	static const union savefpu zero_fpu __aligned(64);
  	struct cpu_info *ci =3D curcpu();
  	int s;
 =20
 @@ -451,7 +556,7 @@ fpu_kern_leave(void)
  	 * through Spectre-class attacks to userland, even if there are
  	 * no bugs in fpu state management.
  	 */
 -	fpu_area_restore(&zero_fpu, x86_xsave_features, /*is_64bit*/false);
 +	fpu_area_restore(zero_fpu, x86_xsave_features, /*is_64bit*/false);
 =20
  	/*
  	 * Set CR0_TS again so that the kernel can't accidentally use
 diff -r 1cb0546d18b6 sys/arch/x86/x86/vm_machdep.c
 --- a/sys/arch/x86/x86/vm_machdep.c	Thu Apr 10 18:53:29 2025 +0000
 +++ b/sys/arch/x86/x86/vm_machdep.c	Fri Apr 11 21:07:26 2025 +0000
 @@ -366,10 +366,42 @@ cpu_uarea_alloc(bool system)
  {
  	vaddr_t base, va;
  	paddr_t pa;
 +	struct pcb *pcb;
 =20
  	base =3D uvm_km_alloc(kernel_map, USPACE + PAGE_SIZE, 0,
  	    UVM_KMF_WIRED|UVM_KMF_WAITVA);
 =20
 +	/*
 +	 * Prepare the FPU save area:
 +	 *
 +	 * 1. If this is a system thread, no save area.
 +	 *    XXX Allocate/free one in kthread_fpu_enter/exit_md.
 +	 *
 +	 * 2. If this is a user thread, and the fpu save size is large
 +	 *    enough, allocate an extra block of memory for it.
 +	 *
 +	 * 3. Otherwise, this is a user thread and the fpu save size
 +	 *    fits inside the pcb page, so use that.
 +	 *
 +	 * XXX Note that this is currently amd64-only -- if you extend
 +	 * this FPU save space allocation to i386, make sure to update
 +	 * fpu_lwp_fork so it doesn't clobber the pcb_savefpu pointer.
 +	 * We panic on i386 boot if the CPU needs more space in
 +	 * fpuinit_mxcsr_mask.
 +	 */
 +	pcb =3D (void *)base;
 +	if (system) {					/* (1) */
 +		pcb->pcb_savefpu =3D NULL;
 +	} else if (x86_fpu_save_separate_p()) {		/* (2) */
 +		__CTASSERT(PAGE_SIZE >=3D 64);
 +		/* No need to zero -- caller will initialize. */
 +		va =3D uvm_km_alloc(kernel_map, x86_fpu_save_size, PAGE_SIZE,
 +		    UVM_KMF_WIRED|UVM_KMF_WAITVA);
 +		pcb->pcb_savefpu =3D (void *)va;
 +	} else {					/* (3) */
 +		pcb->pcb_savefpu =3D &pcb->pcb_savefpusmall;
 +	}
 +
  	/* Page[1] =3D RedZone */
  	va =3D base + PAGE_SIZE;
  	if (!pmap_extract(pmap_kernel(), va, &pa)) {
 @@ -394,8 +426,20 @@ cpu_uarea_alloc(bool system)
  bool
  cpu_uarea_free(void *addr)
  {
 +	const struct pcb *const pcb =3D addr;
  	vaddr_t base =3D (vaddr_t)addr;
 =20
 +	/*
 +	 * If we allocated a separate FPU save area, free it.
 +	 */
 +	if (pcb->pcb_savefpu !=3D NULL &&
 +	    pcb->pcb_savefpu !=3D &pcb->pcb_savefpusmall) {
 +		KASSERTMSG(x86_fpu_save_separate_p(), "pcb=3D%p pcb_savefpu=3D%p",
 +		    pcb, pcb->pcb_savefpu);
 +		uvm_km_free(kernel_map, (vaddr_t)pcb->pcb_savefpu,
 +		    x86_fpu_save_size, UVM_KMF_WIRED);
 +	}
 +
  	KASSERT(!pmap_extract(pmap_kernel(), base + PAGE_SIZE, NULL));
  	KASSERT(!pmap_extract(pmap_kernel(), base + USPACE, NULL));
  	uvm_km_free(kernel_map, base, USPACE + PAGE_SIZE, UVM_KMF_WIRED);

 --=_9iRVuQui7SP0XI0my0i42KB+RbbJU+ZS--

From: Harold Gutch <logix@foobar.franken.de>
To: Taylor R Campbell <riastradh@NetBSD.org>
Cc: gnats-bugs@netbsd.org, port-amd64-maintainer@netbsd.org,
        gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, manu@NetBSD.org,
        andrew.cagney@gmail.com
Subject: Re: port-amd64/57661: Crash when booting on Xeon Silver 4416+ in KVM/Qemu
Date: Sat, 12 Apr 2025 10:48:53 +0200

 On Fri, Apr 11, 2025 at 09:09:44PM +0000, Taylor R Campbell wrote:
 > Thanks, can you try with this patch?

 Looks good with this one too.  I haven't done extensive testing, but I
 reach multiuser and nothing seems out of the ordinary.


   Harold

From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57661 CVS commit: src/sys/arch/amd64/include
Date: Thu, 24 Apr 2025 01:49:30 +0000

 Module Name:	src
 Committed By:	riastradh
 Date:		Thu Apr 24 01:49:30 UTC 2025

 Modified Files:
 	src/sys/arch/amd64/include: param.h

 Log Message:
 amd64/param.h: Make UPAGES definition clearer.

 Break it down into subaccounts that are summed at the end to make it
 clear how much each part is using, and how many pages are actually
 reserved for the stack.

 No functional change intended.

 PR port-amd64/57661: Crash when booting on Xeon Silver 4416+ in
 KVM/Qemu


 To generate a diff of this commit:
 cvs rdiff -u -r1.38 -r1.39 src/sys/arch/amd64/include/param.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57661 CVS commit: src/sys/arch
Date: Thu, 24 Apr 2025 01:50:40 +0000

 Module Name:	src
 Committed By:	riastradh
 Date:		Thu Apr 24 01:50:39 UTC 2025

 Modified Files:
 	src/sys/arch/amd64/amd64: genassym.cf
 	src/sys/arch/amd64/include: pcb.h
 	src/sys/arch/i386/i386: genassym.cf
 	src/sys/arch/i386/include: pcb.h
 	src/sys/arch/x86/include: cpu.h cpu_extended_state.h
 	src/sys/arch/x86/x86: fpu.c vm_machdep.c

 Log Message:
 amd64: Allocate FPU save state outside pcb if it's too large.

 We have seen x86_fpu_save_size values (CPUID[EAX=0x0d, ECX=0].ECX) as
 large as 11008 bytes, notably with Intel AMX TILEDATA's 8192-byte
 state.

 We only do this for user threads, and only on machines where it's
 necessary, to avoid incurring much overhead.  There is still a tiny
 bit of overhead when saving and restoring the FPU state by using a
 pointer indirection instead of arithmetic indirection for access to
 struct pcb::pcb_savefpu, but this is probably a drop in the bucket
 compared to the memory traffic incurred by the FPU state save/restore
 anyway.

 For now, these paths are mostly disabled on i386.  We could enable
 them but it will require either rewriting cpu_uarea_alloc/free for
 i386, or adopting a guard page like amd64 does, which might be costly
 and so should be undertaken only with some thought and care.  And
 since Intel AMX instructions only work in 64-bit mode, it's not
 likely to be useful on i386.

 PR port-amd64/57661: Crash when booting on Xeon Silver 4416+ in
 KVM/Qemu

 These changes, as a side effect, may fix:

 PR kern/57258: kthread_fpu_enter/exit problem

 by making sure to allocate an FPU save space that is large enough to
 guarantee fpu_kern_enter/leave work safely, instead of just using a
 union savefpu object on the stack (which, at 576 bytes, may be too
 small on some machines, particularly with AVX512 requiring ~2.5K).
 (But we'll have to do some extra work with kthread_fpu_enter/exit_md
 -- if we try doing them again on x86 -- to actually allocate the
 separate pcb on these machines!)


 To generate a diff of this commit:
 cvs rdiff -u -r1.98 -r1.99 src/sys/arch/amd64/amd64/genassym.cf
 cvs rdiff -u -r1.32 -r1.33 src/sys/arch/amd64/include/pcb.h
 cvs rdiff -u -r1.136 -r1.137 src/sys/arch/i386/i386/genassym.cf
 cvs rdiff -u -r1.59 -r1.60 src/sys/arch/i386/include/pcb.h
 cvs rdiff -u -r1.139 -r1.140 src/sys/arch/x86/include/cpu.h
 cvs rdiff -u -r1.18 -r1.19 src/sys/arch/x86/include/cpu_extended_state.h
 cvs rdiff -u -r1.89 -r1.90 src/sys/arch/x86/x86/fpu.c
 cvs rdiff -u -r1.46 -r1.47 src/sys/arch/x86/x86/vm_machdep.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57661 CVS commit: src/sys/arch/x86/include
Date: Thu, 24 Apr 2025 01:51:08 +0000

 Module Name:	src
 Committed By:	riastradh
 Date:		Thu Apr 24 01:51:08 UTC 2025

 Modified Files:
 	src/sys/arch/x86/include: specialreg.h

 Log Message:
 x86: Add some more XCR0 bits and references.

 PR port-amd64/57661: Crash when booting on Xeon Silver 4416+ in
 KVM/Qemu


 To generate a diff of this commit:
 cvs rdiff -u -r1.216 -r1.217 src/sys/arch/x86/include/specialreg.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57661 CVS commit: src/sys/arch/x86/x86
Date: Thu, 24 Apr 2025 01:51:21 +0000

 Module Name:	src
 Committed By:	riastradh
 Date:		Thu Apr 24 01:51:21 UTC 2025

 Modified Files:
 	src/sys/arch/x86/x86: identcpu.c

 Log Message:
 x86: Reserve space for only the extended CPU state we will use.

 CPUID[EAX=0x0d, ECX=0].ECX, i.e., the value of descs[2] after
 x86_cpuid2(0x0d, 0, descs), gives the size in bytes of the extended CPU
 state for all features supported by the hardware in CPUID[EAX=0x0d,
 ECX=0].EAX which can be enabled in XCR0.  However, on i386, it is
 senseless to leave TILECFG and TILEDATA enabled, because they are state
 for Intel AMX instructions which work only in 64-bit mode.

 So, instead of querying the hardware's supported features and maximum
 _supported_ extended CPU state size:

 1. Query the hardware's supported features.
 2. Enable only those supported by software as well (XCR0_FPU).
 3. Query the hardware's maximum _enabled_ extended CPU state size.

 We will also disable TILECFG and TILEDATA on amd64 for now too
 because:

 (a) This is not a regression, at least for TILEDATA (and I'm not sure
     any machines ship with TILECFG but not TILEDATA), because the
     size overflowed the PCB page and therefore never worked on amd64
     (PR port-amd64/57661: Crash when booting on Xeon Silver 4416+ in
     KVM/Qemu).

 (b) We need a little extra work to properly support reading and
     writing a process's TILECFG and TILEDATA in ptrace(2), and that
     work hasn't been done yet.

 While here, write out x86_cpuid2(0x0d, <ecx>, ...) explicitly, rather
 than x86_cpuid(0x0d, ...), to make it clear that ECX must be set --
 otherwise we may get garbage.  (It is, perhaps, an accident that
 x86_cpuid(<eax>, ...) always sets ECX=0, but other CPUID access
 paths, like gcc's <cpuid.h> __cpuid(<eax>, ...), do not, so let's
 make it clear for the reader.)

 XXX When we enable TILECFG and TILEDATA in amd64, we should arrange
 to disable them in compat32 processes -- no sense in allocating extra
 space for state they can't use anyway, since the Intel AMX
 instructions work only in 64-bit mode.  However, selectively
 disabling them in some contexts might require hardware support for
 XFD, Extended Feature Disable, which is another kettle of fish to
 deal with.

 PR port-amd64/57661: Crash when booting on Xeon Silver 4416+ in
 KVM/Qemu


 To generate a diff of this commit:
 cvs rdiff -u -r1.134 -r1.135 src/sys/arch/x86/x86/identcpu.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57661 CVS commit: src/sys/arch/x86/include
Date: Thu, 24 Apr 2025 01:51:43 +0000

 Module Name:	src
 Committed By:	riastradh
 Date:		Thu Apr 24 01:51:43 UTC 2025

 Modified Files:
 	src/sys/arch/x86/include: specialreg.h

 Log Message:
 amd64: Enable TILECFG and TILEDATA registers.

 This allows processes to use the registers, and NetBSD will save and
 restore them in context switches.  But it does not expose them to
 ptrace(2) or debuggers like all the other extended CPU state
 (xmm/ymm/zmm) -- that will require more work.

 PR kern/57661: Crash when booting on Xeon Silver 4416+ in KVM/Qemu
 PR port-amd64/59299: Support Intel AMX CPU state (TILECFG/TILEDATA)


 To generate a diff of this commit:
 cvs rdiff -u -r1.217 -r1.218 src/sys/arch/x86/include/specialreg.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57661 CVS commit: src/sys/arch/amd64/include
Date: Sun, 27 Apr 2025 01:32:09 +0000

 Module Name:	src
 Committed By:	riastradh
 Date:		Sun Apr 27 01:32:09 UTC 2025

 Modified Files:
 	src/sys/arch/amd64/include: param.h

 Log Message:
 amd64/param.h: Fix KASAN/KMSAN build (and ALL build by extension).

 Make UPAGES match what it was before my recent changes, as verified
 by the __CTASSERT below, justifying the existence of the __CTASSERT.

 As I recall, SVS is incompatible with KASAN/KMSAN, so it doesn't
 contribute to the sum; presumably KASAN/KMSAN requires three pages,
 though I'm not sure where this is documented.  If it turns out this
 accounting is wrong, we should fix it and cross-reference any relevant
 constraints affecting the accounting.  (But for now I'm just making
 sure to restore the status quo of these definitions, which was my
 intent all along with adding the __CTASSERT.)

 Issue noted by hannken@.

 PR port-amd64/57661: Crash when booting on Xeon Silver 4416+ in
 KVM/Qemu


 To generate a diff of this commit:
 cvs rdiff -u -r1.41 -r1.42 src/sys/arch/amd64/include/param.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Emmanuel Dreyfus <manu@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: port-amd64-maintainer@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org, logix@foobar.franken.de,
	riastradh@NetBSD.org, manu@NetBSD.org, andrew.cagney@gmail.com
Subject: Re: PR/57661 CVS commit: src/sys/arch/amd64/include
Date: Mon, 28 Apr 2025 11:56:42 +0000

 On Sun, Apr 27, 2025 at 01:35:01AM +0000, Taylor R Campbell wrote:
 > Th following reply was made to PR port-amd64/57661; it has been noted by GNATS.

 I tested -current after this batch of changes, the bug is fixed.

 cpu0: highest basic info 00000020
 cpu0: highest extended info 80000008
 cpu0: "Intel(R) Xeon(R) Silver 4410Y"
 cpu0: Intel 4th gen Xeon Scalable (Sapphire Rapids) (686-class), 2000.00 MHz
 cpu0: CPU base freq 2000000000 Hz
 cpu0: CPU max freq 3900000000 Hz
 cpu0: TSC freq CPUID 2000000000 Hz
 cpu0: family 0x6 model 0x8f stepping 0x8 (id 0x806f8)
 cpu0: WARNING: Currently this info can't print correctly (level = 1, pgsize = 6)
 cpu0: WARNING: Currently this info can't print correctly (level = 1, pgsize = 15)
 cpu0: WARNING: Currently this info can't print correctly (level = 1, pgsize = 6)
 cpu0: WARNING: Currently this info can't print correctly (level = 2, pgsize = 7)
 cpu0: WARNING: Currently this info can't print correctly (level = 2, pgsize = 9)
 cpu0: features 0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE>
 cpu0: features 0xbfebfbff<MCA,CMOV,PAT,PSE36,CLFSH,DS,ACPI,MMX,FXSR,SSE,SSE2>
 cpu0: features 0xbfebfbff<SS,HTT,TM,PBE>
 cpu0: features1 0x7ffefbff<SSE3,PCLMULQDQ,DTES64,MONITOR,DS-CPL,VMX,SMX,EST>
 cpu0: features1 0x7ffefbff<TM2,SSSE3,SDBG,FMA,CX16,xTPR,PDCM,PCID,DCA,SSE41>
 cpu0: features1 0x7ffefbff<SSE42,X2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE>
 cpu0: features1 0x7ffefbff<OSXSAVE,AVX,F16C,RDRAND>
 cpu0: features2 0x2c100800<SYSCALL/SYSRET,XD,P1GB,RDTSCP,EM64T>
 cpu0: features3 0x121<LAHF,LZCNT,PREFETCHW>
 cpu0: features5 0xf3bfbfff<FSGSBASE,TSCADJUST,SGX,BMI1,HLE,AVX2,FDPEXONLY,SMEP>
 cpu0: features5 0xf3bfbfff<BMI2,ERMS,INVPCID,RTM,QM,FPUCSDS,PQE,AVX512F>
 cpu0: features5 0xf3bfbfff<AVX512DQ,RDSEED,ADX,SMAP,AVX512_IFMA,CLFLUSHOPT>
 cpu0: features5 0xf3bfbfff<CLWB,PT,AVX512CD,SHA,AVX512BW,AVX512VL>
 cpu0: features6 0xfb417fee<AVX512_VBMI,UMIP,PKU,WAITPKG,AVX512_VBMI2,CET_SS>
 cpu0: features6 0xfb417fee<GFNI,VAES,VPCLMULQDQ,AVX512_VNNI,AVX512_BITALG>
 cpu0: features6 0xfb417fee<TME_EN,AVX512_VPOPCNTDQ,LA57,MAWAU=0,RDPID>
 cpu0: features6 0xfb417fee<BUS_LOCK_DETECT,CLDEMOTE,MOVDIRI,MOVDIR64B,ENQCMD>
 cpu0: features6 0xfb417fee<SGXLC,PKS>
 cpu0: features7 0xffdd4432<SGX_KEYS,FSRM,UINTR,MD_CLEAR,SERIALIZE,TSXLDTRK>
 cpu0: features7 0xffdd4432<PCONFIG,ARCH_LBR,CET_IBT,AMX_BF16,AVX512_FP16>
 cpu0: features7 0xffdd4432<AMX_TILE,AMX_INT8,IBRS,STIBP,L1D_FLUSH,ARCH_CAP>
 cpu0: features7 0xffdd4432<CORE_CAP,SSBD>
 cpu0: xsave features 0x602e7<x87,SSE,AVX,Opmask,ZMM_Hi256,Hi16_ZMM,PKRU>
 cpu0: xsave instructions 0x1f<XSAVEOPT,XSAVEC,XGETBV,XSAVES,XFD>
 cpu0: xsave area size: current 11008, maximum 11008, xgetbv enabled
 cpu0: enabled xsave 0x600e7<x87,SSE,AVX,Opmask,ZMM_Hi256,Hi16_ZMM>
 cpu0: I-cache: 32KB 64B/line 8-way, D-cache: 48KB 64B/line 12-way
 cpu0: L2 cache: 2MB 64B/line 16-way
 cpu0: L3 cache: 30MB 64B/line 15-way
 cpu0: 64B prefetching
 cpu0: ITLB: 256 4KB entries 8-way, 32 4MB entries 8-way
 cpu0: Load only TLB: 8 1GB entries fully associative
 cpu0: Store only TLB: 16 1GB entries fully associative
 cpu0: L2 STLB: 1024 1GB entries 8-way
 cpu0: Initial APIC ID 0
 cpu0: Cluster/Package ID 0
 cpu0: Core ID 0
 cpu0: SMT ID 0
 cpu0: MONITOR/MWAIT extensions 0x3<EMX,IBE>
 cpu0: monitor-line size 64
 cpu0: C1 substates 2
 cpu0: C3 substates 1
 cpu0: DSPM-eax 0x80077<DTS,IDA,ARAT,PLN,ECMD,PTM,HFI>
 cpu0: DSPM-ecx 0x9<HWF,EPB,NTDC=0>
 cpu0: SEF highest subleaf 00000002
 cpu0: SEF-subleaf1-eax 0x1c30<AVXVNNI,AVX512_BF16,FZLRMS,FSRSB,FSRCS>
 cpu0: SEF-subleaf1-edx 0x40000<CET_SSS>
 cpu0: SEF-subleaf2-edx 0x17<PSFD,IPRED_CTRL,RRSBA_CTRL,BHI_CTRL>
 cpu0: Power Management features: 0x100<ITSC>
 cpu0: AMD Extended features 0x200<WBNOINVD>
 cpu0: Perfmon: Ver. 5
 cpu0: Perfmon: General: bitwidth 48, 8 counters
 cpu0: Perfmon: General: avail 0xff<CORECYCL,INST,REFCYCL,LLCREF,LLCMISS,BRINST>
 cpu0: Perfmon: General: avail 0xff<BRMISPR,TOPDOWNSLOT>
 cpu0: Perfmon: Fixed: bitwidth 48, 4 counters
 cpu0: Perfmon: Fixed: avail 0xf<INST,CLK_CORETHREAD,CLK_REF_TSC,TOPDOWNSLOT>
 cpu0: microcode version 0x2b0005d2, platform ID 0

 -- 
 Emmanuel Dreyfus
 manu@netbsd.org

From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57661 CVS commit: src/sys/arch
Date: Mon, 28 Apr 2025 13:01:28 +0000

 Module Name:	src
 Committed By:	riastradh
 Date:		Mon Apr 28 13:01:28 UTC 2025

 Modified Files:
 	src/sys/arch/amd64/include: pcb.h
 	src/sys/arch/x86/include: specialreg.h
 	src/sys/arch/x86/x86: fpu.c

 Log Message:
 xen: Stop-gap FPU PCB fix; disable Intel AMX for now.

 Since the custom cpu_uarea_alloc/free are disabled under XENPV,
 nothing would initialize struct pcb::pcb_savefpu to point either to
 struct pcb::pcb_savefpusmall, or to a separately allocated large area
 on machines with Intel AMX TILECFG/TILEDATA requiring it.  So the
 memset in fpu_lwp_fork would crash on null pointer dereference:

 [   1.0000030] uvm_fault(0xffffffff8094a300, 0x0, 2) -> e
 [   1.0000030] fatal page fault in supervisor mode
 [   1.0000030] trap type 6 code 0x2 rip 0xffffffff8062795c cs 0xe030 rflags 0x10202 cr2 0 ilevel 0 rsp 0xffffffff80adad38
 [   1.0000030] curlwp 0xffffffff8078f880 pid 0.0 lowest kstack 0xffffffff80ad62c0
 kernel: page fault trap, code=0
 Stopped in pid 0.0 (system) at  netbsd:memset+0x2c:     repe stosq      %es:(%rdi)
 memset() at netbsd:memset+0x2c
 lwp_create() at netbsd:lwp_create+0x2f1
 fork1() at netbsd:fork1+0x42c
 main() at netbsd:main+0x44f

 In order to support Intel AMX TILECFG/TILEDATA, or any other CPU
 extensions that increase the XSAVE area beyond what fits in a single
 page after struct pcb, we would need to enable the the custom
 cpu_uarea_alloc/free.  Currently that would imply allocating stack
 guard pages (`redzone') under XENPV; if there's some reason the stack
 guard pages don't work, we could also push #ifdef XENPV conditionals
 into cpu_uarea_alloc/free to cover the guard pages -- to be
 considered.

 PR kern/59371: Xen domU uvm_fault since FPU state allocation patch

 PR port-amd64/57661: Crash when booting on Xeon Silver 4416+ in
 KVM/Qemu


 To generate a diff of this commit:
 cvs rdiff -u -r1.34 -r1.35 src/sys/arch/amd64/include/pcb.h
 cvs rdiff -u -r1.218 -r1.219 src/sys/arch/x86/include/specialreg.h
 cvs rdiff -u -r1.90 -r1.91 src/sys/arch/x86/x86/fpu.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->needs-pullups
State-Changed-By: riastradh@NetBSD.org
State-Changed-When: Wed, 14 May 2025 01:13:07 +0000
State-Changed-Why:
fixed in HEAD, needs pullup-9 and pullup-10

There are several changes here, with different levels of risk.

x86/identcpu rev. 1.135 (x86: Reserve space for only the extended CPU
state we will use.) is probably the lowest risk to pull up, and will
likely fix this symptom on its own.  But it's not enough to fix the
problem I hypothesize underlies PR kern/57258: kthread_fpu_enter/exit
problem <https://gnats.NetBSD.org/57258>, which is that on machines
with AVX-512 -- or any >576-byte xsave area size -- the logic in
fpu_kern_enter/leave uses a static 576-byte union savefpu buffer to
initialize a >576-byte xsave area.

That problem is fixed by x86/fpu.c rev. 1.90, among a collection of
other higher-risk changes (amd64: Allocate FPU save state outside pcb
if it's too large.).  So, we might have to split that change up (and
test it) to make a lower-risk fix for the fpu_kern_enter/leave issue,
say for pullup-9 -- just the part to allocate safe_fpu and zero_fpu in
fpu.c separately if too large, not the parts to support
TILECFG/TILEDATA by separately allocating an fpu save area for each pcb
on affected machines.


From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57661 CVS commit: src/sys/arch/x86/x86
Date: Wed, 14 May 2025 23:39:01 +0000

 Module Name:	src
 Committed By:	riastradh
 Date:		Wed May 14 23:39:01 UTC 2025

 Modified Files:
 	src/sys/arch/x86/x86: fpu.c

 Log Message:
 x86/fpu.c: Revert unnecessary part of rev. 1.90.

 Most of the changes including x86/fpu.c rev. 1.90 served to support a
 separately allocated XSAVE area in each thread's pcb.  On machines
 with large XSAVE areas, the fpu.c changes also:

 1. allocated a separate area for fpuinit_mxcsr_mask, and

 2. allocated a separate area for the safe/zero states of
    fpu_kern_enter/leave.

 But (1) is unnecessary because we uncondiitonally use FXSAVE for
 that, for which union savefpu is enough -- actually, struct fxsave
 (exactly 512 bytes) is enough.  To be done in a separate change.

 PR port-amd64/57661: Crash when booting on Xeon Silver 4416+ in
 KVM/Qemu


 To generate a diff of this commit:
 cvs rdiff -u -r1.91 -r1.92 src/sys/arch/x86/x86/fpu.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57661 CVS commit: src/sys/arch/x86/x86
Date: Wed, 14 May 2025 23:39:54 +0000

 Module Name:	src
 Committed By:	riastradh
 Date:		Wed May 14 23:39:54 UTC 2025

 Modified Files:
 	src/sys/arch/x86/x86: fpu.c

 Log Message:
 x86/fpu.c: Reduce needless stack usage of fpuinit_mxcsr_mask.

 This is using the FXSAVE instruction explicitly, which is defined to
 operate on a 512-byte area, unconditionally, so let's just use the
 struct fxsave type we've defined for that instead of the larger union
 savefpu.

 XXX The fxsave/fxrstor functions should really be typed, but that
 requires moving includes or definitions around between cpufunc.h and
 cpu_extende_state.h which might have userland API implications that
 I'd rather not think about right now.

 PR port-amd64/57661: Crash when booting on Xeon Silver 4416+ in
 KVM/Qemu


 To generate a diff of this commit:
 cvs rdiff -u -r1.92 -r1.93 src/sys/arch/x86/x86/fpu.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Emmanuel Dreyfus <manu@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: port-amd64-maintainer@netbsd.org, netbsd-bugs@netbsd.org,
	gnats-admin@netbsd.org, riastradh@NetBSD.org,
	logix@foobar.franken.de, manu@NetBSD.org, andrew.cagney@gmail.com
Subject: Re: port-amd64/57661 (Crash when booting on Xeon Silver 4416+ in
 KVM/Qemu)
Date: Thu, 15 May 2025 06:07:30 +0000

 On Wed, May 14, 2025 at 01:13:08AM +0000, riastradh@NetBSD.org wrote:
 > There are several changes here, with different levels of risk.

 FWIW, I have been able to boot with this change set:
 https://mail-index.netbsd.org/source-changes/2025/04/24/msg156552.html
 https://mail-index.netbsd.org/source-changes/2025/04/24/msg156553.html
 https://mail-index.netbsd.org/source-changes/2025/04/24/msg156554.html
 https://mail-index.netbsd.org/source-changes/2025/04/24/msg156555.html
 https://mail-index.netbsd.org/source-changes/2025/04/24/msg156556.html
 https://mail-index.netbsd.org/source-changes/2025/04/28/msg156672.html


 -- 
 Emmanuel Dreyfus
 manu@netbsd.org

State-Changed-From-To: needs-pullups->pending-pullups
State-Changed-By: riastradh@NetBSD.org
State-Changed-When: Thu, 15 May 2025 12:45:55 +0000
State-Changed-Why:
Fixed in HEAD, and I have requested the following pullups for minimal
parts of the fix:

pullup-9 #1954 https://releng.netbsd.org/cgi-bin/req-9.cgi?show=1954
=> Just the change to x86/identcpu.c that limits XCR0 to the NetBSD-
   supported features before we query CPUID for the state size.

pullup-10 #1118 https://releng.netbsd.org/cgi-bin/req-10.cgi?show=1118
pullup-10 #1119 https://releng.netbsd.org/cgi-bin/req-10.cgi?show=1119
=> x86/identcpu.c change as for netbsd-9, and x86/fpu.c change to
   allocate adequate space for safe/zero FPU states if the save size
   exceeds 576 bytes (union savefpu)

I think this should be enough to fix the crashes on both branches,
without changing anything about how pcbs are allocated or adding support
for Intel AMX TILECFG/TILEDATA, but it would be good to test too on an
affected machine.


From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57661 CVS commit: [netbsd-10] src/sys/arch/x86/x86
Date: Thu, 15 May 2025 18:06:57 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Thu May 15 18:06:57 UTC 2025

 Modified Files:
 	src/sys/arch/x86/x86 [netbsd-10]: identcpu.c

 Log Message:
 Pull up following revision(s) (requested by riastradh in ticket #1118):

 	sys/arch/x86/x86/identcpu.c: revision 1.135

 x86: Reserve space for only the extended CPU state we will use.
 CPUID[EAX=0x0d, ECX=0].ECX, i.e., the value of descs[2] after
 x86_cpuid2(0x0d, 0, descs), gives the size in bytes of the extended CPU
 state for all features supported by the hardware in CPUID[EAX=0x0d,
 ECX=0].EAX which can be enabled in XCR0.  However, on i386, it is
 senseless to leave TILECFG and TILEDATA enabled, because they are state
 for Intel AMX instructions which work only in 64-bit mode.

 So, instead of querying the hardware's supported features and maximum
 _supported_ extended CPU state size:
 1. Query the hardware's supported features.
 2. Enable only those supported by software as well (XCR0_FPU).
 3. Query the hardware's maximum _enabled_ extended CPU state size.

 We will also disable TILECFG and TILEDATA on amd64 for now too
 because:
 (a) This is not a regression, at least for TILEDATA (and I'm not sure
     any machines ship with TILECFG but not TILEDATA), because the
     size overflowed the PCB page and therefore never worked on amd64
     (PR port-amd64/57661: Crash when booting on Xeon Silver 4416+ in
     KVM/Qemu).
 (b) We need a little extra work to properly support reading and
     writing a process's TILECFG and TILEDATA in ptrace(2), and that
     work hasn't been done yet.

 While here, write out x86_cpuid2(0x0d, <ecx>, ...) explicitly, rather
 than x86_cpuid(0x0d, ...), to make it clear that ECX must be set --
 otherwise we may get garbage.  (It is, perhaps, an accident that
 x86_cpuid(<eax>, ...) always sets ECX=0, but other CPUID access
 paths, like gcc's <cpuid.h> __cpuid(<eax>, ...), do not, so let's
 make it clear for the reader.)

 XXX When we enable TILECFG and TILEDATA in amd64, we should arrange
 to disable them in compat32 processes -- no sense in allocating extra
 space for state they can't use anyway, since the Intel AMX
 instructions work only in 64-bit mode.  However, selectively
 disabling them in some contexts might require hardware support for
 XFD, Extended Feature Disable, which is another kettle of fish to
 deal with.

 PR port-amd64/57661: Crash when booting on Xeon Silver 4416+ in
 KVM/Qemu


 To generate a diff of this commit:
 cvs rdiff -u -r1.123.4.3 -r1.123.4.4 src/sys/arch/x86/x86/identcpu.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57661 CVS commit: [netbsd-9] src/sys/arch/x86/x86
Date: Thu, 15 May 2025 18:09:45 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Thu May 15 18:09:45 UTC 2025

 Modified Files:
 	src/sys/arch/x86/x86 [netbsd-9]: identcpu.c

 Log Message:
 Pull up following revision(s) (requested by riastradh in ticket #1954):

 	sys/arch/x86/x86/identcpu.c: revision 1.135 (via patch)

 x86: Reserve space for only the extended CPU state we will use.
 CPUID[EAX=0x0d, ECX=0].ECX, i.e., the value of descs[2] after
 x86_cpuid2(0x0d, 0, descs), gives the size in bytes of the extended CPU
 state for all features supported by the hardware in CPUID[EAX=0x0d,
 ECX=0].EAX which can be enabled in XCR0.  However, on i386, it is
 senseless to leave TILECFG and TILEDATA enabled, because they are state
 for Intel AMX instructions which work only in 64-bit mode.

 So, instead of querying the hardware's supported features and maximum
 _supported_ extended CPU state size:
 1. Query the hardware's supported features.
 2. Enable only those supported by software as well (XCR0_FPU).
 3. Query the hardware's maximum _enabled_ extended CPU state size.

 We will also disable TILECFG and TILEDATA on amd64 for now too
 because:
 (a) This is not a regression, at least for TILEDATA (and I'm not sure
     any machines ship with TILECFG but not TILEDATA), because the
     size overflowed the PCB page and therefore never worked on amd64
     (PR port-amd64/57661: Crash when booting on Xeon Silver 4416+ in
     KVM/Qemu).
 (b) We need a little extra work to properly support reading and
     writing a process's TILECFG and TILEDATA in ptrace(2), and that
     work hasn't been done yet.

 While here, write out x86_cpuid2(0x0d, <ecx>, ...) explicitly, rather
 than x86_cpuid(0x0d, ...), to make it clear that ECX must be set --
 otherwise we may get garbage.  (It is, perhaps, an accident that
 x86_cpuid(<eax>, ...) always sets ECX=0, but other CPUID access
 paths, like gcc's <cpuid.h> __cpuid(<eax>, ...), do not, so let's
 make it clear for the reader.)

 XXX When we enable TILECFG and TILEDATA in amd64, we should arrange
 to disable them in compat32 processes -- no sense in allocating extra
 space for state they can't use anyway, since the Intel AMX
 instructions work only in 64-bit mode.  However, selectively
 disabling them in some contexts might require hardware support for
 XFD, Extended Feature Disable, which is another kettle of fish to
 deal with.

 PR port-amd64/57661: Crash when booting on Xeon Silver 4416+ in
 KVM/Qemu


 To generate a diff of this commit:
 cvs rdiff -u -r1.93.2.7 -r1.93.2.8 src/sys/arch/x86/x86/identcpu.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: pending-pullups->closed
State-Changed-By: riastradh@NetBSD.org
State-Changed-When: Mon, 02 Feb 2026 03:44:26 +0000
State-Changed-Why:
pullups applied to all affected branches


From: "Thomas Klausner" <wiz@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57661 CVS commit: pkgsrc/lang/ghc910
Date: Mon, 2 Feb 2026 21:28:58 +0000

 Module Name:	pkgsrc
 Committed By:	wiz
 Date:		Mon Feb  2 21:28:58 UTC 2026

 Modified Files:
 	pkgsrc/lang/ghc910: Makefile distinfo
 	pkgsrc/lang/ghc910/patches: patch-libraries_text_cbits_measure__off.c

 Log Message:
 ghc910: remove AVX workaround for NetBSD 11+

 Problem has been fixed as part of PR 57661.

 Bump PKGREVISION.


 To generate a diff of this commit:
 cvs rdiff -u -r1.7 -r1.8 pkgsrc/lang/ghc910/Makefile
 cvs rdiff -u -r1.13 -r1.14 pkgsrc/lang/ghc910/distinfo
 cvs rdiff -u -r1.1 -r1.2 \
     pkgsrc/lang/ghc910/patches/patch-libraries_text_cbits_measure__off.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Thomas Klausner" <wiz@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57661 CVS commit: pkgsrc/lang/ghc98
Date: Mon, 2 Feb 2026 22:52:49 +0000

 Module Name:	pkgsrc
 Committed By:	wiz
 Date:		Mon Feb  2 22:52:49 UTC 2026

 Modified Files:
 	pkgsrc/lang/ghc98: Makefile distinfo
 	pkgsrc/lang/ghc98/patches: patch-libraries_text_cbits_measure__off.c

 Log Message:
 ghc98: remove AVX workaround for NetBSD 11+

 Problem has been fixed as part of PR 57661.

 Bump PKGREVISION.


 To generate a diff of this commit:
 cvs rdiff -u -r1.12 -r1.13 pkgsrc/lang/ghc98/Makefile
 cvs rdiff -u -r1.18 -r1.19 pkgsrc/lang/ghc98/distinfo
 cvs rdiff -u -r1.1 -r1.2 \
     pkgsrc/lang/ghc98/patches/patch-libraries_text_cbits_measure__off.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2026 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.