NetBSD Problem Report #57737

From h.fath@spg.tu-darmstadt.de  Fri Dec  1 11:15:17 2023
Return-Path: <h.fath@spg.tu-darmstadt.de>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id C7F641A923A
	for <gnats-bugs@gnats.NetBSD.org>; Fri,  1 Dec 2023 11:15:17 +0000 (UTC)
Message-Id: <202312011115.3B1BF61l025054@Gstoder.nt.e-technik.tu-darmstadt.de>
Date: Fri, 1 Dec 2023 12:15:06 +0100 (CET)
From: Hauke Fath <hf@spg.tu-darmstadt.de>
Reply-To: Hauke Fath <hf@spg.tu-darmstadt.de>
To: gnats-bugs@NetBSD.org
Cc: Hauke Fath <hf@spg.tu-darmstadt.de>
Subject: netbsd-10 panics on current Epyc CPU
X-Send-Pr-Version: 3.95

>Number:         57737
>Category:       kern
>Synopsis:       netbsd-10 panics on current Epyc CPU
>Confidential:   no
>Severity:       critical
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Dec 01 11:20:00 +0000 2023
>Last-Modified:  Wed Dec 13 08:25:01 +0000 2023
>Originator:     Hauke Fath
>Release:        NetBSD 10.0_RC1
>Organization:
Technische Universitaet Darmstadt
>Environment:


System: NetBSD 10.0_RC1 (GENERIC) #2: Thu Nov 30 13:28:34 CET 2023
Architecture: x86_64
Machine: amd64
>Description:

	netbsd-10 panics early on current multi-core Ryzen cpus.

	See the boot log for an Epyc 9554P cpu on a Gigabyte R263-Z70
	board at

	<ftp://oak.causeuse.org/pub/NetBSD/netbsd-10-GA_R263-Z70_epyc9554p.bootlog.gz>

	and the related discussion on current-users, where Martin
	suggested

	"That sounds like an fpu xsave size issue Taylor looked at
	recently (but it is not fixed)."


>How-To-Repeat:

	Boot netbsd-10 on a recent AMD Epyc system.


>Fix:
	Yes, please.  :)



>Audit-Trail:
From: matthew green <mrg@eterna23.net>
To: gnats-bugs@netbsd.org, hf@spg.tu-darmstadt.de
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
    netbsd-bugs@netbsd.org
Subject: re: kern/57737: netbsd-10 panics on current Epyc CPU
Date: Wed, 13 Dec 2023 19:20:51 +1100

 > 	netbsd-10 panics early on current multi-core Ryzen cpus.
 >
 > 	See the boot log for an Epyc 9554P cpu on a Gigabyte R263-Z70
 > 	board at
 >
 > 	<ftp://oak.causeuse.org/pub/NetBSD/netbsd-10-GA_R263-Z70_epyc9554p.boot=
 log.gz>
 >
 > 	and the related discussion on current-users, where Martin
 > 	suggested
 >
 > 	"That sounds like an fpu xsave size issue Taylor looked at
 > 	recently (but it is not fixed)."

 there are multiple issues with this system, ouch.

 no CPUs attach in this dmesg.  cpu0 remains half-attached.  this
 is some problem with the MADT parser i guess (i don't know this
 very well.)

 [   1.0000040] bogus MADT X2APIC entry (id =3D 0x0)
 [   1.0000040] bogus MADT X2APIC entry (id =3D 0x2)
 ...
 [   1.0000040] bogus MADT X2APIC entry (id =3D 0x7e)
 ...
 [   1.0000040] bogus MADT X2APIC entry (id =3D 0x5e)
 [   1.0000040] bogus MADT X2APIC entry (id =3D 0x1)
 ...
 [   1.0000040] bogus MADT X2APIC entry (id =3D 0x7f)
 ...
 [   1.0000040] bogus MADT X2APIC entry (id =3D 0x5f)
  =

 ie, 128 cpu threads fail to attach (which matches the specs for
 epyc 9554p - 64c/128t.)  some devices still attach things to
 cpu0 for affinity, even though it's in UP mode:

 [   1.0525126] nvme0: for io queue 1 interrupting at msix0 vec 1 affinity =
 to cpu0
 ... plus nvme1/2/3.

 some of the dmesg items seem to have 'nul' chars in them:

 [   1.0000040] ACPI: XSDT 0x00000000A4E13728 000^@0DC (v01 GBT   BTUACPI 0=
 3042021 AMI  01000013)

 [   1.0525126] AMD 19h/1xh RCEC (Root Complex^@ Event Collectosystem) at p=
 ci0 dev 0 function 3 not configured

 and then the final crash as reported in this PR:

 [   1.0525126] fatal privileged instruction fault in supervisor mode
 [   1.0525126] trap type 0 code 0 rip 0xffffffff8023c24e cs 0x8 rflags 0x1=
 0256 cr2 0 ilevel 0x6 rsp 0xffffffff81d3bab8
 [   1.0525126] curlwp 0xffffffff8188ac00 pid 0.0 lowest kstack 0xffffffff8=
 1d362c0
 kernel: privileged instruction fault trap, code=3D0
 Stopped in pid 0.0 (system) at  netbsd:xrstor+0xa:      fxsavel
 xrstor() at netbsd:xrstor+0xa
 aes_selftest() at netbsd:aes_selftest+0x26
 aes_modcmd() at netbsd:aes_modcmd+0xe9
 module_do_builtin() at netbsd:module_do_builtin+0x142
 module_do_builtin() at netbsd:module_do_builtin+0xfa
 module_init_class() at netbsd:module_init_class+0x142


 .mrg.

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2023 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.