NetBSD Problem Report #58118

From brad@anduin.eldar.org  Fri Apr  5 17:29:03 2024
Return-Path: <brad@anduin.eldar.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 84AEC1A9239
	for <gnats-bugs@gnats.NetBSD.org>; Fri,  5 Apr 2024 17:29:03 +0000 (UTC)
Message-Id: <202404051728.435HSwZY012750@anduin.eldar.org>
Date: Fri, 5 Apr 2024 13:28:58 -0400 (EDT)
From: brad@anduin.eldar.org
Reply-To: brad@anduin.eldar.org
To: gnats-bugs@NetBSD.org
Subject: NetBSD/i386 Xen PV guest panic "fpudna from userland"
X-Send-Pr-Version: 3.95

>Number:         58118
>Category:       kern
>Synopsis:       NetBSD/i386 Xen PV guest panic "fpudna from userland"
>Confidential:   no
>Severity:       critical
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Apr 05 17:30:01 +0000 2024
>Last-Modified:  Fri Apr 05 22:50:01 +0000 2024
>Originator:     brad@anduin.eldar.org
>Release:        NetBSD 10.0
>Organization:
	Eldar.org
>Environment:
System: NetBSD meriadoc.nat.eldar.org 10.0 NetBSD 10.0 (XEN3PAE_DOMU) #0: Sat Mar 30 22:44:50 EDT 2024  brad@samwise.nat.eldar.org:/lhome/NetBSD_10_branch_20240328/i386/OBJ/sys/arch/i386/compile/XEN3PAE_DOMU i386
Architecture: x86_64
Machine: i386
>Description:

A i386 Xen PV+PVSHIM newly updated to 10.0 release (self built
release, using XEN3_DOMU i386 PAE kernel) is panicing with the
following:

[ 89%] Building CXX object Tests/CMakeLib/CMakeFiles/CMakeLibTests.dir/testRange.cxx.o
[ 89%] Building CXX object Tests/CMakeLib/CMakeFiles/CMakeLibTests.dir/testOptional.cxx.o
[ 89%] Building CXX object Tests/CMakeLib/CMakeFiles/CMakeLibTests.dir/testString.cxx.o
[ 89%] Building CXX object Tests/CMakeLib/CMakeFiles/CMakeLibTests.dir/testStringAlgorithms.cxx.o
[ 89%] Building CXX object Tests/CMakeLib/CMakeFiles/CMakeLibTests.dir/testSystemTools.cxx.o
[ 19585.9365064] panic: fpudna from userland, ip 0xbbe74f, trapframe 0xdefc3fa8
[ 19585.9365064] cpu0: Begin traceback...
[ 19585.9365064] vpanic(c0554d08,defc3f8c,defc3f9c,c01322bb,c0554d08,c0554d2c,bbe74f,defc3fa8,1a61000,bf7fcdcc) at netbsd:vpanic+0x18e
[ 19585.9365064] panic(c0554d08,c0554d2c,bbe74f,defc3fa8,1a61000,bf7fcdcc,c0102f9e,defc3fa8,bb4900b3,bf7f00ab) at netbsd:panic+0x18
[ 19585.9365064] fpudna(defc3fa8,bb4900b3,bf7f00ab,bf7f001f,bf7f001f,b9e08b80,b8574d40,bf7fcdcc,1a61000,0) at netbsd:fpudna+0x3b
[ 19585.9365064] cpu0: End traceback...

[ 19585.9365064] dumping to dev 142,9 offset 8
[ 19585.9365064] dump uvm_fault(0xc07423e0, 0xfe4ef000, 2) -> 0xe
[ 19585.9365064] fatal page fault in supervisor mode
[ 19585.9365064] trap type 6 code 0x2 eip 0xc012be9d cs 0x9 eflags 0x10202 cr2 0xfe4effff ilevel 0x8 esp 0xc0614b00
[ 19585.9365064] curlwp 0xc749a680 pid 12729 lid 12729 lowest kstack 0xdefc22c0
[ 19585.9365064] Skipping crash dump on recursive panic
[ 19585.9365064] panic: trap
[ 19585.9365064] cpu0: Begin traceback...
[ 19585.9365064] vpanic(c0554a1f,defc3dbc,defc3e78,c0130c71,c0554a1f,defc3e84,defc3e84,31b9,defc22c0,10202) at netbsd:vpanic+0x18e
[ 19585.9365064] panic(c0554a1f,defc3e84,defc3e84,31b9,defc22c0,10202,fe4effff,8,c0614b00,fe4effff) at netbsd:panic+0x18
[ 19585.9365064] trap() at netbsd:trap+0xcc1
[ 19585.9365064] --- trap (number 6) ---
[ 19585.9365064] dodumpsys(b9e08b80,104,0,c012e92d,8,0,5,0,0,1) at netbsd:dodumpsys+0x44d
[ 19585.9365064] dumpsys(104,0,c749a680,c0554cb6,5,defc3f70,c040235c,104,0,0) at netbsd:dumpsys+0x14
[ 19585.9365064] kern_reboot(104,0,0,0,c0747d00,c056b196,b8574d40,defc3f80,c0402418,c0554d08) at netbsd:kern_reboot+0x78
[ 19585.9365064] vpanic(c0554d08,defc3f8c,defc3f9c,c01322bb,c0554d08,c0554d2c,bbe74f,defc3fa8,1a61000,bf7fcdcc) at netbsd:vpanic+0x19c
[ 19585.9365064] panic(c0554d08,c0554d2c,bbe74f,defc3fa8,1a61000,bf7fcdcc,c0102f9e,defc3fa8,bb4900b3,bf7f00ab) at netbsd:panic+0x18
[ 19585.9365064] fpudna(defc3fa8,bb4900b3,bf7f00ab,bf7f001f,bf7f001f,b9e08b80,b8574d40,bf7fcdcc,1a61000,0) at netbsd:fpudna+0x3b
[ 19585.9365064] cpu0: End traceback...
[ 19585.9365064] rebooting...

The system is a Xen NetBSD/i386 build guest and was compiling some
pkgsrc 2024Q1 packages, in particular, working on cmake on the way to
doing emacs 29 (I believe).  This was the second time that this panic
was noted.  The first time was early in the morning, perhaps during
the daily cron runs.

The guest has 1 vcpu and is running in PV+PVSHIM mode with 4GB of
memory.  When the guest was a 9.x system, it ran fine.  The system is
running a self built 10.0 release from 2024-03-28.

>How-To-Repeat:

Not completely sure...  the system had built quite a number of
packages before the panic, so it might up uptime related (i.e. memory
leak).  Cmake does require quite a bit of resources, so it could be
resource related.  But it is probably something else.

If someone finds it useful, I can set the system to enter DDB on panic
and poke around (with a recipe of instructions, optimally).

I highly expect that it will panic again once I restart the builds.

(BTW - the build was restarted and it made it past the point in cmake,
so I think that it can be said that cmake can't reproduce this on its
own)

[A variable that can't be tried as a workaround with i386 guests would
be to run them in pure PVH mode with the GENERIC kernel.  The guest
will boot but in my experience will hang if there is significant disk
activity (untaring a set or two will usually trip it for me).  I
openened a PR about that topic some time ago..  I may try this again
anyway, if I get another panic soon.]

The DOM0 that this guest is running on is a 4.15 using NetBSD
9.3_STABLE/amd64.

>Fix:

Don't know, but I hope a fix will come along as I have several Xen
i386 guests that are important to me and it would be great if they
could run 10.x.

>Audit-Trail:
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/58118: NetBSD/i386 Xen PV guest panic "fpudna from userland"
Date: Fri, 5 Apr 2024 22:43:46 +0200

 On Fri, Apr 05, 2024 at 05:30:01PM +0000, brad@anduin.eldar.org wrote:
 > >Number:         58118
 > >Category:       kern
 > >Synopsis:       NetBSD/i386 Xen PV guest panic "fpudna from userland"

 FWIW I've seen this from time to time running anita tests:
 http://largo.lip6.fr/~bouyer/NetBSD-tests/xen/

 I'm seeing FPU issues for PVH and HVM guests too, and for both i386 and amd64.

 This happens since the lazy FPU context switching has been removed.

 I tried to track it down but failed. The issue may be in Xen itself.

 I just upgraded the test box to Xen 4.18, maybe this will help ...

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: Brad Spencer <brad@anduin.eldar.org>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
        netbsd-bugs@netbsd.org
Subject: Re: kern/58118: NetBSD/i386 Xen PV guest panic "fpudna from userland"
Date: Fri, 05 Apr 2024 18:47:38 -0400

 Manuel Bouyer <bouyer@antioche.eu.org> writes:

 > On Fri, Apr 05, 2024 at 05:30:01PM +0000, brad@anduin.eldar.org wrote:
 >> >Number:         58118
 >> >Category:       kern
 >> >Synopsis:       NetBSD/i386 Xen PV guest panic "fpudna from userland"
 >
 > FWIW I've seen this from time to time running anita tests:
 > http://largo.lip6.fr/~bouyer/NetBSD-tests/xen/
 >
 > I'm seeing FPU issues for PVH and HVM guests too, and for both i386 and amd64.
 >
 > This happens since the lazy FPU context switching has been removed.
 >
 > I tried to track it down but failed. The issue may be in Xen itself.
 >
 > I just upgraded the test box to Xen 4.18, maybe this will help ...


 I also noticed this on the DOM0:

 (XEN) d354v0 Triple fault - invoking HVM shutdown action 1
 (XEN) *** Dumping Dom354 vcpu#0 state: ***
 (XEN) ----[ Xen-4.15.3nb0  x86_64  debug=n  Not tainted ]----
 (XEN) CPU:    6
 (XEN) RIP:    0008:[<ffffffff80235384>]
 (XEN) RFLAGS: 0000000000010046   CONTEXT: hvm guest (d354v0)
 (XEN) rax: 0000000000000000   rbx: 0000000000000000   rcx: 0000000000000000
 (XEN) rdx: ffffd36a5495a780   rsi: 0000000000000002   rdi: ffff9a0000002000
 (XEN) rbp: ffff9a0240900e50   rsp: ffff9a0240900e50   r8:  0000000020000020
 (XEN) r9:  0070e9311f5676db   r10: 000000001dcd6500   r11: 0000000000000000
 (XEN) r12: 0000000000000000   r13: 0000000000000000   r14: ffff9a0240900fb0
 (XEN) r15: ffffffff81887ae0   cr0: 000000008005003b   cr4: 00000000003606b0
 (XEN) cr3: 0000000113b07000   cr2: 00007f7fffa88fe8
 (XEN) fsb: 00007402ee324850   gsb: ffffffff8183bd40   gss: 0000000000000000
 (XEN) ds: 0023   es: 0023   fs: 0000   gs: 0000   ss: 0010   cs: 0008


 I can't say with 100% confidence that domain 354 was the NetBSD/i386
 that panic'ed, but it is possible.  I have made node of the domain id
 number and will see if it survives the cron job run.  The build it was
 doing completed without a panic...  but I could probably start it all
 over again and see if I can force the issue to happen.



 -- 
 Brad Spencer - brad@anduin.eldar.org - KC8VKS - http://anduin.eldar.org

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2024 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.