NetBSD Problem Report #58712
From bernd@fluor.bersie.home Wed Oct 2 12:42:00 2024
Return-Path: <bernd@fluor.bersie.home>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
key-exchange X25519 server-signature RSA-PSS (2048 bits)
client-signature RSA-PSS (2048 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 228511A923B
for <gnats-bugs@gnats.NetBSD.org>; Wed, 2 Oct 2024 12:42:00 +0000 (UTC)
Message-Id: <20241002112444.2E11D981551@fluor.bersie.home>
Date: Wed, 2 Oct 2024 13:24:44 +0200 (CEST)
From: bernd.sieker@posteo.net
Reply-To: bernd.sieker@posteo.net
To: gnats-bugs@NetBSD.org
Subject: Kernel panic when stopping nvmm-backed qemu virtual machine
X-Send-Pr-Version: 3.95
>Number: 58712
>Category: port-amd64
>Synopsis: Kernel panic on the host machine when stopping qemu-nvmm virtual machine
>Confidential: no
>Severity: critical
>Priority: medium
>Responsible: port-amd64-maintainer
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Wed Oct 02 12:45:00 +0000 2024
>Originator: bernd.sieker@posteo.net
>Release: NetBSD 10.0
>Organization:
>Environment:
System: NetBSD fluor.bersie.home 10.0 NetBSD 10.0 (FLUOR) #32: Fri May 3 17:09:48 CEST 2024 bernd@fluor.bersie.home:/usr/src/sys/arch/amd64/compile/FLUOR amd64
Architecture: x86_64
Machine: amd64
>Description:
I still get kernel panics on the host computer occasionally when stopping a qemu VM using nvmm on NetBSD 10.0
This may be related to my previous PR 56969, but the panic message is different, as follows:
[ 4809693.261745] panic: kernel diagnostic assertion "anon != NULL && anon->an_ref != 0" failed: file "../../../../uvm/uvm_amap.c", line 777
[ 4809693.261745] cpu2: Begin traceback...
[ 4809693.261745] vpanic() at netbsd:vpanic+0x183
[ 4809693.271747] kern_assert() at netbsd:kern_assert+0x4b
[ 4809693.281746] amap_wipeout() at netbsd:amap_wipeout+0x88
[ 4809693.291746] uvm_unmap_detach() at netbsd:uvm_unmap_detach+0x56
[ 4809693.301747] uvmspace_free() at netbsd:uvmspace_free+0xf6
[ 4809693.311746] exit1() at netbsd:exit1+0x1b8
[ 4809693.311746] sys_exit() at netbsd:sys_exit+0x39
[ 4809693.321747] syscall() at netbsd:syscall+0x196
[ 4809693.331747] --- syscall (number 1) ---
[ 4809693.331747] netbsd:syscall+0x196:
[ 4809693.331747] cpu2: End traceback...
And a previous, similar one:
[ 2128260.002154] panic: kernel diagnostic assertion "anon->an_lock == amap->am_lock" failed: file "../../../../uvm/uvm_amap.c", line 779
[ 2128260.002154] cpu16: Begin traceback...
[ 2128260.002154] vpanic() at netbsd:vpanic+0x183
[ 2128260.002154] kern_assert() at netbsd:kern_assert+0x4b
[ 2128260.012153] amap_wipeout() at netbsd:amap_wipeout+0x101
[ 2128260.012153] uvm_unmap_detach() at netbsd:uvm_unmap_detach+0x56
[ 2128260.022154] uvmspace_free() at netbsd:uvmspace_free+0xf6
[ 2128260.022154] exit1() at netbsd:exit1+0x1b8
[ 2128260.032154] sys_exit() at netbsd:sys_exit+0x39
[ 2128260.032154] syscall() at netbsd:syscall+0x196
[ 2128260.032154] --- syscall (number 1) ---
[ 2128260.032154] netbsd:syscall+0x196:
[ 2128260.032154] cpu16: End traceback...
Kernel version and configuration were identical on these:
[ 1.000000] NetBSD 10.0 (FLUOR) #32: Fri May 3 17:09:48 CEST 2024
[ 1.000000] bernd@fluor.bersie.home:/usr/src/sys/arch/amd64/compile/FLUOR
[ 1.000000] total memory = 111 GB
[ 1.000000] avail memory = 108 GB
...
[ 1.000004] cpu0 at mainbus0 apid 0
[ 1.000004] cpu0: Use lfence to serialize rdtsc
[ 1.000004] cpu0: Intel(R) Xeon(R) CPU E5-2470 v2 @ 2.40GHz, id 0x306e4
[ 1.000004] cpu0: node 0, package 0, core 0, smt 0
...
[ 1.000004] cpu39 at mainbus0 apid 57
[ 1.000004] cpu39: Intel(R) Xeon(R) CPU E5-2470 v2 @ 2.40GHz, id 0x306e4
[ 1.000004] cpu39: node 2, package 1, core 12, smt 1
...
Hardware is a Dell T420 with 2 10-core Xeon E5-2470 v2, hyperthreading enabled
(40 virtual cores).
Root filesystem is on a raidframe raid1 of two small SATA SSDs, main storage holding
the VM disk images is on a ZFS raidz2 of 8 2TB SAS disks:
[ 1.029100] mpii0 at pci1 dev 0 function 0: Symbios Logic SAS2308 (rev. 0x05)
[ 1.029100] mpii0: interrupting at msix0 vec 0
[ 1.029100] mpii0: SAS9207-8i, firmware 20.0.7.0, MPI 2.0
...
[ 6.684580] sd0 at scsibus0 target 0 lun 0: <IBM-ESXS, ST32000444SS, BC2D> disk fixed
[ 6.694580] sd0: 1863 GB, 249000 cyl, 8 head, 1961 sec, 512 bytes/sect x 3907029168 sectors
[ 6.694580] sd0: tagged queueing
...
[ 6.814579] sd7 at scsibus0 target 7 lun 0: <IBM-ESXS, ST32000444SS, BC2D> disk fixed
[ 6.814579] sd7: 1863 GB, 249000 cyl, 8 head, 1961 sec, 512 bytes/sect x 3907029168 sectors
[ 6.824579] sd7: tagged queueing
...
[ 34.164590] kern.module.path=/stand/amd64/10.0/modules
[ 34.664591] ZFS filesystem version: 5
[ 34.744603] nvmm0: attached, using backend x86-vmx
This also happened on previous systems with Xeon X5675 CPUs and a raidz1 ZFS on 3 SATA disks.
As long as the VMs run, everything seems fine. I usually run 3 or 4 virtual machines, a mix
of NetBSD 9.4, NetBSD 10.0, Linux and Windows, some of them heavily loaded, for weeks, without problems.
It may or may not be related to PR #58335, which confuses me as to whether the host or the guest crashes.
>How-To-Repeat:
Run qemu/nvmm virtual machines for a while, and shut one of them down. Sometimes the above kernel panic
on the host will ensue. I have not been able to narrow down the conditions, it seems totally random to me.
>Fix:
Unknwon.
(Contact us)
$NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2024
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.