NetBSD Problem Report #55667

From kardel@Kardel.name  Thu Sep 17 13:55:13 2020
Return-Path: <kardel@Kardel.name>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 0F4861A9239
	for <gnats-bugs@gnats.NetBSD.org>; Thu, 17 Sep 2020 13:55:13 +0000 (UTC)
Message-Id: <20200917093506.9207544B33@Andromeda.Kardel.name>
Date: Thu, 17 Sep 2020 11:35:06 +0200 (CEST)
From: kardel@netbsd.org
Reply-To: kardel@netbsd.org
To: gnats-bugs@NetBSD.org
Subject: regression: XEN3_DOM0 fails to boot on
X-Send-Pr-Version: 3.95

>Number:         55667
>Category:       kern
>Synopsis:       regression: XEN3_DOM0 fails to boot on
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    jdolecek
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Sep 17 14:00:00 +0000 2020
>Last-Modified:  Mon Feb 22 10:00:02 +0000 2021
>Originator:     kardel@netbsd.org
>Release:        NetBSD 9.99.55<->9.99.72
>Organization:

>Environment:


System: NetBSD Toblerone 9.99.55 NetBSD 9.99.55 (XEN3_DOM0) #1: Thu Apr 9 11:39:49 CEST 2020 kardel@Toblerone:/usr/src/sys/arch/amd64/compile/obj/XEN3_DOM0 amd64
Architecture: x86_64
Machine: amd64
>Description:
	XEN3_DOM0 used to boot flawlessly on this machine. Probably the
	MSI/MSIX interrupt rework introduced the regression.
	The boot stalls after attaching ums0 to wsmouse0. Basic
	timing seems to work as the IPMI driver logs the version
	16 seconds after boot and while the system seems hung.
	This is a regression as 9.99.55 used to work fine.

System info:
machdep.hypervisor = generic
machdep.idle-mechanism = xen
machdep.dmi.system-vendor = Supermicro
machdep.dmi.system-product = AS -2113S-WN24RT
machdep.dmi.system-version = 0123456789
machdep.dmi.bios-vendor = American Megatrends Inc.
machdep.dmi.bios-version = 2.0b
machdep.dmi.bios-date = 20191115
machdep.dmi.board-vendor = Supermicro
machdep.dmi.board-product = H11SSW-NT
machdep.dmi.board-version = 2.00
machdep.dmi.board-asset-tag = To be filled by O.E.M.
machdep.dmi.chassis-vendor = Supermicro
machdep.dmi.chassis-type = Supermicro
machdep.dmi.chassis-version = 0123456789
machdep.dmi.chassis-asset-tag = To be filled by O.E.M.
machdep.dmi.processor-vendor = Advanced Micro Devices, Inc.
machdep.dmi.processor-version = AMD EPYC 7302P 16-Core Processor               
machdep.dmi.processor-frequency = 3000 MHz
machdep.xen.version = 4.11.3nb1
machdep.xen.balloon.current = 4194304
machdep.xen.balloon.target = 4194304
machdep.xen.balloon.min = 2048
machdep.xen.balloon.max = 17179869180

	Working with jdolecek@ we have seen multiple scenarios (via patches
	to the nve driver) from the solid stall (right now) to boots with
	abysmally slow NVME access.

	Using xen 4.13 didn't change the boot stall.

	Other large Intel based systems run XEN3_DOM0 9.99.72 just fine.
	9.99.55 also runs on this system just fine.

>How-To-Repeat:
	Try to boot 9.99.72 XEN3_DOM0 on Supermicro AS -2113S-WN24RT. Watch it getting
	stuck.
>Fix:
	?

>Release-Note:

>Audit-Trail:
From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/55667: regression: XEN3_DOM0 fails to boot on
Date: Fri, 19 Feb 2021 11:29:07 +0100

 Now a gave it a shot at 9.99.80 XEN3_DOM0. Things have changed from
 getting stuck to panic:

 [   1.0000030] ixg1: clearing prefetchable bit
 [   1.0000030] xpq_flush_queue: 2 entries (0 successful) on cpu0 (0)
 [   1.0000030] panic: HYPERVISOR_mmu_update failed, ret: -22

 [   1.0000030] cpu0: Begin traceback...
 [   1.0000030] vpanic() at netbsd:vpanic+0x14a
 [   1.0000030] snprintf() at netbsd:snprintf
 [   1.0000030] xpq_queue_machphys_update() at 
 netbsd:xpq_queue_machphys_update
 [   1.0000030] pmap_zero_page() at netbsd:pmap_zero_page+0xe3
 [   1.0000030] uvm_pagealloc_strat() at netbsd:uvm_pagealloc_strat+0x218
 [   1.0000030] pmap_get_physpage() at netbsd:pmap_get_physpage+0x1c9
 [   1.0000030] pmap_growkernel() at netbsd:pmap_growkernel+0x1b0
 [   1.0000030] uvm_map_prepare() at netbsd:uvm_map_prepare+0x350
 [   1.0000030] uvm_map() at netbsd:uvm_map+0x6e
 [   1.0000030] uvm_km_alloc() at netbsd:uvm_km_alloc+0xfc
 [   1.0000030] x86_mem_add_mapping() at netbsd:x86_mem_add_mapping+0x98
 [   1.0000030] bus_space_map() at netbsd:bus_space_map+0x59
 [   1.0000030] ixgbe_attach() at netbsd:ixgbe_attach+0x311
 [   1.0000030] config_attach_loc() at netbsd:config_attach_loc+0x176
 [   1.0000030] pci_probe_device() at netbsd:pci_probe_device+0x582
 [   1.0000030] pci_enumerate_bus() at netbsd:pci_enumerate_bus+0x1b5
 [   1.0000030] pcirescan() at netbsd:pcirescan+0x4e
 [   1.0000030] pciattach() at netbsd:pciattach+0x186
 [   1.0000030] config_attach_loc() at netbsd:config_attach_loc+0x176
 [   1.0000030] ppbattach() at netbsd:ppbattach+0x1c5
 [   1.0000030] config_attach_loc() at netbsd:config_attach_loc+0x176
 [   1.0000030] pci_probe_device() at netbsd:pci_probe_device+0x582
 [   1.0000030] pci_enumerate_bus() at netbsd:pci_enumerate_bus+0x1b5
 [   1.0000030] pcirescan() at netbsd:pcirescan+0x4e
 [   1.0000030] pciattach() at netbsd:pciattach+0x186
 [   1.0000030] config_attach_loc() at netbsd:config_attach_loc+0x176
 [   1.0000030] mp_pci_scan() at netbsd:mp_pci_scan+0x9e
 [   1.0000030] hypervisor_attach() at netbsd:hypervisor_attach+0x3f0
 [   1.0000030] config_attach_loc() at netbsd:config_attach_loc+0x176
 [   1.0000030] xen_mainbus_attach() at netbsd:xen_mainbus_attach+0x53
 [   1.0000030] mainbus_attach() at netbsd:mainbus_attach+0x4a
 [   1.0000030] config_attach_loc() at netbsd:config_attach_loc+0x176
 [   1.0000030] cpu_configure() at netbsd:cpu_configure+0x25
 [   1.0000030] main() at netbsd:main+0x32c
 [   1.0000030] cpu0: End traceback...
 [   1.0000030] fatal breakpoint trap in supervisor mode
 [   1.0000030] trap type 1 code 0 rip 0xffffffff8023e93d cs 0xe030 
 rflags 0x202 cr2 0 ilevel 0x8 rsp 0xffffffff8316be50
 [   1.0000030] curlwp 0xffffffff80e48340 pid 0.0 lowest kstack 
 0xffffffff831682c0
 Stopped in pid 0.0 (system) at  netbsd:breakpoint+0x5:  leave

 panic stack traces are like the above in different path sof autoconf. I 
 saw also panics in
 acpi device configuration.
 The panic is always panic: HYPERVISOR_mmu_update failed, ret: -22.

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
        kardel@netbsd.org
Subject: Re: kern/55667: regression: XEN3_DOM0 fails to boot on
Date: Fri, 19 Feb 2021 20:14:23 +0100

 On Fri, Feb 19, 2021 at 10:30:02AM +0000, Frank Kardel wrote:
 >  [...]
 >  panic stack traces are like the above in different path sof autoconf. I 
 >  saw also panics in
 >  acpi device configuration.
 >  The panic is always panic: HYPERVISOR_mmu_update failed, ret: -22.

 this means we try to map something we're not allowed to map

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

Responsible-Changed-From-To: kern-bug-people->jdolecek
Responsible-Changed-By: jdolecek@NetBSD.org
Responsible-Changed-When: Mon, 22 Feb 2021 07:54:18 +0000
Responsible-Changed-Why:
Probably mine. I get the same behaviour on my test machine, so I want
to investigate and fix it. I get boot stuck though only, no panic.


From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/55667 (regression: XEN3_DOM0 fails to boot on)
Date: Mon, 22 Feb 2021 09:18:28 +0100

 Good you have a machine you can test with.

 It started out with a hang as described and then progressed to the panic.


 On 02/22/21 08:54, jdolecek@NetBSD.org wrote:
 > Synopsis: regression: XEN3_DOM0 fails to boot on
 >
 > Responsible-Changed-From-To: kern-bug-people->jdolecek
 > Responsible-Changed-By: jdolecek@NetBSD.org
 > Responsible-Changed-When: Mon, 22 Feb 2021 07:54:18 +0000
 > Responsible-Changed-Why:
 > Probably mine. I get the same behaviour on my test machine, so I want
 > to investigate and fix it. I get boot stuck though only, no panic.
 >
 >
 >

From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/55667: regression: XEN3_DOM0 fails to boot on
Date: Mon, 22 Feb 2021 09:58:23 +0000

 On Fri, Feb 19, 2021 at 10:30:02AM +0000, Frank Kardel wrote:
 > The following reply was made to PR kern/55667; it has been noted by GNATS.
 > 
 > From: Frank Kardel <kardel@netbsd.org>
 > To: gnats-bugs@netbsd.org
 > Cc: 
 > Subject: Re: kern/55667: regression: XEN3_DOM0 fails to boot on
 > Date: Fri, 19 Feb 2021 11:29:07 +0100
 > 
 >  Now a gave it a shot at 9.99.80 XEN3_DOM0. Things have changed from
 >  getting stuck to panic:
 >  
 >  [   1.0000030] ixg1: clearing prefetchable bit
 >  [   1.0000030] xpq_flush_queue: 2 entries (0 successful) on cpu0 (0)
 >  [   1.0000030] panic: HYPERVISOR_mmu_update failed, ret: -22

 Just like PR port-xen/55978

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.