NetBSD Problem Report #55207
From www@netbsd.org Sat Apr 25 17:54:17 2020
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id A7BB81A9218
for <gnats-bugs@gnats.NetBSD.org>; Sat, 25 Apr 2020 17:54:17 +0000 (UTC)
Message-Id: <20200425175416.DC40B1A9243@mollari.NetBSD.org>
Date: Sat, 25 Apr 2020 17:54:16 +0000 (UTC)
From: pbraun@nethence.com
Reply-To: pbraun@nethence.com
To: gnats-bugs@NetBSD.org
Subject: netbsd domU does not migrate properly from one xen host to another
X-Send-Pr-Version: www-1.0
>Number: 55207
>Category: port-xen
>Synopsis: netbsd domU does not migrate properly from one xen host to another
>Confidential: no
>Severity: critical
>Priority: low
>Responsible: port-xen-maintainer
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sat Apr 25 17:55:00 +0000 2020
>Last-Modified: Sat Aug 08 11:45:01 +0000 2020
>Originator: Pierre-Philipp Braun
>Release: netbsd-7,netbsd-8,netbsd-9,HEAD
>Organization:
Innopolis University
>Environment:
NetBSD netbsdffs 9.99.59 NetBSD 9.99.59 (XEN3_DOMU) #0: Sat Apr 25 19:38:36 MSK 2020 root@netbsdffs:/usr/objdir/sys/arch/amd64/compile/XEN3_DOMU amd64
>Description:
XEN farm is 4.11 but that would also happen with latest 4.13 release
dom0s here are Slackware Linux 14.2 with kernel 5.5.5
migrating from a xen host to another using `xl migrate`, it first looks good
migration target: Ready to receive domain.
Saving to migration stream new xl format (info 0x3/0x0/993)
Loading new save file <incoming migration stream> (new xl fmt info 0x3/0x0/993)
Savefile contains xl domain config in JSON format
Parsing config from <saved>
xc: info: Saving domain 11, type x86 PV
xc: info: Found x86 PV domain from Xen 4.11
xc: info: Restoring domain
and then comes
libxl: error: libxl_dom_suspend.c:262:domain_suspend_common_pvcontrol_suspending: Domain 11:guest didn't acknowledge suspend, cancelling request
xc: error: Bad mfn for suspend record: Internal error
xc: error: mfn 0x2, max 0x2030000: Internal error
xc: error: m2p[0x2] = 0xffffffffffffffff, max_pfn 0x7ffff: Internal error
xc: error: Save failed (34 = Numerical result out of range): Internal error
libxl: error: libxl_stream_write.c:350:libxl__xc_domain_save_done: Domain 11:saving domain: domain did not respond to suspend request: Numerical result out of range
migration sender: libxl_domain_suspend failed (rc=-8)
xc: error: Failed to read Record Header from stream (0 = Success): Internal error
xc: error: Restore failed (0 = Success): Internal error
libxl: error: libxl_stream_read.c:850:libxl__xc_domain_restore_done: restoring domain: Success
libxl: error: libxl_create.c:1266:domcreate_rebuild_done: Domain 2:cannot (re-)build domain: -3
libxl: error: libxl_domain.c:1034:libxl__destroy_domid: Domain 2:Non-existant domain
libxl: error: libxl_domain.c:993:domain_destroy_callback: Domain 2:Unable to destroy guest
libxl: error: libxl_domain.c:920:domain_destroy_cb: Domain 2:Destruction of domain failed
migration target: Domain creation failed (code -3).
libxl: info: libxl_exec.c:118:libxl_report_child_exitstatus: migration transport process [10506] exited with error status 1
Migration failed, failed to suspend at sender.
...which crashes the guest
guest console does not show anything after
[ 124.2702597] xenbus_shutdown_handler: xenbus_rm 13
[ 124.4904546] Flushing disk caches: done
but while doing previous migration tests with netbsd dom0 v9.0 I could see more
[ 106.6704667] xenbus_shutdown_handler: xenbus_rm 13
[ 106.7003299] Flushing disk caches: 7 done
[ 106.7103358] fatal page fault in supervisor mode
[ 106.7103358] trap type 6 code 0 rip 0xffffffff8020313f cs 0xe030 rflags 0x10256 cr2 0
0xffffd180790ffcc8
[ 106.7103358] curlwp 0xffffd18003a378e0 pid 0.7 lowest kstack 0xffffd180790fc2c0
[ 106.7103358] panic: trap
[ 106.7103358] cpu0: Begin traceback...
[ 106.7103358] vpanic() at netbsd:vpanic+0x143
[ 106.7103358] snprintf() at netbsd:snprintf
[ 106.7103358] startlwp() at netbsd:startlwp
[ 106.7103358] alltraps() at netbsd:alltraps+0xae
[ 106.7103358] sleepq_block() at netbsd:sleepq_block+0x19a
[ 106.7103358] cv_wait() at netbsd:cv_wait+0x9e
>How-To-Repeat:
Get a XEN farm up and running with at least two nodes. Shared storage and network is not even necessary to reproduce the bug. Tested with GNU/Linux dom0s here but might be same problem with NetBSD dom0s.
>Fix:
It's there since 2015 at least - apparently it's not an easy fix.
https://mail-index.netbsd.org/port-xen/2015/01/18/msg008440.html
>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: port-xen-maintainer->jdolecek
Responsible-Changed-By: jdolecek@NetBSD.org
Responsible-Changed-When: Sat, 25 Apr 2020 20:25:50 +0000
Responsible-Changed-Why:
I noticed related behaviour, xbd(4) isn't able to detach properly
after 'xl block-detach' for device used as root filesystem, and I want
to look at it eventually.
From: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek@gmail.com>
To: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
Cc:
Subject: Re: port-xen/55207: netbsd domU does not migrate properly from one
xen host to another
Date: Mon, 11 May 2020 22:06:40 +0200
FYI - investigated this a bit with -current, a way to trigger what
seems to be similar (if not same) problem is to just xl save:
asus-beast# xl save avx2 /mnt/zz-save
Saving to /mnt/zz-save new xl format (info 0x3/0x0/1291)
xc: info: Saving domain 4, type x86 PV
libxl: error: libxl_dom_suspend.c:262:domain_suspend_common_pvcontrol_suspending:
Domain 4:guest didn't acknowledge suspend, cancelling request
xc: error: Domain has not been suspended: shutdown 0, reason 255: Internal error
xc: error: Save failed (0 = Undefined error: 0): Internal error
libxl: error: libxl_stream_write.c:350:libxl__xc_domain_save_done:
Domain 4:saving domain: domain did not respond to suspend request:
Undefined error: 0
Failed to save domain, resuming domain
xc: error: Dom 4 not suspended: (shutdown 0, reason 255): Internal error
libxl: error: libxl_dom_suspend.c:472:libxl__domain_resume: Domain
4:xc_domain_resume failed: Invalid argument
with LOCKDEBUG kernel this triggers in the DomU:
login: [ 342.0400600] xenbus_shutdown_handler: xenbus_rm 13
[ 342.2800364] Flushing disk caches: 1 done
[ 342.2900473] Mutex error: mutex_vector_enter,514: spin lock held
[ 342.2900473] lock address : 0xffff9e80020121d0 type : spin
[ 342.2900473] initialized : 0xffffffff802133f9
[ 342.2900473] shared holds : 0 exclusive: 1
[ 342.2900473] shares wanted: 0 exclusive: 0
[ 342.2900473] relevant cpu : 0 last held: 0
[ 342.2900473] relevant lwp : 0xffff9e8002577b80 last held: 0xffff9e8002577b80
[ 342.2900473] last locked* : 0xffffffff80212370 unlocked : 0xffffffff80213b96
[ 342.2900473] owner field : 0x0000000000010600 wait/spin: 0/1
[ 342.2900473] panic: LOCKDEBUG: Mutex error: mutex_vector_enter,514:
spin lock held
[ 342.2900473] cpu0: Begin traceback...
[ 342.2900473] vpanic() at netbsd:vpanic+0x146
[ 342.2900473] snprintf() at netbsd:snprintf
[ 342.2900473] lockdebug_more() at netbsd:lockdebug_more
[ 342.2900473] mutex_enter() at netbsd:mutex_enter+0x342
[ 342.2900473] event_remove_handler() at netbsd:event_remove_handler+0x26
[ 342.2900473] xbd_xenbus_suspend() at netbsd:xbd_xenbus_suspend+0x91
[ 342.2900473] device_pmf_driver_suspend() at
netbsd:device_pmf_driver_suspend+0x46
[ 342.2900473] pmf_device_suspend_locked() at
netbsd:pmf_device_suspend_locked+0xeb
[ 342.2900473] pmf_device_suspend() at netbsd:pmf_device_suspend+0x45
[ 342.2900473] pmf_system_suspend() at netbsd:pmf_system_suspend+0xba
[ 342.2900473] sysctl_xen_suspend() at netbsd:sysctl_xen_suspend+0xf1
[ 342.2900473] sysctl_dispatch() at netbsd:sysctl_dispatch+0xa3
[ 342.2900473] sys___sysctl() at netbsd:sys___sysctl+0xc5
[ 342.2900473] syscall() at netbsd:syscall+0x9c
[ 342.2900473] --- syscall (number 202) ---
Trying to acquire same spinlock second time without LOCKDEBUG would
simply deadlock, which seems to match the original report.
Continuing investigation.
From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/55207 CVS commit: src/sys/arch/xen/xen
Date: Tue, 12 May 2020 09:54:02 +0000
Module Name: src
Committed By: jdolecek
Date: Tue May 12 09:54:02 UTC 2020
Modified Files:
src/sys/arch/xen/xen: xbd_xenbus.c
Log Message:
move xen_intr_disestablish() call in xbd_xenbus_suspend() so it's executed
without holding the xbd mutex, to avoid LOCKDEBUG assertion on suspend
while here only disestablish the intr if it was established
part of PR port-xen/55207
To generate a diff of this commit:
cvs rdiff -u -r1.125 -r1.126 src/sys/arch/xen/xen/xbd_xenbus.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/55207 CVS commit: src/sys/kern
Date: Tue, 12 May 2020 10:02:56 +0000
Module Name: src
Committed By: jdolecek
Date: Tue May 12 10:02:56 UTC 2020
Modified Files:
src/sys/kern: kern_pmf.c
Log Message:
need to take IFNET_LOCK() around if_stop (on suspend) and if_init (on resume)
calls, those need to read and/or manipulate if_flags and hence need
the lock for IFEF_MPSAFE drivers; the drivers can't do IFNET_LOCK() themselves,
because the ioctl path call these hooks with the lock held
fixes KASSERT() in xennet(4) while investigating PR port-xen/55207
To generate a diff of this commit:
cvs rdiff -u -r1.42 -r1.43 src/sys/kern/kern_pmf.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/55207 CVS commit: src/sys/arch/xen
Date: Wed, 13 May 2020 13:19:38 +0000
Module Name: src
Committed By: jdolecek
Date: Wed May 13 13:19:38 UTC 2020
Modified Files:
src/sys/arch/xen/xen: evtchn.c
src/sys/arch/xen/xenbus: xenbus_comms.c xenbus_comms.h xenbus_probe.c
Log Message:
don't reinitialize mutexes/cv on resume
part of PR port-xen/55207
To generate a diff of this commit:
cvs rdiff -u -r1.94 -r1.95 src/sys/arch/xen/xen/evtchn.c
cvs rdiff -u -r1.23 -r1.24 src/sys/arch/xen/xenbus/xenbus_comms.c
cvs rdiff -u -r1.7 -r1.8 src/sys/arch/xen/xenbus/xenbus_comms.h
cvs rdiff -u -r1.52 -r1.53 src/sys/arch/xen/xenbus/xenbus_probe.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/55207 CVS commit: src/sys/arch/xen/xen
Date: Wed, 13 May 2020 16:13:14 +0000
Module Name: src
Committed By: jdolecek
Date: Wed May 13 16:13:14 UTC 2020
Modified Files:
src/sys/arch/xen/xen: xengnt.c
Log Message:
need to set the version on resume same as during initialization
part of PR port-xen/55207
To generate a diff of this commit:
cvs rdiff -u -r1.37 -r1.38 src/sys/arch/xen/xen/xengnt.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/55207 CVS commit: src/sys/arch/xen/xen
Date: Wed, 13 May 2020 16:17:46 +0000
Module Name: src
Committed By: jdolecek
Date: Wed May 13 16:17:46 UTC 2020
Modified Files:
src/sys/arch/xen/xen: xbd_xenbus.c
Log Message:
move the xen_intr_disestablish() to resume - having it in suspend
seems to cause panic in later phases of suspend
don't try to revoke grants in resume, they are all gone
add some diagnostic code in suspend to make sure the request lists are ready
for resume
part of PR port-xen/55207
To generate a diff of this commit:
cvs rdiff -u -r1.126 -r1.127 src/sys/arch/xen/xen/xbd_xenbus.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/55207 CVS commit: src/sys/arch/xen/xen
Date: Thu, 14 May 2020 09:47:25 +0000
Module Name: src
Committed By: jdolecek
Date: Thu May 14 09:47:25 UTC 2020
Modified Files:
src/sys/arch/xen/xen: if_xennet_xenbus.c
Log Message:
rearrange so that suspend & resume doesn't cause panics, and interface
is more likely to work - particularly, don't try to xengnt_revoke_access()
after resume, move xen_intr_disestablish() call to resume, also
unmask the event channel on resume
XXX right now xennet device detaches immediately after resume, which is not
desirable and needs to be fixed
part of PR port-xen/55207
To generate a diff of this commit:
cvs rdiff -u -r1.124 -r1.125 src/sys/arch/xen/xen/if_xennet_xenbus.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/55207 CVS commit: src/sys/arch/xen/xen
Date: Thu, 14 May 2020 13:25:40 +0000
Module Name: src
Committed By: jdolecek
Date: Thu May 14 13:25:40 UTC 2020
Modified Files:
src/sys/arch/xen/xen: if_xennet_xenbus.c
Log Message:
fix resume for xennet, now the network continues working after resume
we can't read feature-rx-copy in resume, at that time the new backend
device is not filled yet; convert it just to feature flag read on interface
attach, can assume any backend would support rx-copy anyway
fix compile with XENNET_DEBUG
part of PR port-xen/55207
To generate a diff of this commit:
cvs rdiff -u -r1.125 -r1.126 src/sys/arch/xen/xen/if_xennet_xenbus.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
Responsible-Changed-From-To: jdolecek->port-xen-maintainer
Responsible-Changed-By: jdolecek@NetBSD.org
Responsible-Changed-When: Mon, 29 Jun 2020 11:03:05 +0000
Responsible-Changed-Why:
There is still some port-specific work to avoid the crash on suspend,
which seems to be due to interrupts not being disabled properly. I'm
not going to tackle this part for now, volunteers welcome.
From: Pierre-Philipp Braun <pbraun@nethence.com>
To: gnats-bugs@netbsd.org, port-xen-maintainer@netbsd.org,
jdolecek@netbsd.org, netbsd-bugs@netbsd.org, gnats-admin@netbsd.org
Cc:
Subject: Re: port-xen/55207 (netbsd domU does not migrate properly from one
xen host to another)
Date: Sat, 8 Aug 2020 11:06:54 +0300
Thank you JaromÃr for your improvements on hot-migrating netbsd XEN/PV domUs. Even though it is not finalized, I tried again with v9 vs daily build HEAD/202008060510Z and here are the results. I don't know if that matters but for the record, those new tests have been done on top of NFSv3 vdisk ext4 sparse files, with a fake /dev/ on tmpfs, a fictitious disk label and FFS2 as root file-system created by makefs from netbsd's cross-compilation toolchain on linux.
with v9
root@slack2hb:~/guests/slime9# xl save slime9 slime9.save
Saving to slime9.save new xl format (info 0x3/0x0/1497)
xc: info: Saving domain 18, type x86 PV
xc: error: save callback suspend() failed: 0: Internal error
xc: error: Save failed (0 = Success): Internal error
libxl: error: libxl_stream_write.c:350:libxl__xc_domain_save_done: Domain 18:saving domain: domain responded to suspend request: Success
Failed to save domain, resuming domain
libxl: error: libxl_dom.c:40:libxl__domain_type: unable to get domain type for domid=18
[ 63.5605882] xenbus_shutdown_handler: xenbus_rm 13
[ 63.5803786] Flushing disk caches: 8 done
[ 63.6004200] fatal page fault in supervisor mode
[ 63.6004200] trap type 6 code 0 rip 0xffffffff8020313f cs 0xe030 rflags 0x10256 cr2 0x10 ilevel 0x6 rsp
0xffffd2004c92fd58
[ 63.6004200] curlwp 0xffffd20001d1d480 pid 0.3 lowest kstack 0xffffd2004c92c2c0
[ 63.6004200] panic: trap
[ 63.6004200] cpu0: Begin traceback...
[ 63.6004200] vpanic() at netbsd:vpanic+0x143
[ 63.6004200] snprintf() at netbsd:snprintf
[ 63.6004200] startlwp() at netbsd:startlwp
[ 63.6004200] alltraps() at netbsd:alltraps+0xae
[ 63.6004200] softint_thread() at netbsd:softint_thread+0x117
[ 63.6004200] cpu0: End traceback...
[ 63.6004200] dumping to dev 142,1 (offset=0, size=0): not possible
[ 63.6004200] rebooting...
with current
root@slack2hb:~/guests/slime# xl save slime slime.save
Saving to slime.save new xl format (info 0x3/0x0/1479)
xc: info: Saving domain 20, type x86 PV
xc: Frames: 262144/262144 100%
xc: error: Bad mfn for suspend record: Internal error
xc: error: mfn 0x7f7fff4cb7e0, max 0x2030000: Internal error
xc: error: Save failed (34 = Numerical result out of range): Internal error
libxl: error: libxl_stream_write.c:350:libxl__xc_domain_save_done: Domain 20:saving domain: domain responded to suspend request: Numerical result out of range
Failed to save domain, resuming domain
[ 19.2002331] Flushing disk caches: done
[ 19.7401533] uvm_fault(0xffffffff808a8200, 0xffffffff80baf000, 2) -> e
[ 19.7401533] fatal page fault in supervisor mode
[ 19.7401533] trap type 6 code 0x2 rip 0xffffffff806034a9 cs 0xe030 rflags 0x10202 cr2 0xffffffff80baf001 ilevel 0 rsp 0xffffb7805d508e70
[ 19.7401533] curlwp 0xffffb78002ad9a00 pid 827.827 lowest kstack 0xffffb7805d5042c0
[ 19.7401533] panic: trap
[ 19.7401533] cpu0: Begin traceback...
[ 19.7401533] vpanic() at netbsd:vpanic+0x146
[ 19.7401533] snprintf() at netbsd:snprintf
[ 19.7401533] startlwp() at netbsd:startlwp
[ 19.7401533] cpu0: End traceback...
[ 19.7401533] dumping to dev 142,1 (offset=0, size=0): not possible
[ 19.7401533] rebooting...
I am not sure how to interpret this but I hope it's useful.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.