NetBSD Problem Report #55207

From www@netbsd.org  Sat Apr 25 17:54:17 2020
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id A7BB81A9218
	for <gnats-bugs@gnats.NetBSD.org>; Sat, 25 Apr 2020 17:54:17 +0000 (UTC)
Message-Id: <20200425175416.DC40B1A9243@mollari.NetBSD.org>
Date: Sat, 25 Apr 2020 17:54:16 +0000 (UTC)
From: pbraun@nethence.com
Reply-To: pbraun@nethence.com
To: gnats-bugs@NetBSD.org
Subject: netbsd domU does not migrate properly from one xen host to another
X-Send-Pr-Version: www-1.0

>Number:         55207
>Category:       port-xen
>Synopsis:       netbsd domU does not migrate properly from one xen host to another
>Confidential:   no
>Severity:       critical
>Priority:       low
>Responsible:    port-xen-maintainer
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Apr 25 17:55:00 +0000 2020
>Last-Modified:  Sat Aug 08 11:45:01 +0000 2020
>Originator:     Pierre-Philipp Braun
>Release:        netbsd-7,netbsd-8,netbsd-9,HEAD
>Organization:
Innopolis University
>Environment:
NetBSD netbsdffs 9.99.59 NetBSD 9.99.59 (XEN3_DOMU) #0: Sat Apr 25 19:38:36 MSK 2020  root@netbsdffs:/usr/objdir/sys/arch/amd64/compile/XEN3_DOMU amd64
>Description:
XEN farm is 4.11 but that would also happen with latest 4.13 release
dom0s here are Slackware Linux 14.2 with kernel 5.5.5

migrating from a xen host to another using `xl migrate`, it first looks good

migration target: Ready to receive domain.
Saving to migration stream new xl format (info 0x3/0x0/993)
Loading new save file <incoming migration stream> (new xl fmt info 0x3/0x0/993)
 Savefile contains xl domain config in JSON format
Parsing config from <saved>
xc: info: Saving domain 11, type x86 PV
xc: info: Found x86 PV domain from Xen 4.11
xc: info: Restoring domain

and then comes

libxl: error: libxl_dom_suspend.c:262:domain_suspend_common_pvcontrol_suspending: Domain 11:guest didn't acknowledge suspend, cancelling request
xc: error: Bad mfn for suspend record: Internal error
xc: error: mfn 0x2, max 0x2030000: Internal error
xc: error:   m2p[0x2] = 0xffffffffffffffff, max_pfn 0x7ffff: Internal error
xc: error: Save failed (34 = Numerical result out of range): Internal error
libxl: error: libxl_stream_write.c:350:libxl__xc_domain_save_done: Domain 11:saving domain: domain did not respond to suspend request: Numerical result out of range
migration sender: libxl_domain_suspend failed (rc=-8)
xc: error: Failed to read Record Header from stream (0 = Success): Internal error
xc: error: Restore failed (0 = Success): Internal error
libxl: error: libxl_stream_read.c:850:libxl__xc_domain_restore_done: restoring domain: Success
libxl: error: libxl_create.c:1266:domcreate_rebuild_done: Domain 2:cannot (re-)build domain: -3
libxl: error: libxl_domain.c:1034:libxl__destroy_domid: Domain 2:Non-existant domain
libxl: error: libxl_domain.c:993:domain_destroy_callback: Domain 2:Unable to destroy guest
libxl: error: libxl_domain.c:920:domain_destroy_cb: Domain 2:Destruction of domain failed
migration target: Domain creation failed (code -3).
libxl: info: libxl_exec.c:118:libxl_report_child_exitstatus: migration transport process [10506] exited with error status 1
Migration failed, failed to suspend at sender.

...which crashes the guest

guest console does not show anything after

[ 124.2702597] xenbus_shutdown_handler: xenbus_rm 13
[ 124.4904546] Flushing disk caches: done

but while doing previous migration tests with netbsd dom0 v9.0 I could see more

[ 106.6704667] xenbus_shutdown_handler: xenbus_rm 13
[ 106.7003299] Flushing disk caches: 7 done
[ 106.7103358] fatal page fault in supervisor mode
[ 106.7103358] trap type 6 code 0 rip 0xffffffff8020313f cs 0xe030 rflags 0x10256 cr2 0
0xffffd180790ffcc8
[ 106.7103358] curlwp 0xffffd18003a378e0 pid 0.7 lowest kstack 0xffffd180790fc2c0
[ 106.7103358] panic: trap
[ 106.7103358] cpu0: Begin traceback...
[ 106.7103358] vpanic() at netbsd:vpanic+0x143
[ 106.7103358] snprintf() at netbsd:snprintf
[ 106.7103358] startlwp() at netbsd:startlwp
[ 106.7103358] alltraps() at netbsd:alltraps+0xae
[ 106.7103358] sleepq_block() at netbsd:sleepq_block+0x19a
[ 106.7103358] cv_wait() at netbsd:cv_wait+0x9e

>How-To-Repeat:
Get a XEN farm up and running with at least two nodes.  Shared storage and network is not even necessary to reproduce the bug.  Tested with GNU/Linux dom0s here but might be same problem with NetBSD dom0s.
>Fix:
It's there since 2015 at least - apparently it's not an easy fix.
https://mail-index.netbsd.org/port-xen/2015/01/18/msg008440.html

>Release-Note:

>Audit-Trail:

Responsible-Changed-From-To: port-xen-maintainer->jdolecek
Responsible-Changed-By: jdolecek@NetBSD.org
Responsible-Changed-When: Sat, 25 Apr 2020 20:25:50 +0000
Responsible-Changed-Why:
I noticed related behaviour, xbd(4) isn't able to detach properly
after 'xl block-detach' for device used as root filesystem, and I want
to look at it eventually.


From: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek@gmail.com>
To: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
Cc: 
Subject: Re: port-xen/55207: netbsd domU does not migrate properly from one
 xen host to another
Date: Mon, 11 May 2020 22:06:40 +0200

 FYI - investigated this a bit with -current, a way to trigger what
 seems to be similar (if not same) problem is to just xl save:

 asus-beast# xl save avx2 /mnt/zz-save
 Saving to /mnt/zz-save new xl format (info 0x3/0x0/1291)
 xc: info: Saving domain 4, type x86 PV
 libxl: error: libxl_dom_suspend.c:262:domain_suspend_common_pvcontrol_suspending:
 Domain 4:guest didn't acknowledge suspend, cancelling request
 xc: error: Domain has not been suspended: shutdown 0, reason 255: Internal error
 xc: error: Save failed (0 = Undefined error: 0): Internal error
 libxl: error: libxl_stream_write.c:350:libxl__xc_domain_save_done:
 Domain 4:saving domain: domain did not respond to suspend request:
 Undefined error: 0
 Failed to save domain, resuming domain
 xc: error: Dom 4 not suspended: (shutdown 0, reason 255): Internal error
 libxl: error: libxl_dom_suspend.c:472:libxl__domain_resume: Domain
 4:xc_domain_resume failed: Invalid argument

 with LOCKDEBUG kernel this triggers in the DomU:

 login: [ 342.0400600] xenbus_shutdown_handler: xenbus_rm 13
 [ 342.2800364] Flushing disk caches: 1 done
 [ 342.2900473] Mutex error: mutex_vector_enter,514: spin lock held

 [ 342.2900473] lock address : 0xffff9e80020121d0 type     :               spin
 [ 342.2900473] initialized  : 0xffffffff802133f9
 [ 342.2900473] shared holds :                  0 exclusive:                  1
 [ 342.2900473] shares wanted:                  0 exclusive:                  0
 [ 342.2900473] relevant cpu :                  0 last held:                  0
 [ 342.2900473] relevant lwp : 0xffff9e8002577b80 last held: 0xffff9e8002577b80
 [ 342.2900473] last locked* : 0xffffffff80212370 unlocked : 0xffffffff80213b96
 [ 342.2900473] owner field  : 0x0000000000010600 wait/spin:                0/1

 [ 342.2900473] panic: LOCKDEBUG: Mutex error: mutex_vector_enter,514:
 spin lock held
 [ 342.2900473] cpu0: Begin traceback...
 [ 342.2900473] vpanic() at netbsd:vpanic+0x146
 [ 342.2900473] snprintf() at netbsd:snprintf
 [ 342.2900473] lockdebug_more() at netbsd:lockdebug_more
 [ 342.2900473] mutex_enter() at netbsd:mutex_enter+0x342
 [ 342.2900473] event_remove_handler() at netbsd:event_remove_handler+0x26
 [ 342.2900473] xbd_xenbus_suspend() at netbsd:xbd_xenbus_suspend+0x91
 [ 342.2900473] device_pmf_driver_suspend() at
 netbsd:device_pmf_driver_suspend+0x46
 [ 342.2900473] pmf_device_suspend_locked() at
 netbsd:pmf_device_suspend_locked+0xeb
 [ 342.2900473] pmf_device_suspend() at netbsd:pmf_device_suspend+0x45
 [ 342.2900473] pmf_system_suspend() at netbsd:pmf_system_suspend+0xba
 [ 342.2900473] sysctl_xen_suspend() at netbsd:sysctl_xen_suspend+0xf1
 [ 342.2900473] sysctl_dispatch() at netbsd:sysctl_dispatch+0xa3
 [ 342.2900473] sys___sysctl() at netbsd:sys___sysctl+0xc5
 [ 342.2900473] syscall() at netbsd:syscall+0x9c
 [ 342.2900473] --- syscall (number 202) ---

 Trying to acquire same spinlock second time without LOCKDEBUG would
 simply deadlock, which seems to match the original report.

 Continuing investigation.

From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/55207 CVS commit: src/sys/arch/xen/xen
Date: Tue, 12 May 2020 09:54:02 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Tue May 12 09:54:02 UTC 2020

 Modified Files:
 	src/sys/arch/xen/xen: xbd_xenbus.c

 Log Message:
 move xen_intr_disestablish() call in xbd_xenbus_suspend() so it's executed
 without holding the xbd mutex, to avoid LOCKDEBUG assertion on suspend

 while here only disestablish the intr if it was established

 part of PR port-xen/55207


 To generate a diff of this commit:
 cvs rdiff -u -r1.125 -r1.126 src/sys/arch/xen/xen/xbd_xenbus.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/55207 CVS commit: src/sys/kern
Date: Tue, 12 May 2020 10:02:56 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Tue May 12 10:02:56 UTC 2020

 Modified Files:
 	src/sys/kern: kern_pmf.c

 Log Message:
 need to take IFNET_LOCK() around if_stop (on suspend) and if_init (on resume)
 calls, those need to read and/or manipulate if_flags and hence need
 the lock for IFEF_MPSAFE drivers; the drivers can't do IFNET_LOCK() themselves,
 because the ioctl path call these hooks with the lock held

 fixes KASSERT() in xennet(4) while investigating PR port-xen/55207


 To generate a diff of this commit:
 cvs rdiff -u -r1.42 -r1.43 src/sys/kern/kern_pmf.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/55207 CVS commit: src/sys/arch/xen
Date: Wed, 13 May 2020 13:19:38 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Wed May 13 13:19:38 UTC 2020

 Modified Files:
 	src/sys/arch/xen/xen: evtchn.c
 	src/sys/arch/xen/xenbus: xenbus_comms.c xenbus_comms.h xenbus_probe.c

 Log Message:
 don't reinitialize mutexes/cv on resume

 part of PR port-xen/55207


 To generate a diff of this commit:
 cvs rdiff -u -r1.94 -r1.95 src/sys/arch/xen/xen/evtchn.c
 cvs rdiff -u -r1.23 -r1.24 src/sys/arch/xen/xenbus/xenbus_comms.c
 cvs rdiff -u -r1.7 -r1.8 src/sys/arch/xen/xenbus/xenbus_comms.h
 cvs rdiff -u -r1.52 -r1.53 src/sys/arch/xen/xenbus/xenbus_probe.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/55207 CVS commit: src/sys/arch/xen/xen
Date: Wed, 13 May 2020 16:13:14 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Wed May 13 16:13:14 UTC 2020

 Modified Files:
 	src/sys/arch/xen/xen: xengnt.c

 Log Message:
 need to set the version on resume same as during initialization

 part of PR port-xen/55207


 To generate a diff of this commit:
 cvs rdiff -u -r1.37 -r1.38 src/sys/arch/xen/xen/xengnt.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/55207 CVS commit: src/sys/arch/xen/xen
Date: Wed, 13 May 2020 16:17:46 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Wed May 13 16:17:46 UTC 2020

 Modified Files:
 	src/sys/arch/xen/xen: xbd_xenbus.c

 Log Message:
 move the xen_intr_disestablish() to resume - having it in suspend
 seems to cause panic in later phases of suspend

 don't try to revoke grants in resume, they are all gone

 add some diagnostic code in suspend to make sure the request lists are ready
 for resume

 part of PR port-xen/55207


 To generate a diff of this commit:
 cvs rdiff -u -r1.126 -r1.127 src/sys/arch/xen/xen/xbd_xenbus.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/55207 CVS commit: src/sys/arch/xen/xen
Date: Thu, 14 May 2020 09:47:25 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Thu May 14 09:47:25 UTC 2020

 Modified Files:
 	src/sys/arch/xen/xen: if_xennet_xenbus.c

 Log Message:
 rearrange so that suspend & resume doesn't cause panics, and interface
 is more likely to work - particularly, don't try to xengnt_revoke_access()
 after resume, move xen_intr_disestablish() call to resume, also
 unmask the event channel on resume

 XXX right now xennet device detaches immediately after resume, which is not
 desirable and needs to be fixed

 part of PR port-xen/55207


 To generate a diff of this commit:
 cvs rdiff -u -r1.124 -r1.125 src/sys/arch/xen/xen/if_xennet_xenbus.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/55207 CVS commit: src/sys/arch/xen/xen
Date: Thu, 14 May 2020 13:25:40 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Thu May 14 13:25:40 UTC 2020

 Modified Files:
 	src/sys/arch/xen/xen: if_xennet_xenbus.c

 Log Message:
 fix resume for xennet, now the network continues working after resume

 we can't read feature-rx-copy in resume, at that time the new backend
 device is not filled yet; convert it just to feature flag read on interface
 attach, can assume any backend would support rx-copy anyway

 fix compile with XENNET_DEBUG

 part of PR port-xen/55207


 To generate a diff of this commit:
 cvs rdiff -u -r1.125 -r1.126 src/sys/arch/xen/xen/if_xennet_xenbus.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

Responsible-Changed-From-To: jdolecek->port-xen-maintainer
Responsible-Changed-By: jdolecek@NetBSD.org
Responsible-Changed-When: Mon, 29 Jun 2020 11:03:05 +0000
Responsible-Changed-Why:
There is still some port-specific work to avoid the crash on suspend,
which seems to be due to interrupts not being disabled properly. I'm
not going to tackle this part for now, volunteers welcome.


From: Pierre-Philipp Braun <pbraun@nethence.com>
To: gnats-bugs@netbsd.org, port-xen-maintainer@netbsd.org,
 jdolecek@netbsd.org, netbsd-bugs@netbsd.org, gnats-admin@netbsd.org
Cc: 
Subject: Re: port-xen/55207 (netbsd domU does not migrate properly from one
 xen host to another)
Date: Sat, 8 Aug 2020 11:06:54 +0300

 Thank you Jaromír for your improvements on hot-migrating netbsd XEN/PV domUs.  Even though it is not finalized, I tried again with v9 vs daily build HEAD/202008060510Z and here are the results.  I don't know if that matters but for the record, those new tests have been done on top of NFSv3 vdisk ext4 sparse files, with a fake /dev/ on tmpfs, a fictitious disk label and FFS2 as root file-system created by makefs from netbsd's cross-compilation toolchain on linux.

 with v9

 root@slack2hb:~/guests/slime9# xl save slime9 slime9.save
 Saving to slime9.save new xl format (info 0x3/0x0/1497)
 xc: info: Saving domain 18, type x86 PV
 xc: error: save callback suspend() failed: 0: Internal error
 xc: error: Save failed (0 = Success): Internal error
 libxl: error: libxl_stream_write.c:350:libxl__xc_domain_save_done: Domain 18:saving domain: domain responded to suspend request: Success
 Failed to save domain, resuming domain
 libxl: error: libxl_dom.c:40:libxl__domain_type: unable to get domain type for domid=18

 [  63.5605882] xenbus_shutdown_handler: xenbus_rm 13
 [  63.5803786] Flushing disk caches: 8 done
 [  63.6004200] fatal page fault in supervisor mode
 [  63.6004200] trap type 6 code 0 rip 0xffffffff8020313f cs 0xe030 rflags 0x10256 cr2 0x10 ilevel 0x6 rsp
 0xffffd2004c92fd58
 [  63.6004200] curlwp 0xffffd20001d1d480 pid 0.3 lowest kstack 0xffffd2004c92c2c0
 [  63.6004200] panic: trap
 [  63.6004200] cpu0: Begin traceback...
 [  63.6004200] vpanic() at netbsd:vpanic+0x143
 [  63.6004200] snprintf() at netbsd:snprintf
 [  63.6004200] startlwp() at netbsd:startlwp
 [  63.6004200] alltraps() at netbsd:alltraps+0xae
 [  63.6004200] softint_thread() at netbsd:softint_thread+0x117
 [  63.6004200] cpu0: End traceback...

 [  63.6004200] dumping to dev 142,1 (offset=0, size=0): not possible
 [  63.6004200] rebooting...

 with current

 root@slack2hb:~/guests/slime# xl save slime slime.save
 Saving to slime.save new xl format (info 0x3/0x0/1479)
 xc: info: Saving domain 20, type x86 PV
 xc: Frames: 262144/262144  100%
 xc: error: Bad mfn for suspend record: Internal error
 xc: error: mfn 0x7f7fff4cb7e0, max 0x2030000: Internal error
 xc: error: Save failed (34 = Numerical result out of range): Internal error
 libxl: error: libxl_stream_write.c:350:libxl__xc_domain_save_done: Domain 20:saving domain: domain responded to suspend request: Numerical result out of range
 Failed to save domain, resuming domain

 [  19.2002331] Flushing disk caches: done
 [  19.7401533] uvm_fault(0xffffffff808a8200, 0xffffffff80baf000, 2) -> e
 [  19.7401533] fatal page fault in supervisor mode
 [  19.7401533] trap type 6 code 0x2 rip 0xffffffff806034a9 cs 0xe030 rflags 0x10202 cr2 0xffffffff80baf001 ilevel 0 rsp 0xffffb7805d508e70
 [  19.7401533] curlwp 0xffffb78002ad9a00 pid 827.827 lowest kstack 0xffffb7805d5042c0
 [  19.7401533] panic: trap
 [  19.7401533] cpu0: Begin traceback...
 [  19.7401533] vpanic() at netbsd:vpanic+0x146
 [  19.7401533] snprintf() at netbsd:snprintf
 [  19.7401533] startlwp() at netbsd:startlwp
 [  19.7401533] cpu0: End traceback...

 [  19.7401533] dumping to dev 142,1 (offset=0, size=0): not possible
 [  19.7401533] rebooting...

 I am not sure how to interpret this but I hope it's useful.

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.