NetBSD Problem Report #57560
From brad@anduin.eldar.org Fri Aug 4 01:51:01 2023
Return-Path: <brad@anduin.eldar.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id C40371A9238
for <gnats-bugs@gnats.NetBSD.org>; Fri, 4 Aug 2023 01:51:01 +0000 (UTC)
Message-Id: <202308040150.3741ouTN017772@anduin.eldar.org>
Date: Thu, 3 Aug 2023 21:50:56 -0400 (EDT)
From: brad@anduin.eldar.org
Reply-To: brad@anduin.eldar.org
To: gnats-bugs@NetBSD.org
Subject: xennetback_xenbus.c rev 1.109 broke guests, causes DOM0 reboot
X-Send-Pr-Version: 3.95
>Number: 57560
>Category: port-xen
>Synopsis: xennetback_xenbus.c rev 1.109 broke guests, causes DOM0 reboot
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: port-xen-maintainer
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri Aug 04 01:55:00 +0000 2023
>Last-Modified: Wed Aug 09 18:25:01 +0000 2023
>Originator: brad@anduin.eldar.org
>Release: NetBSD 10.0_BETA
>Organization:
eldar.org
>Environment:
System: NetBSD 10.0_BETA or -current as of 2023-08-03, XEN3_DOM0 kernel
Architecture: x86_64
Machine: amd64
>Description:
Given a DOM0 with Xen 4.15.3 built from pkgsrc 2022Q3 on a 32GB box
with Intel Xeon E-2224 CPU:
There have been a small number of non whitespace updates to the Xen
network backend in -current:
http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/arch/xen/xen/xennetback_xenbus.c?rev=1.109&content-type=text/x-cvsweb-markup&only_with_tag=MAIN
http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/arch/xen/xen/xennetback_xenbus.c?rev=1.110&content-type=text/x-cvsweb-markup&only_with_tag=MAIN
http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/arch/xen/xen/xennetback_xenbus.c?rev=1.111&content-type=text/x-cvsweb-markup&only_with_tag=MAIN
These all appear in NetBSD 10.0_BETA as:
http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/arch/xen/xen/xennetback_xenbus.c?rev=1.108.4.1&content-type=text/x-cvsweb-markup&only_with_tag=netbsd-10
Starting at 1.109, when ANY Xen guest starts to use its network
interfaces, the DOM0 mentioned will lock up for a bit and reboot.
There is no entering DDB and no panic message. On a NetBSD guest this
is when /etc/rc.d/network runs, and on a Linux ArchLinux guest it is
when systemd.networkd runs (or equiv) executes. The DOMU guest kernel
probe for the network device just fine.
I have no ability to have a serial console on the DOM0, but it is
possible that the hypervisior is panicing and rebooting with some sort
of message output, as opposed to just doing something spontaneous.
Reverting back to 1.108 of xennetback_xenbus.c allows the guest to
start up and run as before. I did not try 1.110 and 1.111 independent
of 1.109.
If this problem happens to others it may mess with the ability to
update the kernel on their DOM0 system even if a newer Xen works with
>= 1.109.
>How-To-Repeat:
Try to use a recent -current, the one I used was 10.99.6, or NetBSD
10.0_BETA as a DOM0 on Xen 4.15.3 from pkgsrc 2022Q3.
>Fix:
I fixed the problem by reverting back to a previous version of the
code, but it seems likely that some of the fixes are desirable from
reading the commit messages.
>Audit-Trail:
From: Taylor R Campbell <riastradh@NetBSD.org>
To: brad@anduin.eldar.org
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/57560: xennetback_xenbus.c rev 1.109 broke guests, causes DOM0 reboot
Date: Fri, 4 Aug 2023 07:21:10 +0000
This is a multi-part message in MIME format.
--=_usKIEqQi97a3RY2xFPoqMtrNfJcy98qu
Can you please try the attached patch?
--=_usKIEqQi97a3RY2xFPoqMtrNfJcy98qu
Content-Type: text/plain; charset="ISO-8859-1"; name="pr57560"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment; filename="pr57560.patch"
From 569f569998b5b8a741d185a17d6a83c525db0476 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh@NetBSD.org>
Date: Fri, 4 Aug 2023 07:17:41 +0000
Subject: [PATCH] xennet(4): Use correct criterion to test for unconsumed
requests.
Be less clever about minimizing xen_rmb and updates to
xneti->xni_txring.req_cons.
PR kern/57560
---
sys/arch/xen/xen/xennetback_xenbus.c | 11 ++++-------
1 file changed, 4 insertions(+), 7 deletions(-)
diff --git a/sys/arch/xen/xen/xennetback_xenbus.c b/sys/arch/xen/xen/xennet=
back_xenbus.c
index bf7733d0da98..2402f3a20923 100644
--- a/sys/arch/xen/xen/xennetback_xenbus.c
+++ b/sys/arch/xen/xen/xennetback_xenbus.c
@@ -803,23 +803,21 @@ xennetback_evthandler(void *arg)
netif_tx_request_t txreq;
struct mbuf *m, *m0 =3D NULL, *mlast =3D NULL;
int receive_pending;
- RING_IDX req_cons, req_prod;
+ RING_IDX req_cons;
int queued =3D 0, m0_len =3D 0;
struct xnetback_xstate *xst;
const bool discard =3D ((ifp->if_flags & (IFF_UP | IFF_RUNNING)) !=3D
(IFF_UP | IFF_RUNNING));
=20
XENPRINTF(("xennetback_evthandler "));
- req_cons =3D xneti->xni_txring.req_cons;
again:
- req_prod =3D xneti->xni_txring.sring->req_prod;
- xen_rmb();
- while (req_cons !=3D req_prod) {
+ while (RING_HAS_UNCONSUMED_REQUESTS(&xneti->xni_txring)) {
+ xen_rmb();
+ req_cons =3D xneti->xni_txring.req_cons++;
RING_COPY_REQUEST(&xneti->xni_txring, req_cons,
&txreq);
XENPRINTF(("%s pkt size %d\n", xneti->xni_if.if_xname,
txreq.size));
- req_cons++;
if (__predict_false(discard)) {
/* interface not up, drop all requests */
if_statinc(ifp, if_iqdrops);
@@ -959,7 +957,6 @@ mbuf_fail:
RING_FINAL_CHECK_FOR_REQUESTS(&xneti->xni_txring, receive_pending);
if (receive_pending)
goto again;
- xneti->xni_txring.req_cons =3D req_cons;
=20
if (m0) {
/* Queue empty, and still unfinished multi-fragment request */
--=_usKIEqQi97a3RY2xFPoqMtrNfJcy98qu--
From: Taylor R Campbell <riastradh@NetBSD.org>
To: brad@anduin.eldar.org
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/57560: xennetback_xenbus.c rev 1.109 broke guests, causes DOM0 reboot
Date: Fri, 4 Aug 2023 07:35:58 +0000
Also, can you check to see if there's anything from the previous boot
in dmesg when the dom0 comes back?
> Starting at 1.109, when ANY Xen guest starts to use its network
> interfaces, the DOM0 mentioned will lock up for a bit and reboot.
How long is `for a bit'? Around 15sec or something else?
From: Brad Spencer <brad@anduin.eldar.org>
To: Taylor R Campbell <riastradh@NetBSD.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/57560: xennetback_xenbus.c rev 1.109 broke guests, causes
DOM0 reboot
Date: Fri, 04 Aug 2023 06:48:27 -0400
Taylor R Campbell <riastradh@NetBSD.org> writes:
> Also, can you check to see if there's anything from the previous boot
> in dmesg when the dom0 comes back?
While I did not do that as a routine matter during the testing, I did
check one time and there was nothing. The dmesg started with the NetBSD
kernel boot messages as far as I could tell. Should the lockup / panic
happen again I will look.
>> Starting at 1.109, when ANY Xen guest starts to use its network
>> interfaces, the DOM0 mentioned will lock up for a bit and reboot.
>
> How long is `for a bit'? Around 15sec or something else?
Less than 15 second, something in the range of 5 to 8 seconds I would
guess. I wasn't timing the hang very closely. The DOM0 was fine up
until the point that the DOMU started networking and at that point, ya,
something like 5 to 8 seconds and then the DOM0 screen would blank, sync
would drop on the VGAish monitor and a reboot would proceed. No DDB, no
DOM0 panic traceback messages, etc... Sort of like one hit the hard
reset button (at least without ACPI involved).
--
Brad Spencer - brad@anduin.eldar.org - KC8VKS - http://anduin.eldar.org
From: Brad Spencer <brad@anduin.eldar.org>
To: gnats-bugs@netbsd.org
Cc: port-xen-maintainer@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: Re: kern/57560: xennetback_xenbus.c rev 1.109 broke guests, causes
DOM0 reboot
Date: Fri, 04 Aug 2023 13:10:49 -0400
Taylor R Campbell <riastradh@NetBSD.org> writes:
> The following reply was made to PR port-xen/57560; it has been noted by GNATS.
>
> From: Taylor R Campbell <riastradh@NetBSD.org>
> To: brad@anduin.eldar.org
> Cc: gnats-bugs@NetBSD.org
> Subject: Re: kern/57560: xennetback_xenbus.c rev 1.109 broke guests, causes DOM0 reboot
> Date: Fri, 4 Aug 2023 07:21:10 +0000
>
> This is a multi-part message in MIME format.
> --=_usKIEqQi97a3RY2xFPoqMtrNfJcy98qu
>
> Can you please try the attached patch?
Thanks for the quick response. My target system for this all to
function correctly is really NetBSD 10.0, but the results are likely to
apply to -current just as well. The Xen xennet backend is identical
between the two right now (in as much as I can tell right now).
It went as follows:
The source tree was patched with the provided patch, a kernel built, and
the DOM0 booted with it. The test DOMU that I had been using did start
and I was able to log into it. I proceeded to start the production DOMU
guests and all seemed to be well for a while. However, I would say
after 5 to 10 minutes, the DOM0 panic'ed / rebooted as before when the
patch was not installed. No DDB, no traceback... etc.. same thing,
just a little later (the load on the DOMUs would have been ramping up a
lot at this point, as well). When the DOM0 had stablized, I booted it
with the same source tree, except with xennetback_xenbus.c reverted back
to 1.108. I started all of the guests and the DOM0 is still running.
This is all a pretty disruptive operation to perform, so I could not do
anything more at the moment, but I would say that the patch improved the
situation, but did not fully address the problem. I will be watching
the DOMU guests closely and the DOM0 to see if anything undesirable
develops as there might be something else wrong, but for right now, the
environment is stable.
From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/57560 CVS commit: src/sys/arch/xen/xen
Date: Fri, 4 Aug 2023 18:40:36 +0000
Module Name: src
Committed By: riastradh
Date: Fri Aug 4 18:40:36 UTC 2023
Modified Files:
src/sys/arch/xen/xen: xennetback_xenbus.c
Log Message:
Revert "xennetback(4): Omit needless membars in xennetback_connect."
PR kern/57560
To generate a diff of this commit:
cvs rdiff -u -r1.112 -r1.113 src/sys/arch/xen/xen/xennetback_xenbus.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/57560 CVS commit: src/sys/arch/xen/xen
Date: Fri, 4 Aug 2023 18:40:49 +0000
Module Name: src
Committed By: riastradh
Date: Fri Aug 4 18:40:49 UTC 2023
Modified Files:
src/sys/arch/xen/xen: xennetback_xenbus.c
Log Message:
Revert "xennetback(4): Fix membars in xennetback_rx_copy_process."
PR kern/57560
To generate a diff of this commit:
cvs rdiff -u -r1.113 -r1.114 src/sys/arch/xen/xen/xennetback_xenbus.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/57560 CVS commit: src/sys/arch/xen/xen
Date: Fri, 4 Aug 2023 18:41:01 +0000
Module Name: src
Committed By: riastradh
Date: Fri Aug 4 18:41:01 UTC 2023
Modified Files:
src/sys/arch/xen/xen: xennetback_xenbus.c
Log Message:
Revert "xennetback(4): Fix xennetback_evthandler loop."
PR kern/57560
To generate a diff of this commit:
cvs rdiff -u -r1.114 -r1.115 src/sys/arch/xen/xen/xennetback_xenbus.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/57560 CVS commit: [netbsd-10] src/sys/arch/xen/xen
Date: Fri, 4 Aug 2023 19:53:43 +0000
Module Name: src
Committed By: martin
Date: Fri Aug 4 19:53:43 UTC 2023
Modified Files:
src/sys/arch/xen/xen [netbsd-10]: xennetback_xenbus.c
Log Message:
Pull up following revision(s) (requested by riastradh in ticket #312):
sys/arch/xen/xen/xennetback_xenbus.c: revision 1.113
sys/arch/xen/xen/xennetback_xenbus.c: revision 1.114
sys/arch/xen/xen/xennetback_xenbus.c: revision 1.115
Revert "xennetback(4): Omit needless membars in xennetback_connect."
PR kern/57560
Revert "xennetback(4): Fix membars in xennetback_rx_copy_process."
PR kern/57560
Revert "xennetback(4): Fix xennetback_evthandler loop."
PR kern/57560
To generate a diff of this commit:
cvs rdiff -u -r1.108.4.1 -r1.108.4.2 src/sys/arch/xen/xen/xennetback_xenbus.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Taylor R Campbell <riastradh@NetBSD.org>
To: Brad Spencer <brad@anduin.eldar.org>
Cc: gnats-bugs@NetBSD.org, martin@NetBSD.org
Subject: Re: kern/57560: xennetback_xenbus.c rev 1.109 broke guests, causes DOM0 reboot
Date: Fri, 4 Aug 2023 20:19:33 +0000
I reverted the commits on HEAD and pulled the revert up to netbsd-10,
so the immediate problem should be addressed.
However, the code is also slightly problematic because it's still
missing a barrier in one place, and still using a much-too-expensive
barrier (even on x86!) in the loop in another place, and still has
several nonsensical barriers in other places that can't hurt
correctness but do hurt understandability and maintainability.
So I went through to try to redo the changes more carefully. If you
have the opportunity to test or bisect on another patch, I've drafted
a series of changes, attached:
- The .diff file is a single diff of the whole changes. If it works,
great -- I can commit. But if not:
- The .patch file is a series of individual changes for bisection to
see exactly which one breaks it.
From: Taylor R Campbell <riastradh@NetBSD.org>
To: Brad Spencer <brad@anduin.eldar.org>
Cc: gnats-bugs@NetBSD.org, martin@NetBSD.org
Subject: Re: kern/57560: xennetback_xenbus.c rev 1.109 broke guests, causes DOM0 reboot
Date: Fri, 4 Aug 2023 20:26:25 +0000
This is a multi-part message in MIME format.
--=_8xiQiGqtxVZ2LhiPkIgdNm2J38qsNMwx
> Date: Fri, 4 Aug 2023 20:19:33 +0000
> From: Taylor R Campbell <riastradh@NetBSD.org>
>
> So I went through to try to redo the changes more carefully. If you
> have the opportunity to test or bisect on another patch, I've drafted
> a series of changes, attached: [...]
...attached for real this time.
--=_8xiQiGqtxVZ2LhiPkIgdNm2J38qsNMwx
Content-Type: text/plain; charset="ISO-8859-1"; name="xvifmembars"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment; filename="xvifmembars.diff"
diff --git a/sys/arch/xen/xen/xennetback_xenbus.c b/sys/arch/xen/xen/xennet=
back_xenbus.c
index dac32a00d8db..559230017ff5 100644
--- a/sys/arch/xen/xen/xennetback_xenbus.c
+++ b/sys/arch/xen/xen/xennetback_xenbus.c
@@ -497,9 +497,7 @@ xennetback_connect(struct xnetback_instance *xneti)
goto err2;
}
xneti->xni_evtchn =3D evop.u.bind_interdomain.local_port;
- xen_wmb();
xneti->xni_status =3D CONNECTED;
- xen_wmb();
=20
xneti->xni_ih =3D xen_intr_establish_xname(-1, &xen_pic,
xneti->xni_evtchn, IST_LEVEL, IPL_NET, xennetback_evthandler,
@@ -805,28 +803,27 @@ xennetback_evthandler(void *arg)
netif_tx_request_t txreq;
struct mbuf *m, *m0 =3D NULL, *mlast =3D NULL;
int receive_pending;
- RING_IDX req_cons;
int queued =3D 0, m0_len =3D 0;
struct xnetback_xstate *xst;
const bool discard =3D ((ifp->if_flags & (IFF_UP | IFF_RUNNING)) !=3D
(IFF_UP | IFF_RUNNING));
=20
XENPRINTF(("xennetback_evthandler "));
- req_cons =3D xneti->xni_txring.req_cons;
- while (1) {
- xen_rmb(); /* be sure to read the request before updating */
- xneti->xni_txring.req_cons =3D req_cons;
- xen_wmb();
- RING_FINAL_CHECK_FOR_REQUESTS(&xneti->xni_txring,
- receive_pending);
- if (receive_pending =3D=3D 0)
- break;
- RING_COPY_REQUEST(&xneti->xni_txring, req_cons,
- &txreq);
+again:
+ while (RING_HAS_UNCONSUMED_REQUESTS(&xneti->xni_txring)) {
+ /*
+ * Ensure we have read the producer's queue index in
+ * RING_FINAL_CHECK_FOR_REQUESTS before we read the
+ * content of the producer's next request in
+ * RING_COPY_REQUEST.
+ */
xen_rmb();
+ RING_COPY_REQUEST(&xneti->xni_txring,
+ xneti->xni_txring.req_cons,
+ &txreq);
XENPRINTF(("%s pkt size %d\n", xneti->xni_if.if_xname,
txreq.size));
- req_cons++;
+ xneti->xni_txring.req_cons++;
if (__predict_false(discard)) {
/* interface not up, drop all requests */
if_statinc(ifp, if_iqdrops);
@@ -875,7 +872,7 @@ mbuf_fail:
*/
int cnt;
m0_len =3D xennetback_tx_m0len_fragment(xneti,
- txreq.size, req_cons, &cnt);
+ txreq.size, xneti->xni_txring.req_cons, &cnt);
m->m_len =3D m0_len;
KASSERT(cnt <=3D XEN_NETIF_NR_SLOTS_MIN);
=20
@@ -941,7 +938,8 @@ mbuf_fail:
=20
XENPRINTF(("%s pkt offset %d size %d id %d req_cons %d\n",
xneti->xni_if.if_xname, txreq.offset,
- txreq.size, txreq.id, req_cons & (RING_SIZE(&xneti->xni_txring) - 1)=
));
+ txreq.size, txreq.id,
+ xneti->xni_txring.req_cons & (RING_SIZE(&xneti->xni_txring) - 1)));
=20
xst =3D &xneti->xni_xstate[queued];
xst->xs_m =3D (m0 =3D=3D NULL || m =3D=3D m0) ? m : NULL;
@@ -962,6 +960,9 @@ mbuf_fail:
queued =3D 0;
}
}
+ RING_FINAL_CHECK_FOR_REQUESTS(&xneti->xni_txring, receive_pending);
+ if (receive_pending)
+ goto again;
if (m0) {
/* Queue empty, and still unfinished multi-fragment request */
printf("%s: dropped unfinished multi-fragment\n",
@@ -1021,14 +1022,12 @@ xennetback_rx_copy_process(struct ifnet *ifp, struc=
t xnetback_instance *xneti,
}
=20
/* update pointer */
- xen_rmb();
xneti->xni_rxring.req_cons +=3D queued;
xneti->xni_rxring.rsp_prod_pvt +=3D queued;
RING_PUSH_RESPONSES_AND_CHECK_NOTIFY(&xneti->xni_rxring, notify);
=20
/* send event */
if (notify) {
- xen_rmb();
XENPRINTF(("%s receive event\n",
xneti->xni_if.if_xname));
hypervisor_notify_via_evtchn(xneti->xni_evtchn);
--=_8xiQiGqtxVZ2LhiPkIgdNm2J38qsNMwx
Content-Type: text/plain; charset="ISO-8859-1"; name="xvifmembars"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment; filename="xvifmembars.patch"
From 54f5ee6f683141090d9c533a9c0a5ce2ffb1286e Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh@NetBSD.org>
Date: Fri, 4 Aug 2023 19:40:06 +0000
Subject: [PATCH 1/8] xvif(4): Comment on memory barriers in
xennetback_evthandler.
Note which ones appear unnecessary and which ones appear too strong,
but don't change them.
No functional change intended.
---
sys/arch/xen/xen/xennetback_xenbus.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/sys/arch/xen/xen/xennetback_xenbus.c b/sys/arch/xen/xen/xennet=
back_xenbus.c
index dac32a00d8db..beaa73ed5bb1 100644
--- a/sys/arch/xen/xen/xennetback_xenbus.c
+++ b/sys/arch/xen/xen/xennetback_xenbus.c
@@ -814,15 +814,28 @@ xennetback_evthandler(void *arg)
XENPRINTF(("xennetback_evthandler "));
req_cons =3D xneti->xni_txring.req_cons;
while (1) {
+ /*
+ * XXX The xen_rmb here and comment make no sense:
+ * xneti->xni_txring.req_cons is a private variable.
+ */
xen_rmb(); /* be sure to read the request before updating */
xneti->xni_txring.req_cons =3D req_cons;
+ /* XXX Unclear what this xen_wmb is for. */
xen_wmb();
+ /*
+ * XXX RING_FINAL_CHECK_FOR_REQUESTS issues the most
+ * expensive memory barrier, xen_mb. This should be
+ * used only at the end of the loop after we updating
+ * the producer with the last index of the requests we
+ * consumed in the queue.
+ */
RING_FINAL_CHECK_FOR_REQUESTS(&xneti->xni_txring,
receive_pending);
if (receive_pending =3D=3D 0)
break;
RING_COPY_REQUEST(&xneti->xni_txring, req_cons,
&txreq);
+ /* XXX Unclear what this xen_rmb is for. */
xen_rmb();
XENPRINTF(("%s pkt size %d\n", xneti->xni_if.if_xname,
txreq.size));
From ac1bce0e69957aefe966fac39c8afb13432a6774 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh@NetBSD.org>
Date: Fri, 4 Aug 2023 19:41:16 +0000
Subject: [PATCH 2/8] xvif(4): Add missing xen_rmb in xennetback_evthandler.
---
sys/arch/xen/xen/xennetback_xenbus.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/sys/arch/xen/xen/xennetback_xenbus.c b/sys/arch/xen/xen/xennet=
back_xenbus.c
index beaa73ed5bb1..3b0104ba094e 100644
--- a/sys/arch/xen/xen/xennetback_xenbus.c
+++ b/sys/arch/xen/xen/xennetback_xenbus.c
@@ -833,6 +833,13 @@ xennetback_evthandler(void *arg)
receive_pending);
if (receive_pending =3D=3D 0)
break;
+ /*
+ * Ensure we have read the producer's queue index in
+ * RING_FINAL_CHECK_FOR_REQUESTS before we read the
+ * content of the producer's next request in
+ * RING_COPY_REQUEST.
+ */
+ xen_rmb();
RING_COPY_REQUEST(&xneti->xni_txring, req_cons,
&txreq);
/* XXX Unclear what this xen_rmb is for. */
From 122583462eeaf947293b6fb8443d5ec2bd8db3cd Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh@NetBSD.org>
Date: Fri, 4 Aug 2023 19:46:39 +0000
Subject: [PATCH 3/8] xvif(4): Omit local variable aliasing
xneti->xni_txring.req_cons.
No functional change intended.
---
sys/arch/xen/xen/xennetback_xenbus.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)
diff --git a/sys/arch/xen/xen/xennetback_xenbus.c b/sys/arch/xen/xen/xennet=
back_xenbus.c
index 3b0104ba094e..ff53e5d1ad46 100644
--- a/sys/arch/xen/xen/xennetback_xenbus.c
+++ b/sys/arch/xen/xen/xennetback_xenbus.c
@@ -805,21 +805,18 @@ xennetback_evthandler(void *arg)
netif_tx_request_t txreq;
struct mbuf *m, *m0 =3D NULL, *mlast =3D NULL;
int receive_pending;
- RING_IDX req_cons;
int queued =3D 0, m0_len =3D 0;
struct xnetback_xstate *xst;
const bool discard =3D ((ifp->if_flags & (IFF_UP | IFF_RUNNING)) !=3D
(IFF_UP | IFF_RUNNING));
=20
XENPRINTF(("xennetback_evthandler "));
- req_cons =3D xneti->xni_txring.req_cons;
while (1) {
/*
* XXX The xen_rmb here and comment make no sense:
* xneti->xni_txring.req_cons is a private variable.
*/
xen_rmb(); /* be sure to read the request before updating */
- xneti->xni_txring.req_cons =3D req_cons;
/* XXX Unclear what this xen_wmb is for. */
xen_wmb();
/*
@@ -840,13 +837,14 @@ xennetback_evthandler(void *arg)
* RING_COPY_REQUEST.
*/
xen_rmb();
- RING_COPY_REQUEST(&xneti->xni_txring, req_cons,
+ RING_COPY_REQUEST(&xneti->xni_txring,
+ xneti->xni_txring.req_cons,
&txreq);
/* XXX Unclear what this xen_rmb is for. */
xen_rmb();
XENPRINTF(("%s pkt size %d\n", xneti->xni_if.if_xname,
txreq.size));
- req_cons++;
+ xneti->xni_txring.req_cons++;
if (__predict_false(discard)) {
/* interface not up, drop all requests */
if_statinc(ifp, if_iqdrops);
@@ -895,7 +893,7 @@ mbuf_fail:
*/
int cnt;
m0_len =3D xennetback_tx_m0len_fragment(xneti,
- txreq.size, req_cons, &cnt);
+ txreq.size, xneti->xni_txring.req_cons, &cnt);
m->m_len =3D m0_len;
KASSERT(cnt <=3D XEN_NETIF_NR_SLOTS_MIN);
=20
@@ -961,7 +959,8 @@ mbuf_fail:
=20
XENPRINTF(("%s pkt offset %d size %d id %d req_cons %d\n",
xneti->xni_if.if_xname, txreq.offset,
- txreq.size, txreq.id, req_cons & (RING_SIZE(&xneti->xni_txring) - 1)=
));
+ txreq.size, txreq.id,
+ xneti->xni_txring.req_cons & (RING_SIZE(&xneti->xni_txring) - 1)));
=20
xst =3D &xneti->xni_xstate[queued];
xst->xs_m =3D (m0 =3D=3D NULL || m =3D=3D m0) ? m : NULL;
From b009b3e3f0b9346afbd6b5ce2efd2f23b6d36c15 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh@NetBSD.org>
Date: Fri, 4 Aug 2023 19:55:02 +0000
Subject: [PATCH 4/8] xvif(4): Move expensive xen_mb out of
xennetback_evthandler loop.
Use the cheaper RING_HAS_UNCONFIRMED_REQUESTS for most of the loop.
This should improve throughput without any impact on correctness.
---
sys/arch/xen/xen/xennetback_xenbus.c | 15 +++++----------
1 file changed, 5 insertions(+), 10 deletions(-)
diff --git a/sys/arch/xen/xen/xennetback_xenbus.c b/sys/arch/xen/xen/xennet=
back_xenbus.c
index ff53e5d1ad46..a2f375ac4e2e 100644
--- a/sys/arch/xen/xen/xennetback_xenbus.c
+++ b/sys/arch/xen/xen/xennetback_xenbus.c
@@ -811,6 +811,7 @@ xennetback_evthandler(void *arg)
(IFF_UP | IFF_RUNNING));
=20
XENPRINTF(("xennetback_evthandler "));
+again:
while (1) {
/*
* XXX The xen_rmb here and comment make no sense:
@@ -819,16 +820,7 @@ xennetback_evthandler(void *arg)
xen_rmb(); /* be sure to read the request before updating */
/* XXX Unclear what this xen_wmb is for. */
xen_wmb();
- /*
- * XXX RING_FINAL_CHECK_FOR_REQUESTS issues the most
- * expensive memory barrier, xen_mb. This should be
- * used only at the end of the loop after we updating
- * the producer with the last index of the requests we
- * consumed in the queue.
- */
- RING_FINAL_CHECK_FOR_REQUESTS(&xneti->xni_txring,
- receive_pending);
- if (receive_pending =3D=3D 0)
+ if (!RING_HAS_UNCONSUMED_REQUESTS(&xneti->xni_txring))
break;
/*
* Ensure we have read the producer's queue index in
@@ -981,6 +973,9 @@ mbuf_fail:
queued =3D 0;
}
}
+ RING_FINAL_CHECK_FOR_REQUESTS(&xneti->xni_txring, receive_pending);
+ if (receive_pending)
+ goto again;
if (m0) {
/* Queue empty, and still unfinished multi-fragment request */
printf("%s: dropped unfinished multi-fragment\n",
From ce882c7c51fc0c88e225e9b3814c10cfaf996e78 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh@NetBSD.org>
Date: Fri, 4 Aug 2023 19:56:50 +0000
Subject: [PATCH 5/8] xvif(4): Omit needless membars in xennetback_evthandle=
r.
This should improve throughput without any impact on correctness.
---
sys/arch/xen/xen/xennetback_xenbus.c | 9 ---------
1 file changed, 9 deletions(-)
diff --git a/sys/arch/xen/xen/xennetback_xenbus.c b/sys/arch/xen/xen/xennet=
back_xenbus.c
index a2f375ac4e2e..20887b2c2dc0 100644
--- a/sys/arch/xen/xen/xennetback_xenbus.c
+++ b/sys/arch/xen/xen/xennetback_xenbus.c
@@ -813,13 +813,6 @@ xennetback_evthandler(void *arg)
XENPRINTF(("xennetback_evthandler "));
again:
while (1) {
- /*
- * XXX The xen_rmb here and comment make no sense:
- * xneti->xni_txring.req_cons is a private variable.
- */
- xen_rmb(); /* be sure to read the request before updating */
- /* XXX Unclear what this xen_wmb is for. */
- xen_wmb();
if (!RING_HAS_UNCONSUMED_REQUESTS(&xneti->xni_txring))
break;
/*
@@ -832,8 +825,6 @@ again:
RING_COPY_REQUEST(&xneti->xni_txring,
xneti->xni_txring.req_cons,
&txreq);
- /* XXX Unclear what this xen_rmb is for. */
- xen_rmb();
XENPRINTF(("%s pkt size %d\n", xneti->xni_if.if_xname,
txreq.size));
xneti->xni_txring.req_cons++;
From 7b450b9e18a4c874dffb9fb36f938771b201b8b6 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh@NetBSD.org>
Date: Fri, 4 Aug 2023 19:58:50 +0000
Subject: [PATCH 6/8] xvif(4): Simplify while loop in xennetback_evthandler.
No functional change intended.
---
sys/arch/xen/xen/xennetback_xenbus.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/sys/arch/xen/xen/xennetback_xenbus.c b/sys/arch/xen/xen/xennet=
back_xenbus.c
index 20887b2c2dc0..d07153ea4e62 100644
--- a/sys/arch/xen/xen/xennetback_xenbus.c
+++ b/sys/arch/xen/xen/xennetback_xenbus.c
@@ -812,9 +812,7 @@ xennetback_evthandler(void *arg)
=20
XENPRINTF(("xennetback_evthandler "));
again:
- while (1) {
- if (!RING_HAS_UNCONSUMED_REQUESTS(&xneti->xni_txring))
- break;
+ while (RING_HAS_UNCONSUMED_REQUESTS(&xneti->xni_txring)) {
/*
* Ensure we have read the producer's queue index in
* RING_FINAL_CHECK_FOR_REQUESTS before we read the
From 351d5093721ae68925ba8abb7bc305680ecc5d28 Mon Sep 17 00:00:00 2001
From: riastradh <riastradh@NetBSD.org>
Date: Sat, 25 Feb 2023 00:34:25 +0000
Subject: [PATCH 7/8] xvif(4): Omit needless membars in
xennetback_rx_copy_process.
- No need for barrier around touching req_cons and rsp_prod_pvt,
which are private.
- RING_PUSH_RESPONSES_AND_CHECK_NOTIFY updates the shared req_prod and
then issues xen_mb, which is all that we need between the update of
shared req_prod and hypervisor_notify_via_evtchn.
(Between updating the shared req_prod and issuing
hypervisor_notify_via_evtchn, only xen_wmb is needed. But after
writing to the shared req_prod, RING_PUSH_REQUESTS_AND_CHECK_NOTIFY
must also read from the shared rsp_event, which requires the
store-before-load ordering that only xen_mb provides.)
---
sys/arch/xen/xen/xennetback_xenbus.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/sys/arch/xen/xen/xennetback_xenbus.c b/sys/arch/xen/xen/xennet=
back_xenbus.c
index d07153ea4e62..f873b0d1cdc7 100644
--- a/sys/arch/xen/xen/xennetback_xenbus.c
+++ b/sys/arch/xen/xen/xennetback_xenbus.c
@@ -1024,14 +1024,12 @@ xennetback_rx_copy_process(struct ifnet *ifp, struc=
t xnetback_instance *xneti,
}
=20
/* update pointer */
- xen_rmb();
xneti->xni_rxring.req_cons +=3D queued;
xneti->xni_rxring.rsp_prod_pvt +=3D queued;
RING_PUSH_RESPONSES_AND_CHECK_NOTIFY(&xneti->xni_rxring, notify);
=20
/* send event */
if (notify) {
- xen_rmb();
XENPRINTF(("%s receive event\n",
xneti->xni_if.if_xname));
hypervisor_notify_via_evtchn(xneti->xni_evtchn);
From 0e95d0a87aa7109b82a565d88e87ad22f66b60a0 Mon Sep 17 00:00:00 2001
From: riastradh <riastradh@NetBSD.org>
Date: Sat, 25 Feb 2023 00:34:36 +0000
Subject: [PATCH 8/8] xvif(4): Omit needless membars in xennetback_connect.
xneti is a private data structure to which we have exclusive access
here; ordering the stores doesn't make sense.
---
sys/arch/xen/xen/xennetback_xenbus.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/sys/arch/xen/xen/xennetback_xenbus.c b/sys/arch/xen/xen/xennet=
back_xenbus.c
index f873b0d1cdc7..559230017ff5 100644
--- a/sys/arch/xen/xen/xennetback_xenbus.c
+++ b/sys/arch/xen/xen/xennetback_xenbus.c
@@ -497,9 +497,7 @@ xennetback_connect(struct xnetback_instance *xneti)
goto err2;
}
xneti->xni_evtchn =3D evop.u.bind_interdomain.local_port;
- xen_wmb();
xneti->xni_status =3D CONNECTED;
- xen_wmb();
=20
xneti->xni_ih =3D xen_intr_establish_xname(-1, &xen_pic,
xneti->xni_evtchn, IST_LEVEL, IPL_NET, xennetback_evthandler,
--=_8xiQiGqtxVZ2LhiPkIgdNm2J38qsNMwx--
From: Brad Spencer <brad@anduin.eldar.org>
To: gnats-bugs@netbsd.org
Cc: port-xen-maintainer@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: Re: kern/57560: xennetback_xenbus.c rev 1.109 broke guests, causes
DOM0 reboot
Date: Fri, 04 Aug 2023 18:48:15 -0400
Taylor R Campbell <riastradh@NetBSD.org> writes:
> The following reply was made to PR port-xen/57560; it has been noted by GNATS.
>
> From: Taylor R Campbell <riastradh@NetBSD.org>
> To: Brad Spencer <brad@anduin.eldar.org>
> Cc: gnats-bugs@NetBSD.org, martin@NetBSD.org
> Subject: Re: kern/57560: xennetback_xenbus.c rev 1.109 broke guests, causes DOM0 reboot
> Date: Fri, 4 Aug 2023 20:19:33 +0000
>
> I reverted the commits on HEAD and pulled the revert up to netbsd-10,
> so the immediate problem should be addressed.
>
> However, the code is also slightly problematic because it's still
> missing a barrier in one place, and still using a much-too-expensive
> barrier (even on x86!) in the loop in another place, and still has
> several nonsensical barriers in other places that can't hurt
> correctness but do hurt understandability and maintainability.
>
> So I went through to try to redo the changes more carefully. If you
> have the opportunity to test or bisect on another patch, I've drafted
> a series of changes, attached:
>
> - The .diff file is a single diff of the whole changes. If it works,
> great -- I can commit. But if not:
>
> - The .patch file is a series of individual changes for bisection to
> see exactly which one breaks it.
>
I can help test this, but I am really out of time in the next 24 - 48
hours at this point. I expect that I may be able to do something in the
next week at some point, just don't know when.
Thanks very much for looking into this.
--
Brad Spencer - brad@anduin.eldar.org - KC8VKS - http://anduin.eldar.org
From: Martin Husemann <martin@duskware.de>
To: Taylor R Campbell <riastradh@NetBSD.org>
Cc: Brad Spencer <brad@anduin.eldar.org>, gnats-bugs@NetBSD.org,
martin@NetBSD.org
Subject: Re: kern/57560: xennetback_xenbus.c rev 1.109 broke guests, causes
DOM0 reboot
Date: Sat, 5 Aug 2023 15:26:30 +0200
I verified I could reproduce the issue in my setup with a -current kernel
from yesterday without the sys/arch/xen/xen/xennetback_xenbus.c backouts.
As described here, as soon as the domU starts dhcpcd the dom0 locked
up hard - no breaking into ddb possible, no console output.
Then I updated to -current as-is right now (with the xennetback_xenbus.c
changes backed out) and everything worked fine.
This was only to verify I'm on the same base as Brad.
Then I applied the xvifmembars.diff change, booted the new dom0 kernel
and verified domUs can use the network just fine.
So: with the "diff" variant everything seems to be fine.
Martin
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/57560: xennetback_xenbus.c rev 1.109 broke guests, causes
DOM0 reboot
Date: Sat, 5 Aug 2023 15:51:43 +0200
Taylor suggested I try the broken kernel again, but with options HEARTBEAT
enabled - and that got me:
[ 178.0039847] panic: cpu0: softints stuck for 16 seconds
[ 178.0039847] cpu0: Begin traceback...
[ 178.0039847] vpanic() at netbsd:vpanic+0x163
[ 178.0039847] panic() at netbsd:panic+0x3c
[ 178.0039847] heartbeat() at netbsd:heartbeat+0x29f
[ 178.0039847] hardclock() at netbsd:hardclock+0x7d
[ 178.0039847] xen_timer_handler() at netbsd:xen_timer_handler+0x99
[ 178.0039847] evtchn_do_event() at netbsd:evtchn_do_event+0x118
[ 178.0039847] do_hypervisor_callback() at netbsd:do_hypervisor_callback+0x167
[ 178.0039847] Xhandle_hypervisor_callback() at netbsd:Xhandle_hypervisor_callback+0x1a
[ 178.0039847] --- interrupt ---
[ 178.0039847] _kernel_lock() at netbsd:_kernel_lock+0xec
[ 178.0039847] xen_intr_biglock_wrapper() at netbsd:xen_intr_biglock_wrapper+0x14
[ 178.0039847] evtchn_do_event() at netbsd:evtchn_do_event+0x118
[ 178.0039847] do_hypervisor_callback() at netbsd:do_hypervisor_callback+0x167
[ 178.0039847] Xhandle_hypervisor_callback() at netbsd:Xhandle_hypervisor_callback+0x1a
[ 178.0039847] --- interrupt ---
[ 178.0039847] hypercall_page() at netbsd:hypercall_page+0x3aa
[ 178.0039847] idle_loop() at netbsd:idle_loop+0x146
[ 178.0039847] cpu0: End traceback...
[ 178.0039847] fatal breakpoint trap in supervisor mode
[ 178.0039847] trap type 1 code 0 rip 0xffffffff8024196d cs 0xe030 rflags 0x202 cr2 0x7f1eece9f58a ilevel 0x7 rsp 0xffff98004103e888
[ 178.0039847] curlwp 0xffff980000ff7040 pid 0.2 lowest kstack 0xffff98004103a2c0
Stopped in pid 0.2 (system) at netbsd:breakpoint+0x5: leave
breakpoint() at netbsd:breakpoint+0x5
vpanic() at netbsd:vpanic+0x163
panic() at netbsd:panic+0x3c
heartbeat() at netbsd:heartbeat+0x29f
hardclock() at netbsd:hardclock+0x7d
xen_timer_handler() at netbsd:xen_timer_handler+0x99
evtchn_do_event() at netbsd:evtchn_do_event+0x118
do_hypervisor_callback() at netbsd:do_hypervisor_callback+0x167
Xhandle_hypervisor_callback() at netbsd:Xhandle_hypervisor_callback+0x1a
--- interrupt ---
_kernel_lock() at netbsd:_kernel_lock+0xec
xen_intr_biglock_wrapper() at netbsd:xen_intr_biglock_wrapper+0x14
evtchn_do_event() at netbsd:evtchn_do_event+0x118
do_hypervisor_callback() at netbsd:do_hypervisor_callback+0x167
Xhandle_hypervisor_callback() at netbsd:Xhandle_hypervisor_callback+0x1a
--- interrupt ---
hypercall_page() at netbsd:hypercall_page+0x3aa
idle_loop() at netbsd:idle_loop+0x146
ds 0
es 766a
fs e838
gs e888
rdi 7
rsi 1
rbp ffff98004103e888
rbx 44c5
rdx 1
rcx ffffffffffffff
rax 800000004200000
r8 700000004200000
r9 700000000000000
r10 0
r11 fffffffe
r12 ffffffff80f797b8 ostype+0xbc2b0
r13 ffff98004103e8d0
r14 104
r15 b2
rip ffffffff8024196d breakpoint+0x5
cs e030
rflags 202
rsp ffff98004103e888
ss e02b
netbsd:breakpoint+0x5: leave
Martin
From: Brad Spencer <brad@anduin.eldar.org>
To: Martin Husemann <martin@duskware.de>
Cc: riastradh@NetBSD.org, gnats-bugs@NetBSD.org, martin@NetBSD.org
Subject: Re: kern/57560: xennetback_xenbus.c rev 1.109 broke guests, causes
DOM0 reboot
Date: Sat, 05 Aug 2023 09:59:18 -0400
Martin Husemann <martin@duskware.de> writes:
> I verified I could reproduce the issue in my setup with a -current kernel
> from yesterday without the sys/arch/xen/xen/xennetback_xenbus.c backouts.
> As described here, as soon as the domU starts dhcpcd the dom0 locked
> up hard - no breaking into ddb possible, no console output.
For me, the system actually will reboot after a short bit after it hangs
up, but the lockup is very tight in the mean time.
> Then I updated to -current as-is right now (with the xennetback_xenbus.c
> changes backed out) and everything worked fine.
>
> This was only to verify I'm on the same base as Brad.
Thank you very much for the verify in this matter.
> Then I applied the xvifmembars.diff change, booted the new dom0 kernel
> and verified domUs can use the network just fine.
That is also good to hear. It will probably be a bit before I can test
for myself, but that is very encouraging.
> So: with the "diff" variant everything seems to be fine.
>
> Martin
--
Brad Spencer - brad@anduin.eldar.org - KC8VKS - http://anduin.eldar.org
From: Taylor R Campbell <riastradh@NetBSD.org>
To: Brad Spencer <brad@anduin.eldar.org>
Cc: gnats-bugs@NetBSD.org, martin@NetBSD.org
Subject: Re: kern/57560: xennetback_xenbus.c rev 1.109 broke guests, causes DOM0 reboot
Date: Wed, 9 Aug 2023 08:43:00 +0000
FYI, I committed the patch series to HEAD. Will give it a little bit
of time before pulling up to 10 (or 9 or 8). Let me know if anything
goes wrong in HEAD!
From: Brad Spencer <brad@anduin.eldar.org>
To: gnats-bugs@netbsd.org
Cc: port-xen-maintainer@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: Re: kern/57560: xennetback_xenbus.c rev 1.109 broke guests, causes
DOM0 reboot
Date: Wed, 09 Aug 2023 14:23:38 -0400
Taylor R Campbell <riastradh@NetBSD.org> writes:
> The following reply was made to PR port-xen/57560; it has been noted by GNATS.
>
> From: Taylor R Campbell <riastradh@NetBSD.org>
> To: Brad Spencer <brad@anduin.eldar.org>
> Cc: gnats-bugs@NetBSD.org, martin@NetBSD.org
> Subject: Re: kern/57560: xennetback_xenbus.c rev 1.109 broke guests, causes DOM0 reboot
> Date: Wed, 9 Aug 2023 08:43:00 +0000
>
> FYI, I committed the patch series to HEAD. Will give it a little bit
> of time before pulling up to 10 (or 9 or 8). Let me know if anything
> goes wrong in HEAD!
>
I totally failed to mention this (sorry), but I have been running with
the patches in 10.0_BETA for a few days and so far everything appears to
be fine. I very much suspect that HEAD is ok too.
I have no ability to test anything before 10.0, so if it is pulled up to
9.x someone else will have to verify it. For the short time I ran 9.x
on the DOM0, it worked, but that didn't last long because I needed other
features that didn't exist in 9.x.
--
Brad Spencer - brad@anduin.eldar.org - KC8VKS - http://anduin.eldar.org
(Contact us)
$NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2023
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.