NetBSD Problem Report #52211
From hannken@eis.cs.tu-bs.de Wed May 3 08:35:35 2017
Return-Path: <hannken@eis.cs.tu-bs.de>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
by mollari.NetBSD.org (Postfix) with ESMTPS id 4EECD7A1F7
for <gnats-bugs@gnats.NetBSD.org>; Wed, 3 May 2017 08:35:35 +0000 (UTC)
Message-Id: <20170503083530.292297538B@builder.isf.cs.tu-bs.de>
Date: Wed, 3 May 2017 10:35:30 +0200 (MEST)
From: hannken@eis.cs.tu-bs.de
Reply-To: hannken@eis.cs.tu-bs.de
To: gnats-bugs@NetBSD.org
Subject: vioif stops on dmamap load error
X-Send-Pr-Version: 3.95
>Number: 52211
>Category: kern
>Synopsis: vioif stops on dmamap load error
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: jdolecek
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Wed May 03 08:40:00 +0000 2017
>Closed-Date: Wed Jul 05 07:52:37 +0000 2017
>Last-Modified: Wed Jul 05 07:52:37 +0000 2017
>Originator: Juergen Hannken-Illjes
>Release: NetBSD 7.1
>Organization:
>Environment:
System: NetBSD vpnserv.isf.cs.tu-bs.de 7.1 NetBSD 7.1 (gateway.i386) #0: Mon Mar 13 16:40:12 MET 2017 build@builder.isf.cs.tu-bs.de:/build/nbsd7/obj/obj.i386/sys/arch/i386/compile/gateway.i386 i386
Architecture: i386
Machine: i386
>Description:
From time to time the machine prints
vioif0: tx dmamap load failed, error code 27
and most times the interface seems to stop as the machine is no longer
accessible from the network.
$NetBSD: if_vioif.c,v 1.7.2.3 2016/12/23 05:57:40 snj Exp $
>How-To-Repeat:
>Fix:
>Release-Note:
>Audit-Trail:
From: Masanobu SAITOH <msaitoh@execsw.org>
To: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Cc: msaitoh@execsw.org
Subject: Re: kern/52211: vioif stops on dmamap load error
Date: Thu, 11 May 2017 17:06:21 +0900
On 2017/05/03 17:40, hannken@eis.cs.tu-bs.de wrote:
>> Number: 52211
>> Category: kern
>> Synopsis: vioif stops on dmamap load error
>> Confidential: no
>> Severity: serious
>> Priority: medium
>> Responsible: kern-bug-people
>> State: open
>> Class: sw-bug
>> Submitter-Id: net
>> Arrival-Date: Wed May 03 08:40:00 +0000 2017
>> Originator: Juergen Hannken-Illjes
>> Release: NetBSD 7.1
>> Organization:
>
>> Environment:
>
>
> System: NetBSD vpnserv.isf.cs.tu-bs.de 7.1 NetBSD 7.1 (gateway.i386) #0: Mon Mar 13 16:40:12 MET 2017 build@builder.isf.cs.tu-bs.de:/build/nbsd7/obj/obj.i386/sys/arch/i386/compile/gateway.i386 i386
> Architecture: i386
> Machine: i386
>> Description:
>
> From time to time the machine prints
>
> vioif0: tx dmamap load failed, error code 27
27 is EFBIG.
In vioif.c::vioif_start():
> r = bus_dmamap_load_mbuf(virtio_dmat(vsc),
> sc->sc_tx_dmamaps[slot],
> m, BUS_DMA_WRITE|BUS_DMA_NOWAIT);
> if (r != 0) {
> virtio_enqueue_abort(vsc, vq, slot);
> aprint_error_dev(sc->sc_dev,
> "tx dmamap load failed, error code %d\n", r);
> break;
> }
ixg(4), rtwn(4) and vge(4) have a code which calls m_defrag() when
bus_dmamap_load_mbuf() returned EFBIG. On ixg(4) it really occurs
and the recovery works fine if a interface's TSO flag is set.
Coudl you show me the output of ifconfig vioif0?
For vioif, it has no TSO function. It also doesn't have JUMBO_MTU,
so I'm afraid it has another bug in somewhere else.
> and most times the interface seems to stop as the machine is no longer
> accessible from the network.
>
> $NetBSD: if_vioif.c,v 1.7.2.3 2016/12/23 05:57:40 snj Exp $
>> How-To-Repeat:
>
>> Fix:
>
>
>> Unformatted:
--
-----------------------------------------------
SAITOH Masanobu (msaitoh@execsw.org
msaitoh@netbsd.org)
From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/52211: vioif stops on dmamap load error
Date: Thu, 11 May 2017 10:18:09 +0200
> On 11. May 2017, at 10:10, Masanobu SAITOH <msaitoh@execsw.org> wrote:
>
> Coudl you show me the output of ifconfig vioif0?
vioif0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1500
address: XX:XX:XX:XX:XX:XX
inet XXX.XXX.XXX.XXX netmask 0xffffff00 broadcast XXX.XXX.XXX.255
inet6 fe80::XXXX:XXXX:XXXX:XXXX%vioif0 prefixlen 64 scopeid 0x1
inet6 XXXX:XXXX:XXXX:XXXX:XXXX:XXXX:XXXX:XXXX prefixlen 64
--
J. Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)
From: Masanobu SAITOH <msaitoh@execsw.org>
To: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, hannken@eis.cs.tu-bs.de
Cc: msaitoh@execsw.org
Subject: Re: kern/52211: vioif stops on dmamap load error
Date: Thu, 11 May 2017 17:33:19 +0900
On 2017/05/11 17:20, J. Hannken-Illjes wrote:
> The following reply was made to PR kern/52211; it has been noted by GNATS.
>
> From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
> To: gnats-bugs@NetBSD.org
> Cc:
> Subject: Re: kern/52211: vioif stops on dmamap load error
> Date: Thu, 11 May 2017 10:18:09 +0200
>
> > On 11. May 2017, at 10:10, Masanobu SAITOH <msaitoh@execsw.org> wrote:
> >
> > Coudl you show me the output of ifconfig vioif0?
>
> vioif0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1500
> address: XX:XX:XX:XX:XX:XX
> inet XXX.XXX.XXX.XXX netmask 0xffffff00 broadcast XXX.XXX.XXX.255
> inet6 fe80::XXXX:XXXX:XXXX:XXXX%vioif0 prefixlen 64 scopeid 0x1
> inet6 XXXX:XXXX:XXXX:XXXX:XXXX:XXXX:XXXX:XXXX prefixlen 64
Yes, it has neither TSO nor JUMBO_MTU.
If a lof of mbuf chain is a normal case for vioif(4) it would
worth to try m_defrag(), but I suspect it's not a normal case
and it's caused by a bug in vioif or the upper layer.
> --
> J. Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)
>
>
--
-----------------------------------------------
SAITOH Masanobu (msaitoh@execsw.org
msaitoh@netbsd.org)
From: Masanobu SAITOH <msaitoh@execsw.org>
To: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, hannken@eis.cs.tu-bs.de
Cc: msaitoh@execsw.org
Subject: Re: kern/52211: vioif stops on dmamap load error
Date: Thu, 11 May 2017 18:11:00 +0900
On 2017/05/11 17:33, Masanobu SAITOH wrote:
> On 2017/05/11 17:20, J. Hannken-Illjes wrote:
>> The following reply was made to PR kern/52211; it has been noted by GNATS.
>>
>> From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
>> To: gnats-bugs@NetBSD.org
>> Cc:
>> Subject: Re: kern/52211: vioif stops on dmamap load error
>> Date: Thu, 11 May 2017 10:18:09 +0200
>>
>> > On 11. May 2017, at 10:10, Masanobu SAITOH <msaitoh@execsw.org> wrote:
>> >
>> > Coudl you show me the output of ifconfig vioif0?
>> vioif0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1500
>> address: XX:XX:XX:XX:XX:XX
>> inet XXX.XXX.XXX.XXX netmask 0xffffff00 broadcast XXX.XXX.XXX.255
>> inet6 fe80::XXXX:XXXX:XXXX:XXXX%vioif0 prefixlen 64 scopeid 0x1
>> inet6 XXXX:XXXX:XXXX:XXXX:XXXX:XXXX:XXXX:XXXX prefixlen 64
>
> Yes, it has neither TSO nor JUMBO_MTU.
>
> If a lof of mbuf chain is a normal case for vioif(4) it would
> worth to try m_defrag(), but I suspect it's not a normal case
> and it's caused by a bug in vioif or the upper layer.
>
>> --
>> J. Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)
>>
Could you test with the following diff?
Index: if_vioif.c
===================================================================
RCS file: /cvsroot/src/sys/dev/pci/if_vioif.c,v
retrieving revision 1.34
diff -u -p -r1.34 if_vioif.c
--- if_vioif.c 28 Mar 2017 04:10:33 -0000 1.34
+++ if_vioif.c 11 May 2017 09:10:36 -0000
@@ -833,7 +833,21 @@ retry:
r = bus_dmamap_load_mbuf(virtio_dmat(vsc),
sc->sc_tx_dmamaps[slot],
m, BUS_DMA_WRITE|BUS_DMA_NOWAIT);
- if (r != 0) {
+ switch (r) {
+ case 0:
+ break;
+ case EFBIG:
+ printf("%s: loadup_mbuf() returned EFBIG (%d segs)\n",
+ device_xname(sc->sc_dev),
+ sc->sc_tx_dmamaps[slot]->dm_nsegs);
+ if ((m_defrag(m, M_NOWAIT) == 0) &&
+ (bus_dmamap_load_mbuf(virtio_dmat(vsc),
+ sc->sc_tx_dmamaps[slot],
+ m, BUS_DMA_WRITE|BUS_DMA_NOWAIT) == 0))
+ break;
+
+ /* FALLTHROUGH */
+ default:
virtio_enqueue_abort(vsc, vq, slot);
aprint_error_dev(sc->sc_dev,
"tx dmamap load failed, error code %d\n", r);
--
-----------------------------------------------
SAITOH Masanobu (msaitoh@execsw.org
msaitoh@netbsd.org)
From: Masanobu SAITOH <msaitoh@execsw.org>
To: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, hannken@eis.cs.tu-bs.de
Cc: msaitoh@execsw.org
Subject: Re: kern/52211: vioif stops on dmamap load error
Date: Thu, 11 May 2017 18:17:51 +0900
On 2017/05/11 18:11, Masanobu SAITOH wrote:
> On 2017/05/11 17:33, Masanobu SAITOH wrote:
>> On 2017/05/11 17:20, J. Hannken-Illjes wrote:
>>> The following reply was made to PR kern/52211; it has been noted by GNATS.
>>>
>>> From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
>>> To: gnats-bugs@NetBSD.org
>>> Cc:
>>> Subject: Re: kern/52211: vioif stops on dmamap load error
>>> Date: Thu, 11 May 2017 10:18:09 +0200
>>>
>>> > On 11. May 2017, at 10:10, Masanobu SAITOH <msaitoh@execsw.org> wrote:
>>> >
>>> > Coudl you show me the output of ifconfig vioif0?
>>> vioif0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1500
>>> address: XX:XX:XX:XX:XX:XX
>>> inet XXX.XXX.XXX.XXX netmask 0xffffff00 broadcast XXX.XXX.XXX.255
>>> inet6 fe80::XXXX:XXXX:XXXX:XXXX%vioif0 prefixlen 64 scopeid 0x1
>>> inet6 XXXX:XXXX:XXXX:XXXX:XXXX:XXXX:XXXX:XXXX prefixlen 64
>>
>> Yes, it has neither TSO nor JUMBO_MTU.
>>
>> If a lof of mbuf chain is a normal case for vioif(4) it would
>> worth to try m_defrag(), but I suspect it's not a normal case
>> and it's caused by a bug in vioif or the upper layer.
>>
>>> --
>>> J. Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)
>>>
>
> Could you test with the following diff?
>
>
> Index: if_vioif.c
> ===================================================================
> RCS file: /cvsroot/src/sys/dev/pci/if_vioif.c,v
> retrieving revision 1.34
> diff -u -p -r1.34 if_vioif.c
> --- if_vioif.c 28 Mar 2017 04:10:33 -0000 1.34
> +++ if_vioif.c 11 May 2017 09:10:36 -0000
> @@ -833,7 +833,21 @@ retry:
> r = bus_dmamap_load_mbuf(virtio_dmat(vsc),
> sc->sc_tx_dmamaps[slot],
> m, BUS_DMA_WRITE|BUS_DMA_NOWAIT);
> - if (r != 0) {
> + switch (r) {
> + case 0:
> + break;
> + case EFBIG:
> + printf("%s: loadup_mbuf() returned EFBIG (%d segs)\n",
> + device_xname(sc->sc_dev),
> + sc->sc_tx_dmamaps[slot]->dm_nsegs);
> + if ((m_defrag(m, M_NOWAIT) == 0) &&
> + (bus_dmamap_load_mbuf(virtio_dmat(vsc),
> + sc->sc_tx_dmamaps[slot],
> + m, BUS_DMA_WRITE|BUS_DMA_NOWAIT) == 0))
> + break;
> +
> + /* FALLTHROUGH */
> + default:
> virtio_enqueue_abort(vsc, vq, slot);
> aprint_error_dev(sc->sc_dev,
> "tx dmamap load failed, error code %d\n", r);
>
>
Oops. The above patch is broken. Don't use it.
Please wait a little.
--
-----------------------------------------------
SAITOH Masanobu (msaitoh@execsw.org
msaitoh@netbsd.org)
From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/52211: vioif stops on dmamap load error
Date: Thu, 11 May 2017 12:02:15 +0200
> On 11. May 2017, at 11:26, Masanobu SAITOH <msaitoh@execsw.org> wrote:
>
> New one:
>
> Index: if_vioif.c
> ===================================================================
> RCS file: /cvsroot/src/sys/dev/pci/if_vioif.c,v
> retrieving revision 1.34
> diff -u -p -r1.34 if_vioif.c
> --- if_vioif.c 28 Mar 2017 04:10:33 -0000 1.34
> +++ if_vioif.c 11 May 2017 09:20:00 -0000
> @@ -812,6 +812,7 @@ vioif_start(struct ifnet *ifp)
> for (;;) {
> int slot, r;
> + struct mbuf *newm;
> IFQ_DEQUEUE(&ifp->if_snd, m);
> @@ -833,7 +834,23 @@ retry:
> r = bus_dmamap_load_mbuf(virtio_dmat(vsc),
> sc->sc_tx_dmamaps[slot],
> m, BUS_DMA_WRITE|BUS_DMA_NOWAIT);
> - if (r != 0) {
> + switch (r) {
> + case 0:
> + break;
> + case EFBIG:
> + printf("%s: loadup_mbuf() returned EFBIG (%d segs)\n",
> + device_xname(sc->sc_dev),
> + sc->sc_tx_dmamaps[slot]->dm_nsegs);
> + newm = m_defrag(m, M_NOWAIT);
> + if ((newm != NULL) &&
> + (bus_dmamap_load_mbuf(virtio_dmat(vsc),
> + sc->sc_tx_dmamaps[slot],
> + newm, BUS_DMA_WRITE|BUS_DMA_NOWAIT) == 0)) {
> + m = newm;
> + break;
> + }
> + /* FALLTHROUGH */
> + default:
> virtio_enqueue_abort(vsc, vq, slot);
> aprint_error_dev(sc->sc_dev,
> "tx dmamap load failed, error code %d\n", r);
The machine is running this patch now -- it may take weeks to trigger.
We don't break the outer loop on error as the "break" after default
now breaks the case and not the for loop but I hope it doesn't harm.
> BTW, what hypervisor are you using?
KVM from CentOS 6.8.
--
J. Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)
From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/52211: vioif stops on dmamap load error
Date: Thu, 11 May 2017 17:27:03 +0700
Date: Thu, 11 May 2017 08:20:01 +0000 (UTC)
From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
Message-ID: <20170511082001.546997A2AB@mollari.NetBSD.org>
| > Coudl you show me the output of ifconfig vioif0?
|
| vioif0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST>
I see:
vioif0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
ec_capabilities=1<VLAN_MTU>
ec_enabled=0
I also have occasional hangs, where the network just stops. I was waiting
for the next one to see if anything is sent to the console (never bothered
looking before...) I gave seen this up to 7.99.70 I think (not .71 because
I haven't run that one here yet...)
kre
From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/52211: vioif stops on dmamap load error
Date: Thu, 11 May 2017 17:37:56 +0700
Date: Thu, 11 May 2017 10:05:01 +0000 (UTC)
From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
Message-ID: <20170511100501.BA11D7A283@mollari.NetBSD.org>
| The machine is running this patch now -- it may take weeks to trigger.
I will try that soon too, with a similar expectation for how
frequently it happens.
| > BTW, what hypervisor are you using?
| KVM from CentOS 6.8.
I am running VirtualBox 4.3.38 (I tried 5 once, in its early days, and
didn't like it at all... so I went back.) Host is Ubuntu 14.04, kernel
is 3.13.0-117 (386 version, just because the website told me that was
the version that almost al users should install...)
kre
From: Masanobu SAITOH <msaitoh@execsw.org>
To: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, hannken@eis.cs.tu-bs.de
Cc: msaitoh@execsw.org, Robert Elz <kre@munnari.OZ.AU>
Subject: Re: kern/52211: vioif stops on dmamap load error
Date: Fri, 12 May 2017 18:44:12 +0900
Hi.
On 2017/05/11 19:05, J. Hannken-Illjes wrote:
> The following reply was made to PR kern/52211; it has been noted by GNATS.
>
> From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
> To: gnats-bugs@NetBSD.org
> Cc:
> Subject: Re: kern/52211: vioif stops on dmamap load error
> Date: Thu, 11 May 2017 12:02:15 +0200
>
> > On 11. May 2017, at 11:26, Masanobu SAITOH <msaitoh@execsw.org> wrote:
> >
> > New one:
> >
> > Index: if_vioif.c
> > ===================================================================
> > RCS file: /cvsroot/src/sys/dev/pci/if_vioif.c,v
> > retrieving revision 1.34
> > diff -u -p -r1.34 if_vioif.c
> > --- if_vioif.c 28 Mar 2017 04:10:33 -0000 1.34
> > +++ if_vioif.c 11 May 2017 09:20:00 -0000
> > @@ -812,6 +812,7 @@ vioif_start(struct ifnet *ifp)
> > for (;;) {
> > int slot, r;
> > + struct mbuf *newm;
> > IFQ_DEQUEUE(&ifp->if_snd, m);
> > @@ -833,7 +834,23 @@ retry:
> > r = bus_dmamap_load_mbuf(virtio_dmat(vsc),
> > sc->sc_tx_dmamaps[slot],
> > m, BUS_DMA_WRITE|BUS_DMA_NOWAIT);
> > - if (r != 0) {
> > + switch (r) {
> > + case 0:
> > + break;
> > + case EFBIG:
> > + printf("%s: loadup_mbuf() returned EFBIG (%d segs)\n",
> > + device_xname(sc->sc_dev),
> > + sc->sc_tx_dmamaps[slot]->dm_nsegs);
> > + newm = m_defrag(m, M_NOWAIT);
> > + if ((newm != NULL) &&
> > + (bus_dmamap_load_mbuf(virtio_dmat(vsc),
> > + sc->sc_tx_dmamaps[slot],
> > + newm, BUS_DMA_WRITE|BUS_DMA_NOWAIT) == 0)) {
> > + m = newm;
> > + break;
> > + }
> > + /* FALLTHROUGH */
> > + default:
> > virtio_enqueue_abort(vsc, vq, slot);
> > aprint_error_dev(sc->sc_dev,
> > "tx dmamap load failed, error code %d\n", r);
>
> The machine is running this patch now -- it may take weeks to trigger.
>
> We don't break the outer loop on error as the "break" after default
> now breaks the case and not the for loop but I hope it doesn't harm.
Oops.
Please use the following new diff:
Index: if_vioif.c
===================================================================
RCS file: /cvsroot/src/sys/dev/pci/if_vioif.c,v
retrieving revision 1.34
diff -u -p -r1.34 if_vioif.c
--- if_vioif.c 28 Mar 2017 04:10:33 -0000 1.34
+++ if_vioif.c 12 May 2017 09:29:19 -0000
@@ -812,6 +812,7 @@ vioif_start(struct ifnet *ifp)
for (;;) {
int slot, r;
+ bool remap = true;
IFQ_DEQUEUE(&ifp->if_snd, m);
@@ -830,12 +831,30 @@ retry:
}
if (r != 0)
panic("enqueue_prep for a tx buffer");
+retry_load:
r = bus_dmamap_load_mbuf(virtio_dmat(vsc),
sc->sc_tx_dmamaps[slot],
m, BUS_DMA_WRITE|BUS_DMA_NOWAIT);
- if (r != 0) {
+ if ((r == EFBIG) && remap) {
+ struct mbuf *newm;
+
+ device_printf(sc->sc_dev,
+ "loadup_mbuf() returned EFBIG (%d segs)\n",
+ sc->sc_tx_dmamaps[slot]->dm_nsegs);
+ remap = false;
+ newm = m_defrag(m, M_NOWAIT);
+ if (newm == NULL) {
+ virtio_enqueue_abort(vsc, vq, slot);
+ device_printf(sc->sc_dev,
+ "m_defrag() failed\n");
+ break;
+ } else {
+ m = newm;
+ goto retry_load;
+ }
+ } else {
virtio_enqueue_abort(vsc, vq, slot);
- aprint_error_dev(sc->sc_dev,
+ device_printf(sc->sc_dev,
"tx dmamap load failed, error code %d\n", r);
break;
}
> > BTW, what hypervisor are you using?
>
> KVM from CentOS 6.8.
>
> --
> J. Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)
>
>
--
-----------------------------------------------
SAITOH Masanobu (msaitoh@execsw.org
msaitoh@netbsd.org)
From: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek@gmail.com>
To: gnats-bugs@netbsd.org
Cc: netbsd-bugs@netbsd.org
Subject: Re: kern/52211: vioif stops on dmamap load error
Date: Wed, 17 May 2017 00:27:56 +0200
--f403045f284af3cb0f054fabaef0
Content-Type: text/plain; charset="UTF-8"
The driver fails to lock against interrupts without NET_MPSAFE, as it
doesn't do splnet()/splx() as it IMO should. Maybe the biglock
actually is enough, just looks unsafe. IMO it should just always use
the mutexes.
However, the wedge seems to be caused by something else.
If the dmamap load fails while there is no TX active, code still sets
IFF_OACTIVE. The flag is only reset when TX request finishes. Since
nothing ever resets it back, and the start function ignores any
further requests while the flag is set, it would stop sending anything
until interface reset. So minimal fix would be to set the IFF_OACTIVE
only when virtio_enqueue_prep() fails, and leave it alone for the
other errors.
The m_defrag() call anyway looks like good idea - after all it's
better to try to send the data then just dropping it on the floor. I
think it's perfectly possible to have mbufs very fragmented.
Also, the code for trying to dequeue finished requests when we run out
of resources looks somewhat not necessary, IMO it would be simplier to
remove it as it only obfuscates the logic.
What you think about attached patch, based on Masanobu SAITOH's diff?
Besides the things above, I've also changed it so that interface queue
limit is setup according to virtio queue size, i.e. upper layer
wouldn't ever queue more than interface can handle.
Jaromir
2017-05-03 10:40 GMT+02:00 <hannken@eis.cs.tu-bs.de>:
>>Number: 52211
>>Category: kern
>>Synopsis: vioif stops on dmamap load error
>>Confidential: no
>>Severity: serious
>>Priority: medium
>>Responsible: kern-bug-people
>>State: open
>>Class: sw-bug
>>Submitter-Id: net
>>Arrival-Date: Wed May 03 08:40:00 +0000 2017
>>Originator: Juergen Hannken-Illjes
>>Release: NetBSD 7.1
>>Organization:
>
>>Environment:
>
>
> System: NetBSD vpnserv.isf.cs.tu-bs.de 7.1 NetBSD 7.1 (gateway.i386) #0: Mon Mar 13 16:40:12 MET 2017 build@builder.isf.cs.tu-bs.de:/build/nbsd7/obj/obj.i386/sys/arch/i386/compile/gateway.i386 i386
> Architecture: i386
> Machine: i386
>>Description:
>
> From time to time the machine prints
>
> vioif0: tx dmamap load failed, error code 27
>
> and most times the interface seems to stop as the machine is no longer
> accessible from the network.
>
> $NetBSD: if_vioif.c,v 1.7.2.3 2016/12/23 05:57:40 snj Exp $
>>How-To-Repeat:
>
>>Fix:
>
>
>>Unformatted:
>
>
--f403045f284af3cb0f054fabaef0
Content-Type: text/plain; charset="US-ASCII"; name="if_vioif_wedge_fix.diff"
Content-Disposition: attachment; filename="if_vioif_wedge_fix.diff"
Content-Transfer-Encoding: base64
X-Attachment-Id: f_j2s4jecc0
SW5kZXg6IGlmX3Zpb2lmLmMKPT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09
PT09PT09PT09PT09PT09PT09PT09PT09PT09PQpSQ1MgZmlsZTogL2N2c3Jvb3Qvc3JjL3N5cy9k
ZXYvcGNpL2lmX3Zpb2lmLmMsdgpyZXRyaWV2aW5nIHJldmlzaW9uIDEuMzQKZGlmZiAtdSAtcCAt
cjEuMzQgaWZfdmlvaWYuYwotLS0gaWZfdmlvaWYuYwkyOCBNYXIgMjAxNyAwNDoxMDozMyAtMDAw
MAkxLjM0CisrKyBpZl92aW9pZi5jCTE2IE1heSAyMDE3IDIyOjIyOjMyIC0wMDAwCkBAIC0yMjYs
OCArMjI2LDggQEAgc3RydWN0IHZpb2lmX3NvZnRjIHsKIAl9CQkJc2NfY3RybF9pbnVzZTsKIAlr
Y29uZHZhcl90CQlzY19jdHJsX3dhaXQ7CiAJa211dGV4X3QJCXNjX2N0cmxfd2FpdF9sb2NrOwot
CWttdXRleF90CQkqc2NfdHhfbG9jazsKLQlrbXV0ZXhfdAkJKnNjX3J4X2xvY2s7CisJa211dGV4
X3QJCXNjX3R4X2xvY2s7CisJa211dGV4X3QJCXNjX3J4X2xvY2s7CiAJYm9vbAkJCXNjX3N0b3Bw
aW5nOwogCiAJYm9vbAkJCXNjX2hhc19jdHJsOwpAQCAtMjM1LDEyICsyMzUsMTIgQEAgc3RydWN0
IHZpb2lmX3NvZnRjIHsKICNkZWZpbmUgVklSVElPX05FVF9UWF9NQVhOU0VHUwkJKDE2KSAvKiBY
WFggKi8KICNkZWZpbmUgVklSVElPX05FVF9DVFJMX01BQ19NQVhFTlRSSUVTCSg2NCkgLyogWFhY
ICovCiAKLSNkZWZpbmUgVklPSUZfVFhfTE9DSyhfc2MpCWlmICgoX3NjKS0+c2NfdHhfbG9jaykg
bXV0ZXhfZW50ZXIoKF9zYyktPnNjX3R4X2xvY2spCi0jZGVmaW5lIFZJT0lGX1RYX1VOTE9DSyhf
c2MpCWlmICgoX3NjKS0+c2NfdHhfbG9jaykgbXV0ZXhfZXhpdCgoX3NjKS0+c2NfdHhfbG9jaykK
LSNkZWZpbmUgVklPSUZfVFhfTE9DS0VEKF9zYykJKCEoX3NjKS0+c2NfdHhfbG9jayB8fCBtdXRl
eF9vd25lZCgoX3NjKS0+c2NfdHhfbG9jaykpCi0jZGVmaW5lIFZJT0lGX1JYX0xPQ0soX3NjKQlp
ZiAoKF9zYyktPnNjX3J4X2xvY2spIG11dGV4X2VudGVyKChfc2MpLT5zY19yeF9sb2NrKQotI2Rl
ZmluZSBWSU9JRl9SWF9VTkxPQ0soX3NjKQlpZiAoKF9zYyktPnNjX3J4X2xvY2spIG11dGV4X2V4
aXQoKF9zYyktPnNjX3J4X2xvY2spCi0jZGVmaW5lIFZJT0lGX1JYX0xPQ0tFRChfc2MpCSghKF9z
YyktPnNjX3J4X2xvY2sgfHwgbXV0ZXhfb3duZWQoKF9zYyktPnNjX3J4X2xvY2spKQorI2RlZmlu
ZSBWSU9JRl9UWF9MT0NLKF9zYykJbXV0ZXhfZW50ZXIoJihfc2MpLT5zY190eF9sb2NrKQorI2Rl
ZmluZSBWSU9JRl9UWF9VTkxPQ0soX3NjKQltdXRleF9leGl0KCYoX3NjKS0+c2NfdHhfbG9jaykK
KyNkZWZpbmUgVklPSUZfVFhfTE9DS0VEKF9zYykJbXV0ZXhfb3duZWQoJihfc2MpLT5zY190eF9s
b2NrKQorI2RlZmluZSBWSU9JRl9SWF9MT0NLKF9zYykJbXV0ZXhfZW50ZXIoJihfc2MpLT5zY19y
eF9sb2NrKQorI2RlZmluZSBWSU9JRl9SWF9VTkxPQ0soX3NjKQltdXRleF9leGl0KCYoX3NjKS0+
c2NfcnhfbG9jaykKKyNkZWZpbmUgVklPSUZfUlhfTE9DS0VEKF9zYykJbXV0ZXhfb3duZWQoJihf
c2MpLT5zY19yeF9sb2NrKQogCiAvKiBjZmF0dGFjaCBpbnRlcmZhY2UgZnVuY3Rpb25zICovCiBz
dGF0aWMgaW50CXZpb2lmX21hdGNoKGRldmljZV90LCBjZmRhdGFfdCwgdm9pZCAqKTsKQEAgLTQz
OSw3ICs0MzksNyBAQCB2aW9pZl9hbGxvY19tZW1zKHN0cnVjdCB2aW9pZl9zb2Z0YyAqc2MpCiAJ
CUNfTDEodHhoZHJfZG1hbWFwc1tpXSwgdHhfaGRyc1tpXSwKIAkJICAgIHNpemVvZihzdHJ1Y3Qg
dmlydGlvX25ldF9oZHIpLCAxLAogCQkgICAgV1JJVEUsICJ0eCBoZWFkZXIiKTsKLQkJQyh0eF9k
bWFtYXBzW2ldLCBOVUxMLCBFVEhFUl9NQVhfTEVOLCAxNiAvKiBYWFggKi8sIDAsCisJCUModHhf
ZG1hbWFwc1tpXSwgTlVMTCwgRVRIRVJfTUFYX0xFTiwgVklSVElPX05FVF9UWF9NQVhOU0VHUywg
MCwKIAkJICAidHggcGF5bG9hZCIpOwogCX0KIApAQCAtNTkyLDEzICs1OTIsOCBAQCB2aW9pZl9h
dHRhY2goZGV2aWNlX3QgcGFyZW50LCBkZXZpY2VfdCBzCiAKIAlhcHJpbnRfbm9ybWFsX2Rldihz
ZWxmLCAiRXRoZXJuZXQgYWRkcmVzcyAlc1xuIiwgZXRoZXJfc3ByaW50ZihzYy0+c2NfbWFjKSk7
CiAKLSNpZmRlZiBWSU9JRl9NUFNBRkUKLQlzYy0+c2NfdHhfbG9jayA9IG11dGV4X29ial9hbGxv
YyhNVVRFWF9ERUZBVUxULCBJUExfTkVUKTsKLQlzYy0+c2NfcnhfbG9jayA9IG11dGV4X29ial9h
bGxvYyhNVVRFWF9ERUZBVUxULCBJUExfTkVUKTsKLSNlbHNlCi0Jc2MtPnNjX3R4X2xvY2sgPSBO
VUxMOwotCXNjLT5zY19yeF9sb2NrID0gTlVMTDsKLSNlbmRpZgorCW11dGV4X2luaXQoJnNjLT5z
Y190eF9sb2NrLCBNVVRFWF9ERUZBVUxULCBJUExfTkVUKTsKKwltdXRleF9pbml0KCZzYy0+c2Nf
cnhfbG9jaywgTVVURVhfREVGQVVMVCwgSVBMX05FVCk7CiAJc2MtPnNjX3N0b3BwaW5nID0gZmFs
c2U7CiAKIAkvKgpAQCAtNjgwLDYgKzY3NSw4IEBAIHNraXA6CiAJaWZwLT5pZl9zdG9wID0gdmlv
aWZfc3RvcDsKIAlpZnAtPmlmX2NhcGFiaWxpdGllcyA9IDA7CiAJaWZwLT5pZl93YXRjaGRvZyA9
IHZpb2lmX3dhdGNoZG9nOworCUlGUV9TRVRfTUFYTEVOKCZpZnAtPmlmX3NuZCwgTUFYKHNjLT5z
Y192cVtWUV9UWF0udnFfbnVtLCBJRlFfTUFYTEVOKSk7CisJSUZRX1NFVF9SRUFEWSgmaWZwLT5p
Zl9zbmQpOwogCiAJc2MtPnNjX2V0aGVyY29tLmVjX2NhcGFiaWxpdGllcyB8PSBFVEhFUkNBUF9W
TEFOX01UVTsKIApAQCAtNjkwLDEwICs2ODcsOCBAQCBza2lwOgogCXJldHVybjsKIAogZXJyOgot
CWlmIChzYy0+c2NfdHhfbG9jaykKLQkJbXV0ZXhfb2JqX2ZyZWUoc2MtPnNjX3R4X2xvY2spOwot
CWlmIChzYy0+c2NfcnhfbG9jaykKLQkJbXV0ZXhfb2JqX2ZyZWUoc2MtPnNjX3J4X2xvY2spOwor
CW11dGV4X2Rlc3Ryb3koJnNjLT5zY190eF9sb2NrKTsKKwltdXRleF9kZXN0cm95KCZzYy0+c2Nf
cnhfbG9jayk7CiAKIAlpZiAoc2MtPnNjX2hhc19jdHJsKSB7CiAJCWN2X2Rlc3Ryb3koJnNjLT5z
Y19jdHJsX3dhaXQpOwpAQCAtNzk5LDcgKzc5NCw3IEBAIHZpb2lmX3N0YXJ0KHN0cnVjdCBpZm5l
dCAqaWZwKQogCXN0cnVjdCB2aXJ0aW9fc29mdGMgKnZzYyA9IHNjLT5zY192aXJ0aW87CiAJc3Ry
dWN0IHZpcnRxdWV1ZSAqdnEgPSAmc2MtPnNjX3ZxW1ZRX1RYXTsKIAlzdHJ1Y3QgbWJ1ZiAqbTsK
LQlpbnQgcXVldWVkID0gMCwgcmV0cnkgPSAwOworCWludCBxdWV1ZWQgPSAwOwogCiAJVklPSUZf
VFhfTE9DSyhzYyk7CiAKQEAgLTgxNCw0MiArODA5LDU3IEBAIHZpb2lmX3N0YXJ0KHN0cnVjdCBp
Zm5ldCAqaWZwKQogCQlpbnQgc2xvdCwgcjsKIAogCQlJRlFfREVRVUVVRSgmaWZwLT5pZl9zbmQs
IG0pOwotCiAJCWlmIChtID09IE5VTEwpCiAJCQlicmVhazsKIAotcmV0cnk6CiAJCXIgPSB2aXJ0
aW9fZW5xdWV1ZV9wcmVwKHZzYywgdnEsICZzbG90KTsKIAkJaWYgKHIgPT0gRUFHQUlOKSB7CiAJ
CQlpZnAtPmlmX2ZsYWdzIHw9IElGRl9PQUNUSVZFOwotCQkJdmlvaWZfdHhfdnFfZG9uZV9sb2Nr
ZWQodnEpOwotCQkJaWYgKHJldHJ5KysgPT0gMCkKLQkJCQlnb3RvIHJldHJ5OwotCQkJZWxzZQot
CQkJCWJyZWFrOworCQkJYnJlYWs7CiAJCX0KIAkJaWYgKHIgIT0gMCkKIAkJCXBhbmljKCJlbnF1
ZXVlX3ByZXAgZm9yIGEgdHggYnVmZmVyIik7CisKIAkJciA9IGJ1c19kbWFtYXBfbG9hZF9tYnVm
KHZpcnRpb19kbWF0KHZzYyksCiAJCQkJCSBzYy0+c2NfdHhfZG1hbWFwc1tzbG90XSwKIAkJCQkJ
IG0sIEJVU19ETUFfV1JJVEV8QlVTX0RNQV9OT1dBSVQpOwogCQlpZiAociAhPSAwKSB7Ci0JCQl2
aXJ0aW9fZW5xdWV1ZV9hYm9ydCh2c2MsIHZxLCBzbG90KTsKLQkJCWFwcmludF9lcnJvcl9kZXYo
c2MtPnNjX2RldiwKLQkJCSAgICAidHggZG1hbWFwIGxvYWQgZmFpbGVkLCBlcnJvciBjb2RlICVk
XG4iLCByKTsKLQkJCWJyZWFrOworCQkJLyogbWF5YmUganVzdCB0b28gZnJhZ21lbnRlZCAqLwor
CQkJc3RydWN0IG1idWYgKm5ld207CisKKwkJCW5ld20gPSBtX2RlZnJhZyhtLCBNX05PV0FJVCk7
CisJCQlpZiAobmV3bSA9PSBOVUxMKSB7CisJCQkJYXByaW50X2Vycm9yX2RldihzYy0+c2NfZGV2
LAorCQkJCSAgICAibV9kZWZyYWcoKSBmYWlsZWRcbiIpOworCQkJCW1fZnJlZW0obSk7CisJCQkJ
Z290byBza2lwOworCQkJfQorCisJCQlyID0gYnVzX2RtYW1hcF9sb2FkX21idWYodmlydGlvX2Rt
YXQodnNjKSwKKwkJCQkJIHNjLT5zY190eF9kbWFtYXBzW3Nsb3RdLAorCQkJCQkgbmV3bSwgQlVT
X0RNQV9XUklURXxCVVNfRE1BX05PV0FJVCk7CisJCQlpZiAociAhPSAwKSB7CisJCQkJYXByaW50
X2Vycm9yX2RldihzYy0+c2NfZGV2LAorCSAgIAkJCSAgICAidHggZG1hbWFwIGxvYWQgZmFpbGVk
LCBlcnJvciBjb2RlICVkXG4iLAorCQkJCSAgICByKTsKKwkJCQltX2ZyZWVtKG5ld20pOworc2tp
cDoKKwkJCQl2aXJ0aW9fZW5xdWV1ZV9hYm9ydCh2c2MsIHZxLCBzbG90KTsKKwkJCQljb250aW51
ZTsKKwkJCX0KIAkJfQorCisJCS8qIFRoaXMgc2hvdWxkIGFjdHVhbGx5IG5ldmVyIGZhaWwgKi8K
IAkJciA9IHZpcnRpb19lbnF1ZXVlX3Jlc2VydmUodnNjLCB2cSwgc2xvdCwKIAkJCQkJc2MtPnNj
X3R4X2RtYW1hcHNbc2xvdF0tPmRtX25zZWdzICsgMSk7CiAJCWlmIChyICE9IDApIHsKKwkJCWFw
cmludF9lcnJvcl9kZXYoc2MtPnNjX2RldiwKKwkgICAJCSAgICAidmlydGlvX2VucXVldWVfcmVz
ZXJ2ZSBmYWlsZWQsIGVycm9yIGNvZGUgJWRcbiIsCisJCQkgICAgcik7CiAJCQlidXNfZG1hbWFw
X3VubG9hZCh2aXJ0aW9fZG1hdCh2c2MpLAogCQkJCQkgIHNjLT5zY190eF9kbWFtYXBzW3Nsb3Rd
KTsKLQkJCWlmcC0+aWZfZmxhZ3MgfD0gSUZGX09BQ1RJVkU7Ci0JCQl2aW9pZl90eF92cV9kb25l
X2xvY2tlZCh2cSk7Ci0JCQlpZiAocmV0cnkrKyA9PSAwKQotCQkJCWdvdG8gcmV0cnk7Ci0JCQll
bHNlCi0JCQkJYnJlYWs7CisJCQkvKiBzbG90IGFscmVhZHkgZnJlZWQgYnkgdmlydGlvX2VucXVl
dWVfcmVzZXJ2ZSAqLworCQkJY29udGludWU7CiAJCX0KIAogCQlzYy0+c2NfdHhfbWJ1ZnNbc2xv
dF0gPSBtOwpAQCAtODY0LDE0ICs4NzQsMTMgQEAgcmV0cnk6CiAJCXZpcnRpb19lbnF1ZXVlKHZz
YywgdnEsIHNsb3QsIHNjLT5zY190eGhkcl9kbWFtYXBzW3Nsb3RdLCB0cnVlKTsKIAkJdmlydGlv
X2VucXVldWUodnNjLCB2cSwgc2xvdCwgc2MtPnNjX3R4X2RtYW1hcHNbc2xvdF0sIHRydWUpOwog
CQl2aXJ0aW9fZW5xdWV1ZV9jb21taXQodnNjLCB2cSwgc2xvdCwgZmFsc2UpOworCiAJCXF1ZXVl
ZCsrOwogCQlicGZfbXRhcChpZnAsIG0pOwogCX0KIAotCWlmIChtICE9IE5VTEwpIHsKLQkJaWZw
LT5pZl9mbGFncyB8PSBJRkZfT0FDVElWRTsKKwlpZiAobSAhPSBOVUxMKQogCQltX2ZyZWVtKG0p
OwotCX0KIAogCWlmIChxdWV1ZWQgPiAwKSB7CiAJCXZpcnRpb19lbnF1ZXVlX2NvbW1pdCh2c2Ms
IHZxLCAtMSwgdHJ1ZSk7Cg==
--f403045f284af3cb0f054fabaef0--
From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/52211 CVS commit: src/sys/dev/pci
Date: Wed, 17 May 2017 20:04:50 +0000
Module Name: src
Committed By: jdolecek
Date: Wed May 17 20:04:50 UTC 2017
Modified Files:
src/sys/dev/pci: if_vioif.c
Log Message:
simplify vioif_start() - remove the delivery attempts on failure and retries,
leave that for the dedicated thread
if dma map load fails, retry after m_defrag(), but continue processing
other queue items regardless
set interface queue length according to the length of virtio queue, so that
higher layer won't queue more than interface can manage to keep in flight
use the mutexes always, not just with NET_MPSAFE, so they continue
being exercised and hence working; they also enforce proper IPL level
inspired by discussion around PR kern/52211, thanks to Masanobu SAITOH
for the m_defrag() idea and code
To generate a diff of this commit:
cvs rdiff -u -r1.35 -r1.36 src/sys/dev/pci/if_vioif.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
Responsible-Changed-From-To: kern-bug-people->jdolecek
Responsible-Changed-By: jdolecek@NetBSD.org
Responsible-Changed-When: Wed, 17 May 2017 20:06:39 +0000
Responsible-Changed-Why:
I'll take care of this.
State-Changed-From-To: open->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Wed, 17 May 2017 20:06:39 +0000
State-Changed-Why:
Fix committed to -current. Can you please check it out? I understand it's
not very simple to trigger ...
From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/52211 (vioif stops on dmamap load error)
Date: Thu, 18 May 2017 08:52:41 +0200
> On 17. May 2017, at 22:06, jdolecek@NetBSD.org wrote:
>
> Fix committed to -current. Can you please check it out? I understand it's
> not very simple to trigger ...
Unfortunately not -- the machine in question runs 7.1.
--
J. Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)
State-Changed-From-To: feedback->closed
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Wed, 05 Jul 2017 07:52:37 +0000
State-Changed-Why:
Fix is present in netbsd-8 branch, so will be part of 8.0.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.