NetBSD Problem Report #52263
From hf@spg.tu-darmstadt.de Tue May 30 16:00:17 2017
Return-Path: <hf@spg.tu-darmstadt.de>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
by mollari.NetBSD.org (Postfix) with ESMTPS id EF11C7A173
for <gnats-bugs@gnats.NetBSD.org>; Tue, 30 May 2017 16:00:16 +0000 (UTC)
Message-Id: <201705301551.v4UFpHLH004124@Zinnenwand.nt.e-technik.tu-darmstadt.de>
Date: Tue, 30 May 2017 17:51:17 +0200 (CEST)
From: Hauke Fath <hf@spg.tu-darmstadt.de>
Reply-To: Hauke Fath <hf@spg.tu-darmstadt.de>
To: gnats-bugs@NetBSD.org
Cc: Hauke Fath <hf@spg.tu-darmstadt.de>
Subject: Frequent ixg(4) panic
X-Send-Pr-Version: 3.95
>Number: 52263
>Category: kern
>Synopsis: Frequent ixg(4) panic in ixgbe_rxeof()
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Tue May 30 16:05:00 +0000 2017
>Closed-Date: Fri Oct 11 09:21:27 +0000 2019
>Last-Modified: Fri Oct 11 09:21:27 +0000 2019
>Originator: Hauke Fath
>Release: NetBSD 7.99.73
>Organization:
Technische Universitaet Darmstadt
>Environment:
System: NetBSD Zinnenwand 7.99.73 NetBSD 7.99.73 (FIFI-$Revision$) #0: Mon May 29 17:00:08 CEST 2017 hf@Hochstuhl:/var/obj/netbsd-builds/developer/amd64/sys/arch/amd64/compile/FIFI amd64
Architecture: x86_64
Machine: amd64
>Description:
A pr & carp router under current (7.99.73 here, but happens in
yesterday's .75, too) panics frequently with
NetBSD 7.99.73 (FIFI-$Revision$) #2: Fri May 26 15:51:24 CEST 2017
hf@Hochstuhl:/var/obj/netbsd-builds/developer/amd64/sys/arch/amd64/compile/FIFI
[...]
fatal protection fault in supervisor mode
trap type 4 code 0 rip 0xffffffff8029646d cs 0x8 rflags 0x10202 cr2 0xffff80008f799000 ilevel 0x8 rsp 0xfffffe810e8aeeb0
curlwp 0xfffffe810e89d4c0 pid 0.18 lowest kstack 0xfffffe810e8ab2c0
panic: trap
cpu1: Begin traceback...
vpanic() at netbsd:vpanic+0x140
snprintf() at netbsd:snprintf
trap() at netbsd:trap+0xbab
--- trap (number 4) ---
ixgbe_rxeof() at netbsd:ixgbe_rxeof+0x523
ixgbe_handle_que() at netbsd:ixgbe_handle_que+0x98
softint_dispatch() at netbsd:softint_dispatch+0xd4
DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xfffffe810e8aeff0
Xsoftintr() at netbsd:Xsoftintr+0x4f
--- interrupt ---
0:
cpu1: End traceback...
rebooting...
According to objdump(1) probing, the relevant instruction is
at sys/dev/pci/ixgbe/ix_txrx.c:1933
1922 /*
1923 * Optimize. This might be a small packet,
1924 * maybe just a TCP ACK. Do a fast copy that
1925 * is cache aligned into a new mbuf, and
1926 * leave the old mbuf+cluster for re-use.
1927 */
1928 if (eop && len <= IXGBE_RX_COPY_LEN) {
1929 sendmp = m_gethdr(M_NOWAIT, MT_DATA);
1930 if (sendmp != NULL) {
1931 sendmp->m_data +=
1932 IXGBE_RX_COPY_ALIGN;
1933 ixgbe_bcopy(mp->m_data,
1934 sendmp->m_data, len);
1935 sendmp->m_len = len;
1936 rxr->rx_copies.ev_count++;
1937 rbuf->flags |= IXGBE_RX_COPY;
1938 }
1939 }
I tried to KASSERT() for zero pointers, but it wasn't that
easy.
Sometimes I also see
fatal protection fault in supervisor mode
trap type 4 code 0 rip 0xffffffff8061e443 cs 0x8 rflags 0x10202 cr2 0x6b1e00 ilevel 0x4 rsp 0xfffffe810e913ef0
curlwp 0xfffffe810e904540 pid 0.30 lowest kstack 0xfffffe810e9102c0
panic: trap
cpu3: Begin traceback...
vpanic() at netbsd:vpanic+0x140
snprintf() at netbsd:snprintf
trap() at netbsd:trap+0xbab
--- trap (number 4) ---
ether_input() at netbsd:ether_input+0x83
if_percpuq_softint() at netbsd:if_percpuq_softint+0x5b
softint_dispatch() at netbsd:softint_dispatch+0xd4
DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xfffffe810e913ff0
Xsoftintr() at netbsd:Xsoftintr+0x4f
--- interrupt ---
f557b81a7cde3fa1:
cpu3: End traceback...
rebooting...
>How-To-Repeat:
Run serious amounts of traffic over an ixg(4) equipped pf/carp
router machine - 9 vlans here.
Happens once every few hours here, so I can provide details,
and/or try things easily.
>Fix:
I'd love to.
>Release-Note:
>Audit-Trail:
From: Masanobu SAITOH <msaitoh@execsw.org>
To: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Cc: msaitoh@execsw.org
Subject: Re: kern/52263: Frequent ixg(4) panic
Date: Thu, 10 Aug 2017 13:54:32 +0900
Hi.
On 2017/05/31 1:05, Hauke Fath wrote:
>> Number: 52263
>> Category: kern
>> Synopsis: Frequent ixg(4) panic in ixgbe_rxeof()
>> Confidential: no
>> Severity: critical
>> Priority: high
>> Responsible: kern-bug-people
>> State: open
>> Class: sw-bug
>> Submitter-Id: net
>> Arrival-Date: Tue May 30 16:05:00 +0000 2017
>> Originator: Hauke Fath
>> Release: NetBSD 7.99.73
>> Organization:
> Technische Universitaet Darmstadt
>> Environment:
>
>
> System: NetBSD Zinnenwand 7.99.73 NetBSD 7.99.73 (FIFI-$Revision$) #0: Mon May 29 17:00:08 CEST 2017 hf@Hochstuhl:/var/obj/netbsd-builds/developer/amd64/sys/arch/amd64/compile/FIFI amd64
> Architecture: x86_64
> Machine: amd64
>> Description:
>
> A pr & carp router under current (7.99.73 here, but happens in
> yesterday's .75, too) panics frequently with
>
> NetBSD 7.99.73 (FIFI-$Revision$) #2: Fri May 26 15:51:24 CEST 2017
> hf@Hochstuhl:/var/obj/netbsd-builds/developer/amd64/sys/arch/amd64/compile/FIFI
>
> [...]
>
> fatal protection fault in supervisor mode
> trap type 4 code 0 rip 0xffffffff8029646d cs 0x8 rflags 0x10202 cr2 0xffff80008f799000 ilevel 0x8 rsp 0xfffffe810e8aeeb0
> curlwp 0xfffffe810e89d4c0 pid 0.18 lowest kstack 0xfffffe810e8ab2c0
> panic: trap
> cpu1: Begin traceback...
> vpanic() at netbsd:vpanic+0x140
> snprintf() at netbsd:snprintf
> trap() at netbsd:trap+0xbab
> --- trap (number 4) ---
> ixgbe_rxeof() at netbsd:ixgbe_rxeof+0x523
> ixgbe_handle_que() at netbsd:ixgbe_handle_que+0x98
> softint_dispatch() at netbsd:softint_dispatch+0xd4
> DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xfffffe810e8aeff0
> Xsoftintr() at netbsd:Xsoftintr+0x4f
> --- interrupt ---
> 0:
> cpu1: End traceback...
> rebooting...
>
> According to objdump(1) probing, the relevant instruction is
> at sys/dev/pci/ixgbe/ix_txrx.c:1933
>
> 1922 /*
> 1923 * Optimize. This might be a small packet,
> 1924 * maybe just a TCP ACK. Do a fast copy that
> 1925 * is cache aligned into a new mbuf, and
> 1926 * leave the old mbuf+cluster for re-use.
> 1927 */
> 1928 if (eop && len <= IXGBE_RX_COPY_LEN) {
> 1929 sendmp = m_gethdr(M_NOWAIT, MT_DATA);
> 1930 if (sendmp != NULL) {
> 1931 sendmp->m_data +=
> 1932 IXGBE_RX_COPY_ALIGN;
> 1933 ixgbe_bcopy(mp->m_data,
> 1934 sendmp->m_data, len);
> 1935 sendmp->m_len = len;
> 1936 rxr->rx_copies.ev_count++;
> 1937 rbuf->flags |= IXGBE_RX_COPY;
> 1938 }
> 1939 }
>
> I tried to KASSERT() for zero pointers, but it wasn't that
> easy.
>
> Sometimes I also see
>
> fatal protection fault in supervisor mode
> trap type 4 code 0 rip 0xffffffff8061e443 cs 0x8 rflags 0x10202 cr2 0x6b1e00 ilevel 0x4 rsp 0xfffffe810e913ef0
> curlwp 0xfffffe810e904540 pid 0.30 lowest kstack 0xfffffe810e9102c0
> panic: trap
> cpu3: Begin traceback...
> vpanic() at netbsd:vpanic+0x140
> snprintf() at netbsd:snprintf
> trap() at netbsd:trap+0xbab
> --- trap (number 4) ---
> ether_input() at netbsd:ether_input+0x83
> if_percpuq_softint() at netbsd:if_percpuq_softint+0x5b
> softint_dispatch() at netbsd:softint_dispatch+0xd4
> DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xfffffe810e913ff0
> Xsoftintr() at netbsd:Xsoftintr+0x4f
> --- interrupt ---
> f557b81a7cde3fa1:
> cpu3: End traceback...
> rebooting...
>
>
>> How-To-Repeat:
>
> Run serious amounts of traffic over an ixg(4) equipped pf/carp
> router machine - 9 vlans here.
Does this problem still occur?
I suspect this is not ixg(4)'s bug but pf's bug.
Have you ever tested without pf?
The following change avoid using the optimization, but
it won't solve your machine's proble,
------------------
Index: ix_txrx.c
===================================================================
RCS file: /cvsroot/src/sys/dev/pci/ixgbe/ix_txrx.c,v
retrieving revision 1.27
diff -u -p -r1.27 ix_txrx.c
--- ix_txrx.c 13 Jun 2017 09:37:22 -0000 1.27
+++ ix_txrx.c 10 Aug 2017 04:40:59 -0000
@@ -1915,6 +1915,7 @@ ixgbe_rxeof(struct ix_queue *que)
* is cache aligned into a new mbuf, and
* leave the old mbuf+cluster for re-use.
*/
+#if 0
if (eop && len <= IXGBE_RX_COPY_LEN) {
sendmp = m_gethdr(M_NOWAIT, MT_DATA);
if (sendmp != NULL) {
@@ -1927,6 +1928,7 @@ ixgbe_rxeof(struct ix_queue *que)
rbuf->flags |= IXGBE_RX_COPY;
}
}
+#endif
if (sendmp == NULL) {
rbuf->buf = rbuf->fmp = NULL;
sendmp = mp;
------------------
> Happens once every few hours here, so I can provide details,
> and/or try things easily.
>
>
>> Fix:
> I'd love to.
>
>
>
>> Unformatted:
>
>
>
--
-----------------------------------------------
SAITOH Masanobu (msaitoh@execsw.org
msaitoh@netbsd.org)
From: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
To: Masanobu SAITOH <msaitoh@execsw.org>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
Subject: Re: kern/52263: Frequent ixg(4) panic
Date: Thu, 10 Aug 2017 08:40:22 +0200
On Thu, 10 Aug 2017 13:54:32 +0900, Masanobu SAITOH wrote:
>>=20
>> =09Run serious amounts of traffic over an ixg(4) equipped pf/carp
>> =09router machine - 9 vlans here.
>=20
> Does this problem still occur?
It did, right until I reinstalled the router pair with FreeBSD.
> I suspect this is not ixg(4)'s bug but pf's bug.
That is well possible - somewhere in the bermuda triangle of pf and=20
carp.
> Have you ever tested without pf?
No, the machines are institute network routers, and I unfortunately=20
don't have any spare machines with 10 GBE interfaces.
Cheerio,
hauke
--=20
Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
Ernst-Ludwig-Stra=DFe 15
64625 Bensheim
Germany
From: Hauke Fath <hf@spg.tu-darmstadt.de>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
Subject: Re: kern/52263: Frequent ixg(4) panic
Date: Thu, 10 Aug 2017 10:32:50 +0200
On Thu, 10 Aug 2017 04:55:00 +0000 (UTC), Masanobu SAITOH wrote:
> I suspect this is not ixg(4)'s bug but pf's bug.
> Have you ever tested without pf?
Comes to mind - in my desperation, I set up a pair of Xen DomUs as=20
failover routers, figuring I'd split ixg/4) and pf(4) network=20
processing. The Dom0 was fine; the DomUs crashed less often, but they=20
did.
Gave me half the tpeak network hroughput of the native installationn,=20
btw.
Cheerio,
hauke
--=20
The ASCII Ribbon Campaign Hauke Fath
() No HTML/RTF in email Institut f=FCr Nachrichtentechnik
/\ No Word docs in email TU Darmstadt
Respect for open standards Ruf +49-6151-16-21344
State-Changed-From-To: open->feedback
State-Changed-By: msaitoh@NetBSD.org
State-Changed-When: Fri, 11 Oct 2019 08:55:36 +0000
State-Changed-Why:
We have fixed a lot of ixg(4)'s bugs since May 2017. If the problem was
caused by ixg(4), it might be fixes. Did you see this problem recently.
Is it OK to close?
From: Hauke Fath <hf@spg.tu-darmstadt.de>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, netbsd-bugs@netbsd.org, gnats-admin@netbsd.org,
msaitoh@netbsd.org
Subject: Re: kern/52263 (Frequent ixg(4) panic in ixgbe_rxeof())
Date: Fri, 11 Oct 2019 11:04:14 +0200
On Fri, 11 Oct 2019 08:55:36 +0000 (UTC), msaitoh@netbsd.org wrote:
> We have fixed a lot of ixg(4)'s bugs since May 2017. If the problem was
> caused by ixg(4), it might be fixes.=20
Thanks for looking at the issue.
> Did you see this problem recently.
We do not run NetBSD on our routers any more, so I have no way to check.
> Is it OK to close?
If you are confident the problem is fixed, that is good enough for me. =20
:)
Cheerio,
hauke
--=20
The ASCII Ribbon Campaign Hauke Fath
() No HTML/RTF in email Institut f=FCr Nachrichtentechnik
/\ No Word docs in email TU Darmstadt
Respect for open standards Ruf +49-6151-16-21344
State-Changed-From-To: feedback->closed
State-Changed-By: msaitoh@NetBSD.org
State-Changed-When: Fri, 11 Oct 2019 09:21:27 +0000
State-Changed-Why:
The submitter OK'd to close.
Thanks!
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.