NetBSD Problem Report #52263

From hf@spg.tu-darmstadt.de  Tue May 30 16:00:17 2017
Return-Path: <hf@spg.tu-darmstadt.de>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id EF11C7A173
	for <gnats-bugs@gnats.NetBSD.org>; Tue, 30 May 2017 16:00:16 +0000 (UTC)
Message-Id: <201705301551.v4UFpHLH004124@Zinnenwand.nt.e-technik.tu-darmstadt.de>
Date: Tue, 30 May 2017 17:51:17 +0200 (CEST)
From: Hauke Fath <hf@spg.tu-darmstadt.de>
Reply-To: Hauke Fath <hf@spg.tu-darmstadt.de>
To: gnats-bugs@NetBSD.org
Cc: Hauke Fath <hf@spg.tu-darmstadt.de>
Subject: Frequent ixg(4) panic 
X-Send-Pr-Version: 3.95

>Number:         52263
>Category:       kern
>Synopsis:       Frequent ixg(4) panic in ixgbe_rxeof()
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue May 30 16:05:00 +0000 2017
>Closed-Date:    Fri Oct 11 09:21:27 +0000 2019
>Last-Modified:  Fri Oct 11 09:21:27 +0000 2019
>Originator:     Hauke Fath
>Release:        NetBSD 7.99.73
>Organization:
Technische Universitaet Darmstadt
>Environment:


System: NetBSD Zinnenwand 7.99.73 NetBSD 7.99.73 (FIFI-$Revision$) #0: Mon May 29 17:00:08 CEST 2017 hf@Hochstuhl:/var/obj/netbsd-builds/developer/amd64/sys/arch/amd64/compile/FIFI amd64
Architecture: x86_64
Machine: amd64
>Description:

	A pr & carp router under current (7.99.73 here, but happens in
	yesterday's .75, too) panics frequently with

NetBSD 7.99.73 (FIFI-$Revision$) #2: Fri May 26 15:51:24 CEST 2017
        hf@Hochstuhl:/var/obj/netbsd-builds/developer/amd64/sys/arch/amd64/compile/FIFI

[...]

fatal protection fault in supervisor mode
trap type 4 code 0 rip 0xffffffff8029646d cs 0x8 rflags 0x10202 cr2 0xffff80008f799000 ilevel 0x8 rsp 0xfffffe810e8aeeb0
curlwp 0xfffffe810e89d4c0 pid 0.18 lowest kstack 0xfffffe810e8ab2c0
panic: trap
cpu1: Begin traceback...
vpanic() at netbsd:vpanic+0x140
snprintf() at netbsd:snprintf
trap() at netbsd:trap+0xbab
--- trap (number 4) ---
ixgbe_rxeof() at netbsd:ixgbe_rxeof+0x523
ixgbe_handle_que() at netbsd:ixgbe_handle_que+0x98
softint_dispatch() at netbsd:softint_dispatch+0xd4
DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xfffffe810e8aeff0
Xsoftintr() at netbsd:Xsoftintr+0x4f
--- interrupt ---
0:
cpu1: End traceback...
rebooting...

	According to objdump(1) probing, the relevant instruction is
	at sys/dev/pci/ixgbe/ix_txrx.c:1933

   1922                         /*
   1923                          * Optimize.  This might be a small packet,
   1924                          * maybe just a TCP ACK.  Do a fast copy that
   1925                          * is cache aligned into a new mbuf, and
   1926                          * leave the old mbuf+cluster for re-use.
   1927                          */
   1928                         if (eop && len <= IXGBE_RX_COPY_LEN) {
   1929                                 sendmp = m_gethdr(M_NOWAIT, MT_DATA);
   1930                                 if (sendmp != NULL) {
   1931                                         sendmp->m_data +=
   1932                                             IXGBE_RX_COPY_ALIGN;
   1933                                         ixgbe_bcopy(mp->m_data,
   1934                                             sendmp->m_data, len);
   1935                                         sendmp->m_len = len;
   1936                                         rxr->rx_copies.ev_count++;
   1937                                         rbuf->flags |= IXGBE_RX_COPY;
   1938                                 }
   1939                         }

	I tried to KASSERT() for zero pointers, but it wasn't that
	easy.

	Sometimes I also see 

fatal protection fault in supervisor mode
trap type 4 code 0 rip 0xffffffff8061e443 cs 0x8 rflags 0x10202 cr2 0x6b1e00 ilevel 0x4 rsp 0xfffffe810e913ef0
curlwp 0xfffffe810e904540 pid 0.30 lowest kstack 0xfffffe810e9102c0
panic: trap
cpu3: Begin traceback...
vpanic() at netbsd:vpanic+0x140
snprintf() at netbsd:snprintf
trap() at netbsd:trap+0xbab
--- trap (number 4) ---
ether_input() at netbsd:ether_input+0x83
if_percpuq_softint() at netbsd:if_percpuq_softint+0x5b
softint_dispatch() at netbsd:softint_dispatch+0xd4
DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xfffffe810e913ff0
Xsoftintr() at netbsd:Xsoftintr+0x4f
--- interrupt ---
f557b81a7cde3fa1:
cpu3: End traceback...
rebooting...


>How-To-Repeat:

	Run serious amounts of traffic over an ixg(4) equipped pf/carp
	router machine - 9 vlans here.

	Happens once every few hours here, so I can provide details,
	and/or try things easily.


>Fix:
	I'd love to.



>Release-Note:

>Audit-Trail:
From: Masanobu SAITOH <msaitoh@execsw.org>
To: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
 gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Cc: msaitoh@execsw.org
Subject: Re: kern/52263: Frequent ixg(4) panic
Date: Thu, 10 Aug 2017 13:54:32 +0900

 Hi.

 On 2017/05/31 1:05, Hauke Fath wrote:
 >> Number:         52263
 >> Category:       kern
 >> Synopsis:       Frequent ixg(4) panic in ixgbe_rxeof()
 >> Confidential:   no
 >> Severity:       critical
 >> Priority:       high
 >> Responsible:    kern-bug-people
 >> State:          open
 >> Class:          sw-bug
 >> Submitter-Id:   net
 >> Arrival-Date:   Tue May 30 16:05:00 +0000 2017
 >> Originator:     Hauke Fath
 >> Release:        NetBSD 7.99.73
 >> Organization:
 > Technische Universitaet Darmstadt
 >> Environment:
 > 	
 > 	
 > System: NetBSD Zinnenwand 7.99.73 NetBSD 7.99.73 (FIFI-$Revision$) #0: Mon May 29 17:00:08 CEST 2017 hf@Hochstuhl:/var/obj/netbsd-builds/developer/amd64/sys/arch/amd64/compile/FIFI amd64
 > Architecture: x86_64
 > Machine: amd64
 >> Description:
 > 
 > 	A pr & carp router under current (7.99.73 here, but happens in
 > 	yesterday's .75, too) panics frequently with
 > 
 > NetBSD 7.99.73 (FIFI-$Revision$) #2: Fri May 26 15:51:24 CEST 2017
 >          hf@Hochstuhl:/var/obj/netbsd-builds/developer/amd64/sys/arch/amd64/compile/FIFI
 > 
 > [...]
 > 
 > fatal protection fault in supervisor mode
 > trap type 4 code 0 rip 0xffffffff8029646d cs 0x8 rflags 0x10202 cr2 0xffff80008f799000 ilevel 0x8 rsp 0xfffffe810e8aeeb0
 > curlwp 0xfffffe810e89d4c0 pid 0.18 lowest kstack 0xfffffe810e8ab2c0
 > panic: trap
 > cpu1: Begin traceback...
 > vpanic() at netbsd:vpanic+0x140
 > snprintf() at netbsd:snprintf
 > trap() at netbsd:trap+0xbab
 > --- trap (number 4) ---
 > ixgbe_rxeof() at netbsd:ixgbe_rxeof+0x523
 > ixgbe_handle_que() at netbsd:ixgbe_handle_que+0x98
 > softint_dispatch() at netbsd:softint_dispatch+0xd4
 > DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xfffffe810e8aeff0
 > Xsoftintr() at netbsd:Xsoftintr+0x4f
 > --- interrupt ---
 > 0:
 > cpu1: End traceback...
 > rebooting...
 > 
 > 	According to objdump(1) probing, the relevant instruction is
 > 	at sys/dev/pci/ixgbe/ix_txrx.c:1933
 > 
 >     1922                         /*
 >     1923                          * Optimize.  This might be a small packet,
 >     1924                          * maybe just a TCP ACK.  Do a fast copy that
 >     1925                          * is cache aligned into a new mbuf, and
 >     1926                          * leave the old mbuf+cluster for re-use.
 >     1927                          */
 >     1928                         if (eop && len <= IXGBE_RX_COPY_LEN) {
 >     1929                                 sendmp = m_gethdr(M_NOWAIT, MT_DATA);
 >     1930                                 if (sendmp != NULL) {
 >     1931                                         sendmp->m_data +=
 >     1932                                             IXGBE_RX_COPY_ALIGN;
 >     1933                                         ixgbe_bcopy(mp->m_data,
 >     1934                                             sendmp->m_data, len);
 >     1935                                         sendmp->m_len = len;
 >     1936                                         rxr->rx_copies.ev_count++;
 >     1937                                         rbuf->flags |= IXGBE_RX_COPY;
 >     1938                                 }
 >     1939                         }
 > 
 > 	I tried to KASSERT() for zero pointers, but it wasn't that
 > 	easy.
 > 
 > 	Sometimes I also see
 > 
 > fatal protection fault in supervisor mode
 > trap type 4 code 0 rip 0xffffffff8061e443 cs 0x8 rflags 0x10202 cr2 0x6b1e00 ilevel 0x4 rsp 0xfffffe810e913ef0
 > curlwp 0xfffffe810e904540 pid 0.30 lowest kstack 0xfffffe810e9102c0
 > panic: trap
 > cpu3: Begin traceback...
 > vpanic() at netbsd:vpanic+0x140
 > snprintf() at netbsd:snprintf
 > trap() at netbsd:trap+0xbab
 > --- trap (number 4) ---
 > ether_input() at netbsd:ether_input+0x83
 > if_percpuq_softint() at netbsd:if_percpuq_softint+0x5b
 > softint_dispatch() at netbsd:softint_dispatch+0xd4
 > DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xfffffe810e913ff0
 > Xsoftintr() at netbsd:Xsoftintr+0x4f
 > --- interrupt ---
 > f557b81a7cde3fa1:
 > cpu3: End traceback...
 > rebooting...
 > 
 > 
 >> How-To-Repeat:
 > 
 > 	Run serious amounts of traffic over an ixg(4) equipped pf/carp
 > 	router machine - 9 vlans here.

   Does this problem still occur?

 I suspect this is not ixg(4)'s bug but pf's bug.
 Have you ever tested without pf?

   The following change avoid using the optimization, but
 it won't solve your machine's proble,

 ------------------
 Index: ix_txrx.c
 ===================================================================
 RCS file: /cvsroot/src/sys/dev/pci/ixgbe/ix_txrx.c,v
 retrieving revision 1.27
 diff -u -p -r1.27 ix_txrx.c
 --- ix_txrx.c	13 Jun 2017 09:37:22 -0000	1.27
 +++ ix_txrx.c	10 Aug 2017 04:40:59 -0000
 @@ -1915,6 +1915,7 @@ ixgbe_rxeof(struct ix_queue *que)
   			 * is cache aligned into a new mbuf, and
   			 * leave the old mbuf+cluster for re-use.
   			 */
 +#if 0
   			if (eop && len <= IXGBE_RX_COPY_LEN) {
   				sendmp = m_gethdr(M_NOWAIT, MT_DATA);
   				if (sendmp != NULL) {
 @@ -1927,6 +1928,7 @@ ixgbe_rxeof(struct ix_queue *que)
   					rbuf->flags |= IXGBE_RX_COPY;
   				}
   			}
 +#endif
   			if (sendmp == NULL) {
   				rbuf->buf = rbuf->fmp = NULL;
   				sendmp = mp;
 ------------------


 > 	Happens once every few hours here, so I can provide details,
 > 	and/or try things easily.
 > 	
 > 	
 >> Fix:
 > 	I'd love to.
 > 
 > 	
 > 
 >> Unformatted:
 >   	
 >   	
 > 


 -- 
 -----------------------------------------------
                  SAITOH Masanobu (msaitoh@execsw.org
                                   msaitoh@netbsd.org)

From: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
To: Masanobu SAITOH <msaitoh@execsw.org>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
Subject: Re: kern/52263: Frequent ixg(4) panic
Date: Thu, 10 Aug 2017 08:40:22 +0200

 On Thu, 10 Aug 2017 13:54:32 +0900, Masanobu SAITOH wrote:
 >>=20
 >> =09Run serious amounts of traffic over an ixg(4) equipped pf/carp
 >> =09router machine - 9 vlans here.
 >=20
 >  Does this problem still occur?

 It did, right until I reinstalled the router pair with FreeBSD.

 > I suspect this is not ixg(4)'s bug but pf's bug.

 That is well possible - somewhere in the bermuda triangle of pf and=20
 carp.

 > Have you ever tested without pf?

 No, the machines are institute network routers, and I unfortunately=20
 don't have any spare machines with 10 GBE interfaces.

 Cheerio,
 hauke

 --=20
 Hauke Fath                        <hauke@Espresso.Rhein-Neckar.DE>
 Ernst-Ludwig-Stra=DFe 15
 64625 Bensheim
 Germany

From: Hauke Fath <hf@spg.tu-darmstadt.de>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
Subject: Re: kern/52263: Frequent ixg(4) panic
Date: Thu, 10 Aug 2017 10:32:50 +0200

 On Thu, 10 Aug 2017 04:55:00 +0000 (UTC), Masanobu SAITOH wrote:
 >  I suspect this is not ixg(4)'s bug but pf's bug.
 >  Have you ever tested without pf?

 Comes to mind - in my desperation, I set up a pair of Xen DomUs as=20
 failover routers, figuring I'd split ixg/4) and pf(4) network=20
 processing. The Dom0 was fine; the DomUs crashed less often, but they=20
 did.

 Gave me half the tpeak network hroughput of the native installationn,=20
 btw.

 Cheerio,
 hauke

 --=20
      The ASCII Ribbon Campaign                    Hauke Fath
 ()     No HTML/RTF in email            Institut f=FCr Nachrichtentechnik
 /\     No Word docs in email                     TU Darmstadt
      Respect for open standards              Ruf +49-6151-16-21344

State-Changed-From-To: open->feedback
State-Changed-By: msaitoh@NetBSD.org
State-Changed-When: Fri, 11 Oct 2019 08:55:36 +0000
State-Changed-Why:
 We have fixed a lot of ixg(4)'s bugs since May 2017. If the problem was
caused by ixg(4), it might be fixes. Did you see this problem recently.
Is it OK to close?


From: Hauke Fath <hf@spg.tu-darmstadt.de>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, netbsd-bugs@netbsd.org, gnats-admin@netbsd.org,
        msaitoh@netbsd.org
Subject: Re: kern/52263 (Frequent ixg(4) panic in ixgbe_rxeof())
Date: Fri, 11 Oct 2019 11:04:14 +0200

 On Fri, 11 Oct 2019 08:55:36 +0000 (UTC), msaitoh@netbsd.org wrote:
 >  We have fixed a lot of ixg(4)'s bugs since May 2017. If the problem was
 > caused by ixg(4), it might be fixes.=20

 Thanks for looking at the issue.

 > Did you see this problem recently.

 We do not run NetBSD on our routers any more, so I have no way to check.

 > Is it OK to close?

 If you are confident the problem is fixed, that is good enough for me. =20
 :)

 Cheerio,
 hauke

 --=20
      The ASCII Ribbon Campaign                    Hauke Fath
 ()     No HTML/RTF in email            Institut f=FCr Nachrichtentechnik
 /\     No Word docs in email                     TU Darmstadt
      Respect for open standards              Ruf +49-6151-16-21344

State-Changed-From-To: feedback->closed
State-Changed-By: msaitoh@NetBSD.org
State-Changed-When: Fri, 11 Oct 2019 09:21:27 +0000
State-Changed-Why:
The submitter OK'd to close.
Thanks!


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.