NetBSD Problem Report #53294

From martin@duskware.de  Wed May 16 14:14:14 2018
Return-Path: <martin@duskware.de>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id CCDD07A157
	for <gnats-bugs@gnats.NetBSD.org>; Wed, 16 May 2018 14:14:13 +0000 (UTC)
Message-Id: <20180516141405.2EB3B5CC8BC@emmas.aprisoft.de>
Date: Wed, 16 May 2018 16:14:05 +0200 (CEST)
From: martin@NetBSD.org
Reply-To: martin@NetBSD.org
To: gnats-bugs@NetBSD.org
Subject: ixg(4) stops receiving pkts
X-Send-Pr-Version: 3.95

>Number:         53294
>Category:       kern
>Synopsis:       ixg(4) stops receiving pkts
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    msaitoh
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed May 16 14:15:00 +0000 2018
>Closed-Date:    Wed Aug 01 05:27:00 +0000 2018
>Last-Modified:  Wed Aug 01 05:27:00 +0000 2018
>Originator:     Martin Husemann
>Release:        NetBSD 8.99.17
>Organization:
The NetBSD Foundation, Inc.
>Environment:
System: NetBSD night-owl.duskware.de 8.99.17 NetBSD 8.99.17 (NIGHT-OWL) #597: Sun May 13 23:03:45 CEST 2018 martin@night-owl.duskware.de:/usr/src/sys/arch/amd64/compile/NIGHT-OWL amd64
Architecture: x86_64
Machine: amd64
>Description:

The ixg(4) driver stops receiving pkts after heavy network
traffic. This has been observed on some TNF machines and also
been reported by a developer.

This is just a place holder PR, so I can add it to the show stopper list.

>How-To-Repeat:
n/a

>Fix:
n/a

>Release-Note:

>Audit-Trail:
From: "SAITOH Masanobu" <msaitoh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/53294 CVS commit: src/sys/dev/pci/ixgbe
Date: Wed, 30 May 2018 08:35:27 +0000

 Module Name:	src
 Committed By:	msaitoh
 Date:		Wed May 30 08:35:27 UTC 2018

 Modified Files:
 	src/sys/dev/pci/ixgbe: ixgbe.c ixv.c

 Log Message:
  Clear que->disabled_count in {ixgbe,ixv}_init_locked(). Without this,
 interrupt mask state and EIMS may mismatch and if_init doesn't recover
 from TX/RX stall problem.

  This change itself doesn't fix PR#53294.


 To generate a diff of this commit:
 cvs rdiff -u -r1.156 -r1.157 src/sys/dev/pci/ixgbe/ixgbe.c
 cvs rdiff -u -r1.101 -r1.102 src/sys/dev/pci/ixgbe/ixv.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

Responsible-Changed-From-To: kern-bug-people->msaitoh
Responsible-Changed-By: msaitoh@NetBSD.org
Responsible-Changed-When: Mon, 04 Jun 2018 03:39:16 +0000
Responsible-Changed-Why:
mine.


State-Changed-From-To: open->feedback
State-Changed-By: msaitoh@NetBSD.org
State-Changed-When: Mon, 04 Jun 2018 03:39:16 +0000
State-Changed-Why:
This problem should be fixed in ixgbe.c rev. 1.160. Could you test this change on netbsd-8?


From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/53294 CVS commit: [netbsd-8] src/sys/dev/pci/ixgbe
Date: Sat, 9 Jun 2018 14:59:43 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Sat Jun  9 14:59:43 UTC 2018

 Modified Files:
 	src/sys/dev/pci/ixgbe [netbsd-8]: ix_txrx.c ixgbe.c ixgbe.h
 	    ixgbe_netbsd.c ixgbe_netbsd.h ixgbe_osdep.h ixv.c

 Log Message:
 Pull up following revision(s) (requested by msaitoh in ticket #864):

 	sys/dev/pci/ixgbe/ix_txrx.c			1.40-1.47 (patch)
 	sys/dev/pci/ixgbe/ixgbe.c			1.148,1.149,1.151,
 							1.152,1.154,
 							1.155,1.157-1.160 (patch)
 	sys/dev/pci/ixgbe/ixgbe.h			1.43,1.44,1.46,1.49 (patch)
 	sys/dev/pci/ixgbe/ixgbe_netbsd.c		1.7 (patch)
 	sys/dev/pci/ixgbe/ixgbe_netbsd.h		1.8 (patch)
 	sys/dev/pci/ixgbe/ixgbe_osdep.h			1.22 (patch)
 	sys/dev/pci/ixgbe/ixv.c				1.100-1.104 (patch)
 	sys/dev/pci/ixgbe/ixv.c				1.94,1.95,1.99 (patch)

  Remove unused structure entries. No functional change.
  -
  Remove unused IXGBE_FC_HI and IXGBE_FC_LO. The watermark of the flow control
 is automatically calculated from the size of the packet buffer.
  -
  Use ixgbe_eitr_write() when writing the EITR for the link interrupt like
 queue's EITR to write the register safely. This change is not relatively
 so important than queue's EITR because link's EITR is written in if_init().
  -
  Don't free and reallocate bus_dmamem when it's not required. Currently,
 the watchdog timer is completely broken and never fire (it's from FreeBSD
 (pre iflib)). If the problem is fixed and watchdog fired, ixgbe_init() always
 calls ixgbe_jcl_reinit() and it causes panic. The reason is that
 ixgbe_local_timer1(it includes watchdog function) is softint and
 xgbe_jcl_reinit() calls bus_dmamem*() functions. bus_dmamem*() can't be called
 from interrupt context.

  One of the way to prevent panic is use worqueue for the timer, but it's
 not a small change. (I'll do it in future).

  Another way is not reallocate dmamem if it's not required. If both the MTU
 (rx_mbuf_sz in reality) and the number of RX descriptors are not changed, it's
 not required to call bus_dmamem_{unmap,free}(). Even if we use workque, this
 change save time of ixgbe_init().

  I have a code to fix broken watchdog timer but it sometime causes watchdog
 timeout, so I don't commit it yet.
  -
 Count some register correctly:
 - QPRDC register is only for 82599 and newer.
 - Count IXGBE_QPRDC, PX{ON,OFF}{T,R}XC[NT].
  The TQSMR register is not for receiving but for transmitting, so move the
 initialization from ixgbe_initialize_receive_units() to
 ixgbe_initialize_transmit_units(). No functional change.
  -
  Whitespace fix. No functional change.
  -
  Add rxd_nxck (Receive Descriptor next to check) read only sysctl.
  Don't check IFF_RUNNING in ixgbe_rxeof(). Doing break and leaving a deacriptor
 with DD bit is worse than just processing the entry. And it's also racy to
 check IFF_RUNNING in rxeof(). If you'd like to strictly obey IFF_RUNNING,
 it would be better to do it in the upper layer.
  Same as DragonFly (a part of 79251f5ebe4cf9dd2f3e6aed590e09d756d39922).
  Add "bool txr_no_space" for TX descriptor shortage. Use it like IFF_OACTIVE.
  Clear que->disabled_count in {ixgbe,ixv}_init_locked(). Without this,
 interrupt mask state and EIMS may mismatch and if_init doesn't recover
 from TX/RX stall problem.
  This change itself doesn't fix PR#53294.
  -
  Add hw.ixgN.debug sysctl. "sysctl -w hw.ixgN.debug=1" dumps some registers
 to console.
  -
 Constify several variables in ixgbe/ so that they land in .rodata (1038
 bytes).
  -
  Don't call ixgbe_rearm_queues() in ixgbe_local_timer1().
    ixgbe_enable_queue() and ixgbe_disable_queue() try to enable/disable queue
   interrupt safely. It has the internal counter. When a queue's MSI-X is
   received, ixgbe_msix_que() is called (IPL_NET). This function disable the
   queue's interrupt by ixgbe_disable_queue() and issues an softint.
   ixgbe_handle() queue is called by the softint (IPL_SOFTNET), process TX, RX
   and call ixgbe_enable_queue() at the end.

    ixgbe_local_timer1() is a callout and run always on CPU 0 (IPL_SOFTCLOCK).
   When ixgbe_rearm_queues() called, an MSI-X interrupt is issued for a specific
   queue. It may not CPU 0. If this interrupt's ixgbe_msix_que() is called and
   sofint_schedule() is called before the last sofint's softint_execute() is not
   called, the softint_schedule() fails because of SOFTINT_PENDING. It result
   in breaking ixgbe_{enable,disable}_queue()'s internal counter.
    ixgbe_local_timer1() is written not to call ixgbe_rearm_queues() if
   the interrupt is disabled, but it's called because of unknown bug or a race.

   One solution to avoid this problem is to not to use the internal counter,
   but it's little difficult. Another solution is stop using
   ixgbe_rearm_queues() at all. Essentially, ixgbe_rearm_queues() is not
   required (it was added in ixgbe.c rev. 1.43 (2016/12/01)).
   ixgbe_rearm_queues() helps for lost interrupt problem but I've never seen it
   other than ixgbe_rearm_queues() problem.

 XXX pullup-8.


 To generate a diff of this commit:
 cvs rdiff -u -r1.24.2.10 -r1.24.2.11 src/sys/dev/pci/ixgbe/ix_txrx.c
 cvs rdiff -u -r1.88.2.19 -r1.88.2.20 src/sys/dev/pci/ixgbe/ixgbe.c
 cvs rdiff -u -r1.24.6.11 -r1.24.6.12 src/sys/dev/pci/ixgbe/ixgbe.h
 cvs rdiff -u -r1.6 -r1.6.2.1 src/sys/dev/pci/ixgbe/ixgbe_netbsd.c
 cvs rdiff -u -r1.7 -r1.7.6.1 src/sys/dev/pci/ixgbe/ixgbe_netbsd.h
 cvs rdiff -u -r1.17.6.3 -r1.17.6.4 src/sys/dev/pci/ixgbe/ixgbe_osdep.h
 cvs rdiff -u -r1.56.2.16 -r1.56.2.17 src/sys/dev/pci/ixgbe/ixv.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: feedback->closed
State-Changed-By: msaitoh@NetBSD.org
State-Changed-When: Wed, 01 Aug 2018 05:27:00 +0000
State-Changed-Why:
Fixed and pulled up to netbsd-8.
Thanks.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.