NetBSD Problem Report #50150

From hf@spg.tu-darmstadt.de  Mon Aug 17 13:40:03 2015
Return-Path: <hf@spg.tu-darmstadt.de>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 31AAFA57FD
	for <gnats-bugs@gnats.NetBSD.org>; Mon, 17 Aug 2015 13:40:03 +0000 (UTC)
Message-Id: <201508171337.t7HDbl5v002717@Vertatscha.nt.e-technik.tu-darmstadt.de>
Date: Mon, 17 Aug 2015 15:37:47 +0200 (CEST)
From: Hauke Fath <hf@spg.tu-darmstadt.de>
Reply-To: Hauke Fath <hf@spg.tu-darmstadt.de>
To: gnats-bugs@NetBSD.org
Cc: Hauke Fath <hf@spg.tu-darmstadt.de>
Subject: wm(4) failure
X-Send-Pr-Version: 3.95

>Number:         50150
>Category:       kern
>Synopsis:       wm(4) failure
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Aug 17 13:45:00 +0000 2015
>Last-Modified:  Fri Jun 17 08:10:00 +0000 2016
>Originator:     Hauke Fath
>Release:        NetBSD 7.0_RC2
>Organization:
Technische Universitaet Darmstadt
>Environment:


System: NetBSD Vertatscha 7.0_RC2 NetBSD 7.0_RC2 (FIFI-$Revision: 1.85 $) #0: Fri Aug 7 16:16:59 CEST 2015 hf@Hochstuhl:/var/obj/netbsd-builds/7/amd64/sys/arch/amd64/compile/FIFI amd64
Architecture: x86_64
Machine: amd64
>Description:

	On a router running netbsd-7, one wm(4) interface

wm5 at pci8 dev 0 function 0: Intel i82574L (rev. 0x00)
wm5: interrupting at ioapic0 pin 19
wm5: Ethernet address 00:25:90:af:ed:cf
makphy1 at wm5 phy 1: Marvell 88E1149 Gigabit PHY, rev. 1
makphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto

	has failed three times so far over a course of two months.
	The wm5 interface is one of two onboard NICs. It has five
	vlans configured on it. The other onboard NIC, wm4, has one
	vlan, but sees considerably more traffic, and performs
	flawlessly. 

	In addition, there is a four-port NIC pci-e card of type
	82571EB, whose interfaces wm0 ... wm3 also perform flawlessly.

	During the failure, the kernel logs a lot of

wm5: device timeout (txfree 4089 txsfree 57 txnext 7)

	style messages. At this point, the interface is down, reporting
	"no carrier":

wm5: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
        capabilities=7ff80<TSO4,IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx>
        capabilities=7ff80<TCP4CSUM_Tx,UDP4CSUM_Rx,UDP4CSUM_Tx,TCP6CSUM_Rx>
        capabilities=7ff80<TCP6CSUM_Tx,UDP6CSUM_Rx,UDP6CSUM_Tx,TSO6>
        enabled=0
        ec_capabilities=7<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU>
        ec_enabled=3<VLAN_MTU,V^@LAN_HWTAGGING>
        address: 00:25:90:af:ed:cf
        media: Ethernet autoselect (none)
        status: no carrier
        inet6 fe80::225:90ff:feaf:edcf%wm5 prefixlen 64 detached scopeid 0x6

	Since the router had to come back up as quick as possible, we
	did not take the time to check whether an 'ifconfig wm5 down ;
	ifconfig wm5 up' could have brought the NIC back in line.


>How-To-Repeat:

	Configure a handful of vlans on an Intel i82574L wm(4) NIC,
	run traffic, and wait, I guess?

>Fix:
	None known.

>Audit-Trail:
From: Hauke Fath <hf@spg.tu-darmstadt.de>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
Subject: Re: kern/50150: wm(4) failure
Date: Tue, 10 May 2016 15:18:50 +0200

 On Mon, 17 Aug 2015 13:45:00 +0000 (UTC), Hauke Fath wrote:
 > =09Since the router had to come back up as quick as possible, we
 > =09did not take the time to check whether an 'ifconfig wm5 down ;
 > =09ifconfig wm5 up' could have brought the NIC back in line.

 ... happened again during nightly backup traffic, and an 'ifconfig wm5=20
 down ; ifconfig wm5 up' did not fix it.

 Any recent relevant changes to wm(4) that I could give a spin?

 Cheerio,
 hauke

 --=20
      The ASCII Ribbon Campaign                    Hauke Fath
 ()     No HTML/RTF in email            Institut f=FCr Nachrichtentechnik
 /\     No Word docs in email                     TU Darmstadt
      Respect for open standards              Ruf +49-6151-16-21344

From: Masanobu SAITOH <msaitoh@execsw.org>
To: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
 gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
 Hauke Fath <hf@spg.tu-darmstadt.de>
Cc: msaitoh@execsw.org
Subject: Re: kern/50150: wm(4) failure
Date: Wed, 11 May 2016 12:45:46 +0900

 Hi.

 On 2016/05/10 22:20, Hauke Fath wrote:
 > The following reply was made to PR kern/50150; it has been noted by GNATS.
 >
 > From: Hauke Fath <hf@spg.tu-darmstadt.de>
 > To: gnats-bugs@NetBSD.org
 > Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
 > Subject: Re: kern/50150: wm(4) failure
 > Date: Tue, 10 May 2016 15:18:50 +0200
 >
 >  On Mon, 17 Aug 2015 13:45:00 +0000 (UTC), Hauke Fath wrote:
 >  > =09Since the router had to come back up as quick as possible, we
 >  > =09did not take the time to check whether an 'ifconfig wm5 down ;
 >  > =09ifconfig wm5 up' could have brought the NIC back in line.
 >
 >  ... happened again during nightly backup traffic, and an 'ifconfig wm5=20
 >  down ; ifconfig wm5 up' did not fix it.
 >
 >  Any recent relevant changes to wm(4) that I could give a spin?

   Some people reported the same problem and the problem have not solved
 yet I think :-(




 >  Cheerio,
 >  hauke
 >
 >  --=20
 >       The ASCII Ribbon Campaign                    Hauke Fath
 >  ()     No HTML/RTF in email            Institut f=FCr Nachrichtentechnik
 >  /\     No Word docs in email                     TU Darmstadt
 >       Respect for open standards              Ruf +49-6151-16-21344
 >
 >


 -- 
 -----------------------------------------------
                  SAITOH Masanobu (msaitoh@execsw.org
                                   msaitoh@netbsd.org)

From: Kengo NAKAHARA <k-nakahara@iij.ad.jp>
To: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
        netbsd-bugs@netbsd.org, hf@spg.tu-darmstadt.de
Cc: 
Subject: Re: kern/50150: wm(4) failure
Date: Wed, 18 May 2016 16:20:35 +0900

 Hi,

 On 2016/05/11 12:50, Masanobu SAITOH wrote:
 > The following reply was made to PR kern/50150; it has been noted by GNATS.
 > 
 > From: Masanobu SAITOH <msaitoh@execsw.org>
 > To: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
 >  gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
 >  Hauke Fath <hf@spg.tu-darmstadt.de>
 > Cc: msaitoh@execsw.org
 > Subject: Re: kern/50150: wm(4) failure
 > Date: Wed, 11 May 2016 12:45:46 +0900
 > 
 >  Hi.
 >  
 >  On 2016/05/10 22:20, Hauke Fath wrote:
 >  > The following reply was made to PR kern/50150; it has been noted by GNATS.
 >  >
 >  > From: Hauke Fath <hf@spg.tu-darmstadt.de>
 >  > To: gnats-bugs@NetBSD.org
 >  > Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
 >  > Subject: Re: kern/50150: wm(4) failure
 >  > Date: Tue, 10 May 2016 15:18:50 +0200
 >  >
 >  >  On Mon, 17 Aug 2015 13:45:00 +0000 (UTC), Hauke Fath wrote:
 >  >  > =09Since the router had to come back up as quick as possible, we
 >  >  > =09did not take the time to check whether an 'ifconfig wm5 down ;
 >  >  > =09ifconfig wm5 up' could have brought the NIC back in line.
 >  >
 >  >  ... happened again during nightly backup traffic, and an 'ifconfig wm5=20
 >  >  down ; ifconfig wm5 up' did not fix it.
 >  >
 >  >  Any recent relevant changes to wm(4) that I could give a spin?
 >  
 >    Some people reported the same problem and the problem have not solved
 >  yet I think :-(

 I think if_wm.c:r1.397 might fix this problem. If you have time, Could
 you try if_wm.c:r1.397 or if_wm.c:r1.398 ?


 Thanks,

 -- 
 //////////////////////////////////////////////////////////////////////
 Internet Initiative Japan Inc.

 Device Engineering Section,
 IoT Platform Development Department,
 Network Division,
 Technology Unit

 Kengo NAKAHARA <k-nakahara@iij.ad.jp>

From: Hauke Fath <hf@spg.tu-darmstadt.de>
To: Kengo NAKAHARA <k-nakahara@iij.ad.jp>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
Subject: Re: kern/50150: wm(4) failure
Date: Wed, 18 May 2016 13:46:42 +0200

 On Wed, 18 May 2016 16:20:35 +0900, Kengo NAKAHARA wrote:
 > I think if_wm.c:r1.397 might fix this problem. If you have time, Could
 > you try if_wm.c:r1.397 or if_wm.c:r1.398 ?

 Hi Kengo,

 that's a huge diff against netbsd-7 - almost 40 k in file size alone.=20
 Rev. 1.391 does not make it a drop-in replacement for the netbsd-7=20
 version, either, and I don't have any particular knowledge about the=20
 wm(4) code.

 Plus, this is about a production machine which cannot run -current.

 In short: Do you think you could narrow down the relevant changes to a=20
 netbsd-7 compatible patch?

 Cheerio,
 hauke

 --=20
      The ASCII Ribbon Campaign                    Hauke Fath
 ()     No HTML/RTF in email            Institut f=FCr Nachrichtentechnik
 /\     No Word docs in email                     TU Darmstadt
      Respect for open standards              Ruf +49-6151-16-21344

From: Kengo NAKAHARA <k-nakahara@iij.ad.jp>
To: hf@spg.tu-darmstadt.de
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
Subject: Re: kern/50150: wm(4) failure
Date: Wed, 18 May 2016 21:00:14 +0900

 Hi,

 On 2016/05/18 20:46, Hauke Fath wrote:
 > On Wed, 18 May 2016 16:20:35 +0900, Kengo NAKAHARA wrote:
 >> I think if_wm.c:r1.397 might fix this problem. If you have time, Could
 >> you try if_wm.c:r1.397 or if_wm.c:r1.398 ?
 > 
 > Hi Kengo,
 > 
 > that's a huge diff against netbsd-7 - almost 40 k in file size alone. 
 > Rev. 1.391 does not make it a drop-in replacement for the netbsd-7 
 > version, either, and I don't have any particular knowledge about the 
 > wm(4) code.
 > 
 > Plus, this is about a production machine which cannot run -current.
 > 
 > In short: Do you think you could narrow down the relevant changes to a 
 > netbsd-7 compatible patch?

 Sorry, I assume you use -current.
 However, netbsd-7 seems to have a similar problem. As you indicated,
 there is huge diff between netbsd-7 and -current. So, I will implement
 new patch for netbsd-7 from scratch. Please give me a time for a few
 days... or a few weeks.


 Thanks,

 -- 
 //////////////////////////////////////////////////////////////////////
 Internet Initiative Japan Inc.

 Device Engineering Section,
 IoT Platform Development Department,
 Network Division,
 Technology Unit

 Kengo NAKAHARA <k-nakahara@iij.ad.jp>

From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/50150: wm(4) failure
Date: Thu, 19 May 2016 09:48:49 +0200

 Just found this recipe:

 =
 https://sourceforge.net/projects/e1000/files/e1000e%20stable/eeprom_fix_82=
 574_or_82583/

 Unfortunately it needs a running Linux distribution.

 On my server it changed the eeprom value from 0x58 to 0x5a.

 --
 J. Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)

From: Masanobu SAITOH <msaitoh@execsw.org>
To: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
 gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
 Hauke Fath <hf@spg.tu-darmstadt.de>
Cc: msaitoh@execsw.org
Subject: Re: kern/50150: wm(4) failure
Date: Thu, 19 May 2016 17:27:47 +0900

 On 2016/05/19 16:50, J. Hannken-Illjes wrote:
 > The following reply was made to PR kern/50150; it has been noted by GNATS.
 >
 > From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
 > To: gnats-bugs@NetBSD.org
 > Cc:
 > Subject: Re: kern/50150: wm(4) failure
 > Date: Thu, 19 May 2016 09:48:49 +0200
 >
 >  Just found this recipe:
 >
 >  =
 >  https://sourceforge.net/projects/e1000/files/e1000e%20stable/eeprom_fix_82=
 >  574_or_82583/
 >
 >  Unfortunately it needs a running Linux distribution.
 >
 >  On my server it changed the eeprom value from 0x58 to 0x5a.
 >
 >  --
 >  J. Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)
 >

 It's 82547 Errata 25 "Dropped Rx Packets"

 See:
 	http://www.intel.co.jp/content/www/jp/ja/embedded/products/networking/82574-gbe-controller-spec-update.html
 	http://nxr.netbsd.org/xref/src/sys/dev/pci/if_wm.c#3493


 -- 
 -----------------------------------------------
                  SAITOH Masanobu (msaitoh@execsw.org
                                   msaitoh@netbsd.org)

From: Hauke Fath <hf@spg.tu-darmstadt.de>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/50150: wm(4) failure
Date: Thu, 19 May 2016 15:30:47 +0200

 On Thu, 19 May 2016 09:41:06 +0900, Kengo NAKAHARA wrote:
 > I ensured above patch can apply to if_wm.c:r1.289.2.9 and can build.

 Thanks!

 > Could you try this patch?

 A patched kernel is up and running on the machine.

 The patch didn't break anything. If it indeed deals with a known issue,=20
 it probably is a pull-up candidate as it is.

 OTOH, I've seen the router's wm3 go south about four times in the last=20
 six months, so a definitive statement will need some time.

 Cheerio,
 Hauke

 --=20
      The ASCII Ribbon Campaign                    Hauke Fath
 ()     No HTML/RTF in email            Institut f=FCr Nachrichtentechnik
 /\     No Word docs in email                     TU Darmstadt
      Respect for open standards              Ruf +49-6151-16-21344

From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/50150: wm(4) failure
Date: Fri, 17 Jun 2016 10:06:01 +0200

 Running this patch on top of if_wm.c 1.289.2.9 I got it again:

 wm0 at pci4 dev 0 function 0: Intel i82574L (rev. 0x00)
 wm0: interrupting at ioapic0 pin 16
 wm0: PCI-Express bus
 wm0: 2048 words (8 address bits) SPI EEPROM, version 2.1.2, Image Unique =
 ID 0000ffff
 wm0: Ethernet address 00:25:90:74:28:0c
 makphy0 at wm0 phy 1: Marvell 88E1149 Gigabit PHY, rev. 1

 ...

 wm0: device timeout (txfree 4095 txsfree 63 txnext 265)

 --
 J. Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.