NetBSD Problem Report #50150
From hf@spg.tu-darmstadt.de Mon Aug 17 13:40:03 2015
Return-Path: <hf@spg.tu-darmstadt.de>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
by mollari.NetBSD.org (Postfix) with ESMTPS id 31AAFA57FD
for <gnats-bugs@gnats.NetBSD.org>; Mon, 17 Aug 2015 13:40:03 +0000 (UTC)
Message-Id: <201508171337.t7HDbl5v002717@Vertatscha.nt.e-technik.tu-darmstadt.de>
Date: Mon, 17 Aug 2015 15:37:47 +0200 (CEST)
From: Hauke Fath <hf@spg.tu-darmstadt.de>
Reply-To: Hauke Fath <hf@spg.tu-darmstadt.de>
To: gnats-bugs@NetBSD.org
Cc: Hauke Fath <hf@spg.tu-darmstadt.de>
Subject: wm(4) failure
X-Send-Pr-Version: 3.95
>Number: 50150
>Category: kern
>Synopsis: wm(4) failure
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Mon Aug 17 13:45:00 +0000 2015
>Last-Modified: Fri Jun 17 08:10:00 +0000 2016
>Originator: Hauke Fath
>Release: NetBSD 7.0_RC2
>Organization:
Technische Universitaet Darmstadt
>Environment:
System: NetBSD Vertatscha 7.0_RC2 NetBSD 7.0_RC2 (FIFI-$Revision: 1.85 $) #0: Fri Aug 7 16:16:59 CEST 2015 hf@Hochstuhl:/var/obj/netbsd-builds/7/amd64/sys/arch/amd64/compile/FIFI amd64
Architecture: x86_64
Machine: amd64
>Description:
On a router running netbsd-7, one wm(4) interface
wm5 at pci8 dev 0 function 0: Intel i82574L (rev. 0x00)
wm5: interrupting at ioapic0 pin 19
wm5: Ethernet address 00:25:90:af:ed:cf
makphy1 at wm5 phy 1: Marvell 88E1149 Gigabit PHY, rev. 1
makphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
has failed three times so far over a course of two months.
The wm5 interface is one of two onboard NICs. It has five
vlans configured on it. The other onboard NIC, wm4, has one
vlan, but sees considerably more traffic, and performs
flawlessly.
In addition, there is a four-port NIC pci-e card of type
82571EB, whose interfaces wm0 ... wm3 also perform flawlessly.
During the failure, the kernel logs a lot of
wm5: device timeout (txfree 4089 txsfree 57 txnext 7)
style messages. At this point, the interface is down, reporting
"no carrier":
wm5: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
capabilities=7ff80<TSO4,IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx>
capabilities=7ff80<TCP4CSUM_Tx,UDP4CSUM_Rx,UDP4CSUM_Tx,TCP6CSUM_Rx>
capabilities=7ff80<TCP6CSUM_Tx,UDP6CSUM_Rx,UDP6CSUM_Tx,TSO6>
enabled=0
ec_capabilities=7<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU>
ec_enabled=3<VLAN_MTU,V^@LAN_HWTAGGING>
address: 00:25:90:af:ed:cf
media: Ethernet autoselect (none)
status: no carrier
inet6 fe80::225:90ff:feaf:edcf%wm5 prefixlen 64 detached scopeid 0x6
Since the router had to come back up as quick as possible, we
did not take the time to check whether an 'ifconfig wm5 down ;
ifconfig wm5 up' could have brought the NIC back in line.
>How-To-Repeat:
Configure a handful of vlans on an Intel i82574L wm(4) NIC,
run traffic, and wait, I guess?
>Fix:
None known.
>Audit-Trail:
From: Hauke Fath <hf@spg.tu-darmstadt.de>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
Subject: Re: kern/50150: wm(4) failure
Date: Tue, 10 May 2016 15:18:50 +0200
On Mon, 17 Aug 2015 13:45:00 +0000 (UTC), Hauke Fath wrote:
> =09Since the router had to come back up as quick as possible, we
> =09did not take the time to check whether an 'ifconfig wm5 down ;
> =09ifconfig wm5 up' could have brought the NIC back in line.
... happened again during nightly backup traffic, and an 'ifconfig wm5=20
down ; ifconfig wm5 up' did not fix it.
Any recent relevant changes to wm(4) that I could give a spin?
Cheerio,
hauke
--=20
The ASCII Ribbon Campaign Hauke Fath
() No HTML/RTF in email Institut f=FCr Nachrichtentechnik
/\ No Word docs in email TU Darmstadt
Respect for open standards Ruf +49-6151-16-21344
From: Masanobu SAITOH <msaitoh@execsw.org>
To: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
Hauke Fath <hf@spg.tu-darmstadt.de>
Cc: msaitoh@execsw.org
Subject: Re: kern/50150: wm(4) failure
Date: Wed, 11 May 2016 12:45:46 +0900
Hi.
On 2016/05/10 22:20, Hauke Fath wrote:
> The following reply was made to PR kern/50150; it has been noted by GNATS.
>
> From: Hauke Fath <hf@spg.tu-darmstadt.de>
> To: gnats-bugs@NetBSD.org
> Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
> Subject: Re: kern/50150: wm(4) failure
> Date: Tue, 10 May 2016 15:18:50 +0200
>
> On Mon, 17 Aug 2015 13:45:00 +0000 (UTC), Hauke Fath wrote:
> > =09Since the router had to come back up as quick as possible, we
> > =09did not take the time to check whether an 'ifconfig wm5 down ;
> > =09ifconfig wm5 up' could have brought the NIC back in line.
>
> ... happened again during nightly backup traffic, and an 'ifconfig wm5=20
> down ; ifconfig wm5 up' did not fix it.
>
> Any recent relevant changes to wm(4) that I could give a spin?
Some people reported the same problem and the problem have not solved
yet I think :-(
> Cheerio,
> hauke
>
> --=20
> The ASCII Ribbon Campaign Hauke Fath
> () No HTML/RTF in email Institut f=FCr Nachrichtentechnik
> /\ No Word docs in email TU Darmstadt
> Respect for open standards Ruf +49-6151-16-21344
>
>
--
-----------------------------------------------
SAITOH Masanobu (msaitoh@execsw.org
msaitoh@netbsd.org)
From: Kengo NAKAHARA <k-nakahara@iij.ad.jp>
To: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org, hf@spg.tu-darmstadt.de
Cc:
Subject: Re: kern/50150: wm(4) failure
Date: Wed, 18 May 2016 16:20:35 +0900
Hi,
On 2016/05/11 12:50, Masanobu SAITOH wrote:
> The following reply was made to PR kern/50150; it has been noted by GNATS.
>
> From: Masanobu SAITOH <msaitoh@execsw.org>
> To: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
> gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
> Hauke Fath <hf@spg.tu-darmstadt.de>
> Cc: msaitoh@execsw.org
> Subject: Re: kern/50150: wm(4) failure
> Date: Wed, 11 May 2016 12:45:46 +0900
>
> Hi.
>
> On 2016/05/10 22:20, Hauke Fath wrote:
> > The following reply was made to PR kern/50150; it has been noted by GNATS.
> >
> > From: Hauke Fath <hf@spg.tu-darmstadt.de>
> > To: gnats-bugs@NetBSD.org
> > Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
> > Subject: Re: kern/50150: wm(4) failure
> > Date: Tue, 10 May 2016 15:18:50 +0200
> >
> > On Mon, 17 Aug 2015 13:45:00 +0000 (UTC), Hauke Fath wrote:
> > > =09Since the router had to come back up as quick as possible, we
> > > =09did not take the time to check whether an 'ifconfig wm5 down ;
> > > =09ifconfig wm5 up' could have brought the NIC back in line.
> >
> > ... happened again during nightly backup traffic, and an 'ifconfig wm5=20
> > down ; ifconfig wm5 up' did not fix it.
> >
> > Any recent relevant changes to wm(4) that I could give a spin?
>
> Some people reported the same problem and the problem have not solved
> yet I think :-(
I think if_wm.c:r1.397 might fix this problem. If you have time, Could
you try if_wm.c:r1.397 or if_wm.c:r1.398 ?
Thanks,
--
//////////////////////////////////////////////////////////////////////
Internet Initiative Japan Inc.
Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit
Kengo NAKAHARA <k-nakahara@iij.ad.jp>
From: Hauke Fath <hf@spg.tu-darmstadt.de>
To: Kengo NAKAHARA <k-nakahara@iij.ad.jp>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
Subject: Re: kern/50150: wm(4) failure
Date: Wed, 18 May 2016 13:46:42 +0200
On Wed, 18 May 2016 16:20:35 +0900, Kengo NAKAHARA wrote:
> I think if_wm.c:r1.397 might fix this problem. If you have time, Could
> you try if_wm.c:r1.397 or if_wm.c:r1.398 ?
Hi Kengo,
that's a huge diff against netbsd-7 - almost 40 k in file size alone.=20
Rev. 1.391 does not make it a drop-in replacement for the netbsd-7=20
version, either, and I don't have any particular knowledge about the=20
wm(4) code.
Plus, this is about a production machine which cannot run -current.
In short: Do you think you could narrow down the relevant changes to a=20
netbsd-7 compatible patch?
Cheerio,
hauke
--=20
The ASCII Ribbon Campaign Hauke Fath
() No HTML/RTF in email Institut f=FCr Nachrichtentechnik
/\ No Word docs in email TU Darmstadt
Respect for open standards Ruf +49-6151-16-21344
From: Kengo NAKAHARA <k-nakahara@iij.ad.jp>
To: hf@spg.tu-darmstadt.de
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
Subject: Re: kern/50150: wm(4) failure
Date: Wed, 18 May 2016 21:00:14 +0900
Hi,
On 2016/05/18 20:46, Hauke Fath wrote:
> On Wed, 18 May 2016 16:20:35 +0900, Kengo NAKAHARA wrote:
>> I think if_wm.c:r1.397 might fix this problem. If you have time, Could
>> you try if_wm.c:r1.397 or if_wm.c:r1.398 ?
>
> Hi Kengo,
>
> that's a huge diff against netbsd-7 - almost 40 k in file size alone.
> Rev. 1.391 does not make it a drop-in replacement for the netbsd-7
> version, either, and I don't have any particular knowledge about the
> wm(4) code.
>
> Plus, this is about a production machine which cannot run -current.
>
> In short: Do you think you could narrow down the relevant changes to a
> netbsd-7 compatible patch?
Sorry, I assume you use -current.
However, netbsd-7 seems to have a similar problem. As you indicated,
there is huge diff between netbsd-7 and -current. So, I will implement
new patch for netbsd-7 from scratch. Please give me a time for a few
days... or a few weeks.
Thanks,
--
//////////////////////////////////////////////////////////////////////
Internet Initiative Japan Inc.
Device Engineering Section,
IoT Platform Development Department,
Network Division,
Technology Unit
Kengo NAKAHARA <k-nakahara@iij.ad.jp>
From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/50150: wm(4) failure
Date: Thu, 19 May 2016 09:48:49 +0200
Just found this recipe:
=
https://sourceforge.net/projects/e1000/files/e1000e%20stable/eeprom_fix_82=
574_or_82583/
Unfortunately it needs a running Linux distribution.
On my server it changed the eeprom value from 0x58 to 0x5a.
--
J. Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)
From: Masanobu SAITOH <msaitoh@execsw.org>
To: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
Hauke Fath <hf@spg.tu-darmstadt.de>
Cc: msaitoh@execsw.org
Subject: Re: kern/50150: wm(4) failure
Date: Thu, 19 May 2016 17:27:47 +0900
On 2016/05/19 16:50, J. Hannken-Illjes wrote:
> The following reply was made to PR kern/50150; it has been noted by GNATS.
>
> From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
> To: gnats-bugs@NetBSD.org
> Cc:
> Subject: Re: kern/50150: wm(4) failure
> Date: Thu, 19 May 2016 09:48:49 +0200
>
> Just found this recipe:
>
> =
> https://sourceforge.net/projects/e1000/files/e1000e%20stable/eeprom_fix_82=
> 574_or_82583/
>
> Unfortunately it needs a running Linux distribution.
>
> On my server it changed the eeprom value from 0x58 to 0x5a.
>
> --
> J. Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)
>
It's 82547 Errata 25 "Dropped Rx Packets"
See:
http://www.intel.co.jp/content/www/jp/ja/embedded/products/networking/82574-gbe-controller-spec-update.html
http://nxr.netbsd.org/xref/src/sys/dev/pci/if_wm.c#3493
--
-----------------------------------------------
SAITOH Masanobu (msaitoh@execsw.org
msaitoh@netbsd.org)
From: Hauke Fath <hf@spg.tu-darmstadt.de>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/50150: wm(4) failure
Date: Thu, 19 May 2016 15:30:47 +0200
On Thu, 19 May 2016 09:41:06 +0900, Kengo NAKAHARA wrote:
> I ensured above patch can apply to if_wm.c:r1.289.2.9 and can build.
Thanks!
> Could you try this patch?
A patched kernel is up and running on the machine.
The patch didn't break anything. If it indeed deals with a known issue,=20
it probably is a pull-up candidate as it is.
OTOH, I've seen the router's wm3 go south about four times in the last=20
six months, so a definitive statement will need some time.
Cheerio,
Hauke
--=20
The ASCII Ribbon Campaign Hauke Fath
() No HTML/RTF in email Institut f=FCr Nachrichtentechnik
/\ No Word docs in email TU Darmstadt
Respect for open standards Ruf +49-6151-16-21344
From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/50150: wm(4) failure
Date: Fri, 17 Jun 2016 10:06:01 +0200
Running this patch on top of if_wm.c 1.289.2.9 I got it again:
wm0 at pci4 dev 0 function 0: Intel i82574L (rev. 0x00)
wm0: interrupting at ioapic0 pin 16
wm0: PCI-Express bus
wm0: 2048 words (8 address bits) SPI EEPROM, version 2.1.2, Image Unique =
ID 0000ffff
wm0: Ethernet address 00:25:90:74:28:0c
makphy0 at wm0 phy 1: Marvell 88E1149 Gigabit PHY, rev. 1
...
wm0: device timeout (txfree 4095 txsfree 63 txnext 265)
--
J. Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.