NetBSD Problem Report #53216

From bouyer@antioche.eu.org  Thu Apr 26 08:22:37 2018
Return-Path: <bouyer@antioche.eu.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 628B87A177
	for <gnats-bugs@gnats.NetBSD.org>; Thu, 26 Apr 2018 08:22:37 +0000 (UTC)
Message-Id: <20180426082230.2AAB627FB@rochebonne.antioche.eu.org>
Date: Thu, 26 Apr 2018 10:22:30 +0200 (CEST)
From: bouyer@antioche.eu.org
Reply-To: bouyer@antioche.eu.org
To: gnats-bugs@NetBSD.org
Subject: sunxi awge is unreliable at gigabit speed
X-Send-Pr-Version: 3.95

>Number:         53216
>Category:       port-arm
>Synopsis:       sunxi awge is unreliable at gigabit speed
>Confidential:   no
>Severity:       non-critical
>Priority:       medium
>Responsible:    port-arm-maintainer
>State:          feedback
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Apr 26 08:25:00 +0000 2018
>Closed-Date:    
>Last-Modified:  Mon Oct 18 19:05:01 +0000 2021
>Originator:     Manuel Bouyer
>Release:        NetBSD 8.99.14
>Organization:
>Environment:
System: NetBSD lime2 8.99.14 NetBSD 8.99.14 (SUNXI_CAN) #21: Wed Apr 25 14:57:43 CEST 2018 bouyer@bip.soc.lip6.fr:/dsk/l1/misc/bouyer/tmp/evbarm-earmhf/obj/dsk/l1/misc/bouyer/HEAD/clean/src/sys/arch/evbarm/compile/SUNXI_CAN evbarm
Architecture: earmv7hf
Machine: evbarm

>Description:
	On a olimex Lime2 board with:
awge0 at fdt1 (/soc@1c00000/ethernet@1c50000)fdt: [ethernet@1c50000] decoded addr #0: 1c50000 -> 1c50000
: GMAC
awge0: interrupting on GIC irq 117
awge0: Ethernet address: 02:c7:04:82:c2:37
rgephy0 at awge0 phy 0: RTL8169S/8110S/8211 1000BASE-T media interface, rev. 5
rgephy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
rgephy1 at awge0 phy 1: RTL8169S/8110S/8211 1000BASE-T media interface, rev. 5
rgephy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
awge0: WARNING: power management not supported

	(note that the PHY attaches twice), the network is sometimes
	unreliable. When connected to a 100Mbs cisco switch everything
	works fine. When connected to a 1Gbs dlink switch, the green led
	on the board's ethernet connector (link state) flashes at about
	1s and there are packet loss:
100 packets transmitted, 96 packets received, 4.0% packet loss
round-trip min/avg/max/stddev = 0.286974/1.172346/10.286504/2.295635 ms
	(scp also has much lower speed than it should).
	The link on the switch side, and in ifconfig output doens't
	show this down/up problem.
	While the green led is off on the board's ethenet connector, the
	yellow led (link activity) seems to still be flashing as usual.
	Also, the ping's packet loss doens't reflect the led off/on ratio.

>How-To-Repeat:
	connect a lime2 to a 1Gbs switch
>Fix:
	unknown

>Release-Note:

>Audit-Trail:
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: port-arm/53216: sunxi awge is unreliable at gigabit speed
Date: Thu, 26 Apr 2018 11:21:01 +0200

 FWIW, my cubietruck does this when doing ping -c 10000 -f $host on a gigE
 link:

 10000 packets transmitted, 10000 packets received, 0.0% packet loss
 round-trip min/avg/max/stddev = 0.548294/1.014064/5.901567/0.262285 ms
   984.0 packets/sec sent,  984.0 packets/sec received


 Martin

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: port-arm-maintainer@netbsd.org, gnats-admin@netbsd.org,
        netbsd-bugs@netbsd.org
Subject: Re: port-arm/53216: sunxi awge is unreliable at gigabit speed
Date: Thu, 26 Apr 2018 11:36:14 +0200

 On Thu, Apr 26, 2018 at 09:25:01AM +0000, Martin Husemann wrote:
 > The following reply was made to PR port-arm/53216; it has been noted by GNATS.
 > 
 > From: Martin Husemann <martin@duskware.de>
 > To: gnats-bugs@NetBSD.org
 > Cc: 
 > Subject: Re: port-arm/53216: sunxi awge is unreliable at gigabit speed
 > Date: Thu, 26 Apr 2018 11:21:01 +0200
 > 
 >  FWIW, my cubietruck does this when doing ping -c 10000 -f $host on a gigE
 >  link:
 >  
 >  10000 packets transmitted, 10000 packets received, 0.0% packet loss
 >  round-trip min/avg/max/stddev = 0.548294/1.014064/5.901567/0.262285 ms
 >    984.0 packets/sec sent,  984.0 packets/sec received

 How does the ethernet attach ?
 Does it also see 2 PHYs ?

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: Martin Husemann <martin@duskware.de>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: gnats-bugs@NetBSD.org, port-arm-maintainer@netbsd.org,
	gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: port-arm/53216: sunxi awge is unreliable at gigabit speed
Date: Thu, 26 Apr 2018 11:41:21 +0200

 On Thu, Apr 26, 2018 at 11:36:14AM +0200, Manuel Bouyer wrote:
 > How does the ethernet attach ?
 > Does it also see 2 PHYs ?

 Yes, there are two phys in the soc, only one is connected on the cubietruck
 (AFAIK).

 awge0 at fdt1: GMAC
 awge0: interrupting on GIC irq 117
 awge0: Ethernet address: 02:0e:03:41:63:14
 rgephy0 at awge0 phy 0: RTL8169S/8110S/8211 1000BASE-T media interface, rev. 5
 rgephy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
 rgephy1 at awge0 phy 1: RTL8169S/8110S/8211 1000BASE-T media interface, rev. 5
 rgephy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto

 and:

 awge0: flags=0x8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
         ec_capabilities=1<VLAN_MTU>
         ec_enabled=0
         address: 02:0e:03:41:63:14
         media: Ethernet autoselect (1000baseT full-duplex)
         status: active
 [..]

 Which reminds me I need to add checksum offload to the driver.

 Martin

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Martin Husemann <martin@duskware.de>
Cc: gnats-bugs@NetBSD.org, port-arm-maintainer@netbsd.org,
        gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: port-arm/53216: sunxi awge is unreliable at gigabit speed
Date: Thu, 26 Apr 2018 11:46:38 +0200

 On Thu, Apr 26, 2018 at 11:41:21AM +0200, Martin Husemann wrote:
 > On Thu, Apr 26, 2018 at 11:36:14AM +0200, Manuel Bouyer wrote:
 > > How does the ethernet attach ?
 > > Does it also see 2 PHYs ?
 > 
 > Yes, there are two phys in the soc, only one is connected on the cubietruck
 > (AFAIK).
 > 
 > awge0 at fdt1: GMAC
 > awge0: interrupting on GIC irq 117
 > awge0: Ethernet address: 02:0e:03:41:63:14
 > rgephy0 at awge0 phy 0: RTL8169S/8110S/8211 1000BASE-T media interface, rev. 5
 > rgephy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
 > rgephy1 at awge0 phy 1: RTL8169S/8110S/8211 1000BASE-T media interface, rev. 5
 > rgephy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto

 Actually I don't think there are 2 PHYs, in the cubietruck schematics, as
 well as the lime2 schematic, and by board inspection, there is only one
 PHY. I think the same PHY is responding on 2 different addresses.

 Also, it looks like the lim2 resision I have and the cubietruck have
 the same 8211 variant.

 I'll have to investigate some more on my side then

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Martin Husemann <martin@duskware.de>
Cc: gnats-bugs@NetBSD.org, port-arm-maintainer@netbsd.org,
        gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: port-arm/53216: sunxi awge is unreliable at gigabit speed
Date: Thu, 26 Apr 2018 19:46:32 +0200

 On Thu, Apr 26, 2018 at 11:46:38AM +0200, Manuel Bouyer wrote:
 > > Yes, there are two phys in the soc, only one is connected on the cubietruck
 > > (AFAIK).
 > > 
 > > awge0 at fdt1: GMAC
 > > awge0: interrupting on GIC irq 117
 > > awge0: Ethernet address: 02:0e:03:41:63:14
 > > rgephy0 at awge0 phy 0: RTL8169S/8110S/8211 1000BASE-T media interface, rev. 5
 > > rgephy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
 > > rgephy1 at awge0 phy 1: RTL8169S/8110S/8211 1000BASE-T media interface, rev. 5
 > > rgephy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
 > 
 > Actually I don't think there are 2 PHYs, in the cubietruck schematics, as
 > well as the lime2 schematic, and by board inspection, there is only one
 > PHY. I think the same PHY is responding on 2 different addresses.
 > 
 > Also, it looks like the lim2 resision I have and the cubietruck have
 > the same 8211 variant.
 > 
 > I'll have to investigate some more on my side then

 Could be some uninitialized register problem.
 I tried to apply the no-rx-delay property for my board in sunxi_platform.c,
 this made the problem worse (more packet loss with ping, I couldn't
 even connect via ssh, link led on the board off but still on on the switch).
 rebooting with a known-good kernel didn't get me a working ethernet.
 I had to power cycle to get back to the previous working state.

 Now to find what appropriate "no-rx-delay" values would work for this
 board ...

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: port-arm/53216: sunxi awge is unreliable at gigabit speed
Date: Fri, 27 Apr 2018 09:07:57 +0200

 On Thu, Apr 26, 2018 at 05:50:01PM +0000, Manuel Bouyer wrote:
 >  Could be some uninitialized register problem.

 May depend on the u-boot version too, I have:

 U-Boot 2017.11 (Dec 05 2017 - 14:37:38 +0100) Allwinner Technology

 arm-none-eabi-gcc (GCC) 7.2.0
 GNU ld (GNU Binutils) 2.29


 (IIRC I build that myself from pkgsrc)

 Martin

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: port-arm/53216: sunxi awge is unreliable at gigabit speed
Date: Fri, 27 Apr 2018 09:11:53 +0200

 On Fri, Apr 27, 2018 at 07:10:01AM +0000, Martin Husemann wrote:
 >  May depend on the u-boot version too, I have:
 >  
 >  U-Boot 2017.11 (Dec 05 2017 - 14:37:38 +0100) Allwinner Technology

 And also I am netbooting the machine (via gigE).

 Martin

State-Changed-From-To: open->feedback
State-Changed-By: skrll@NetBSD.org
State-Changed-When: Mon, 18 Oct 2021 07:41:25 +0000
State-Changed-Why:
Is this still a problem? I see these changes that might help your problem

revision 1.67
date: 2019-10-15 18:19:05 +0100;  author: tnn;  state: Exp;  lines: +4 -5;  commitid: eOp1fNcFFCQafZGB;
correct pointer arithmetics
----------------------------
revision 1.66
date: 2019-10-15 17:30:49 +0100;  author: tnn;  state: Exp;  lines: +28 -14;  commitid: t0ylgdxkVARbYYGB;
awge: fix issue that caused rx packets to be corrupt with DIAGNOSTIC kernel

It seems the hardware can only reliably do rx DMA to addresses that are
dcache size aligned. This is hinted at by some GMAC data sheets but hard to
find an authoritative source.

on non-DIAGNOSTIC kernels we always implicitly get MCLBYTES-aligned mbuf
data pointers, but with the reintroduction of POOL_REDZONE for DIAGNOSTIC
we can get 8-byte alignment due to redzone padding. So align rx pointers to
64 bytes which should be good for both arm32 and aarch64.

While here change some bus_dmamap_load() to bus_dmamap_load_mbuf() and add
one missing bus_dmamap_sync(). Also fixes the code to not assume that
MCLBYTES == AWGE_MAX_PACKET. User may override MCLSHIFT in kernel config.


From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@netbsd.org
Cc: port-arm-maintainer@netbsd.org, netbsd-bugs@netbsd.org,
        gnats-admin@netbsd.org, skrll@NetBSD.org
Subject: Re: port-arm/53216 (sunxi awge is unreliable at gigabit speed)
Date: Mon, 18 Oct 2021 21:04:05 +0200

 On Mon, Oct 18, 2021 at 07:41:26AM +0000, skrll@NetBSD.org wrote:
 > Synopsis: sunxi awge is unreliable at gigabit speed
 > 
 > State-Changed-From-To: open->feedback
 > State-Changed-By: skrll@NetBSD.org
 > State-Changed-When: Mon, 18 Oct 2021 07:41:25 +0000
 > State-Changed-Why:
 > Is this still a problem? I see these changes that might help your problem

 On netbsd-9 I'm still seeing the strange led behavior but I don't notice
 the packet loss any more (maybe I'm just lucky ?).
 Anyway the change below doens't explain why it would work fine with a cisco
 switch but fail with a dlink ...

 > 
 > revision 1.67
 > date: 2019-10-15 18:19:05 +0100;  author: tnn;  state: Exp;  lines: +4 -5;  commitid: eOp1fNcFFCQafZGB;
 > correct pointer arithmetics
 > ----------------------------
 > revision 1.66
 > date: 2019-10-15 17:30:49 +0100;  author: tnn;  state: Exp;  lines: +28 -14;  commitid: t0ylgdxkVARbYYGB;
 > awge: fix issue that caused rx packets to be corrupt with DIAGNOSTIC kernel
 > 
 > It seems the hardware can only reliably do rx DMA to addresses that are
 > dcache size aligned. This is hinted at by some GMAC data sheets but hard to
 > find an authoritative source.
 > 
 > on non-DIAGNOSTIC kernels we always implicitly get MCLBYTES-aligned mbuf
 > data pointers, but with the reintroduction of POOL_REDZONE for DIAGNOSTIC
 > we can get 8-byte alignment due to redzone padding. So align rx pointers to
 > 64 bytes which should be good for both arm32 and aarch64.
 > 
 > While here change some bus_dmamap_load() to bus_dmamap_load_mbuf() and add
 > one missing bus_dmamap_sync(). Also fixes the code to not assume that
 > MCLBYTES == AWGE_MAX_PACKET. User may override MCLSHIFT in kernel config.
 > 
 > 

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.