NetBSD Problem Report #46260

From he@betelgeuse.urc.uninett.no  Mon Mar 26 22:23:11 2012
Return-Path: <he@betelgeuse.urc.uninett.no>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	by www.NetBSD.org (Postfix) with ESMTP id 074E363BBEC
	for <gnats-bugs@gnats.NetBSD.org>; Mon, 26 Mar 2012 22:23:11 +0000 (UTC)
Message-Id: <20120326210509.CD7DA155E4@betelgeuse.urc.uninett.no>
Date: Mon, 26 Mar 2012 21:05:09 +0000 (UTC)
From: he@NetBSD.org
Reply-To: he@NetBSD.org
To: gnats-bugs@gnats.NetBSD.org
Subject: gem0 driver fails to recover after RX overflow
X-Send-Pr-Version: 3.95

>Number:         46260
>Category:       port-sparc64
>Synopsis:       gem0 driver fails to recover after RX overflow
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    port-sparc64-maintainer
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Mar 26 22:25:00 +0000 2012
>Closed-Date:    Mon Dec 23 23:49:49 +0000 2013
>Last-Modified:  Mon Dec 23 23:49:49 +0000 2013
>Originator:     Havard Eidnes
>Release:        NetBSD 6.0_BETA
>Organization:
	None
>Environment:
System: NetBSD betelgeuse.urc.uninett.no 6.0_BETA NetBSD 6.0_BETA (GENERIC) #1: Mon Mar 26 20:41:19 UTC 2012 he@betelgeuse.urc.uninett.no:/usr/obj/sys/arch/sparc64/compile/GENERIC sparc64
Architecture: sparc64
Machine: sparc64
>Description:
	I've currently been upgrading a SunFire V120 from 4.0 via 5.1
	to 6.0_BETA.  The host sometimes gets significant traffic over
	gem0.  With the code in 4.0, it has been rock solid.

	However, both with 5.1 and 6.0_BETA, the gem(4) Ethernet interface
	tends to lock up.  Adding some debugging printf()s reveals that
	the errors which occur right before the interface seizes up is
	an RX overflow, the modified code is:

...
        if (status & GEM_INTR_RX_MAC) {
                int rxstat = bus_space_read_4(t, h, GEM_MAC_RX_STATUS);
                /*
                 * At least with GEM_SUN_GEM and some GEM_SUN_ERI
                 * revisions GEM_MAC_RX_OVERFLOW happen often due to a
                 * silicon bug so handle them silently. Moreover, it's
                 * likely that the receiver has hung so we reset it.
                 */
                if (rxstat & GEM_MAC_RX_OVERFLOW) {
                        ifp->if_ierrors++;
                        aprint_error_dev(sc->sc_dev,
                            "receive error: RX overflow");
                        gem_reset_rxdma(sc);
...

	And this printf() is triggered.

>How-To-Repeat:
	Push lots of traffic through gem0 with either 5.1 or 6.0_BETA.
	Watch it seize up.

>Fix:
	Doing an "ifconfig gem0 down; ifconfig gem0 up" resets the
	interface so that it works again for a while.

>Release-Note:

>Audit-Trail:
From: Julian Coleman <jdc@coris.org.uk>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: port-sparc64/46260: gem0 driver fails to recover after RX overflow
Date: Fri, 30 Mar 2012 16:11:45 +0100

 Hi,

 > 	                the modified code is:
 > 
 > ...
 >         if (status & GEM_INTR_RX_MAC) {
 >                 int rxstat = bus_space_read_4(t, h, GEM_MAC_RX_STATUS);
 >                 /*
 >                  * At least with GEM_SUN_GEM and some GEM_SUN_ERI
 >                  * revisions GEM_MAC_RX_OVERFLOW happen often due to a
 >                  * silicon bug so handle them silently. Moreover, it's
 >                  * likely that the receiver has hung so we reset it.
 >                  */
 >                 if (rxstat & GEM_MAC_RX_OVERFLOW) {
 >                         ifp->if_ierrors++;
 >                         aprint_error_dev(sc->sc_dev,
 >                             "receive error: RX overflow");
 >                         gem_reset_rxdma(sc);
 > ...
 > 
 > 	And this printf() is triggered.

 Revision 1.68 (which was pulled up to netbsd-4, but not netbsd-4-0) changed
 the:

   gem_init(ifp);

 to:

   gem_reset_rxdma(sc);

 when (rxstat & GEM_MAC_RX_OVERFLOW) occurred.  Can you try changing this
 back?

 Thanks,

 J

 -- 
   My other computer also runs NetBSD    /        Sailing at Newbiggin
         http://www.netbsd.org/        /   http://www.newbigginsailingclub.org/

From: Havard Eidnes <he@NetBSD.org>
To: gnats-bugs@NetBSD.org, jdc@coris.org.uk
Cc: port-sparc64-maintainer@netbsd.org
Subject: Re: port-sparc64/46260: gem0 driver fails to recover after RX
 overflow
Date: Tue, 10 Apr 2012 15:30:48 +0200 (CEST)

 >  Revision 1.68 (which was pulled up to netbsd-4, but not netbsd-4-0) changed
 >  the:
 >  
 >    gem_init(ifp);
 >  
 >  to:
 >  
 >    gem_reset_rxdma(sc);
 >  
 >  when (rxstat & GEM_MAC_RX_OVERFLOW) occurred.  Can you try changing this
 >  back?

 Hm, did that, and ... apparently I can't do that from interrupt
 context, as it hit a diagnostic assertion (this is with netbsd-6
 sources this time, BTW):

 gem0: receive error: RX overflow
 panic: kernel diagnostic assertion "!cpu_intr_p()" failed: file "/usr/src/sys/kern/kern_timeout.c", line 471
 cpu0: Begin traceback...
 cpu0: End traceback...
 Frame pointer is at 0xe0016fd1
 Call traceback:
  netbsd:cpu_reboot+0x268(1cb4210, 5db1c00, ff0f0000000001, 0, 1, 75) fp = e0017091
  netbsd:vpanic+0x20c(104, 0, ffffffffff000000, 1, 146e0c0, 0) fp = e0017141
  netbsd:kern_assert+0x34(17928f0, e0017b38, e0017970, fefefefefefefeff, 10cb9a0, ff000000000000) fp = e00171f1
  netbsd:callout_halt+0x194(17928f0, 1792928, 179b7f0, 17ce890, 1d7, 73) fp = e00172b1
  netbsd:gem_stop+0xc(6d5e440, 0, 1000000, 0, e0017d50, 0) fp = e0017361
  netbsd:gem_init+0x28(6d5e008, 0, e0017d50, 800, 1d, e) fp = e0017411
  netbsd:gem_intr+0x17c(6d5e008, 17adce0, 7fff0000, 24, 4000000000000000, 1d) fp = e00174c1
  netbsd:intr_biglock_wrapper+0x10(6d5e000, 0, 885671dc, 88567170, 0, 0) fp = e0017571
  netbsd:sparc_interrupt+0x224(5db7390, 0, e0017ed0, 1, 122ca40, 2014000) fp = e0017621
  netbsd:sched_nextlwp+0x124(1792800, 0, 0, 6, 0, 2014000) fp = 877c3401

 Then how should we go about doing this?

 Regards,

 - Havard

From: Havard Eidnes <he@NetBSD.org>
To: gnats-bugs@NetBSD.org, jdc@coris.org.uk
Cc: port-sparc64-maintainer@netbsd.org
Subject: Re: port-sparc64/46260: gem0 driver fails to recover after RX
 overflow
Date: Wed, 11 Apr 2012 11:15:53 +0200 (CEST)

 ----Next_Part(Wed_Apr_11_11_15_53_2012_993)--
 Content-Type: Text/Plain; charset=iso-8859-1
 Content-Transfer-Encoding: quoted-printable

 Hi,

 I've taken a look at the OpenBSD driver, and copied their method
 of detection & reset.  I'm currently testing this, but so far it
 has not yet triggered.  Diff attached below.

 - H=E5vard

 ----Next_Part(Wed_Apr_11_11_15_53_2012_993)--
 Content-Type: Text/Plain; charset=us-ascii
 Content-Transfer-Encoding: 7bit
 Content-Disposition: inline; filename=diff

 Index: gem.c
 ===================================================================
 RCS file: /cvsroot/src/sys/dev/ic/gem.c,v
 retrieving revision 1.98
 diff -u -r1.98 gem.c
 --- gem.c	2 Feb 2012 19:43:03 -0000	1.98
 +++ gem.c	11 Apr 2012 09:12:46 -0000
 @@ -89,6 +89,7 @@
  int		gem_ioctl(struct ifnet *, u_long, void *);
  void		gem_tick(void *);
  void		gem_watchdog(struct ifnet *);
 +void		gem_rx_watchdog(void *);
  void		gem_pcs_start(struct gem_softc *sc);
  void		gem_pcs_stop(struct gem_softc *sc, int);
  int		gem_init(struct ifnet *);
 @@ -177,6 +178,7 @@
  		ifmedia_delete_instance(&sc->sc_mii.mii_media, IFM_INST_ANY);

  		callout_destroy(&sc->sc_tick_ch);
 +		callout_destroy(&sc->sc_rx_watchdog);

  		/*FALLTHROUGH*/
  	case GEM_ATT_MII:
 @@ -613,6 +615,8 @@
  #endif

  	callout_init(&sc->sc_tick_ch, 0);
 +	callout_init(&sc->sc_rx_watchdog, 0);
 +	callout_setfunc(&sc->sc_rx_watchdog, gem_rx_watchdog, sc);

  	sc->sc_att_stage = GEM_ATT_FINISHED;

 @@ -1824,6 +1828,8 @@
  		if (gem_add_rxbuf(sc, i) != 0) {
  			GEM_COUNTER_INCR(sc, sc_ev_rxnobuf);
  			ifp->if_ierrors++;
 +			aprint_error_dev(sc->sc_dev,
 +			    "receive error: RX no buffer space\n");
  			GEM_INIT_RXDESC(sc, i);
  			bus_dmamap_sync(sc->sc_dmatag, rxs->rxs_dmamap, 0,
  			    rxs->rxs_dmamap->dm_mapsize, BUS_DMASYNC_PREREAD);
 @@ -1965,12 +1971,34 @@
  	DPRINTF(sc, ("gem_rint: done sc->rxptr %d, complete %d\n",
  		sc->sc_rxptr, bus_space_read_4(t, h, GEM_RX_COMPLETION)));

 +	if ((i = bus_space_read_4(t, h, GEM_MAC_RX_LEN_ERR_CNT)) != 0) {
 +		aprint_error_dev(sc->sc_dev,
 +			    "receive error: RX length error\n");
 +		ifp->if_ierrors += i;
 +	}
 +	if ((i = bus_space_read_4(t, h, GEM_MAC_RX_ALIGN_ERR)) != 0) {
 +		aprint_error_dev(sc->sc_dev,
 +			    "receive error: RX alignment error\n");
 +		ifp->if_ierrors += i;
 +	}
 +	if ((i = bus_space_read_4(t, h, GEM_MAC_RX_CRC_ERR_CNT)) != 0) {
 +		aprint_error_dev(sc->sc_dev,
 +			    "receive error: RX CRC error\n");
 +		ifp->if_ierrors += i;
 +	}
 +	if ((i = bus_space_read_4(t, h, GEM_MAC_RX_CODE_VIOL)) != 0) {
 +		aprint_error_dev(sc->sc_dev,
 +			    "receive error: RX code violationn");
 +		ifp->if_ierrors += i;
 +	}
 +#if 0
  	/* Read error counters ... */
  	ifp->if_ierrors +=
  	    bus_space_read_4(t, h, GEM_MAC_RX_LEN_ERR_CNT) +
  	    bus_space_read_4(t, h, GEM_MAC_RX_ALIGN_ERR) +
  	    bus_space_read_4(t, h, GEM_MAC_RX_CRC_ERR_CNT) +
  	    bus_space_read_4(t, h, GEM_MAC_RX_CODE_VIOL);
 +#endif

  	/* ... then clear the hardware counters. */
  	bus_space_write_4(t, h, GEM_MAC_RX_LEN_ERR_CNT, 0);
 @@ -2209,7 +2237,21 @@
  		 */
  		if (rxstat & GEM_MAC_RX_OVERFLOW) {
  			ifp->if_ierrors++;
 +			aprint_error_dev(sc->sc_dev,
 +			    "receive error: RX overflow\n");
  			gem_reset_rxdma(sc);
 +			/*
 +			 * Apparently a silicon bug causes ERI to hang from 
 +			 * time to time.  So if we detect an RX FIFO overflow,
 +			 * we fire off a timer, and check whether we're still
 +			 * making progress by looking at the RX FIFO write
 +			 * and read pointers.
 +			 */
 +			sc->sc_rx_fifo_wr_ptr =
 +				bus_space_read_4(t, h, GEM_RX_FIFO_WR_PTR);
 +			sc->sc_rx_fifo_rd_ptr =
 +				bus_space_read_4(t, h, GEM_RX_FIFO_RD_PTR);
 +			callout_schedule(&sc->sc_rx_watchdog, 400);
  		} else if (rxstat & ~(GEM_MAC_RX_DONE | GEM_MAC_RX_FRAME_CNT))
  			printf("%s: MAC rx fault, status 0x%02x\n",
  			    device_xname(sc->sc_dev), rxstat);
 @@ -2236,6 +2278,40 @@
  	return (r);
  }

 +void
 +gem_rx_watchdog(void *arg)
 +{
 +	struct gem_softc *sc = arg;
 +	struct ifnet *ifp = &sc->sc_ethercom.ec_if;
 +	bus_space_tag_t t = sc->sc_bustag;
 +	bus_space_handle_t h = sc->sc_h1;
 +	u_int32_t rx_fifo_wr_ptr;
 +	u_int32_t rx_fifo_rd_ptr;
 +	u_int32_t state;
 +
 +	if ((ifp->if_flags & IFF_RUNNING) == 0)
 +		return;
 +
 +	rx_fifo_wr_ptr = bus_space_read_4(t, h, GEM_RX_FIFO_WR_PTR);
 +	rx_fifo_rd_ptr = bus_space_read_4(t, h, GEM_RX_FIFO_RD_PTR);
 +	state = bus_space_read_4(t, h, GEM_MAC_MAC_STATE);
 +	if ((state & GEM_MAC_RX_OVERFLOW) == GEM_MAC_RX_OVERFLOW &&
 +	    ((rx_fifo_wr_ptr == rx_fifo_rd_ptr) ||
 +	     ((sc->sc_rx_fifo_wr_ptr == rx_fifo_wr_ptr) &&
 +	      (sc->sc_rx_fifo_rd_ptr == rx_fifo_rd_ptr))))
 +	{
 +		/*
 +		 * The RX state machine is still in overflow state and
 +		 * the RX FIFO write and read pointers seem to be
 +		 * stuck.  Whack the chip over the head to get things
 +		 * going again.
 +		 */
 +		aprint_error_dev(sc->sc_dev,
 +		    "receiver stuck in overflow, resetting\n");
 +		gem_init(ifp);
 +	}
 +	
 +}

  void
  gem_watchdog(struct ifnet *ifp)
 Index: gemvar.h
 ===================================================================
 RCS file: /cvsroot/src/sys/dev/ic/gemvar.h,v
 retrieving revision 1.23
 diff -u -r1.23 gemvar.h
 --- gemvar.h	2 Feb 2012 19:43:03 -0000	1.23
 +++ gemvar.h	11 Apr 2012 09:12:46 -0000
 @@ -130,6 +130,7 @@
  	struct ethercom sc_ethercom;	/* ethernet common data */
  	struct mii_data	sc_mii;		/* MII media control */
  	struct callout	sc_tick_ch;	/* tick callout */
 +	struct callout	sc_rx_watchdog;	/* RX watchdog callout */

  	/* The following bus handles are to be provided by the bus front-end */
  	bus_space_tag_t	sc_bustag;	/* bus tag */
 @@ -223,6 +224,10 @@
  	struct evcnt sc_ev_rxhist[9];
  #endif

 +	/* For use by the RX watchdog */
 +	u_int32_t 	sc_rx_fifo_wr_ptr;
 +	u_int32_t	sc_rx_fifo_rd_ptr;
 +
  	enum gem_attach_stage	sc_att_stage;
  };


 ----Next_Part(Wed_Apr_11_11_15_53_2012_993)----

From: Havard Eidnes <he@NetBSD.org>
To: gnats-bugs@NetBSD.org, jdc@coris.org.uk
Cc: port-sparc64-maintainer@netbsd.org
Subject: Re: port-sparc64/46260: gem0 driver fails to recover after RX
 overflow
Date: Wed, 11 Apr 2012 16:25:57 +0200 (CEST)

 ----Next_Part(Wed_Apr_11_16_25_57_2012_393)--
 Content-Type: Text/Plain; charset=iso-8859-1
 Content-Transfer-Encoding: quoted-printable

 > I've taken a look at the OpenBSD driver, and copied their method
 > of detection & reset.  I'm currently testing this, but so far it
 > has not yet triggered.  Diff attached below.

 Scratch that diff, here is one which works, but which ends up
 resetting the interface Quite Often, despite the state register
 indicating it's not in overflow mode -- it prints

   gem0: rx_watchdog: not in overflow state: 0x10400

 Only once in my testing did I see

   gem0: rx_watchdog: rd pointer !=3D saved

 occur, but it *did* occur.

 Regards,

 - H=E5vard

 ----Next_Part(Wed_Apr_11_16_25_57_2012_393)--
 Content-Type: Text/Plain; charset=us-ascii
 Content-Transfer-Encoding: 7bit
 Content-Disposition: inline; filename=diff

 Index: gem.c
 ===================================================================
 RCS file: /cvsroot/src/sys/dev/ic/gem.c,v
 retrieving revision 1.98
 diff -u -r1.98 gem.c
 --- gem.c	2 Feb 2012 19:43:03 -0000	1.98
 +++ gem.c	11 Apr 2012 14:19:52 -0000
 @@ -89,6 +89,7 @@
  int		gem_ioctl(struct ifnet *, u_long, void *);
  void		gem_tick(void *);
  void		gem_watchdog(struct ifnet *);
 +void		gem_rx_watchdog(void *);
  void		gem_pcs_start(struct gem_softc *sc);
  void		gem_pcs_stop(struct gem_softc *sc, int);
  int		gem_init(struct ifnet *);
 @@ -177,6 +178,7 @@
  		ifmedia_delete_instance(&sc->sc_mii.mii_media, IFM_INST_ANY);

  		callout_destroy(&sc->sc_tick_ch);
 +		callout_destroy(&sc->sc_rx_watchdog);

  		/*FALLTHROUGH*/
  	case GEM_ATT_MII:
 @@ -613,6 +615,8 @@
  #endif

  	callout_init(&sc->sc_tick_ch, 0);
 +	callout_init(&sc->sc_rx_watchdog, 0);
 +	callout_setfunc(&sc->sc_rx_watchdog, gem_rx_watchdog, sc);

  	sc->sc_att_stage = GEM_ATT_FINISHED;

 @@ -764,6 +768,8 @@
  	/* Wait till it finishes */
  	if (!gem_bitwait(sc, h, GEM_RX_CONFIG, 1, 0))
  		aprint_error_dev(sc->sc_dev, "cannot disable read dma\n");
 +	/* Wait 5ms extra. */
 +	delay(5000);

  	/* Finally, reset the ERX */
  	bus_space_write_4(t, h2, GEM_RESET, GEM_RESET_RX);
 @@ -1824,6 +1830,8 @@
  		if (gem_add_rxbuf(sc, i) != 0) {
  			GEM_COUNTER_INCR(sc, sc_ev_rxnobuf);
  			ifp->if_ierrors++;
 +			aprint_error_dev(sc->sc_dev,
 +			    "receive error: RX no buffer space\n");
  			GEM_INIT_RXDESC(sc, i);
  			bus_dmamap_sync(sc->sc_dmatag, rxs->rxs_dmamap, 0,
  			    rxs->rxs_dmamap->dm_mapsize, BUS_DMASYNC_PREREAD);
 @@ -1965,12 +1973,34 @@
  	DPRINTF(sc, ("gem_rint: done sc->rxptr %d, complete %d\n",
  		sc->sc_rxptr, bus_space_read_4(t, h, GEM_RX_COMPLETION)));

 +	if ((i = bus_space_read_4(t, h, GEM_MAC_RX_LEN_ERR_CNT)) != 0) {
 +		aprint_error_dev(sc->sc_dev,
 +			    "receive error: RX length error\n");
 +		ifp->if_ierrors += i;
 +	}
 +	if ((i = bus_space_read_4(t, h, GEM_MAC_RX_ALIGN_ERR)) != 0) {
 +		aprint_error_dev(sc->sc_dev,
 +			    "receive error: RX alignment error\n");
 +		ifp->if_ierrors += i;
 +	}
 +	if ((i = bus_space_read_4(t, h, GEM_MAC_RX_CRC_ERR_CNT)) != 0) {
 +		aprint_error_dev(sc->sc_dev,
 +			    "receive error: RX CRC error\n");
 +		ifp->if_ierrors += i;
 +	}
 +	if ((i = bus_space_read_4(t, h, GEM_MAC_RX_CODE_VIOL)) != 0) {
 +		aprint_error_dev(sc->sc_dev,
 +			    "receive error: RX code violationn");
 +		ifp->if_ierrors += i;
 +	}
 +#if 0
  	/* Read error counters ... */
  	ifp->if_ierrors +=
  	    bus_space_read_4(t, h, GEM_MAC_RX_LEN_ERR_CNT) +
  	    bus_space_read_4(t, h, GEM_MAC_RX_ALIGN_ERR) +
  	    bus_space_read_4(t, h, GEM_MAC_RX_CRC_ERR_CNT) +
  	    bus_space_read_4(t, h, GEM_MAC_RX_CODE_VIOL);
 +#endif

  	/* ... then clear the hardware counters. */
  	bus_space_write_4(t, h, GEM_MAC_RX_LEN_ERR_CNT, 0);
 @@ -2209,7 +2239,21 @@
  		 */
  		if (rxstat & GEM_MAC_RX_OVERFLOW) {
  			ifp->if_ierrors++;
 +			aprint_error_dev(sc->sc_dev,
 +			    "receive error: RX overflow\n");
  			gem_reset_rxdma(sc);
 +			/*
 +			 * Apparently a silicon bug causes ERI to hang from 
 +			 * time to time.  So if we detect an RX FIFO overflow,
 +			 * we fire off a timer, and check whether we're still
 +			 * making progress by looking at the RX FIFO write
 +			 * and read pointers.
 +			 */
 +			sc->sc_rx_fifo_wr_ptr =
 +				bus_space_read_4(t, h, GEM_RX_FIFO_WR_PTR);
 +			sc->sc_rx_fifo_rd_ptr =
 +				bus_space_read_4(t, h, GEM_RX_FIFO_RD_PTR);
 +			callout_schedule(&sc->sc_rx_watchdog, 400);
  		} else if (rxstat & ~(GEM_MAC_RX_DONE | GEM_MAC_RX_FRAME_CNT))
  			printf("%s: MAC rx fault, status 0x%02x\n",
  			    device_xname(sc->sc_dev), rxstat);
 @@ -2236,6 +2280,61 @@
  	return (r);
  }

 +void
 +gem_rx_watchdog(void *arg)
 +{
 +	struct gem_softc *sc = arg;
 +	struct ifnet *ifp = &sc->sc_ethercom.ec_if;
 +	bus_space_tag_t t = sc->sc_bustag;
 +	bus_space_handle_t h = sc->sc_h1;
 +	u_int32_t rx_fifo_wr_ptr;
 +	u_int32_t rx_fifo_rd_ptr;
 +	u_int32_t state;
 +
 +	if ((ifp->if_flags & IFF_RUNNING) == 0) {
 +		aprint_error_dev(sc->sc_dev, "receiver not running\n");
 +		return;
 +	}
 +
 +	rx_fifo_wr_ptr = bus_space_read_4(t, h, GEM_RX_FIFO_WR_PTR);
 +	rx_fifo_rd_ptr = bus_space_read_4(t, h, GEM_RX_FIFO_RD_PTR);
 +	state = bus_space_read_4(t, h, GEM_MAC_MAC_STATE);
 +	if ((state & GEM_MAC_STATE_OVERFLOW) == GEM_MAC_STATE_OVERFLOW &&
 +	    ((rx_fifo_wr_ptr == rx_fifo_rd_ptr) ||
 +	     ((sc->sc_rx_fifo_wr_ptr == rx_fifo_wr_ptr) &&
 +	      (sc->sc_rx_fifo_rd_ptr == rx_fifo_rd_ptr))))
 +	{
 +		/*
 +		 * The RX state machine is still in overflow state and
 +		 * the RX FIFO write and read pointers seem to be
 +		 * stuck.  Whack the chip over the head to get things
 +		 * going again.
 +		 */
 +		aprint_error_dev(sc->sc_dev,
 +		    "receiver stuck in overflow, resetting\n");
 +		gem_init(ifp);
 +	} else {
 +		if ((state & GEM_MAC_STATE_OVERFLOW) != GEM_MAC_STATE_OVERFLOW) {
 +			aprint_error_dev(sc->sc_dev,
 +				"rx_watchdog: not in overflow state: 0x%x\n",
 +				state);
 +		}
 +		if (rx_fifo_wr_ptr != rx_fifo_rd_ptr) {
 +			aprint_error_dev(sc->sc_dev,
 +				"rx_watchdog: wr & rd ptr different\n");
 +		}
 +		if (sc->sc_rx_fifo_wr_ptr != rx_fifo_wr_ptr) {
 +			aprint_error_dev(sc->sc_dev,
 +				"rx_watchdog: wr pointer != saved\n");
 +		}
 +		if (sc->sc_rx_fifo_rd_ptr != rx_fifo_rd_ptr) {
 +			aprint_error_dev(sc->sc_dev,
 +				"rx_watchdog: rd pointer != saved\n");
 +		}
 +		aprint_error_dev(sc->sc_dev, "resetting anyway\n");
 +		gem_init(ifp);
 +	}
 +}

  void
  gem_watchdog(struct ifnet *ifp)
 Index: gemreg.h
 ===================================================================
 RCS file: /cvsroot/src/sys/dev/ic/gemreg.h,v
 retrieving revision 1.14
 diff -u -r1.14 gemreg.h
 --- gemreg.h	15 Sep 2008 19:43:24 -0000	1.14
 +++ gemreg.h	11 Apr 2012 14:19:52 -0000
 @@ -516,6 +516,8 @@
  #define	GEM_MAC_CC_PASS_PAUSE	0x00000004	/* pass pause up */
  #define	GEM_MAC_CC_BITS		"\177\020b\0TXPAUSE\0b\1RXPAUSE\0b\2NOPAUSE\0\0"

 +/* GEM_MAC_MAC_STATE register bits */
 +#define GEM_MAC_STATE_OVERFLOW	0x03800000

  /* 
   * Bits in GEM_MAC_SLOT_TIME register
 Index: gemvar.h
 ===================================================================
 RCS file: /cvsroot/src/sys/dev/ic/gemvar.h,v
 retrieving revision 1.23
 diff -u -r1.23 gemvar.h
 --- gemvar.h	2 Feb 2012 19:43:03 -0000	1.23
 +++ gemvar.h	11 Apr 2012 14:19:52 -0000
 @@ -130,6 +130,7 @@
  	struct ethercom sc_ethercom;	/* ethernet common data */
  	struct mii_data	sc_mii;		/* MII media control */
  	struct callout	sc_tick_ch;	/* tick callout */
 +	struct callout	sc_rx_watchdog;	/* RX watchdog callout */

  	/* The following bus handles are to be provided by the bus front-end */
  	bus_space_tag_t	sc_bustag;	/* bus tag */
 @@ -223,6 +224,10 @@
  	struct evcnt sc_ev_rxhist[9];
  #endif

 +	/* For use by the RX watchdog */
 +	u_int32_t 	sc_rx_fifo_wr_ptr;
 +	u_int32_t	sc_rx_fifo_rd_ptr;
 +
  	enum gem_attach_stage	sc_att_stage;
  };


 ----Next_Part(Wed_Apr_11_16_25_57_2012_393)----

From: Julian Coleman <jdc@coris.org.uk>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: port-sparc64/46260: gem0 driver fails to recover after RX overflow
Date: Wed, 11 Apr 2012 19:56:00 +0100

 Hi,

 > Scratch that diff, here is one which works, but which ends up
 > resetting the interface Quite Often, despite the state register
 > indicating it's not in overflow mode -- it prints
 > 
 >   gem0: rx_watchdog: not in overflow state: 0x10400
 > 
 > Only once in my testing did I see
 > 
 >   gem0: rx_watchdog: rd pointer != saved
 > 
 > occur, but it *did* occur.

 I think that this approach (watchdog) is the way to go.  A few comments:

 > +			aprint_error_dev(sc->sc_dev,
 > +			    "receive error: RX no buffer space\n");

 I wonder if we should print out these extra diagnostics (we don't do it
 in other drivers).  I think that we would be better off by updating extra
 counters, and using them as per the suggestions in "Using event counters
 with network interfaces, is there a reason they're all ifdefed out of
 mainline use?" thread, starting at:

   http://mail-index.NetBSD.org/tech-kern/2011/12/10/msg012122.html

 I think that the reason that you're seeing the extra resets is the difference
 between checking GEM_MAC_RX_OVERFLOW in gem_intr():

 >  		if (rxstat & GEM_MAC_RX_OVERFLOW) {
 >  			ifp->if_ierrors++;
 > +			aprint_error_dev(sc->sc_dev,
 > +			    "receive error: RX overflow\n");
 >  			gem_reset_rxdma(sc);

 but checking GEM_MAC_STATE_OVERFLOW in gem_rx_watchdog():

 > +	if ((state & GEM_MAC_STATE_OVERFLOW) == GEM_MAC_STATE_OVERFLOW &&

 .  However, I'm not sure if we should be checking GEM_MAC_STATE_OVERFLOW and
 GEM_MAC_RX_OVERFLOW in gem_intr().  Maybe we can check GEM_MAC_STATE_OVERFLOW
 if GEM_MAC_RX_OVERFLOW is set and fire the callout only then.  Alternatively,
 we might only need to check GEM_MAC_STATE_OVERFLOW (and not bother with
 GEM_MAC_RX_OVERFLOW at all).

 Thanks,

 J

 -- 
   My other computer also runs NetBSD    /        Sailing at Newbiggin
         http://www.netbsd.org/        /   http://www.newbigginsailingclub.org/

From: Havard Eidnes <he@NetBSD.org>
To: gnats-bugs@NetBSD.org, jdc@coris.org.uk
Cc: port-sparc64-maintainer@netbsd.org
Subject: Re: port-sparc64/46260: gem0 driver fails to recover after RX
 overflow
Date: Thu, 12 Apr 2012 09:47:19 +0200 (CEST)

 >  > +			aprint_error_dev(sc->sc_dev,
 >  > +			    "receive error: RX no buffer space\n");
 >
 >  I wonder if we should print out these extra diagnostics (we don't do=
  it
 >  in other drivers).  I think that we would be better off by updating =
 extra
 >  counters, and using them as per the suggestions in "Using event coun=
 ters
 >  with network interfaces, is there a reason they're all ifdefed out o=
 f
 >  mainline use?" thread, starting at:
 >
 >    http://mail-index.NetBSD.org/tech-kern/2011/12/10/msg012122.html

 That's true, I added the printf()s only as a debugging aid to more
 easily see if there was anything else going on which might trigger
 the problem.  I agree that doing printf()s on input errors is not
 appropriate for production code.

 >  I think that the reason that you're seeing the extra resets is the d=
 ifference
 >  between checking GEM_MAC_RX_OVERFLOW in gem_intr():
 >
 >  >  		if (rxstat & GEM_MAC_RX_OVERFLOW) {
 >  >  			ifp->if_ierrors++;
 >  > +			aprint_error_dev(sc->sc_dev,
 >  > +			    "receive error: RX overflow\n");
 >  >  			gem_reset_rxdma(sc);
 >
 >  but checking GEM_MAC_STATE_OVERFLOW in gem_rx_watchdog():
 >
 >  > +	if ((state & GEM_MAC_STATE_OVERFLOW) =3D=3D GEM_MAC_STATE_OVERFL=
 OW &&

 Well, actually, no...  If GEM_MAC_RX_OVERFLOW is flagged, it appears
 that in my case a gem_reset_rxdma() is *not* sufficient to kick the
 receiver back to life.  Also, the GEM_MAC_STATE_OVERFLOW test in the
 gem_rx_watchdog() function in my case never kicks in, so that's why
 I added the extra code in the else clause, doing gem_init() there as
 well, doing a full unconditional reset in the watchdog function.
 Each and every time this has happened in my case, it's always been
 the code in the else clause in gem_rx_watchdog() which has kicked
 in.

 >  GEM_MAC_RX_OVERFLOW in gem_intr().  Maybe we can check GEM_MAC_STATE=
 _OVERFLOW
 >  if GEM_MAC_RX_OVERFLOW is set and fire the callout only then.  Alter=
 natively,
 >  we might only need to check GEM_MAC_STATE_OVERFLOW (and not bother w=
 ith
 >  GEM_MAC_RX_OVERFLOW at all).

 The change I added is adapted from the OpenBSD driver, from this
 diff:

 http://www.openbsd.org/cgi-bin/cvsweb/src/sys/dev/ic/gem.c.diff?r1=3D1.=
 88;r2=3D1.89;f=3Dh

 Do we have any documentation anywhere which douments the bit fields
 in the GEM_MAC_MAC_STATE register?  In my case I always read back
 0x10400.

 However, what worries me is the ease with which this problem can now
 be triggered.  It doesn't take particularly heavy network traffic to
 make it happen.  And, furthermore, this appears to be a regression
 compared to the release I was running earlier, 4.0.1.


 Regards,

 - H=E5vard

From: Julian Coleman <jdc@coris.org.uk>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: port-sparc64/46260: gem0 driver fails to recover after RX overflow
Date: Thu, 12 Apr 2012 13:57:48 +0100

 Hi,

 >  That's true, I added the printf()s only as a debugging aid to more
 >  easily see if there was anything else going on which might trigger
 >  the problem.  I agree that doing printf()s on input errors is not
 >  appropriate for production code.

 OK.  Sorry.

 >  Well, actually, no...  If GEM_MAC_RX_OVERFLOW is flagged, it appears
 >  that in my case a gem_reset_rxdma() is *not* sufficient to kick the
 >  receiver back to life.  Also, the GEM_MAC_STATE_OVERFLOW test in the
 >  gem_rx_watchdog() function in my case never kicks in, so that's why
 >  I added the extra code in the else clause, doing gem_init() there as
 >  well, doing a full unconditional reset in the watchdog function.
 >  Each and every time this has happened in my case, it's always been
 >  the code in the else clause in gem_rx_watchdog() which has kicked
 >  in.

 Right - our original code reset the whole chip when we saw GEM_MAC_RX_OVERFLOW
 and we didn't check for GEM_MAC_STATE_OVERFLOW.  So, I'd expect the behaviour
 to be different if we now check both.  And, from your previous message, it
 seems that we mainly end up resetting when GEM_MAC_STATE_OVERFLOW isn't set
 (with a spurious reset when the read pointer changed).

 >  Do we have any documentation anywhere which douments the bit fields
 >  in the GEM_MAC_MAC_STATE register?  In my case I always read back
 >  0x10400.

 It seems that the GEM document is available again.  See:

   ge.pdf  GEM (First Generation PCI Gigabit Ethernet) User's Manual

 from:

   http://sosc-dr.sun.com/processors/documentation.html

 but there doesn't seem to be any information on the bits in the MAC state
 machine register though.

 >  However, what worries me is the ease with which this problem can now
 >  be triggered.  It doesn't take particularly heavy network traffic to
 >  make it happen.  And, furthermore, this appears to be a regression
 >  compared to the release I was running earlier, 4.0.1.

 Yes.  This is worrying.  See the last paragraph of 2.6.1 "RxFIFO overflow"
 and also 2.3.2 "Frame Reception".  An increase in overflows implies that the
 RX FIFO is not emptying fast enough, which implies that we are not reading
 and emptying packets from the ring buffer quickly enough when an interrupt
 occurs.  Are you able to check earlier kernels (e.g. 5.0) to get a rough
 indication of when the increased resets problem started?  I'm now unsure if
 this aspect is a gem(4) problem, or something else.

 Thanks,

 J

 -- 
   My other computer also runs NetBSD    /        Sailing at Newbiggin
         http://www.netbsd.org/        /   http://www.newbigginsailingclub.org/

From: Havard Eidnes <he@NetBSD.org>
To: jdc@coris.org.uk
Cc: gnats-bugs@NetBSD.org, martin@NetBSD.org
Subject: Re: port-sparc64/46260: gem0 driver fails to recover after RX
 overflow
Date: Wed, 25 Apr 2012 10:06:59 +0200 (CEST)

 Hi,

 I recently upgraded one of my Mac Mini G4 machines to 6.0_BETA,
 and even though I don't have console on the box, I'm pretty sure
 this bug has struck there as well.  The reason I say this is that
 I lost connectivity to the box for a few days, but after I
 rebooted it with power-off / power-on, I got messages from
 mailer-daemon about delayed or failed delivery of e-mail messages
 in the time it was "off-air".

 I'm going to patch the source tree with the workaround posted
 here earlier for the gem driver to see if this stabilizes its
 connectivity.  Another option could be to use the now-recognized
 bwi0 interface...

 So, I'm pretty sure this is a gem driver bug, and not something
 only seen on sparc64.  And... we really should fix or work around
 this problem before we release NetBSD 6.0 proper.

 Regards,

 - H=E5vard

From: Julian Coleman <jdc@coris.org.uk>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: port-sparc64/46260: gem0 driver fails to recover after RX overflow
Date: Fri, 8 Jun 2012 10:48:56 +0100

 Hi,

 I've had a chance to look at this some more.  I've been testing on a V120.  A
 summary would be that I've found this bug very hard to reproduce under normal
 conditions.  However, adding extra debugging output to the driver makes it a
 lot easier.  For example, adding a printf in gem_rint() makes it likely that
 I'll hit the RX overflow several times when copying over a new kernel to test.
 Note, that the console is 9600 baud serial.  I printed out the values of
 sc->rxptr at the end of the interrupt function, and also the value of sc->rxptr
 and the completion register when we overflow (I'd already verified that the
 value of sc->rxptr is equal to the completion register at the end of the
 interrupt function).  I see output like:

   gem0: gem_rint end sc->sc_rxptr = 6
   gem0: receive error: RX overflow sc->rxptr 6, complete 6
   gem0: gem_rint end sc->sc_rxptr = 7

 when the receiver doesn't lock up, and:

   gem0: gem_rint end sc->sc_rxptr = 100
   gem0: receive error: RX overflow sc->rxptr 100, complete 100
   gem0: receiver stuck in overflow, resetting
   gem0: gem_rint end sc->sc_rxptr = 1

 when it does.  It is possible that the the chip has filled the whole ring when
 it reports overflow, but I think that is fairly unlikely.  However, I'm still
 not sure why it locks up sometime, and especially more with 5 or 6.  I've
 also seen occasional:

   gem0: rx_watchdog: not in overflow state: 0x810400

 I think what sometimes happens here is that we get an RX_OVERFLOW that doesn't
 lock up the receiver and also there are a low number of packets received at
 this point.  So, we can end up resetting when we don't need to.  However, I
 can't see the difference between the overflows that lock up and those that
 don't.  So, it seems best to reset here anyway.

 >  Yes.  This is worrying.  See the last paragraph of 2.6.1 "RxFIFO overflow"
 >  and also 2.3.2 "Frame Reception".  An increase in overflows implies that the
 >  RX FIFO is not emptying fast enough, which implies that we are not reading
 >  and emptying packets from the ring buffer quickly enough when an interrupt
 >  occurs.  Are you able to check earlier kernels (e.g. 5.0) to get a rough
 >  indication of when the increased resets problem started?  I'm now unsure if
 >  this aspect is a gem(4) problem, or something else.

 As I mentioned above, I don't think that we are filling up the ring buffer.  I
 had another look at the differences between the driver in netbsd-4-0 and in
 netbsd-4.  Apart from the difference between the settings of
 GEM_MAC_CONTROL_MASK and GEM_INTMASK (we don't set GEM_INTR_PCS), I can't
 see anything to cause this.  I've checked the current code with the previous
 setting of GEM_MAC_CONTROL_MASK and with GEM_INTR_PCS interrupts enabled, and
 I didn't see any difference (I also didn't see any GEM_INTR_PCS interrupts).

 To try and make the hardware move packets from the RX FIFO more quickly, I
 altered the threshold in the GEM_RX_CONFIG register down to GEM_THRSH_64,
 but this doesn't seem to make much difference.

 Looking at the history, most of the current changes came in after 4.0 was
 released, and were pulled up to the netbsd-4 branch.  Is it possible to
 try a netbsd-4 kernel, so that we can try and work out if the problem is
 with these changes, or with something that happened later, please?

 Thanks,

 J

 -- 
   My other computer also runs NetBSD    /        Sailing at Newbiggin
         http://www.netbsd.org/        /   http://www.newbigginsailingclub.org/

From: "Julian Coleman" <jdc@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/46260 CVS commit: src/sys/dev/ic
Date: Mon, 2 Jul 2012 11:23:41 +0000

 Module Name:	src
 Committed By:	jdc
 Date:		Mon Jul  2 11:23:41 UTC 2012

 Modified Files:
 	src/sys/dev/ic: gem.c gemreg.h gemvar.h

 Log Message:
 Apply lockup fixes from Havard Eidnes/OpenBSD in PR port-sparc64/46260:
   - add an additional watchdog for RX overflow
   - re-initialise the chip on device timeout
 Also alter the interrupt blanking rate to 8 packets, as per OpenSolaris.


 To generate a diff of this commit:
 cvs rdiff -u -r1.98 -r1.99 src/sys/dev/ic/gem.c
 cvs rdiff -u -r1.14 -r1.15 src/sys/dev/ic/gemreg.h
 cvs rdiff -u -r1.23 -r1.24 src/sys/dev/ic/gemvar.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Jeff Rizzo" <riz@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/46260 CVS commit: [netbsd-6] src/sys/dev/ic
Date: Thu, 5 Jul 2012 17:59:13 +0000

 Module Name:	src
 Committed By:	riz
 Date:		Thu Jul  5 17:59:13 UTC 2012

 Modified Files:
 	src/sys/dev/ic [netbsd-6]: gem.c gemreg.h gemvar.h

 Log Message:
 Pull up following revision(s) (requested by jdc in ticket #401):
 	sys/dev/ic/gem.c: revision 1.99
 	sys/dev/ic/gemvar.h: revision 1.24
 	sys/dev/ic/gemreg.h: revision 1.15
 Apply lockup fixes from Havard Eidnes/OpenBSD in PR port-sparc64/46260:
 - add an additional watchdog for RX overflow
 - re-initialise the chip on device timeout
 Also alter the interrupt blanking rate to 8 packets, as per OpenSolaris.


 To generate a diff of this commit:
 cvs rdiff -u -r1.98 -r1.98.2.1 src/sys/dev/ic/gem.c
 cvs rdiff -u -r1.14 -r1.14.34.1 src/sys/dev/ic/gemreg.h
 cvs rdiff -u -r1.23 -r1.23.2.1 src/sys/dev/ic/gemvar.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Manuel Bouyer" <bouyer@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/46260 CVS commit: [netbsd-5] src/sys/dev/ic
Date: Sat, 15 Sep 2012 09:32:36 +0000

 Module Name:	src
 Committed By:	bouyer
 Date:		Sat Sep 15 09:32:36 UTC 2012

 Modified Files:
 	src/sys/dev/ic [netbsd-5]: gem.c gemreg.h gemvar.h

 Log Message:
 Pull up following revision(s) (requested by jdc in ticket #1789):
 	sys/dev/ic/gem.c: revision 1.99 via patch
 	sys/dev/ic/gemvar.h: revision 1.24 via patch
 	sys/dev/ic/gemreg.h: revision 1.15 via patch
 Apply lockup fixes from Havard Eidnes/OpenBSD in PR port-sparc64/46260:
 - add an additional watchdog for RX overflow
 - re-initialise the chip on device timeout
 Also alter the interrupt blanking rate to 8 packets, as per OpenSolaris.


 To generate a diff of this commit:
 cvs rdiff -u -r1.78.4.2 -r1.78.4.3 src/sys/dev/ic/gem.c
 cvs rdiff -u -r1.14 -r1.14.4.1 src/sys/dev/ic/gemreg.h
 cvs rdiff -u -r1.18 -r1.18.20.1 src/sys/dev/ic/gemvar.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->feedback
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sat, 15 Sep 2012 17:00:26 +0000
State-Changed-Why:
Is this fixed now?


State-Changed-From-To: feedback->closed
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Mon, 23 Dec 2013 23:49:49 +0000
State-Changed-Why:
I'm assuming this is in fact fixed.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.