NetBSD Problem Report #38637

From bouyer@chassiron.antioche.eu.org  Mon May 12 13:51:39 2008
Return-Path: <bouyer@chassiron.antioche.eu.org>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by narn.NetBSD.org (Postfix) with ESMTP id ED68663B8BC
	for <gnats-bugs@gnats.NetBSD.org>; Mon, 12 May 2008 13:51:38 +0000 (UTC)
Message-Id: <200805121351.m4CDpWo8001949@chassiron.antioche.eu.org>
Date: Mon, 12 May 2008 15:51:32 +0200 (MEST)
From: bouyer@antioche.eu.org
Reply-To: bouyer@antioche.eu.org
To: gnats-bugs@gnats.NetBSD.org
Subject: pppoe fails to reconnect sometimes
X-Send-Pr-Version: 3.95

>Number:         38637
>Category:       kern
>Synopsis:       pppoe fails to reconnect sometimes
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon May 12 13:55:00 +0000 2008
>Closed-Date:    Sun Sep 20 12:47:12 +0000 2009
>Last-Modified:  Sun Sep 20 12:47:12 +0000 2009
>Originator:     Manuel Bouyer
>Release:        NetBSD 4.99.62
>Organization:
>Environment:
System: NetBSD chassiron.antioche.eu.org 4.99.62 NetBSD 4.99.62 (CHASSIRON) #1: Sun May 4 21:57:20 CEST 2008 bouyer@rock:/dsk/l1/misc/bouyer/tmp/sparc/obj/dsk/l1/misc/bouyer/current/src/sys/arch/sparc/compile/CHASSIRON sparc
Architecture: sparc
Machine: sparc
>Description:
	Since I upgraded to 4.99.62, pppoe sometimes fails to reconnect.
	The kernel says:
May 12 14:22:48 chassiron /netbsd: pppoe0: LCP keepalive timed out, going to restart the connection
	and then, the only activity on the ethernet link is:
15:17:53.128532 PPPoE PADI [Service-Name] [Host-Uniq 0xF0B16000]
15:17:53.186377 PPPoE PADO [AC-Name "BSMSO108"] [Host-Uniq 0xF0B16000] [Service-Name] [AC-Cookie 0xF017FC9DEF801E29EFF08EEFC116CACE]
	once per minute. Here it has been in this state for about 55 minutes.
	A reboot cause the pppoe device to reconnect immediatly. A down/up
	or destroy/create of the pppoe0 device doesn't help, a reboot is
	required. This seems to only happens when the link is actively used
	(e.g. pop3 and imap clients active, a web browser open with
	self-refershing pages loaded, some ssh sessions, ...). The link has
	been up for several days while I was not home ...
	A normal connect (after a fresh reboot) looks like
15:35:03.563740 PPPoE PADI [Service-Name] [Host-Uniq 0xF0B13000]
15:35:03.678298 PPPoE PADO [AC-Name "BSMSO108"] [Host-Uniq 0xF0B13000] [Service-Name] [AC-Cookie 0xF017FC9DEF801E29EFF08EEFC116CACE]
15:35:03.678667 PPPoE PADR [Service-Name] [AC-Cookie 0xF017FC9DEF801E29EFF08EEFC116CACE] [Host-Uniq 0xF0B13000]
15:35:03.838963 PPPoE PADS [ses 0x104a] [Service-Name] [Host-Uniq 0xF0B13000] [AC-Name "BSMSO108"] [AC-Cookie 0xF017FC9DEF801E29EFF08EEFC116CACE]
15:35:03.850141 PPPoE  [ses 0x104a] LCP, Conf-Request (0x01), id 1, length 16
15:35:03.958371 PPPoE  [ses 0x104a] LCP, Conf-Request (0x01), id 113, length 21
15:35:03.958829 PPPoE  [ses 0x104a] LCP, Conf-Ack (0x02), id 113, length 21
15:35:03.961091 PPPoE  [ses 0x104a] LCP, Conf-Ack (0x02), id 1, length 16
15:35:04.016830 PPPoE  [ses 0x104a] CHAP, Challenge (0x01), id 22, Value ff40f7a275c2a4792ce99b25447d7c2a, Name BSMSO108
15:35:04.017743 PPPoE  [ses 0x104a] CHAP, Response (0x02), id 22, Value 748b4c144654d0ff50853491582a6a17, Name bouyer@net1.nerim.nerim
15:35:04.362918 PPPoE  [ses 0x104a] LCP, Conf-Request (0x01), id 1, length 17
15:35:04.363401 PPPoE  [ses 0x104a] LCP, Conf-Ack (0x02), id 1, length 17
15:35:04.363746 PPPoE  [ses 0x104a] LCP, Conf-Request (0x01), id 2, length 16
15:35:04.422696 PPPoE  [ses 0x104a] LCP, Conf-Nack (0x03), id 2, length 10
15:35:04.423209 PPPoE  [ses 0x104a] LCP, Conf-Request (0x01), id 3, length 16
15:35:04.475556 PPPoE  [ses 0x104a] LCP, Conf-Nack (0x03), id 3, length 10
15:35:04.476004 PPPoE  [ses 0x104a] LCP, Conf-Request (0x01), id 4, length 16
15:35:04.531192 PPPoE  [ses 0x104a] LCP, Conf-Nack (0x03), id 4, length 10
15:35:04.531779 PPPoE  [ses 0x104a] LCP, Conf-Request (0x01), id 5, length 16
15:35:04.585438 PPPoE  [ses 0x104a] LCP, Conf-Nack (0x03), id 5, length 10
15:35:04.586005 PPPoE  [ses 0x104a] LCP, Conf-Request (0x01), id 6, length 16
15:35:04.642439 PPPoE  [ses 0x104a] LCP, Conf-Nack (0x03), id 6, length 10
15:35:04.643141 PPPoE  [ses 0x104a] LCP, Conf-Request (0x01), id 7, length 16
15:35:04.694636 PPPoE  [ses 0x104a] LCP, Conf-Reject (0x04), id 7, length 10
15:35:04.695074 PPPoE  [ses 0x104a] LCP, Conf-Request (0x01), id 8, length 12
15:35:04.750559 PPPoE  [ses 0x104a] LCP, Conf-Ack (0x02), id 8, length 12
15:35:04.753007 PPPoE  [ses 0x104a] CHAP, Challenge (0x01), id 23, Value c7707a87f9e0df56aa137099f6c6961e, Name lns303-tip-courbevoie
15:35:04.753528 PPPoE  [ses 0x104a] CHAP, Response (0x02), id 23, Value b02dbd3ffedc96e7d42166e83c8bc657, Name bouyer@net1.nerim.nerim
15:35:04.822418 PPPoE  [ses 0x104a] CHAP, Success (0x03), id 23, Msg
15:35:04.822963 PPPoE  [ses 0x104a] IPCP, Conf-Request (0x01), id 1, length 6

so it looks like the PADO sent by BSMSO108 is ignored by the kernel in the
non-working case.
It should also be noted that the telco forces a disconnect every 24 hours,
and in this case the kernel has no problems reconnecting. It only
happens when it detects a "LCP keepalive timed out", in some case
(if I power off the modem, the kernel also detects the LCP timeout, but
powering on the modem again gives a working connection again after a few
minutes).

I didn't have this issue with the previous (4.99.20, June 2007) kernel.

>How-To-Repeat:
	use pppoe ?
>Fix:
	unknown

>Release-Note:

>Audit-Trail:
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/38637: pppoe fails to reconnect sometimes
Date: Mon, 12 May 2008 16:43:05 +0200

 --5mCyUwZo2JvN/JJP
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline

 Could you try if the printf in the following patch fires in the non
 working case?

 Thanks,

 Martin

 --5mCyUwZo2JvN/JJP
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: attachment; filename=patch

 Index: if_pppoe.c
 ===================================================================
 RCS file: /cvsroot/src/sys/net/if_pppoe.c,v
 retrieving revision 1.86
 diff -u -r1.86 if_pppoe.c
 --- if_pppoe.c	28 Apr 2008 20:24:09 -0000	1.86
 +++ if_pppoe.c	12 May 2008 14:41:56 -0000
 @@ -663,8 +663,12 @@
  				free(sc->sc_ac_cookie, M_DEVBUF);
  			sc->sc_ac_cookie = malloc(ac_cookie_len, M_DEVBUF,
  			    M_DONTWAIT);
 -			if (sc->sc_ac_cookie == NULL)
 +			if (sc->sc_ac_cookie == NULL) {
 +				printf("%s: FATAL: could not allocate memory "
 +				    "for AC cookie\n",
 +				    sc->sc_sppp.pp_if.if_xname);
  				goto done;
 +			}
  			sc->sc_ac_cookie_len = ac_cookie_len;
  			memcpy(sc->sc_ac_cookie, ac_cookie, ac_cookie_len);
  		}

 --5mCyUwZo2JvN/JJP--

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/38637: pppoe fails to reconnect sometimes
Date: Mon, 12 May 2008 23:09:08 +0200

 On Mon, May 12, 2008 at 02:45:02PM +0000, Martin Husemann wrote:
 > The following reply was made to PR kern/38637; it has been noted by GNATS.
 > 
 > From: Martin Husemann <martin@duskware.de>
 > To: gnats-bugs@NetBSD.org
 > Cc: 
 > Subject: Re: kern/38637: pppoe fails to reconnect sometimes
 > Date: Mon, 12 May 2008 16:43:05 +0200
 > 
 >  --5mCyUwZo2JvN/JJP
 >  Content-Type: text/plain; charset=us-ascii
 >  Content-Disposition: inline
 >  
 >  Could you try if the printf in the following patch fires in the non
 >  working case?

 New kernel installed, now waiting for the issue to show up again.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/38637: pppoe fails to reconnect sometimes
Date: Sun, 25 May 2008 22:22:56 +0200

 On Mon, May 12, 2008 at 11:09:08PM +0200, Manuel Bouyer wrote:
 > On Mon, May 12, 2008 at 02:45:02PM +0000, Martin Husemann wrote:
 > > The following reply was made to PR kern/38637; it has been noted by GNATS.
 > > 
 > > From: Martin Husemann <martin@duskware.de>
 > > To: gnats-bugs@NetBSD.org
 > > Cc: 
 > > Subject: Re: kern/38637: pppoe fails to reconnect sometimes
 > > Date: Mon, 12 May 2008 16:43:05 +0200
 > > 
 > >  --5mCyUwZo2JvN/JJP
 > >  Content-Type: text/plain; charset=us-ascii
 > >  Content-Disposition: inline
 > >  
 > >  Could you try if the printf in the following patch fires in the non
 > >  working case?
 > 
 > New kernel installed, now waiting for the issue to show up again.

 I did, and the printf didn't fire.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: Martin Husemann <martin@duskware.de>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/38637: pppoe fails to reconnect sometimes
Date: Thu, 2 Oct 2008 12:45:14 +0200

 On Wed, Oct 01, 2008 at 10:10:27PM +0200, Manuel Bouyer wrote:
 > (all other queue length are 0). So it looks like the pppoe queues stop
 > being processed, for unknown reason. This may even be the reason of the
 > disconnect in the first place.

 Sorry, I don't see how this could happen besides the softint handling
 breaking completely (something corrupting the call wheel?):

 	ether_input does:

                 if (etype == ETHERTYPE_PPPOEDISC)
                         inq = &ppoediscinq;
                 else
                         inq = &ppoeinq;
                 if (IF_QFULL(inq)) {
                         IF_DROP(inq);
                         m_freem(m);
                 } else
                         IF_ENQUEUE(inq, m);
                 softint_schedule(pppoe_softintr);
                 return;

 This should make pppoe_softintr_handler() run, which just does:

         mutex_enter(softnet_lock);
         pppoe_input();
         mutex_exit(softnet_lock);

 and pppoe_input() loops untill the queues are both empty.

 Can you find out if pppoe_softintr_handler is getting called at all?

 Martin

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/38637: pppoe fails to reconnect sometimes
Date: Sat, 4 Oct 2008 13:00:19 +0200

 --5vNYLRcllDrimb99
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline

 On Thu, Oct 02, 2008 at 10:50:05AM +0000, Martin Husemann wrote:
 >  On Wed, Oct 01, 2008 at 10:10:27PM +0200, Manuel Bouyer wrote:
 >  > (all other queue length are 0). So it looks like the pppoe queues stop
 >  > being processed, for unknown reason. This may even be the reason of the
 >  > disconnect in the first place.
 >  
 >  Sorry, I don't see how this could happen besides the softint handling
 >  breaking completely (something corrupting the call wheel?):
 >  
 >  	ether_input does:
 >  
 >                  if (etype == ETHERTYPE_PPPOEDISC)
 >                          inq = &ppoediscinq;
 >                  else
 >                          inq = &ppoeinq;
 >                  if (IF_QFULL(inq)) {
 >                          IF_DROP(inq);
 >                          m_freem(m);
 >                  } else
 >                          IF_ENQUEUE(inq, m);
 >                  softint_schedule(pppoe_softintr);
 >                  return;
 >  
 >  This should make pppoe_softintr_handler() run, which just does:
 >  
 >          mutex_enter(softnet_lock);
 >          pppoe_input();
 >          mutex_exit(softnet_lock);
 >  
 >  and pppoe_input() loops untill the queues are both empty.
 >  
 >  Can you find out if pppoe_softintr_handler is getting called at all?

 It happened again, I did a break to ddb and set a breakpoint on
 pppoe_softintr_handler. It didn't fire.
 What's more strange is that tcpdump on the interface shows the host sending
 PADI paquests, and the BAS replying with PADO. tcpdump shows the ethernet
 addresses in the packet header are right. Despite this, pppoe_softintr_handler
 seems not to be called, and more strange, netstat -q doen't show the
 queue lengths for the pppoe queues growing, nor queue drops.

 So I set a breakpoint on ether_input() and single-stepped here.
 softint_schedule(pppoe_softintr) is being called. But on the return
 from softint_schedule() ddb entered clockintr_4(), and then
 sparc_interrupt44c(), and the system did reset with a watchdog here.
 Attached is a trace of my ddb session. Hope it helps ...

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

 --5vNYLRcllDrimb99
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: attachment; filename=ether_input

 db> c
 Breakpoint in pid 0.2 (system) at       netbsd:ether_input:     save            % sp, -0x70, %sp
 db> s
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x4: lduh            [%i0 + 0
 x30], %o3
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x8: andcc           %o3, 0x1
 , %g0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0xc: be              netbsd:e
 ther_input+0x20c
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x10:        st              %
 i1, [%fp + 0x48]
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x14:        ld              [
 %i1 + 0x8], %l1
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x18:        sethi           %
 hi(0x1000), %g1
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x1c:        ldub            [
 %l1 + 0xc], %g2
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x20:        sll             %
 g2, 0x8, %g2
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x24:        ld              [
 %i1 + 0x14], %l0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x28:        ld              [
 %i1 + 0x28], %g4
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x2c:        andcc           %
 l0, %g1, %g0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x30:        or              %
 g0, 0x0, %o4
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x34:        ldub            [
 %l1 + 0xd], %g1
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x38:        or              %
 g0, 0x4, %o5
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x3c:        or              %
 g1, %g2, %l2
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x40:        be              n
 etbsd:ether_input+0x21c
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x44:        ldd             [
 %i0 + 0x40], %g2
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x21c:       or              %
 g0, 0x0, %o5
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x220:       addcc           %
 o5, %g3, %o5
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x224:       or              %
 g0, 0x0, %o4
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x228:       sethi           %
 hi(0x8000), %g1
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x22c:       addc            %
 o4, %g2, %o4
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x230:       or              %
 g1, 0x100, %g1
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x234:       sll             %
 l2, 0x10, %g2
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x238:       srl             %
 g2, 0x10, %g2
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x23c:       subcc           %
 g2, %g1, %g0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x240:       bne             n
 etbsd:ether_input+0x70
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x244:       sethi           %
 hi(0x8800), %g1
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x70:        or              %
 g0, 0x16, %g3
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x74:        or              %
 g1, 0x64, %g1
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x78:        subcc           %
 g2, %g1, %g0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x7c:        be              n
 etbsd:ether_input+0x8c
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x80:        or              %
 g0, 0x0, %g2
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x84:        or              %
 g0, 0x0, %g2
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x88:        or              %
 g0, 0xe, %g3
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x8c:        addcc           %
 g3, %o5, %o5
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x90:        addc            %
 g2, %o4, %o4
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x94:        sra             %
 g4, 0x1f, %g2
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x98:        subcc           %
 g2, %o4, %g0
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x9c:        bgu             n
 etbsd:ether_input+0x260
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0xa0:        sethi           %
 hi(0xf02f7000), %o0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0xa4:        subcc           %
 g2, %o4, %g0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0xa8:        be              n
 etbsd:ether_input+0x254
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0xac:        subcc           %
 g4, %o5, %g0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x254:       bleu,a          n
 etbsd:ether_input+0xb4
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x258:       ldub            [
 %l1 + %g0], %g1
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0xb4:        andcc           %
 g1, 0x1, %g0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0xb8:        be,a            n
 etbsd:ether_input+0x12c
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0xbc:        ld              [
 %fp + 0x48], %i1
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x12c:       sethi           %
 hi(0x1000), %g1
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x130:       ld              [
 %i1 + 0x14], %g2
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x134:       andcc           %
 g2, %g1, %g0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x138:       bne             n
 etbsd:ether_input+0x314
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x13c:       or              %
 g0, %i1, %o0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x140:       ld              [
 %i1 + 0x28], %g1
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x144:       ldd             [
 %i0 + 0x80], %o4
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x148:       addcc           %
 o5, %g1, %o5
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x14c:       sra             %
 g1, 0x1f, %g2
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x150:       sethi           %
 hi(0x2000), %l3
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x154:       addc            %
 o4, %g2, %o4
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x158:       or              %
 l3, 0x300, %g1
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x15c:       std             %
 o4, [%i0 + 0x80]
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x160:       ld              [
 %i1 + 0x14], %l0
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x164:       andcc           %
 l0, %g1, %g0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x168:       be,a            n
 etbsd:ether_input+0x290
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x16c:       lduh            [
 %i0 + 0x30], %g1
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x290:       andcc           %
 g1, 0x100, %g0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x294:       be,a            n
 etbsd:ether_input+0x174
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x298:       ld              [
 %i1 + 0x14], %g2
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x174:       sethi           %
 hi(0x2000), %g1
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x178:       andcc           %
 g2, %g1, %g0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x17c:       be              n
 etbsd:ether_input+0x2d4
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x180:       add             %
 i0, 0x12c, %o0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x2d4:       add             %
 fp, 0x48, %o1
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x2d8:       or              %
 g0, %i0, %o2
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x2dc:       call            n
 etbsd:pfil_run_hooks
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x2e0:       or              %
 g0, 0x1, %o3
 db> 
 Stopped in pid 0.2 (system) at  netbsd:pfil_run_hooks:  save            %sp, -0x
 70, %sp
 db> 
 Stopped in pid 0.2 (system) at  netbsd:pfil_run_hooks+0x4:      and             %
 i3, 0x3, %g1
 db> 
 Stopped in pid 0.2 (system) at  netbsd:pfil_run_hooks+0x8:      subcc           %
 g0, %g1, %g0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:pfil_run_hooks+0xc:      addc            %
 g0, 0x0, %g2
 Stopped in pid 0.2 (system) at  netbsd:pfil_run_hooks+0x10:     subcc           %
 g0, %i1, %g0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:pfil_run_hooks+0x14:     addc            %
 g0, 0x0, %g1
 db> 
 Stopped in pid 0.2 (system) at  netbsd:pfil_run_hooks+0x18:     andcc           %
 g2, %g1, %l0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:pfil_run_hooks+0x1c:     be              n
 etbsd:pfil_run_hooks+0x2c
 db> 
 Stopped in pid 0.2 (system) at  netbsd:pfil_run_hooks+0x20:     st              %
 g0, [%fp - 0xc]
 db> 
 Stopped in pid 0.2 (system) at  netbsd:pfil_run_hooks+0x24:     ld              [
 %i1 + %g0], %g1
 Stopped in pid 0.2 (system) at  netbsd:pfil_run_hooks+0x28:     st              %
 g1, [%fp - 0xc]
 db> 
 Stopped in pid 0.2 (system) at  netbsd:pfil_run_hooks+0x2c:     subcc           %
 i3, 0x1, %g0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:pfil_run_hooks+0x30:     be              n
 etbsd:pfil_run_hooks+0x6c
 db> 
 Stopped in pid 0.2 (system) at  netbsd:pfil_run_hooks+0x34:     subcc           %
 i3, 0x2, %g0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:pfil_run_hooks+0x6c:     ld              [
 %i0 + %g0], %i0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:pfil_run_hooks+0x70:     subcc           %
 i0, 0x0, %g0
 Stopped in pid 0.2 (system) at  netbsd:pfil_run_hooks+0x74:     be              n
 etbsd:pfil_run_hooks+0x50
 db> 
 Stopped in pid 0.2 (system) at  netbsd:pfil_run_hooks+0x78:     or              %
 g0, 0x0, %o0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:pfil_run_hooks+0x50:     or              %
 g0, 0x0, %o0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:pfil_run_hooks+0x54:     subcc           %
 l0, 0x0, %g0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:pfil_run_hooks+0x58:     be              n
 etbsd:pfil_run_hooks+0x64
 db> 
 Stopped in pid 0.2 (system) at  netbsd:pfil_run_hooks+0x5c:     ld              [
 %fp - 0xc], %g1
 Stopped in pid 0.2 (system) at  netbsd:pfil_run_hooks+0x60:     st              %
 g1, [%i1 + %g0]
 db> 
 Stopped in pid 0.2 (system) at  netbsd:pfil_run_hooks+0x64:     jmpl            [
 %i7 + 0x8], %g0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:pfil_run_hooks+0x68:     restore         %
 g0, %o0, %o0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x2e4:       subcc           %
 o0, 0x0, %g0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x2e8:       bne             n
 etbsd:ether_input+0x214
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x2ec:       ld              [
 %fp + 0x48], %i1
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x2f0:       subcc           %
 i1, 0x0, %g0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x2f4:       be              n
 etbsd:ether_input+0x634
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x2f8:       nop
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x2fc:       ld              [
 %i1 + 0x8], %l1
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x300:       ldub            [
 %l1 + 0xc], %g1
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x304:       ldub            [
 %l1 + 0xd], %g2
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x308:       sll             %
 g1, 0x8, %g1
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x30c:       ba              n
 etbsd:ether_input+0x184
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x310:       or              %
 g2, %g1, %l2
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x184:       ld              [
 %i0 + 0x220], %g1
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x188:       subcc           %
 g1, 0x0, %g0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x18c:       bne             n
 etbsd:ether_input+0x338
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x190:       or              %
 g0, %i1, %o0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x194:       sll             %
 l2, 0x10, %l0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x198:       sethi           %
 hi(0x8800), %g1
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x19c:       srl             %
 l0, 0x10, %g2
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x1a0:       or              %
 g1, 0x9, %g1
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x1a4:       subcc           %
 g2, %g1, %g0
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x1a8:       be              n
 etbsd:ether_input+0x42c
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x1ac:       or              %
 g0, %i1, %o0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x1b0:       bgu             n
 etbsd:ether_input+0x384
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x1b4:       sethi           %
 hi(0xffff7400), %g1
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x384:       or              %
 g1, 0x39d, %g1
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x388:       add             %
 g2, %g1, %g1
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x38c:       subcc           %
 g1, 0x1, %g0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x390:       bgu             n
 etbsd:ether_input+0x1d0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x394:       ld              [
 %i1 + 0x14], %g2
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x398:       sethi           %
 hi(0x2000), %g1
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x39c:       andcc           %
 g2, %g1, %g0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x3a0:       bne             n
 etbsd:ether_input+0x20c
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x3a4:       andcc           %
 g2, 0x300, %g0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x3a8:       bne             n
 etbsd:ether_input+0x20c
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x3ac:       srl             %
 l0, 0x10, %g1
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x3b0:       sethi           %
 hi(0x8800), %g2
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x3b4:       sethi           %
 hi(0xf02b6c00), %g3
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x3b8:       or              %
 g2, 0x63, %g2
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x3bc:       subcc           %
 g1, %g2, %g0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x3c0:       be              n
 etbsd:ether_input+0x3d0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x3c4:       or              %
 g3, 0x58, %g3
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x3d0:       ld              [
 %g3 + 0x8], %g2
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x3d4:       ld              [
 %g3 + 0xc], %g1
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x3d8:       subcc           %
 g2, %g1, %g0
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x3dc:       bge,a           n
 etbsd:ether_input+0x63c
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x3e4:       st              %
 g0, [%i1 + 0x4]
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x3e8:       ld              [
 %g3 + 0x4], %g2
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x3ec:       subcc           %
 g2, 0x0, %g0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x3f0:       be              n
 etbsd:ether_input+0x67c
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x3f4:       ld              [
 %fp + 0x48], %g1
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x3f8:       st              %
 g1, [%g2 + 0x4]
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x3fc:       ld              [
 %fp + 0x48], %g2
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x400:       ld              [
 %g3 + 0x8], %g1
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x404:       st              %
 g2, [%g3 + 0x4]
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x408:       add             %
 g1, 0x1, %g1
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x40c:       st              %
 g1, [%g3 + 0x8]
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x410:       sethi           %
 hi(0xf02b6c00), %g1
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x414:       call            n
 etbsd:softint_schedule
 db> 
 Stopped in pid 0.2 (system) at  netbsd:ether_input+0x418:       ld              [
 %g1 + 0x80], %o0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:softint_schedule:        save            %
 sp, -0x68, %sp
 db> 
 Stopped in pid 0.2 (system) at  netbsd:softint_schedule+0x4:    sethi           %
 hi(0xf0002000), %g1
 db> 
 Stopped in pid 0.2 (system) at  netbsd:softint_schedule+0x8:    ld              [
 %g1 + 0x100], %g2
 Stopped in pid 0.2 (system) at  netbsd:softint_schedule+0xc:    ld              [
 %g2 + 0xc0], %g2
 db> 
 Stopped in pid 0.2 (system) at  netbsd:softint_schedule+0x10:   add             %
 i0, %g2, %g3
 db> 
 Stopped in pid 0.2 (system) at  netbsd:softint_schedule+0x14:   ld              [
 %g3 + 0x10], %g1
 db> 
 Stopped in pid 0.2 (system) at  netbsd:softint_schedule+0x18:   subcc           %
 g1, 0x0, %g0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:softint_schedule+0x1c:   bne             n
 etbsd:softint_schedule+0x7c
 db> 
 Stopped in pid 0.2 (system) at  netbsd:softint_schedule+0x20:   nop
 Stopped in pid 0.2 (system) at  netbsd:softint_schedule+0x7c:   jmpl            [
 %i7 + 0x8], %g0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:softint_schedule+0x80:   restore         %
 g0, %g0, %g0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:clockintr_4+0x24:        jmpl            [
 %i7 + 0x8], %g0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:clockintr_4+0x28:        restore         %
 g0, %g0, %g0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:sparc_interrupt44c+0x130:        orcc            %
 o0, %g0, %g0
 db> 
 Stopped in pid 0.2 (system) at  netbsd:sparc_interrupt44c+0x134:        bne             n
 etbsd:sparc_interrupt44c+0x164
 Stopped in pid 0.2 (system) at  netbsd:sparc_interrupt44c+0x138:        nop
 db> 
 Stopped in pid 0.2 (system) at  netbsd:sparc_interrupt44c+0x164:        sethi           %
 hi(0xf0002000), %o2
 db> 
 Stopped in pid 0.2 (system) at  netbsd:sparc_interrupt44c+0x168:        ld              [
 %o2 + 0x1b4], %o3
 db> 
 Stopped in pid 0.2 (system) at  netbsd:sparc_interrupt44c+0x16c:        sub             %
 o3, 0x1, %o3
 db> 
 Stopped in pid 0.2 (system) at  netbsd:sparc_interrupt44c+0x170:        st              %
 o3, [%o2 + 0x1b4]
 db> 
 Stopped in pid 0.2 (system) at  netbsd:sparc_interrupt44c+0x174:        or              %
 Stopped in pid 0.2 (system) at  netbsd:sparc_interrupt44c+0x178:        wr              %
 l6, 0x0, %y
 db>  
 Stopped in pid 0.2 (system) at  netbsd:sparc_interrupt44c+0x17c:        ldd             [
 %sp + 0x78], %g2
 db> 
 Stopped in pid 0.2 (system) at  netbsd:sparc_interrupt44c+0x180:        ldd             [
 %sp + 0x80], %g4
 db> 
 Stopped in pid 0.2 (system) at  netbsd:sparc_interrupt44c+0x184:        ldd             [
 %sp + 0x88], %g6
 db> 
 Stopped in pid 0.2 (system) at  netbsd:sparc_interrupt44c+0x188:        ba              n
 etbsd:sparc_interrupt44c+0x760
 db> 
 Stopped in pid 0.2 (system) at  netbsd:sparc_interrupt44c+0x18c:        wr              %
 l0, 0x0, %psr


 --5vNYLRcllDrimb99--

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/38637: pppoe fails to reconnect sometimes
Date: Fri, 3 Apr 2009 20:06:00 +0200

 On Thu, Oct 02, 2008 at 10:50:05AM +0000, Martin Husemann wrote:
 >  Sorry, I don't see how this could happen besides the softint handling
 >  breaking completely (something corrupting the call wheel?):
 > [...]

 Update on this: this time (with a 5.0_RC3 kernel), it's not the ppoeinq
 which stopped being processed and did overflow, but ipintrq !
 So it's really related to soft interrupt not being called anymore ...

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/38637: pppoe fails to reconnect sometimes
Date: Tue, 5 May 2009 22:20:21 +0200

 On Fri, Apr 03, 2009 at 08:06:00PM +0200, Manuel Bouyer wrote:
 > On Thu, Oct 02, 2008 at 10:50:05AM +0000, Martin Husemann wrote:
 > >  Sorry, I don't see how this could happen besides the softint handling
 > >  breaking completely (something corrupting the call wheel?):
 > > [...]
 > 
 > Update on this: this time (with a 5.0_RC3 kernel), it's not the ppoeinq
 > which stopped being processed and did overflow, but ipintrq !
 > So it's really related to soft interrupt not being called anymore ...

 I think I found the reason for this (or at last a possible reason).
 In sys/kern/kern_softint.c, sh_flags should be declared volatile.
 Without it, on ports where splhigh() is inline, the compiler will optimise
 the second SOFTINT_PENDING test in softint_schedule(). A dissasembly
 of softint_schedule() with and without the volatile sh_flags confirm this
 on sparc.
 Because of this there is a race that could lead to the softhand_t
 being enqueued twice on si_q, leading to a corrupted queue and
 some handler being SOFTINT_PENDING but never called.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/38637: pppoe fails to reconnect sometimes
Date: Tue, 5 May 2009 22:20:21 +0200

 On Fri, Apr 03, 2009 at 08:06:00PM +0200, Manuel Bouyer wrote:
 > On Thu, Oct 02, 2008 at 10:50:05AM +0000, Martin Husemann wrote:
 > >  Sorry, I don't see how this could happen besides the softint handling
 > >  breaking completely (something corrupting the call wheel?):
 > > [...]
 > 
 > Update on this: this time (with a 5.0_RC3 kernel), it's not the ppoeinq
 > which stopped being processed and did overflow, but ipintrq !
 > So it's really related to soft interrupt not being called anymore ...

 I think I found the reason for this (or at last a possible reason).
 In sys/kern/kern_softint.c, sh_flags should be declared volatile.
 Without it, on ports where splhigh() is inline, the compiler will optimise
 the second SOFTINT_PENDING test in softint_schedule(). A dissasembly
 of softint_schedule() with and without the volatile sh_flags confirm this
 on sparc.
 Because of this there is a race that could lead to the softhand_t
 being enqueued twice on si_q, leading to a corrupted queue and
 some handler being SOFTINT_PENDING but never called.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: Manuel Bouyer <bouyer@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/38637 CVS commit: src/sys/kern
Date: Tue, 5 May 2009 20:26:37 +0000

 Module Name:	src
 Committed By:	bouyer
 Date:		Tue May  5 20:26:36 UTC 2009

 Modified Files:
 	src/sys/kern: kern_softint.c

 Log Message:
 Declare sh_flags volatile.
 Without it, on ports where splhigh() is inline, the compiler will optimise
 the second SOFTINT_PENDING test in softint_schedule(). A dissasembly
 of softint_schedule() with and without the volatile sh_flags confirm this
 on sparc.
 Because of this there is a race that could lead to the softhand_t
 being enqueued twice on si_q, leading to a corrupted queue and
 some handler being SOFTINT_PENDING but never called.

 Should fix PR kern/38637


 To generate a diff of this commit:
 cvs rdiff -u -r1.26 -r1.27 src/sys/kern/kern_softint.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Andrew Doran <ad@netbsd.org>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org,
	gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/38637: pppoe fails to reconnect sometimes
Date: Tue, 5 May 2009 20:52:45 +0000

 On Tue, May 05, 2009 at 10:20:21PM +0200, Manuel Bouyer wrote:

 > On Fri, Apr 03, 2009 at 08:06:00PM +0200, Manuel Bouyer wrote:
 > > On Thu, Oct 02, 2008 at 10:50:05AM +0000, Martin Husemann wrote:
 > > >  Sorry, I don't see how this could happen besides the softint handling
 > > >  breaking completely (something corrupting the call wheel?):
 > > > [...]
 > > 
 > > Update on this: this time (with a 5.0_RC3 kernel), it's not the ppoeinq
 > > which stopped being processed and did overflow, but ipintrq !
 > > So it's really related to soft interrupt not being called anymore ...
 > 
 > I think I found the reason for this (or at last a possible reason).
 > In sys/kern/kern_softint.c, sh_flags should be declared volatile.
 > Without it, on ports where splhigh() is inline, the compiler will optimise
 > the second SOFTINT_PENDING test in softint_schedule(). A dissasembly
 > of softint_schedule() with and without the volatile sh_flags confirm this
 > on sparc.
 > Because of this there is a race that could lead to the softhand_t
 > being enqueued twice on si_q, leading to a corrupted queue and
 > some handler being SOFTINT_PENDING but never called.

 Nice diagnosis! However the softint code is correct.

 splhigh/splx on sparc64 should be real functions, or if one believes there
 is a performance advantage, should be inlines with __insn_barrier in the
 correct spots. Consider the potential effects elsewhere. I pointed this
 issue out to Martin or Matthew a couple of years ago and then promptly
 forgot about it.

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Andrew Doran <ad@netbsd.org>
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
        netbsd-bugs@netbsd.org
Subject: Re: kern/38637: pppoe fails to reconnect sometimes
Date: Tue, 5 May 2009 23:26:46 +0200

 On Tue, May 05, 2009 at 08:52:45PM +0000, Andrew Doran wrote:
 > > I think I found the reason for this (or at last a possible reason).
 > > In sys/kern/kern_softint.c, sh_flags should be declared volatile.
 > > Without it, on ports where splhigh() is inline, the compiler will optimise
 > > the second SOFTINT_PENDING test in softint_schedule(). A dissasembly
 > > of softint_schedule() with and without the volatile sh_flags confirm this
 > > on sparc.
 > > Because of this there is a race that could lead to the softhand_t
 > > being enqueued twice on si_q, leading to a corrupted queue and
 > > some handler being SOFTINT_PENDING but never called.
 > 
 > Nice diagnosis! However the softint code is correct.
 > 
 > splhigh/splx on sparc64 should be real functions, or if one believes there
 > is a performance advantage, should be inlines with __insn_barrier in the
 > correct spots. Consider the potential effects elsewhere. I pointed this
 > issue out to Martin or Matthew a couple of years ago and then promptly
 > forgot about it.

 Yes, it seems __insn_barrier() would do it too. All ports needs to be audited
 then; I've seen this on sparc not sparc64 (I don't know if they share
 their spl functions) and macallan@ reported this could be an issue on
 macppc too.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: Martin Husemann <martin@duskware.de>
To: Andrew Doran <ad@netbsd.org>
Cc: Manuel Bouyer <bouyer@antioche.eu.org>, gnats-bugs@NetBSD.org
Subject: Re: kern/38637: pppoe fails to reconnect sometimes
Date: Wed, 6 May 2009 00:14:40 +0200

 On Tue, May 05, 2009 at 08:52:45PM +0000, Andrew Doran wrote:
 > splhigh/splx on sparc64 should be real functions, or if one believes there
 > is a performance advantage, should be inlines with __insn_barrier in the
 > correct spots. Consider the potential effects elsewhere. I pointed this
 > issue out to Martin or Matthew a couple of years ago and then promptly
 > forgot about it.

 You did, and I added the barriers to sparc64/include/psl.h:

 ----------------------------
 revision 1.38
 date: 2007/12/11 08:59:57;  author: martin;  state: Exp;  lines: +9 -9
 branches:  1.38.4;  1.38.8;
 Make sure the compiler does not reorder stores accross spl...() calls
 [or, for completeness, calls changing the processor state]. The mutex
 code has small race windows otherwise.
 ----------------------------

 Martin

State-Changed-From-To: open->closed
State-Changed-By: bouyer@NetBSD.org
State-Changed-When: Sun, 20 Sep 2009 12:47:12 +0000
State-Changed-Why:
Fixed by making sure that spl functions act as a memory barrier.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.