NetBSD Problem Report #38637
From bouyer@chassiron.antioche.eu.org Mon May 12 13:51:39 2008
Return-Path: <bouyer@chassiron.antioche.eu.org>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
by narn.NetBSD.org (Postfix) with ESMTP id ED68663B8BC
for <gnats-bugs@gnats.NetBSD.org>; Mon, 12 May 2008 13:51:38 +0000 (UTC)
Message-Id: <200805121351.m4CDpWo8001949@chassiron.antioche.eu.org>
Date: Mon, 12 May 2008 15:51:32 +0200 (MEST)
From: bouyer@antioche.eu.org
Reply-To: bouyer@antioche.eu.org
To: gnats-bugs@gnats.NetBSD.org
Subject: pppoe fails to reconnect sometimes
X-Send-Pr-Version: 3.95
>Number: 38637
>Category: kern
>Synopsis: pppoe fails to reconnect sometimes
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Mon May 12 13:55:00 +0000 2008
>Closed-Date: Sun Sep 20 12:47:12 +0000 2009
>Last-Modified: Sun Sep 20 12:47:12 +0000 2009
>Originator: Manuel Bouyer
>Release: NetBSD 4.99.62
>Organization:
>Environment:
System: NetBSD chassiron.antioche.eu.org 4.99.62 NetBSD 4.99.62 (CHASSIRON) #1: Sun May 4 21:57:20 CEST 2008 bouyer@rock:/dsk/l1/misc/bouyer/tmp/sparc/obj/dsk/l1/misc/bouyer/current/src/sys/arch/sparc/compile/CHASSIRON sparc
Architecture: sparc
Machine: sparc
>Description:
Since I upgraded to 4.99.62, pppoe sometimes fails to reconnect.
The kernel says:
May 12 14:22:48 chassiron /netbsd: pppoe0: LCP keepalive timed out, going to restart the connection
and then, the only activity on the ethernet link is:
15:17:53.128532 PPPoE PADI [Service-Name] [Host-Uniq 0xF0B16000]
15:17:53.186377 PPPoE PADO [AC-Name "BSMSO108"] [Host-Uniq 0xF0B16000] [Service-Name] [AC-Cookie 0xF017FC9DEF801E29EFF08EEFC116CACE]
once per minute. Here it has been in this state for about 55 minutes.
A reboot cause the pppoe device to reconnect immediatly. A down/up
or destroy/create of the pppoe0 device doesn't help, a reboot is
required. This seems to only happens when the link is actively used
(e.g. pop3 and imap clients active, a web browser open with
self-refershing pages loaded, some ssh sessions, ...). The link has
been up for several days while I was not home ...
A normal connect (after a fresh reboot) looks like
15:35:03.563740 PPPoE PADI [Service-Name] [Host-Uniq 0xF0B13000]
15:35:03.678298 PPPoE PADO [AC-Name "BSMSO108"] [Host-Uniq 0xF0B13000] [Service-Name] [AC-Cookie 0xF017FC9DEF801E29EFF08EEFC116CACE]
15:35:03.678667 PPPoE PADR [Service-Name] [AC-Cookie 0xF017FC9DEF801E29EFF08EEFC116CACE] [Host-Uniq 0xF0B13000]
15:35:03.838963 PPPoE PADS [ses 0x104a] [Service-Name] [Host-Uniq 0xF0B13000] [AC-Name "BSMSO108"] [AC-Cookie 0xF017FC9DEF801E29EFF08EEFC116CACE]
15:35:03.850141 PPPoE [ses 0x104a] LCP, Conf-Request (0x01), id 1, length 16
15:35:03.958371 PPPoE [ses 0x104a] LCP, Conf-Request (0x01), id 113, length 21
15:35:03.958829 PPPoE [ses 0x104a] LCP, Conf-Ack (0x02), id 113, length 21
15:35:03.961091 PPPoE [ses 0x104a] LCP, Conf-Ack (0x02), id 1, length 16
15:35:04.016830 PPPoE [ses 0x104a] CHAP, Challenge (0x01), id 22, Value ff40f7a275c2a4792ce99b25447d7c2a, Name BSMSO108
15:35:04.017743 PPPoE [ses 0x104a] CHAP, Response (0x02), id 22, Value 748b4c144654d0ff50853491582a6a17, Name bouyer@net1.nerim.nerim
15:35:04.362918 PPPoE [ses 0x104a] LCP, Conf-Request (0x01), id 1, length 17
15:35:04.363401 PPPoE [ses 0x104a] LCP, Conf-Ack (0x02), id 1, length 17
15:35:04.363746 PPPoE [ses 0x104a] LCP, Conf-Request (0x01), id 2, length 16
15:35:04.422696 PPPoE [ses 0x104a] LCP, Conf-Nack (0x03), id 2, length 10
15:35:04.423209 PPPoE [ses 0x104a] LCP, Conf-Request (0x01), id 3, length 16
15:35:04.475556 PPPoE [ses 0x104a] LCP, Conf-Nack (0x03), id 3, length 10
15:35:04.476004 PPPoE [ses 0x104a] LCP, Conf-Request (0x01), id 4, length 16
15:35:04.531192 PPPoE [ses 0x104a] LCP, Conf-Nack (0x03), id 4, length 10
15:35:04.531779 PPPoE [ses 0x104a] LCP, Conf-Request (0x01), id 5, length 16
15:35:04.585438 PPPoE [ses 0x104a] LCP, Conf-Nack (0x03), id 5, length 10
15:35:04.586005 PPPoE [ses 0x104a] LCP, Conf-Request (0x01), id 6, length 16
15:35:04.642439 PPPoE [ses 0x104a] LCP, Conf-Nack (0x03), id 6, length 10
15:35:04.643141 PPPoE [ses 0x104a] LCP, Conf-Request (0x01), id 7, length 16
15:35:04.694636 PPPoE [ses 0x104a] LCP, Conf-Reject (0x04), id 7, length 10
15:35:04.695074 PPPoE [ses 0x104a] LCP, Conf-Request (0x01), id 8, length 12
15:35:04.750559 PPPoE [ses 0x104a] LCP, Conf-Ack (0x02), id 8, length 12
15:35:04.753007 PPPoE [ses 0x104a] CHAP, Challenge (0x01), id 23, Value c7707a87f9e0df56aa137099f6c6961e, Name lns303-tip-courbevoie
15:35:04.753528 PPPoE [ses 0x104a] CHAP, Response (0x02), id 23, Value b02dbd3ffedc96e7d42166e83c8bc657, Name bouyer@net1.nerim.nerim
15:35:04.822418 PPPoE [ses 0x104a] CHAP, Success (0x03), id 23, Msg
15:35:04.822963 PPPoE [ses 0x104a] IPCP, Conf-Request (0x01), id 1, length 6
so it looks like the PADO sent by BSMSO108 is ignored by the kernel in the
non-working case.
It should also be noted that the telco forces a disconnect every 24 hours,
and in this case the kernel has no problems reconnecting. It only
happens when it detects a "LCP keepalive timed out", in some case
(if I power off the modem, the kernel also detects the LCP timeout, but
powering on the modem again gives a working connection again after a few
minutes).
I didn't have this issue with the previous (4.99.20, June 2007) kernel.
>How-To-Repeat:
use pppoe ?
>Fix:
unknown
>Release-Note:
>Audit-Trail:
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/38637: pppoe fails to reconnect sometimes
Date: Mon, 12 May 2008 16:43:05 +0200
--5mCyUwZo2JvN/JJP
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Could you try if the printf in the following patch fires in the non
working case?
Thanks,
Martin
--5mCyUwZo2JvN/JJP
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename=patch
Index: if_pppoe.c
===================================================================
RCS file: /cvsroot/src/sys/net/if_pppoe.c,v
retrieving revision 1.86
diff -u -r1.86 if_pppoe.c
--- if_pppoe.c 28 Apr 2008 20:24:09 -0000 1.86
+++ if_pppoe.c 12 May 2008 14:41:56 -0000
@@ -663,8 +663,12 @@
free(sc->sc_ac_cookie, M_DEVBUF);
sc->sc_ac_cookie = malloc(ac_cookie_len, M_DEVBUF,
M_DONTWAIT);
- if (sc->sc_ac_cookie == NULL)
+ if (sc->sc_ac_cookie == NULL) {
+ printf("%s: FATAL: could not allocate memory "
+ "for AC cookie\n",
+ sc->sc_sppp.pp_if.if_xname);
goto done;
+ }
sc->sc_ac_cookie_len = ac_cookie_len;
memcpy(sc->sc_ac_cookie, ac_cookie, ac_cookie_len);
}
--5mCyUwZo2JvN/JJP--
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/38637: pppoe fails to reconnect sometimes
Date: Mon, 12 May 2008 23:09:08 +0200
On Mon, May 12, 2008 at 02:45:02PM +0000, Martin Husemann wrote:
> The following reply was made to PR kern/38637; it has been noted by GNATS.
>
> From: Martin Husemann <martin@duskware.de>
> To: gnats-bugs@NetBSD.org
> Cc:
> Subject: Re: kern/38637: pppoe fails to reconnect sometimes
> Date: Mon, 12 May 2008 16:43:05 +0200
>
> --5mCyUwZo2JvN/JJP
> Content-Type: text/plain; charset=us-ascii
> Content-Disposition: inline
>
> Could you try if the printf in the following patch fires in the non
> working case?
New kernel installed, now waiting for the issue to show up again.
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/38637: pppoe fails to reconnect sometimes
Date: Sun, 25 May 2008 22:22:56 +0200
On Mon, May 12, 2008 at 11:09:08PM +0200, Manuel Bouyer wrote:
> On Mon, May 12, 2008 at 02:45:02PM +0000, Martin Husemann wrote:
> > The following reply was made to PR kern/38637; it has been noted by GNATS.
> >
> > From: Martin Husemann <martin@duskware.de>
> > To: gnats-bugs@NetBSD.org
> > Cc:
> > Subject: Re: kern/38637: pppoe fails to reconnect sometimes
> > Date: Mon, 12 May 2008 16:43:05 +0200
> >
> > --5mCyUwZo2JvN/JJP
> > Content-Type: text/plain; charset=us-ascii
> > Content-Disposition: inline
> >
> > Could you try if the printf in the following patch fires in the non
> > working case?
>
> New kernel installed, now waiting for the issue to show up again.
I did, and the printf didn't fire.
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
From: Martin Husemann <martin@duskware.de>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/38637: pppoe fails to reconnect sometimes
Date: Thu, 2 Oct 2008 12:45:14 +0200
On Wed, Oct 01, 2008 at 10:10:27PM +0200, Manuel Bouyer wrote:
> (all other queue length are 0). So it looks like the pppoe queues stop
> being processed, for unknown reason. This may even be the reason of the
> disconnect in the first place.
Sorry, I don't see how this could happen besides the softint handling
breaking completely (something corrupting the call wheel?):
ether_input does:
if (etype == ETHERTYPE_PPPOEDISC)
inq = &ppoediscinq;
else
inq = &ppoeinq;
if (IF_QFULL(inq)) {
IF_DROP(inq);
m_freem(m);
} else
IF_ENQUEUE(inq, m);
softint_schedule(pppoe_softintr);
return;
This should make pppoe_softintr_handler() run, which just does:
mutex_enter(softnet_lock);
pppoe_input();
mutex_exit(softnet_lock);
and pppoe_input() loops untill the queues are both empty.
Can you find out if pppoe_softintr_handler is getting called at all?
Martin
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/38637: pppoe fails to reconnect sometimes
Date: Sat, 4 Oct 2008 13:00:19 +0200
--5vNYLRcllDrimb99
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
On Thu, Oct 02, 2008 at 10:50:05AM +0000, Martin Husemann wrote:
> On Wed, Oct 01, 2008 at 10:10:27PM +0200, Manuel Bouyer wrote:
> > (all other queue length are 0). So it looks like the pppoe queues stop
> > being processed, for unknown reason. This may even be the reason of the
> > disconnect in the first place.
>
> Sorry, I don't see how this could happen besides the softint handling
> breaking completely (something corrupting the call wheel?):
>
> ether_input does:
>
> if (etype == ETHERTYPE_PPPOEDISC)
> inq = &ppoediscinq;
> else
> inq = &ppoeinq;
> if (IF_QFULL(inq)) {
> IF_DROP(inq);
> m_freem(m);
> } else
> IF_ENQUEUE(inq, m);
> softint_schedule(pppoe_softintr);
> return;
>
> This should make pppoe_softintr_handler() run, which just does:
>
> mutex_enter(softnet_lock);
> pppoe_input();
> mutex_exit(softnet_lock);
>
> and pppoe_input() loops untill the queues are both empty.
>
> Can you find out if pppoe_softintr_handler is getting called at all?
It happened again, I did a break to ddb and set a breakpoint on
pppoe_softintr_handler. It didn't fire.
What's more strange is that tcpdump on the interface shows the host sending
PADI paquests, and the BAS replying with PADO. tcpdump shows the ethernet
addresses in the packet header are right. Despite this, pppoe_softintr_handler
seems not to be called, and more strange, netstat -q doen't show the
queue lengths for the pppoe queues growing, nor queue drops.
So I set a breakpoint on ether_input() and single-stepped here.
softint_schedule(pppoe_softintr) is being called. But on the return
from softint_schedule() ddb entered clockintr_4(), and then
sparc_interrupt44c(), and the system did reset with a watchdog here.
Attached is a trace of my ddb session. Hope it helps ...
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
--5vNYLRcllDrimb99
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename=ether_input
db> c
Breakpoint in pid 0.2 (system) at netbsd:ether_input: save % sp, -0x70, %sp
db> s
Stopped in pid 0.2 (system) at netbsd:ether_input+0x4: lduh [%i0 + 0
x30], %o3
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x8: andcc %o3, 0x1
, %g0
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0xc: be netbsd:e
ther_input+0x20c
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x10: st %
i1, [%fp + 0x48]
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x14: ld [
%i1 + 0x8], %l1
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x18: sethi %
hi(0x1000), %g1
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x1c: ldub [
%l1 + 0xc], %g2
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x20: sll %
g2, 0x8, %g2
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x24: ld [
%i1 + 0x14], %l0
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x28: ld [
%i1 + 0x28], %g4
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x2c: andcc %
l0, %g1, %g0
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x30: or %
g0, 0x0, %o4
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x34: ldub [
%l1 + 0xd], %g1
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x38: or %
g0, 0x4, %o5
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x3c: or %
g1, %g2, %l2
Stopped in pid 0.2 (system) at netbsd:ether_input+0x40: be n
etbsd:ether_input+0x21c
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x44: ldd [
%i0 + 0x40], %g2
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x21c: or %
g0, 0x0, %o5
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x220: addcc %
o5, %g3, %o5
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x224: or %
g0, 0x0, %o4
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x228: sethi %
hi(0x8000), %g1
Stopped in pid 0.2 (system) at netbsd:ether_input+0x22c: addc %
o4, %g2, %o4
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x230: or %
g1, 0x100, %g1
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x234: sll %
l2, 0x10, %g2
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x238: srl %
g2, 0x10, %g2
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x23c: subcc %
g2, %g1, %g0
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x240: bne n
etbsd:ether_input+0x70
Stopped in pid 0.2 (system) at netbsd:ether_input+0x244: sethi %
hi(0x8800), %g1
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x70: or %
g0, 0x16, %g3
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x74: or %
g1, 0x64, %g1
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x78: subcc %
g2, %g1, %g0
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x7c: be n
etbsd:ether_input+0x8c
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x80: or %
g0, 0x0, %g2
Stopped in pid 0.2 (system) at netbsd:ether_input+0x84: or %
g0, 0x0, %g2
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x88: or %
g0, 0xe, %g3
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x8c: addcc %
g3, %o5, %o5
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x90: addc %
g2, %o4, %o4
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x94: sra %
g4, 0x1f, %g2
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x98: subcc %
g2, %o4, %g0
Stopped in pid 0.2 (system) at netbsd:ether_input+0x9c: bgu n
etbsd:ether_input+0x260
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0xa0: sethi %
hi(0xf02f7000), %o0
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0xa4: subcc %
g2, %o4, %g0
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0xa8: be n
etbsd:ether_input+0x254
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0xac: subcc %
g4, %o5, %g0
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x254: bleu,a n
etbsd:ether_input+0xb4
Stopped in pid 0.2 (system) at netbsd:ether_input+0x258: ldub [
%l1 + %g0], %g1
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0xb4: andcc %
g1, 0x1, %g0
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0xb8: be,a n
etbsd:ether_input+0x12c
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0xbc: ld [
%fp + 0x48], %i1
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x12c: sethi %
hi(0x1000), %g1
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x130: ld [
%i1 + 0x14], %g2
Stopped in pid 0.2 (system) at netbsd:ether_input+0x134: andcc %
g2, %g1, %g0
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x138: bne n
etbsd:ether_input+0x314
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x13c: or %
g0, %i1, %o0
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x140: ld [
%i1 + 0x28], %g1
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x144: ldd [
%i0 + 0x80], %o4
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x148: addcc %
o5, %g1, %o5
Stopped in pid 0.2 (system) at netbsd:ether_input+0x14c: sra %
g1, 0x1f, %g2
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x150: sethi %
hi(0x2000), %l3
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x154: addc %
o4, %g2, %o4
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x158: or %
l3, 0x300, %g1
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x15c: std %
o4, [%i0 + 0x80]
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x160: ld [
%i1 + 0x14], %l0
Stopped in pid 0.2 (system) at netbsd:ether_input+0x164: andcc %
l0, %g1, %g0
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x168: be,a n
etbsd:ether_input+0x290
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x16c: lduh [
%i0 + 0x30], %g1
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x290: andcc %
g1, 0x100, %g0
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x294: be,a n
etbsd:ether_input+0x174
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x298: ld [
%i1 + 0x14], %g2
Stopped in pid 0.2 (system) at netbsd:ether_input+0x174: sethi %
hi(0x2000), %g1
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x178: andcc %
g2, %g1, %g0
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x17c: be n
etbsd:ether_input+0x2d4
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x180: add %
i0, 0x12c, %o0
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x2d4: add %
fp, 0x48, %o1
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x2d8: or %
g0, %i0, %o2
Stopped in pid 0.2 (system) at netbsd:ether_input+0x2dc: call n
etbsd:pfil_run_hooks
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x2e0: or %
g0, 0x1, %o3
db>
Stopped in pid 0.2 (system) at netbsd:pfil_run_hooks: save %sp, -0x
70, %sp
db>
Stopped in pid 0.2 (system) at netbsd:pfil_run_hooks+0x4: and %
i3, 0x3, %g1
db>
Stopped in pid 0.2 (system) at netbsd:pfil_run_hooks+0x8: subcc %
g0, %g1, %g0
db>
Stopped in pid 0.2 (system) at netbsd:pfil_run_hooks+0xc: addc %
g0, 0x0, %g2
Stopped in pid 0.2 (system) at netbsd:pfil_run_hooks+0x10: subcc %
g0, %i1, %g0
db>
Stopped in pid 0.2 (system) at netbsd:pfil_run_hooks+0x14: addc %
g0, 0x0, %g1
db>
Stopped in pid 0.2 (system) at netbsd:pfil_run_hooks+0x18: andcc %
g2, %g1, %l0
db>
Stopped in pid 0.2 (system) at netbsd:pfil_run_hooks+0x1c: be n
etbsd:pfil_run_hooks+0x2c
db>
Stopped in pid 0.2 (system) at netbsd:pfil_run_hooks+0x20: st %
g0, [%fp - 0xc]
db>
Stopped in pid 0.2 (system) at netbsd:pfil_run_hooks+0x24: ld [
%i1 + %g0], %g1
Stopped in pid 0.2 (system) at netbsd:pfil_run_hooks+0x28: st %
g1, [%fp - 0xc]
db>
Stopped in pid 0.2 (system) at netbsd:pfil_run_hooks+0x2c: subcc %
i3, 0x1, %g0
db>
Stopped in pid 0.2 (system) at netbsd:pfil_run_hooks+0x30: be n
etbsd:pfil_run_hooks+0x6c
db>
Stopped in pid 0.2 (system) at netbsd:pfil_run_hooks+0x34: subcc %
i3, 0x2, %g0
db>
Stopped in pid 0.2 (system) at netbsd:pfil_run_hooks+0x6c: ld [
%i0 + %g0], %i0
db>
Stopped in pid 0.2 (system) at netbsd:pfil_run_hooks+0x70: subcc %
i0, 0x0, %g0
Stopped in pid 0.2 (system) at netbsd:pfil_run_hooks+0x74: be n
etbsd:pfil_run_hooks+0x50
db>
Stopped in pid 0.2 (system) at netbsd:pfil_run_hooks+0x78: or %
g0, 0x0, %o0
db>
Stopped in pid 0.2 (system) at netbsd:pfil_run_hooks+0x50: or %
g0, 0x0, %o0
db>
Stopped in pid 0.2 (system) at netbsd:pfil_run_hooks+0x54: subcc %
l0, 0x0, %g0
db>
Stopped in pid 0.2 (system) at netbsd:pfil_run_hooks+0x58: be n
etbsd:pfil_run_hooks+0x64
db>
Stopped in pid 0.2 (system) at netbsd:pfil_run_hooks+0x5c: ld [
%fp - 0xc], %g1
Stopped in pid 0.2 (system) at netbsd:pfil_run_hooks+0x60: st %
g1, [%i1 + %g0]
db>
Stopped in pid 0.2 (system) at netbsd:pfil_run_hooks+0x64: jmpl [
%i7 + 0x8], %g0
db>
Stopped in pid 0.2 (system) at netbsd:pfil_run_hooks+0x68: restore %
g0, %o0, %o0
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x2e4: subcc %
o0, 0x0, %g0
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x2e8: bne n
etbsd:ether_input+0x214
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x2ec: ld [
%fp + 0x48], %i1
Stopped in pid 0.2 (system) at netbsd:ether_input+0x2f0: subcc %
i1, 0x0, %g0
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x2f4: be n
etbsd:ether_input+0x634
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x2f8: nop
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x2fc: ld [
%i1 + 0x8], %l1
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x300: ldub [
%l1 + 0xc], %g1
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x304: ldub [
%l1 + 0xd], %g2
Stopped in pid 0.2 (system) at netbsd:ether_input+0x308: sll %
g1, 0x8, %g1
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x30c: ba n
etbsd:ether_input+0x184
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x310: or %
g2, %g1, %l2
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x184: ld [
%i0 + 0x220], %g1
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x188: subcc %
g1, 0x0, %g0
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x18c: bne n
etbsd:ether_input+0x338
Stopped in pid 0.2 (system) at netbsd:ether_input+0x190: or %
g0, %i1, %o0
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x194: sll %
l2, 0x10, %l0
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x198: sethi %
hi(0x8800), %g1
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x19c: srl %
l0, 0x10, %g2
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x1a0: or %
g1, 0x9, %g1
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x1a4: subcc %
g2, %g1, %g0
Stopped in pid 0.2 (system) at netbsd:ether_input+0x1a8: be n
etbsd:ether_input+0x42c
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x1ac: or %
g0, %i1, %o0
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x1b0: bgu n
etbsd:ether_input+0x384
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x1b4: sethi %
hi(0xffff7400), %g1
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x384: or %
g1, 0x39d, %g1
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x388: add %
g2, %g1, %g1
Stopped in pid 0.2 (system) at netbsd:ether_input+0x38c: subcc %
g1, 0x1, %g0
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x390: bgu n
etbsd:ether_input+0x1d0
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x394: ld [
%i1 + 0x14], %g2
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x398: sethi %
hi(0x2000), %g1
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x39c: andcc %
g2, %g1, %g0
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x3a0: bne n
etbsd:ether_input+0x20c
Stopped in pid 0.2 (system) at netbsd:ether_input+0x3a4: andcc %
g2, 0x300, %g0
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x3a8: bne n
etbsd:ether_input+0x20c
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x3ac: srl %
l0, 0x10, %g1
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x3b0: sethi %
hi(0x8800), %g2
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x3b4: sethi %
hi(0xf02b6c00), %g3
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x3b8: or %
g2, 0x63, %g2
Stopped in pid 0.2 (system) at netbsd:ether_input+0x3bc: subcc %
g1, %g2, %g0
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x3c0: be n
etbsd:ether_input+0x3d0
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x3c4: or %
g3, 0x58, %g3
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x3d0: ld [
%g3 + 0x8], %g2
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x3d4: ld [
%g3 + 0xc], %g1
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x3d8: subcc %
g2, %g1, %g0
Stopped in pid 0.2 (system) at netbsd:ether_input+0x3dc: bge,a n
etbsd:ether_input+0x63c
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x3e4: st %
g0, [%i1 + 0x4]
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x3e8: ld [
%g3 + 0x4], %g2
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x3ec: subcc %
g2, 0x0, %g0
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x3f0: be n
etbsd:ether_input+0x67c
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x3f4: ld [
%fp + 0x48], %g1
Stopped in pid 0.2 (system) at netbsd:ether_input+0x3f8: st %
g1, [%g2 + 0x4]
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x3fc: ld [
%fp + 0x48], %g2
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x400: ld [
%g3 + 0x8], %g1
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x404: st %
g2, [%g3 + 0x4]
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x408: add %
g1, 0x1, %g1
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x40c: st %
g1, [%g3 + 0x8]
Stopped in pid 0.2 (system) at netbsd:ether_input+0x410: sethi %
hi(0xf02b6c00), %g1
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x414: call n
etbsd:softint_schedule
db>
Stopped in pid 0.2 (system) at netbsd:ether_input+0x418: ld [
%g1 + 0x80], %o0
db>
Stopped in pid 0.2 (system) at netbsd:softint_schedule: save %
sp, -0x68, %sp
db>
Stopped in pid 0.2 (system) at netbsd:softint_schedule+0x4: sethi %
hi(0xf0002000), %g1
db>
Stopped in pid 0.2 (system) at netbsd:softint_schedule+0x8: ld [
%g1 + 0x100], %g2
Stopped in pid 0.2 (system) at netbsd:softint_schedule+0xc: ld [
%g2 + 0xc0], %g2
db>
Stopped in pid 0.2 (system) at netbsd:softint_schedule+0x10: add %
i0, %g2, %g3
db>
Stopped in pid 0.2 (system) at netbsd:softint_schedule+0x14: ld [
%g3 + 0x10], %g1
db>
Stopped in pid 0.2 (system) at netbsd:softint_schedule+0x18: subcc %
g1, 0x0, %g0
db>
Stopped in pid 0.2 (system) at netbsd:softint_schedule+0x1c: bne n
etbsd:softint_schedule+0x7c
db>
Stopped in pid 0.2 (system) at netbsd:softint_schedule+0x20: nop
Stopped in pid 0.2 (system) at netbsd:softint_schedule+0x7c: jmpl [
%i7 + 0x8], %g0
db>
Stopped in pid 0.2 (system) at netbsd:softint_schedule+0x80: restore %
g0, %g0, %g0
db>
Stopped in pid 0.2 (system) at netbsd:clockintr_4+0x24: jmpl [
%i7 + 0x8], %g0
db>
Stopped in pid 0.2 (system) at netbsd:clockintr_4+0x28: restore %
g0, %g0, %g0
db>
Stopped in pid 0.2 (system) at netbsd:sparc_interrupt44c+0x130: orcc %
o0, %g0, %g0
db>
Stopped in pid 0.2 (system) at netbsd:sparc_interrupt44c+0x134: bne n
etbsd:sparc_interrupt44c+0x164
Stopped in pid 0.2 (system) at netbsd:sparc_interrupt44c+0x138: nop
db>
Stopped in pid 0.2 (system) at netbsd:sparc_interrupt44c+0x164: sethi %
hi(0xf0002000), %o2
db>
Stopped in pid 0.2 (system) at netbsd:sparc_interrupt44c+0x168: ld [
%o2 + 0x1b4], %o3
db>
Stopped in pid 0.2 (system) at netbsd:sparc_interrupt44c+0x16c: sub %
o3, 0x1, %o3
db>
Stopped in pid 0.2 (system) at netbsd:sparc_interrupt44c+0x170: st %
o3, [%o2 + 0x1b4]
db>
Stopped in pid 0.2 (system) at netbsd:sparc_interrupt44c+0x174: or %
Stopped in pid 0.2 (system) at netbsd:sparc_interrupt44c+0x178: wr %
l6, 0x0, %y
db>
Stopped in pid 0.2 (system) at netbsd:sparc_interrupt44c+0x17c: ldd [
%sp + 0x78], %g2
db>
Stopped in pid 0.2 (system) at netbsd:sparc_interrupt44c+0x180: ldd [
%sp + 0x80], %g4
db>
Stopped in pid 0.2 (system) at netbsd:sparc_interrupt44c+0x184: ldd [
%sp + 0x88], %g6
db>
Stopped in pid 0.2 (system) at netbsd:sparc_interrupt44c+0x188: ba n
etbsd:sparc_interrupt44c+0x760
db>
Stopped in pid 0.2 (system) at netbsd:sparc_interrupt44c+0x18c: wr %
l0, 0x0, %psr
--5vNYLRcllDrimb99--
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/38637: pppoe fails to reconnect sometimes
Date: Fri, 3 Apr 2009 20:06:00 +0200
On Thu, Oct 02, 2008 at 10:50:05AM +0000, Martin Husemann wrote:
> Sorry, I don't see how this could happen besides the softint handling
> breaking completely (something corrupting the call wheel?):
> [...]
Update on this: this time (with a 5.0_RC3 kernel), it's not the ppoeinq
which stopped being processed and did overflow, but ipintrq !
So it's really related to soft interrupt not being called anymore ...
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/38637: pppoe fails to reconnect sometimes
Date: Tue, 5 May 2009 22:20:21 +0200
On Fri, Apr 03, 2009 at 08:06:00PM +0200, Manuel Bouyer wrote:
> On Thu, Oct 02, 2008 at 10:50:05AM +0000, Martin Husemann wrote:
> > Sorry, I don't see how this could happen besides the softint handling
> > breaking completely (something corrupting the call wheel?):
> > [...]
>
> Update on this: this time (with a 5.0_RC3 kernel), it's not the ppoeinq
> which stopped being processed and did overflow, but ipintrq !
> So it's really related to soft interrupt not being called anymore ...
I think I found the reason for this (or at last a possible reason).
In sys/kern/kern_softint.c, sh_flags should be declared volatile.
Without it, on ports where splhigh() is inline, the compiler will optimise
the second SOFTINT_PENDING test in softint_schedule(). A dissasembly
of softint_schedule() with and without the volatile sh_flags confirm this
on sparc.
Because of this there is a race that could lead to the softhand_t
being enqueued twice on si_q, leading to a corrupted queue and
some handler being SOFTINT_PENDING but never called.
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/38637: pppoe fails to reconnect sometimes
Date: Tue, 5 May 2009 22:20:21 +0200
On Fri, Apr 03, 2009 at 08:06:00PM +0200, Manuel Bouyer wrote:
> On Thu, Oct 02, 2008 at 10:50:05AM +0000, Martin Husemann wrote:
> > Sorry, I don't see how this could happen besides the softint handling
> > breaking completely (something corrupting the call wheel?):
> > [...]
>
> Update on this: this time (with a 5.0_RC3 kernel), it's not the ppoeinq
> which stopped being processed and did overflow, but ipintrq !
> So it's really related to soft interrupt not being called anymore ...
I think I found the reason for this (or at last a possible reason).
In sys/kern/kern_softint.c, sh_flags should be declared volatile.
Without it, on ports where splhigh() is inline, the compiler will optimise
the second SOFTINT_PENDING test in softint_schedule(). A dissasembly
of softint_schedule() with and without the volatile sh_flags confirm this
on sparc.
Because of this there is a race that could lead to the softhand_t
being enqueued twice on si_q, leading to a corrupted queue and
some handler being SOFTINT_PENDING but never called.
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
From: Manuel Bouyer <bouyer@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/38637 CVS commit: src/sys/kern
Date: Tue, 5 May 2009 20:26:37 +0000
Module Name: src
Committed By: bouyer
Date: Tue May 5 20:26:36 UTC 2009
Modified Files:
src/sys/kern: kern_softint.c
Log Message:
Declare sh_flags volatile.
Without it, on ports where splhigh() is inline, the compiler will optimise
the second SOFTINT_PENDING test in softint_schedule(). A dissasembly
of softint_schedule() with and without the volatile sh_flags confirm this
on sparc.
Because of this there is a race that could lead to the softhand_t
being enqueued twice on si_q, leading to a corrupted queue and
some handler being SOFTINT_PENDING but never called.
Should fix PR kern/38637
To generate a diff of this commit:
cvs rdiff -u -r1.26 -r1.27 src/sys/kern/kern_softint.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Andrew Doran <ad@netbsd.org>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org,
gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/38637: pppoe fails to reconnect sometimes
Date: Tue, 5 May 2009 20:52:45 +0000
On Tue, May 05, 2009 at 10:20:21PM +0200, Manuel Bouyer wrote:
> On Fri, Apr 03, 2009 at 08:06:00PM +0200, Manuel Bouyer wrote:
> > On Thu, Oct 02, 2008 at 10:50:05AM +0000, Martin Husemann wrote:
> > > Sorry, I don't see how this could happen besides the softint handling
> > > breaking completely (something corrupting the call wheel?):
> > > [...]
> >
> > Update on this: this time (with a 5.0_RC3 kernel), it's not the ppoeinq
> > which stopped being processed and did overflow, but ipintrq !
> > So it's really related to soft interrupt not being called anymore ...
>
> I think I found the reason for this (or at last a possible reason).
> In sys/kern/kern_softint.c, sh_flags should be declared volatile.
> Without it, on ports where splhigh() is inline, the compiler will optimise
> the second SOFTINT_PENDING test in softint_schedule(). A dissasembly
> of softint_schedule() with and without the volatile sh_flags confirm this
> on sparc.
> Because of this there is a race that could lead to the softhand_t
> being enqueued twice on si_q, leading to a corrupted queue and
> some handler being SOFTINT_PENDING but never called.
Nice diagnosis! However the softint code is correct.
splhigh/splx on sparc64 should be real functions, or if one believes there
is a performance advantage, should be inlines with __insn_barrier in the
correct spots. Consider the potential effects elsewhere. I pointed this
issue out to Martin or Matthew a couple of years ago and then promptly
forgot about it.
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Andrew Doran <ad@netbsd.org>
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: Re: kern/38637: pppoe fails to reconnect sometimes
Date: Tue, 5 May 2009 23:26:46 +0200
On Tue, May 05, 2009 at 08:52:45PM +0000, Andrew Doran wrote:
> > I think I found the reason for this (or at last a possible reason).
> > In sys/kern/kern_softint.c, sh_flags should be declared volatile.
> > Without it, on ports where splhigh() is inline, the compiler will optimise
> > the second SOFTINT_PENDING test in softint_schedule(). A dissasembly
> > of softint_schedule() with and without the volatile sh_flags confirm this
> > on sparc.
> > Because of this there is a race that could lead to the softhand_t
> > being enqueued twice on si_q, leading to a corrupted queue and
> > some handler being SOFTINT_PENDING but never called.
>
> Nice diagnosis! However the softint code is correct.
>
> splhigh/splx on sparc64 should be real functions, or if one believes there
> is a performance advantage, should be inlines with __insn_barrier in the
> correct spots. Consider the potential effects elsewhere. I pointed this
> issue out to Martin or Matthew a couple of years ago and then promptly
> forgot about it.
Yes, it seems __insn_barrier() would do it too. All ports needs to be audited
then; I've seen this on sparc not sparc64 (I don't know if they share
their spl functions) and macallan@ reported this could be an issue on
macppc too.
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
From: Martin Husemann <martin@duskware.de>
To: Andrew Doran <ad@netbsd.org>
Cc: Manuel Bouyer <bouyer@antioche.eu.org>, gnats-bugs@NetBSD.org
Subject: Re: kern/38637: pppoe fails to reconnect sometimes
Date: Wed, 6 May 2009 00:14:40 +0200
On Tue, May 05, 2009 at 08:52:45PM +0000, Andrew Doran wrote:
> splhigh/splx on sparc64 should be real functions, or if one believes there
> is a performance advantage, should be inlines with __insn_barrier in the
> correct spots. Consider the potential effects elsewhere. I pointed this
> issue out to Martin or Matthew a couple of years ago and then promptly
> forgot about it.
You did, and I added the barriers to sparc64/include/psl.h:
----------------------------
revision 1.38
date: 2007/12/11 08:59:57; author: martin; state: Exp; lines: +9 -9
branches: 1.38.4; 1.38.8;
Make sure the compiler does not reorder stores accross spl...() calls
[or, for completeness, calls changing the processor state]. The mutex
code has small race windows otherwise.
----------------------------
Martin
State-Changed-From-To: open->closed
State-Changed-By: bouyer@NetBSD.org
State-Changed-When: Sun, 20 Sep 2009 12:47:12 +0000
State-Changed-Why:
Fixed by making sure that spl functions act as a memory barrier.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.