NetBSD Problem Report #50186
From www@NetBSD.org Tue Sep 1 03:36:13 2015
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
by mollari.NetBSD.org (Postfix) with ESMTPS id F2741A65BA
for <gnats-bugs@gnats.NetBSD.org>; Tue, 1 Sep 2015 03:36:12 +0000 (UTC)
Message-Id: <20150901033611.B83C5A65BB@mollari.NetBSD.org>
Date: Tue, 1 Sep 2015 03:36:11 +0000 (UTC)
From: jdbaker@mylinuxisp.com
Reply-To: jdbaker@mylinuxisp.com
To: gnats-bugs@NetBSD.org
Subject: sparc memfault panic after 7.99.21 ARP changes
X-Send-Pr-Version: www-1.0
>Number: 50186
>Category: kern
>Synopsis: sparc memfault panic after 7.99.21 ARP changes
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Tue Sep 01 03:40:00 +0000 2015
>Closed-Date: Tue Sep 15 08:49:04 +0000 2015
>Last-Modified: Tue Sep 15 08:49:04 +0000 2015
>Originator: John D. Baker
>Release: NetBSD/sparc-7.99.21
>Organization:
>Environment:
NetBSD jean.technoskunk.fur 7.99.21 NetBSD 7.99.21 (JEAN) #0: Mon Aug 31 20:21:50 CDT 2015 sysop@skuld.technoskunk.fur:/d0/build/current/obj/sparc/sys/arch/sparc/compile/JEAN sparc
NetBSD jean.technoskunk.fur 7.99.21 NetBSD 7.99.21 (GENERIC) #19: Mon Aug 31 20:03:50 CDT 2015 sysop@skuld.technoskunk.fur:/d0/build/current/obj/sparc/sys/arch/sparc/compile/GENERIC sparc
>Description:
Following the changes to ARP cache handling beginning with the
following commit:
http://mail-index.netbsd.org/source-changes/2015/08/31/msg068612.html
sparc platform will panic after an indeterminate time (probably when
about to expire an ARP entry) as follows:
From custom kernel JEAN:
cpu0: data fault: pc=0xf008350c addr=0x10 sfsr=0x326<PERR=0x0,LVL=0x3,AT=0x1,FT=0x1,FAV,OW>
panic: kernel fault
Stopped in pid 0.5 (system) at netbsd:cpu_Debugger+0x4: or %
o7, %g0, %g1
db> bt
cpu_Debugger(0xf03a4758, 0xf99efd20, 0xf0432400, 0xf04331a8, 0xf0433000, 0x104) a
t netbsd:panic+0x20
panic(0xf03a4758, 0x0, 0xf008350c, 0x10, 0xf99efd40, 0xf040cc00) at netbsd:mem_a
ccess_fault4m+0x5a4
mem_access_fault4m(0x9, 0x326, 0x10, 0xf99efde0, 0xf0409ff0, 0xf0a0d540) at netb
sd:memfault_sun4m+0xe8
memfault_sun4m(0xf0b366ac, 0x1, 0x0, 0xf041e318, 0xf0a0d544, 0xf0a0d544) at netb
sd:arptimer+0x6c
arptimer(0xf0b36600, 0xf0a0d540, 0xf0b39008, 0x0, 0xf0b366ac, 0xf0437800) at net
bsd:callout_softclock+0x154
callout_softclock(0xf041e31c, 0x1000000, 0x10000, 0xf041e318, 0xf0b36600, 0xf008
3478) at netbsd:softint_thread+0x94
softint_thread(0xf0a0d540, 0x3000, 0x2000, 0x0, 0x0, 0xf99e8218) at netbsd:lwp_t
rampoline+0x8
db>
From GENERIC:
cpu0: data fault: pc=0xf00a626c addr=0x10 sfsr=0x326<PERR=0x0,LVL=0x3,AT=0x1,FT=0x1,FAV,OW>
panic: kernel fault
Stopped in pid 0.5 (system) at netbsd:cpu_Debugger+0x4: or %
o7, %g0, %g1
db> bt
cpu_Debugger(0xf03efb58, 0xf9ac7d20, 0xf0482c00, 0xf0483a58, 0xf0483800, 0x104) a
t netbsd:panic+0x20
panic(0xf03efb58, 0x0, 0xf00a626c, 0x10, 0xf9ac7d40, 0xf045c800) at netbsd:mem_a
ccess_fault4m+0x5b0
mem_access_fault4m(0x9, 0x326, 0x10, 0xf9ac7de0, 0xf0459b20, 0xf0a60540) at netb
sd:memfault_sun4m+0xe8
memfault_sun4m(0xf0b8852c, 0x1, 0x0, 0xf04712a0, 0xf0a60544, 0xf0a60544) at netb
sd:arptimer+0x6c
arptimer(0xf0b88480, 0xf0a60540, 0xf0b8c808, 0x0, 0xf0b8852c, 0xf0488800) at net
bsd:callout_softclock+0x154
callout_softclock(0xf04712a4, 0x1000000, 0x10000, 0xf04712a0, 0xf0b88480, 0xf00a
61d8) at netbsd:softint_thread+0x94
softint_thread(0xf0a60540, 0x3000, 0x2000, 0x0, 0x0, 0xf9ac0218) at netbsd:lwp_t
rampoline+0x8
db>
Machine is SPARCstation 5, 110Mhz, 256MB RAM. Operating diskless.
(NetBSD-7.0_RC3 on local disk)
I hope to confirm this observation on another system, but it is
engaged in another task at this time.
>How-To-Repeat:
Build sparc release from 201509010100 or later and boot GENERIC.
>Fix:
>Release-Note:
>Audit-Trail:
From: Ryota Ozaki <ozaki-r@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/50186: sparc memfault panic after 7.99.21 ARP changes
Date: Tue, 1 Sep 2015 15:52:45 +0900
Hi,
On Tue, Sep 1, 2015 at 12:40 PM, <jdbaker@mylinuxisp.com> wrote:
>>Number: 50186
>>Category: kern
>>Synopsis: sparc memfault panic after 7.99.21 ARP changes
>>Confidential: no
>>Severity: critical
>>Priority: high
>>Responsible: kern-bug-people
>>State: open
>>Class: sw-bug
>>Submitter-Id: net
>>Arrival-Date: Tue Sep 01 03:40:00 +0000 2015
>>Originator: John D. Baker
>>Release: NetBSD/sparc-7.99.21
>>Organization:
>>Environment:
> NetBSD jean.technoskunk.fur 7.99.21 NetBSD 7.99.21 (JEAN) #0: Mon Aug 31 20:21:50 CDT 2015 sysop@skuld.technoskunk.fur:/d0/build/current/obj/sparc/sys/arch/sparc/compile/JEAN sparc
>
> NetBSD jean.technoskunk.fur 7.99.21 NetBSD 7.99.21 (GENERIC) #19: Mon Aug 31 20:03:50 CDT 2015 sysop@skuld.technoskunk.fur:/d0/build/current/obj/sparc/sys/arch/sparc/compile/GENERIC sparc
>
>>Description:
> Following the changes to ARP cache handling beginning with the
> following commit:
>
> http://mail-index.netbsd.org/source-changes/2015/08/31/msg068612.html
>
> sparc platform will panic after an indeterminate time (probably when
> about to expire an ARP entry) as follows:
>
> From custom kernel JEAN:
>
> cpu0: data fault: pc=0xf008350c addr=0x10 sfsr=0x326<PERR=0x0,LVL=0x3,AT=0x1,FT=0x1,FAV,OW>
> panic: kernel fault
> Stopped in pid 0.5 (system) at netbsd:cpu_Debugger+0x4: or %
> o7, %g0, %g1
> db> bt
> cpu_Debugger(0xf03a4758, 0xf99efd20, 0xf0432400, 0xf04331a8, 0xf0433000, 0x104) a
> t netbsd:panic+0x20
> panic(0xf03a4758, 0x0, 0xf008350c, 0x10, 0xf99efd40, 0xf040cc00) at netbsd:mem_a
> ccess_fault4m+0x5a4
> mem_access_fault4m(0x9, 0x326, 0x10, 0xf99efde0, 0xf0409ff0, 0xf0a0d540) at netb
> sd:memfault_sun4m+0xe8
> memfault_sun4m(0xf0b366ac, 0x1, 0x0, 0xf041e318, 0xf0a0d544, 0xf0a0d544) at netb
> sd:arptimer+0x6c
> arptimer(0xf0b36600, 0xf0a0d540, 0xf0b39008, 0x0, 0xf0b366ac, 0xf0437800) at net
> bsd:callout_softclock+0x154
> callout_softclock(0xf041e31c, 0x1000000, 0x10000, 0xf041e318, 0xf0b36600, 0xf008
> 3478) at netbsd:softint_thread+0x94
> softint_thread(0xf0a0d540, 0x3000, 0x2000, 0x0, 0x0, 0xf99e8218) at netbsd:lwp_t
> rampoline+0x8
> db>
>
>
> From GENERIC:
>
> cpu0: data fault: pc=0xf00a626c addr=0x10 sfsr=0x326<PERR=0x0,LVL=0x3,AT=0x1,FT=0x1,FAV,OW>
> panic: kernel fault
> Stopped in pid 0.5 (system) at netbsd:cpu_Debugger+0x4: or %
> o7, %g0, %g1
> db> bt
> cpu_Debugger(0xf03efb58, 0xf9ac7d20, 0xf0482c00, 0xf0483a58, 0xf0483800, 0x104) a
> t netbsd:panic+0x20
> panic(0xf03efb58, 0x0, 0xf00a626c, 0x10, 0xf9ac7d40, 0xf045c800) at netbsd:mem_a
> ccess_fault4m+0x5b0
> mem_access_fault4m(0x9, 0x326, 0x10, 0xf9ac7de0, 0xf0459b20, 0xf0a60540) at netb
> sd:memfault_sun4m+0xe8
> memfault_sun4m(0xf0b8852c, 0x1, 0x0, 0xf04712a0, 0xf0a60544, 0xf0a60544) at netb
> sd:arptimer+0x6c
> arptimer(0xf0b88480, 0xf0a60540, 0xf0b8c808, 0x0, 0xf0b8852c, 0xf0488800) at net
> bsd:callout_softclock+0x154
> callout_softclock(0xf04712a4, 0x1000000, 0x10000, 0xf04712a0, 0xf0b88480, 0xf00a
> 61d8) at netbsd:softint_thread+0x94
> softint_thread(0xf0a60540, 0x3000, 0x2000, 0x0, 0x0, 0xf9ac0218) at netbsd:lwp_t
> rampoline+0x8
> db>
>
> Machine is SPARCstation 5, 110Mhz, 256MB RAM. Operating diskless.
> (NetBSD-7.0_RC3 on local disk)
>
> I hope to confirm this observation on another system, but it is
> engaged in another task at this time.
>>How-To-Repeat:
> Build sparc release from 201509010100 or later and boot GENERIC.
>>Fix:
>
I investigated where it happens:
----
$ ~/git/netbsd-src/work.tools/sparc--netbsdelf/bin/nm -n
work.sparc/sys/arch/sparc/compile/GENERIC/netbsd |grep arptimer
f00a61d8 t arptimer
$ ruby -e 'puts (0xf00a61d8 + 0x6c).to_s(16)'
f00a6244
$ ~/git/netbsd-src/work.tools/sparc--netbsdelf/bin/objdump -d -S
work.sparc/sys/arch/sparc/compile/GENERIC/netbsd.gdb |grep -10
f00a6244
ifp = lle->lle_tbl->llt_ifp;
f00a6234: c2 06 20 40 ld [ %i0 + 0x40 ], %g1
callout_stop(&lle->la_timer);
f00a6238: 90 10 00 1b mov %i3, %o0
f00a623c: 40 03 34 68 call f01733dc <callout_stop>
f00a6240: f4 00 60 10 ld [ %g1 + 0x10 ], %i2
/* XXX: LOR avoidance. We still have ref on lle. */
LLE_WUNLOCK(lle);
f00a6244: 40 02 f7 63 call f0163fd0 <rw_exit>
f00a6248: 90 10 00 1c mov %i4, %o0
/*
* Free an arp entry.
*/
static void arptfree(struct llentry *la)
{
struct rtentry *rt = la->la_rt;
f00a624c: f6 06 20 b0 ld [ %i0 + 0xb0 ], %i3
KASSERT(rt != NULL);
----
Hmm, the place calling rw_exit? Or just before/after it?
I'm not familiar with sparc so I may be wrong on the
investigation.
Thanks,
ozaki-r
From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: re: kern/50186: sparc memfault panic after 7.99.21 ARP changes
Date: Tue, 01 Sep 2015 17:32:57 +1000
> >From GENERIC:
> =
> cpu0: data fault: pc=3D0xf00a626c addr=3D0x10 sfsr=3D0x326<PERR=3D0x0,LV=
L=3D0x3,AT=3D0x1,FT=3D0x1,FAV,OW>
> panic: kernel fault
> Stopped in pid 0.5 (system) at netbsd:cpu_Debugger+0x4: or =
%o7, %g0, %g1
> db> bt
> cpu_Debugger(0xf03efb58, 0xf9ac7d20, 0xf0482c00, 0xf0483a58, 0xf0483800,=
0x104) at netbsd:panic+0x20
> panic(0xf03efb58, 0x0, 0xf00a626c, 0x10, 0xf9ac7d40, 0xf045c800) at netb=
sd:mem_access_fault4m+0x5b0
> mem_access_fault4m(0x9, 0x326, 0x10, 0xf9ac7de0, 0xf0459b20, 0xf0a60540)=
at netbsd:memfault_sun4m+0xe8
> memfault_sun4m(0xf0b8852c, 0x1, 0x0, 0xf04712a0, 0xf0a60544, 0xf0a60544)=
at netbsd:arptimer+0x6c
> arptimer(0xf0b88480, 0xf0a60540, 0xf0b8c808, 0x0, 0xf0b8852c, 0xf0488800=
) at netbsd:callout_softclock+0x154
> callout_softclock(0xf04712a4, 0x1000000, 0x10000, 0xf04712a0, 0xf0b88480=
, 0xf00a61d8) at netbsd:softint_thread+0x94
> softint_thread(0xf0a60540, 0x3000, 0x2000, 0x0, 0x0, 0xf9ac0218) at netb=
sd:lwp_trampoline+0x8
> db>
OK, so i built my own GENERIC. i get this:
(gdb) l *(arptimer+0x6c)
0xf00a6244 is in arptimer (/usr/src4/sys/netinet/if_arp.c:352).
347 ifp =3D lle->lle_tbl->llt_ifp;
348 =
349 callout_stop(&lle->la_timer);
350 =
351 /* XXX: LOR avoidance. We still have ref on lle. */
352 LLE_WUNLOCK(lle);
353 =
354 /* We have to call this w/o lock */
355 arptfree(lle);
disass/m arptimer gives:
351 /* XXX: LOR avoidance. We still have ref on lle. */
352 LLE_WUNLOCK(lle);
0xf00a6244 <+108>: call 0xf0163fd0 <rw_vector_exit> <--- [a=
]
0xf00a6248 <+112>: mov %i4, %o0
but my addresses don't match yours entirely. [a] is the instruction that
appears to me faulting... which makes little sense.
John, can you try the above gdb commands for yourself? thanks.
.mrg.
From: "John D. Baker" <jdbaker@mylinuxisp.com>
To: gnats-bugs@NetBSD.org
Cc:
Subject: re: kern/50186: sparc memfault panic after 7.99.21 ARP changes
Date: Tue, 1 Sep 2015 04:19:07 -0500 (CDT)
On Tue, 1 Sep 2015, matthew green wrote:
> John, can you try the above gdb commands for yourself? thanks.
I need to build a DEBUG-enabled GENERIC first. That is next on the
list once my build host finishes its current task.
--
|/"\ John D. Baker, KN5UKS NetBSD Darwin/MacOS X
|\ / jdbaker[snail]mylinuxisp[flyspeck]com OpenBSD FreeBSD
| X No HTML/proprietary data in email. BSD just sits there and works!
|/ \ GPGkeyID: D703 4A7E 479F 63F8 D3F4 BD99 9572 8F23 E4AD 1645
From: "John D. Baker" <jdbaker@mylinuxisp.com>
To: gnats-bugs@NetBSD.org
Cc:
Subject: re: kern/50186: sparc memfault panic after 7.99.21 ARP changes
Date: Tue, 1 Sep 2015 10:35:04 -0500 (CDT)
On Tue, 1 Sep 2015, matthew green wrote:
> John, can you try the above gdb commands for yourself? thanks.
My freshly-built DEBUG-enabled GENERIC behaves the same. The panic:
cpu0: data fault: pc=0xf00a626c addr=0x10 sfsr=0x326<PERR=0x0,LVL=0x3,AT=0x1,FT=0x1,FAV,OW>
panic: kernel fault
Stopped in pid 0.5 (system) at netbsd:cpu_Debugger+0x4: or %
o7, %g0, %g1
db> bt
cpu_Debugger(0xf03efba0, 0xf9ac3d20, 0xf0482c00, 0xf0483a98, 0xf0483800, 0x104) a
t netbsd:panic+0x20
panic(0xf03efba0, 0x0, 0xf00a626c, 0x10, 0xf9ac3d40, 0xf045c800) at netbsd:mem_a
ccess_fault4m+0x5b0
mem_access_fault4m(0x9, 0x326, 0x10, 0xf9ac3de0, 0xf0459b60, 0xf0a5c540) at netb
sd:memfault_sun4m+0xe8
memfault_sun4m(0xf0b8452c, 0x1, 0x0, 0xf04712e0, 0xf0a5c544, 0xf0a5c544) at netb
sd:arptimer+0x6c
arptimer(0xf0b84480, 0xf0a5c540, 0xf0b88808, 0x0, 0xf0b8452c, 0xf0488800) at net
bsd:callout_softclock+0x154
callout_softclock(0xf04712e4, 0x1000000, 0x10000, 0xf04712e0, 0xf0b84480, 0xf00a
61d8) at netbsd:softint_thread+0x94
softint_thread(0xf0a5c540, 0x3000, 0x2000, 0x0, 0x0, 0xf9a3b218) at netbsd:lwp_t
rampoline+0x8
db>
Loading into 'gdb' gives the same as you observed:
Reading symbols from netbsd.gdb...done.
(gdb) l *(arptimer+0x6c)
0xf00a6244 is in arptimer (/x/current/src/sys/netinet/if_arp.c:352).
347 ifp = lle->lle_tbl->llt_ifp;
348
349 callout_stop(&lle->la_timer);
350
351 /* XXX: LOR avoidance. We still have ref on lle. */
352 LLE_WUNLOCK(lle);
353
354 /* We have to call this w/o lock */
355 arptfree(lle);
356
(gdb) disass/m arptimer
Dump of assembler code for function arptimer:
[...]
350
351 /* XXX: LOR avoidance. We still have ref on lle. */
352 LLE_WUNLOCK(lle);
0xf00a6244 <+108>: call 0xf0163fd0 <rw_vector_exit>
0xf00a6248 <+112>: mov %i4, %o0
The program counter reported in the initial fault message:
0xf00a626c
gives:
(gdb) l *0xf00a626c
0xf00a626c is in arptimer (/x/current/src/sys/netinet/if_arp.c:1438).
1433 if (la->la_rt != NULL) {
1434 rtfree(la->la_rt);
1435 la->la_rt = NULL;
1436 }
1437
1438 rtrequest(RTM_DELETE, rt_getkey(rt), NULL, rt_mask(rt), 0, NULL);
1439 }
1440
1441 /*
1442 * Lookup or enter a new address in arptab.
and disassembling there gives:
1437
1438 rtrequest(RTM_DELETE, rt_getkey(rt), NULL, rt_mask(rt), 0, NULL);
0xf00a6268 <+144>: clr %o2
0xf00a626c <+148>: ld [ %i3 + 0x10 ], %o3
0xf00a6270 <+152>: clr %o4
0xf00a6274 <+156>: clr %o5
0xf00a6278 <+160>: ld [ %i3 + 0xb4 ], %o1
0xf00a627c <+164>: call 0xf025a39c <rtrequest>
0xf00a6280 <+168>: mov 2, %o0
I don't know SPARC assembly or the register usage conventions, but it
looks to me like there is an expected load at offset 0x10 from an address
in "i3", but since the address reported in the fault message is "0x10",
it would seem that "i3" contains 0 (zero).
--
|/"\ John D. Baker, KN5UKS NetBSD Darwin/MacOS X
|\ / jdbaker[snail]mylinuxisp[flyspeck]com OpenBSD FreeBSD
| X No HTML/proprietary data in email. BSD just sits there and works!
|/ \ GPGkeyID: D703 4A7E 479F 63F8 D3F4 BD99 9572 8F23 E4AD 1645
From: "John D. Baker" <jdbaker@mylinuxisp.com>
To: gnats-bugs@NetBSD.org
Cc:
Subject: re: kern/50186: sparc memfault panic after 7.99.21 ARP changes
Date: Tue, 1 Sep 2015 10:52:15 -0500 (CDT)
Actually, looking back earlier in the disassembly:
1424 /*
1425 * Free an arp entry.
1426 */
1427 static void arptfree(struct llentry *la)
---Type <return> to continue, or q <return> to quit---
1428 {
1429 struct rtentry *rt = la->la_rt;
0xf00a624c <+116>: ld [ %i0 + 0xb0 ], %i3
1430
1431 KASSERT(rt != NULL);
1432
1433 if (la->la_rt != NULL) {
0xf00a6250 <+120>: cmp %i3, 0
0xf00a6254 <+124>: be 0xf00a626c <arptimer+148>
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0xf00a6258 <+128>: clr %o2
1434 rtfree(la->la_rt);
0xf00a625c <+132>: call 0xf025a784 <rtfree>
0xf00a6260 <+136>: mov %i3, %o0
1435 la->la_rt = NULL;
0xf00a6264 <+140>: clr [ %i0 + 0xb0 ]
1436 }
The fault appears to be a KASSERT in disguise? Register "i3" is compared
with zero and if equal (i.e., zero) branch to the address reported in
the fault message. This would indicate that the arp entry requested to
be freed is NULL (or a wild pointer)?
--
|/"\ John D. Baker, KN5UKS NetBSD Darwin/MacOS X
|\ / jdbaker[snail]mylinuxisp[flyspeck]com OpenBSD FreeBSD
| X No HTML/proprietary data in email. BSD just sits there and works!
|/ \ GPGkeyID: D703 4A7E 479F 63F8 D3F4 BD99 9572 8F23 E4AD 1645
From: "John D. Baker" <jdbaker@mylinuxisp.com>
To: gnats-bugs@NetBSD.org
Cc:
Subject: re: kern/50186: sparc memfault panic after 7.99.21 ARP changes
Date: Wed, 2 Sep 2015 06:44:47 -0500 (CDT)
I've repeated my observations on a few other machines:
SPARCstation 5, 85MHz, 256MB
SPARCstation 10, 40MHz (SM41 SuperSPARC), 128MB
and most recently:
NetBSD dpe2950 7.99.21 NetBSD 7.99.21 (GENERIC) #63: Mon Aug 31 19:52:42 CDT 2015 sysop@yggdrasil.technoskunk.fur:/r0/build/current/obj/amd64/sys/arch/amd64/compile/GENERIC amd64
The panic:
panic: kernel diagnostic assertion "rt != NULL" failed: file "/r0/nbsd/current/src/sys/netinet/if_arp.c", line 1431
fatal breakpoint trap in supervisor mode
trap type 1 code 0 rip ffffffff8028a395 cs 8 rflags 246 cr2 ffff8000ddb92800 ilevel 2 rsp fffffe8115ef0e80
curlwp 0xfffffe8115f1ca40 pid 0.50 lowest kstack 0xfffffe8115eed2c0
Stopped in pid 0.50 (system) at netbsd:breakpoint+0x5: leave
db{6}> bt
breakpoint() at netbsd:breakpoint+0x5
vpanic() at netbsd:vpanic+0x13c
kern_assert() at netbsd:kern_assert+0x4f
arptimer() at netbsd:arptimer+0x1d0
callout_softclock() at netbsd:callout_softclock+0x1d0
softint_dispatch() at netbsd:softint_dispatch+0xd3
DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xfffffe8115ef0ff0
Xsoftintr() at netbsd:Xsoftintr+0x4f
--- interrupt ---
0:
db{6}>
Excerpt from 'dmesg':
NetBSD 7.99.21 (GENERIC) #63: Mon Aug 31 19:52:42 CDT 2015
sysop@yggdrasil.technoskunk.fur:/r0/build/current/obj/amd64/sys/arch/amd64/compile/GENERIC
total memory = 12287 MB
avail memory = 11911 MB
[...]
ACPI: RSDP 0x00000000000F2620 000024 (v02 DELL )
ACPI: XSDT 0x00000000000F26A0 00004C (v01 DELL PE_SC3 00000001 DELL 00000001)
ACPI: FACP 0x00000000000F27A8 0000F4 (v03 DELL PE_SC3 00000001 DELL 00000001)
ACPI: DSDT 0x00000000CFFA8000 003C53 (v01 DELL PE_SC3 00000001 MSFT 0100000E)
ACPI: FACS 0x00000000CFFB7C00 000040
ACPI: FACS 0x00000000CFFB7C00 000040
ACPI: APIC 0x00000000000F289C 0000C8 (v01 DELL PE_SC3 00000001 DELL 00000001)
ACPI: SPCR 0x00000000000F297D 000050 (v01 DELL PE_SC3 00000001 DELL 00000001)
ACPI: HPET 0x00000000000F29CD 000038 (v01 DELL PE_SC3 00000001 DELL 00000001)
ACPI: MCFG 0x00000000000F2A05 00003C (v01 DELL PE_SC3 00000001 DELL 00000001)
ACPI: All ACPI Tables successfully acquired
ioapic0 at mainbus0 apid 8
ioapic1 at mainbus0 apid 9
cpu0 at mainbus0 apid 0
cpu0: Intel(R) Xeon(R) CPU E5310 @ 1.60GHz, id 0x6f7
cpu1 at mainbus0 apid 4
cpu1: Intel(R) Xeon(R) CPU E5310 @ 1.60GHz, id 0x6f7
cpu2 at mainbus0 apid 1
cpu2: Intel(R) Xeon(R) CPU E5310 @ 1.60GHz, id 0x6f7
cpu3 at mainbus0 apid 5
cpu3: Intel(R) Xeon(R) CPU E5310 @ 1.60GHz, id 0x6f7
cpu4 at mainbus0 apid 2
cpu4: Intel(R) Xeon(R) CPU E5310 @ 1.60GHz, id 0x6f7
cpu5 at mainbus0 apid 6
cpu5: Intel(R) Xeon(R) CPU E5310 @ 1.60GHz, id 0x6f7
cpu6 at mainbus0 apid 3
cpu6: Intel(R) Xeon(R) CPU E5310 @ 1.60GHz, id 0x6f7
cpu7 at mainbus0 apid 7
cpu7: Intel(R) Xeon(R) CPU E5310 @ 1.60GHz, id 0x6f7
So, this is not just a SPARC issue.
I see that KASSERTs in "if_arp.c" were recently disabled.
--
|/"\ John D. Baker, KN5UKS NetBSD Darwin/MacOS X
|\ / jdbaker[snail]mylinuxisp[flyspeck]com OpenBSD FreeBSD
| X No HTML/proprietary data in email. BSD just sits there and works!
|/ \ GPGkeyID: D703 4A7E 479F 63F8 D3F4 BD99 9572 8F23 E4AD 1645
From: Ryota Ozaki <ozaki-r@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
jdbaker@mylinuxisp.com
Subject: Re: kern/50186: sparc memfault panic after 7.99.21 ARP changes
Date: Thu, 3 Sep 2015 19:35:39 +0900
Hi,
Thank you for your investigation.
On Wed, Sep 2, 2015 at 8:50 PM, John D. Baker <jdbaker@mylinuxisp.com> wrote:
> The following reply was made to PR kern/50186; it has been noted by GNATS.
>
> From: "John D. Baker" <jdbaker@mylinuxisp.com>
> To: gnats-bugs@NetBSD.org
> Cc:
> Subject: re: kern/50186: sparc memfault panic after 7.99.21 ARP changes
> Date: Wed, 2 Sep 2015 06:44:47 -0500 (CDT)
>
> I've repeated my observations on a few other machines:
>
> SPARCstation 5, 85MHz, 256MB
> SPARCstation 10, 40MHz (SM41 SuperSPARC), 128MB
>
> and most recently:
>
> NetBSD dpe2950 7.99.21 NetBSD 7.99.21 (GENERIC) #63: Mon Aug 31 19:52:42 CDT 2015 sysop@yggdrasil.technoskunk.fur:/r0/build/current/obj/amd64/sys/arch/amd64/compile/GENERIC amd64
>
> The panic:
>
> panic: kernel diagnostic assertion "rt != NULL" failed: file "/r0/nbsd/current/src/sys/netinet/if_arp.c", line 1431
> fatal breakpoint trap in supervisor mode
> trap type 1 code 0 rip ffffffff8028a395 cs 8 rflags 246 cr2 ffff8000ddb92800 ilevel 2 rsp fffffe8115ef0e80
> curlwp 0xfffffe8115f1ca40 pid 0.50 lowest kstack 0xfffffe8115eed2c0
> Stopped in pid 0.50 (system) at netbsd:breakpoint+0x5: leave
> db{6}> bt
> breakpoint() at netbsd:breakpoint+0x5
> vpanic() at netbsd:vpanic+0x13c
> kern_assert() at netbsd:kern_assert+0x4f
> arptimer() at netbsd:arptimer+0x1d0
> callout_softclock() at netbsd:callout_softclock+0x1d0
> softint_dispatch() at netbsd:softint_dispatch+0xd3
> DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xfffffe8115ef0ff0
> Xsoftintr() at netbsd:Xsoftintr+0x4f
> --- interrupt ---
> 0:
> db{6}>
How do you produce the panic? I don't reproduce on my amd64 machines yet.
>
>
> Excerpt from 'dmesg':
>
> NetBSD 7.99.21 (GENERIC) #63: Mon Aug 31 19:52:42 CDT 2015
> sysop@yggdrasil.technoskunk.fur:/r0/build/current/obj/amd64/sys/arch/amd64/compile/GENERIC
> total memory = 12287 MB
> avail memory = 11911 MB
> [...]
> ACPI: RSDP 0x00000000000F2620 000024 (v02 DELL )
> ACPI: XSDT 0x00000000000F26A0 00004C (v01 DELL PE_SC3 00000001 DELL 00000001)
> ACPI: FACP 0x00000000000F27A8 0000F4 (v03 DELL PE_SC3 00000001 DELL 00000001)
> ACPI: DSDT 0x00000000CFFA8000 003C53 (v01 DELL PE_SC3 00000001 MSFT 0100000E)
> ACPI: FACS 0x00000000CFFB7C00 000040
> ACPI: FACS 0x00000000CFFB7C00 000040
> ACPI: APIC 0x00000000000F289C 0000C8 (v01 DELL PE_SC3 00000001 DELL 00000001)
> ACPI: SPCR 0x00000000000F297D 000050 (v01 DELL PE_SC3 00000001 DELL 00000001)
> ACPI: HPET 0x00000000000F29CD 000038 (v01 DELL PE_SC3 00000001 DELL 00000001)
> ACPI: MCFG 0x00000000000F2A05 00003C (v01 DELL PE_SC3 00000001 DELL 00000001)
> ACPI: All ACPI Tables successfully acquired
> ioapic0 at mainbus0 apid 8
> ioapic1 at mainbus0 apid 9
> cpu0 at mainbus0 apid 0
> cpu0: Intel(R) Xeon(R) CPU E5310 @ 1.60GHz, id 0x6f7
> cpu1 at mainbus0 apid 4
> cpu1: Intel(R) Xeon(R) CPU E5310 @ 1.60GHz, id 0x6f7
> cpu2 at mainbus0 apid 1
> cpu2: Intel(R) Xeon(R) CPU E5310 @ 1.60GHz, id 0x6f7
> cpu3 at mainbus0 apid 5
> cpu3: Intel(R) Xeon(R) CPU E5310 @ 1.60GHz, id 0x6f7
> cpu4 at mainbus0 apid 2
> cpu4: Intel(R) Xeon(R) CPU E5310 @ 1.60GHz, id 0x6f7
> cpu5 at mainbus0 apid 6
> cpu5: Intel(R) Xeon(R) CPU E5310 @ 1.60GHz, id 0x6f7
> cpu6 at mainbus0 apid 3
> cpu6: Intel(R) Xeon(R) CPU E5310 @ 1.60GHz, id 0x6f7
> cpu7 at mainbus0 apid 7
> cpu7: Intel(R) Xeon(R) CPU E5310 @ 1.60GHz, id 0x6f7
>
>
> So, this is not just a SPARC issue.
>
> I see that KASSERTs in "if_arp.c" were recently disabled.
Does the fix suppress the panic on SPARC machines?
Thanks,
ozaki-r
From: Christos Zoulas <christos@zoulas.com>
To: "gnats-bugs@NetBSD.org" <gnats-bugs@NetBSD.org>
Cc: "kern-bug-people@netbsd.org" <kern-bug-people@netbsd.org>,
"gnats-admin@netbsd.org" <gnats-admin@netbsd.org>,
"netbsd-bugs@netbsd.org" <netbsd-bugs@netbsd.org>,
"jdbaker@mylinuxisp.com" <jdbaker@mylinuxisp.com>
Subject: Re: kern/50186: sparc memfault panic after 7.99.21 ARP changes
Date: Thu, 3 Sep 2015 14:38:39 +0300
I just crashed in arptimer() so there are more locking problems in the code.=
Can you document the locking discipline for la_rt and changing the lists?
>=20
From: Ryota Ozaki <ozaki-r@netbsd.org>
To: Christos Zoulas <christos@zoulas.com>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>,
"kern-bug-people@netbsd.org" <kern-bug-people@netbsd.org>,
"gnats-admin@netbsd.org" <gnats-admin@netbsd.org>, "netbsd-bugs@netbsd.org" <netbsd-bugs@netbsd.org>,
"jdbaker@mylinuxisp.com" <jdbaker@mylinuxisp.com>
Subject: Re: kern/50186: sparc memfault panic after 7.99.21 ARP changes
Date: Fri, 4 Sep 2015 13:50:52 +0900
On Thu, Sep 3, 2015 at 8:38 PM, Christos Zoulas <christos@zoulas.com> wrote:
> I just crashed in arptimer() so there are more locking problems in the code. Can you document the locking discipline for la_rt and changing the lists?
>
I'm sorry for the defect.
An ARP cache list of an interface is protected by a rwlock
(IF_AFDATA_*LOCK) and each ARP cache is protected by a rwlock
and refernce counting (LLE_*LOCK). However, la_rt still needs
softnet_lock; if la_rt is accessed or modified without
softnet_lock, it's a bug. And I found a bug :( lltable_free
accesses la_rt but it's called without softnet_lock.
Here is a patch:
http://www.netbsd.org/~ozaki-r/lltable_free-softnet_lock.diff
Could you try it?
Thanks,
ozaki-r
From: christos@zoulas.com (Christos Zoulas)
To: Ryota Ozaki <ozaki-r@netbsd.org>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>,
"kern-bug-people@netbsd.org" <kern-bug-people@netbsd.org>,
"gnats-admin@netbsd.org" <gnats-admin@netbsd.org>,
"netbsd-bugs@netbsd.org" <netbsd-bugs@netbsd.org>,
"jdbaker@mylinuxisp.com" <jdbaker@mylinuxisp.com>
Subject: Re: kern/50186: sparc memfault panic after 7.99.21 ARP changes
Date: Fri, 4 Sep 2015 04:40:42 -0400
On Sep 4, 1:50pm, ozaki-r@netbsd.org (Ryota Ozaki) wrote:
-- Subject: Re: kern/50186: sparc memfault panic after 7.99.21 ARP changes
| On Thu, Sep 3, 2015 at 8:38 PM, Christos Zoulas <christos@zoulas.com> wrote:
| > I just crashed in arptimer() so there are more locking problems in the code. Can you document the locking discipline for la_rt and changing the lists?
| >
|
| I'm sorry for the defect.
|
| An ARP cache list of an interface is protected by a rwlock
| (IF_AFDATA_*LOCK) and each ARP cache is protected by a rwlock
| and refernce counting (LLE_*LOCK). However, la_rt still needs
| softnet_lock; if la_rt is accessed or modified without
| softnet_lock, it's a bug. And I found a bug :( lltable_free
| accesses la_rt but it's called without softnet_lock.
|
| Here is a patch:
| http://www.netbsd.org/~ozaki-r/lltable_free-softnet_lock.diff
| Could you try it?
|
Thanks, I am running with it now. Should we revert the KASSERT change
too?
christos
From: Ryota Ozaki <ozaki-r@netbsd.org>
To: Christos Zoulas <christos@zoulas.com>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>,
"kern-bug-people@netbsd.org" <kern-bug-people@netbsd.org>,
"gnats-admin@netbsd.org" <gnats-admin@netbsd.org>, "netbsd-bugs@netbsd.org" <netbsd-bugs@netbsd.org>,
"jdbaker@mylinuxisp.com" <jdbaker@mylinuxisp.com>
Subject: Re: kern/50186: sparc memfault panic after 7.99.21 ARP changes
Date: Fri, 4 Sep 2015 18:10:39 +0900
On Fri, Sep 4, 2015 at 5:40 PM, Christos Zoulas <christos@zoulas.com> wrote:
> On Sep 4, 1:50pm, ozaki-r@netbsd.org (Ryota Ozaki) wrote:
> -- Subject: Re: kern/50186: sparc memfault panic after 7.99.21 ARP changes
>
> | On Thu, Sep 3, 2015 at 8:38 PM, Christos Zoulas <christos@zoulas.com> wrote:
> | > I just crashed in arptimer() so there are more locking problems in the code. Can you document the locking discipline for la_rt and changing the lists?
> | >
> |
> | I'm sorry for the defect.
> |
> | An ARP cache list of an interface is protected by a rwlock
> | (IF_AFDATA_*LOCK) and each ARP cache is protected by a rwlock
> | and refernce counting (LLE_*LOCK). However, la_rt still needs
> | softnet_lock; if la_rt is accessed or modified without
> | softnet_lock, it's a bug. And I found a bug :( lltable_free
> | accesses la_rt but it's called without softnet_lock.
> |
> | Here is a patch:
> | http://www.netbsd.org/~ozaki-r/lltable_free-softnet_lock.diff
> | Could you try it?
> |
>
> Thanks, I am running with it now. Should we revert the KASSERT change
> too?
Well, yes and no. Because KASSERT was actually wrong; la_rt can be NULL
at the point according to my investigation for PR 50184. So anyway we
have to get rid of it.
I made a patch for the bug:
http://www.netbsd.org/~ozaki-r/fix-PR50184.take2.diff
which was for PR 50184. So reverting your commit and applying the patch
instead might be easy for me. Of course rebasing my patch on the HEAD
makes no difference though.
ozaki-r
From: christos@zoulas.com (Christos Zoulas)
To: Ryota Ozaki <ozaki-r@netbsd.org>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>,
"kern-bug-people@netbsd.org" <kern-bug-people@netbsd.org>,
"gnats-admin@netbsd.org" <gnats-admin@netbsd.org>,
"netbsd-bugs@netbsd.org" <netbsd-bugs@netbsd.org>,
"jdbaker@mylinuxisp.com" <jdbaker@mylinuxisp.com>
Subject: Re: kern/50186: sparc memfault panic after 7.99.21 ARP changes
Date: Fri, 4 Sep 2015 05:23:26 -0400
On Sep 4, 6:10pm, ozaki-r@netbsd.org (Ryota Ozaki) wrote:
-- Subject: Re: kern/50186: sparc memfault panic after 7.99.21 ARP changes
| On Fri, Sep 4, 2015 at 5:40 PM, Christos Zoulas <christos@zoulas.com> wrote:
| > On Sep 4, 1:50pm, ozaki-r@netbsd.org (Ryota Ozaki) wrote:
| > -- Subject: Re: kern/50186: sparc memfault panic after 7.99.21 ARP changes
| >
| > | On Thu, Sep 3, 2015 at 8:38 PM, Christos Zoulas <christos@zoulas.com> wrote:
| > | > I just crashed in arptimer() so there are more locking problems in the code. Can you document the locking discipline for la_rt and changing the lists?
| > | >
| > |
| > | I'm sorry for the defect.
| > |
| > | An ARP cache list of an interface is protected by a rwlock
| > | (IF_AFDATA_*LOCK) and each ARP cache is protected by a rwlock
| > | and refernce counting (LLE_*LOCK). However, la_rt still needs
| > | softnet_lock; if la_rt is accessed or modified without
| > | softnet_lock, it's a bug. And I found a bug :( lltable_free
| > | accesses la_rt but it's called without softnet_lock.
| > |
| > | Here is a patch:
| > | http://www.netbsd.org/~ozaki-r/lltable_free-softnet_lock.diff
| > | Could you try it?
| > |
| >
| > Thanks, I am running with it now. Should we revert the KASSERT change
| > too?
|
| Well, yes and no. Because KASSERT was actually wrong; la_rt can be NULL
| at the point according to my investigation for PR 50184. So anyway we
| have to get rid of it.
|
| I made a patch for the bug:
| http://www.netbsd.org/~ozaki-r/fix-PR50184.take2.diff
| which was for PR 50184. So reverting your commit and applying the patch
| instead might be easy for me. Of course rebasing my patch on the HEAD
| makes no difference though.
Why don't you commit both of them?
christos
From: Ryota Ozaki <ozaki-r@netbsd.org>
To: Christos Zoulas <christos@zoulas.com>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>,
"kern-bug-people@netbsd.org" <kern-bug-people@netbsd.org>,
"gnats-admin@netbsd.org" <gnats-admin@netbsd.org>, "netbsd-bugs@netbsd.org" <netbsd-bugs@netbsd.org>,
"jdbaker@mylinuxisp.com" <jdbaker@mylinuxisp.com>
Subject: Re: kern/50186: sparc memfault panic after 7.99.21 ARP changes
Date: Fri, 4 Sep 2015 18:30:40 +0900
On Fri, Sep 4, 2015 at 6:23 PM, Christos Zoulas <christos@zoulas.com> wrote:
> On Sep 4, 6:10pm, ozaki-r@netbsd.org (Ryota Ozaki) wrote:
> -- Subject: Re: kern/50186: sparc memfault panic after 7.99.21 ARP changes
>
> | On Fri, Sep 4, 2015 at 5:40 PM, Christos Zoulas <christos@zoulas.com> wrote:
> | > On Sep 4, 1:50pm, ozaki-r@netbsd.org (Ryota Ozaki) wrote:
> | > -- Subject: Re: kern/50186: sparc memfault panic after 7.99.21 ARP changes
> | >
> | > | On Thu, Sep 3, 2015 at 8:38 PM, Christos Zoulas <christos@zoulas.com> wrote:
> | > | > I just crashed in arptimer() so there are more locking problems in the code. Can you document the locking discipline for la_rt and changing the lists?
> | > | >
> | > |
> | > | I'm sorry for the defect.
> | > |
> | > | An ARP cache list of an interface is protected by a rwlock
> | > | (IF_AFDATA_*LOCK) and each ARP cache is protected by a rwlock
> | > | and refernce counting (LLE_*LOCK). However, la_rt still needs
> | > | softnet_lock; if la_rt is accessed or modified without
> | > | softnet_lock, it's a bug. And I found a bug :( lltable_free
> | > | accesses la_rt but it's called without softnet_lock.
> | > |
> | > | Here is a patch:
> | > | http://www.netbsd.org/~ozaki-r/lltable_free-softnet_lock.diff
> | > | Could you try it?
> | > |
> | >
> | > Thanks, I am running with it now. Should we revert the KASSERT change
> | > too?
> |
> | Well, yes and no. Because KASSERT was actually wrong; la_rt can be NULL
> | at the point according to my investigation for PR 50184. So anyway we
> | have to get rid of it.
> |
> | I made a patch for the bug:
> | http://www.netbsd.org/~ozaki-r/fix-PR50184.take2.diff
> | which was for PR 50184. So reverting your commit and applying the patch
> | instead might be easy for me. Of course rebasing my patch on the HEAD
> | makes no difference though.
>
> Why don't you commit both of them?
I want to clarify they really fix the bug(s). I fail to reproduce the panic
on my machines. Do my softnet_lock patch fix your issue?
ozaki-r
From: christos@zoulas.com (Christos Zoulas)
To: Ryota Ozaki <ozaki-r@netbsd.org>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>,
"kern-bug-people@netbsd.org" <kern-bug-people@netbsd.org>,
"gnats-admin@netbsd.org" <gnats-admin@netbsd.org>,
"netbsd-bugs@netbsd.org" <netbsd-bugs@netbsd.org>,
"jdbaker@mylinuxisp.com" <jdbaker@mylinuxisp.com>
Subject: Re: kern/50186: sparc memfault panic after 7.99.21 ARP changes
Date: Fri, 4 Sep 2015 05:43:57 -0400
On Sep 4, 6:30pm, ozaki-r@netbsd.org (Ryota Ozaki) wrote:
-- Subject: Re: kern/50186: sparc memfault panic after 7.99.21 ARP changes
| > Why don't you commit both of them?
|
| I want to clarify they really fix the bug(s). I fail to reproduce the panic
| on my machines. Do my softnet_lock patch fix your issue?
There are races, so they are not 100% reproducible... If it does not
crash in the next couple of days...
christos
From: "John D. Baker" <jdbaker@mylinuxisp.com>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/50186: sparc memfault panic after 7.99.21 ARP changes
Date: Fri, 4 Sep 2015 08:40:44 -0500 (CDT)
I meant to send this to gnats-bugs@, but forgot to reply-all and edit
the To: list:
On Thu, 3 Sep 2015, John D. Baker wrote:
> On Thu, 3 Sep 2015, Ryota Ozaki wrote:
>
> > Thank you for your investigation.
> > [snip]
> > How do you produce the panic? I don't reproduce on my amd64 machines yet.
>
> I boot the system, collect a couple of transient ARP entries (ping some
> other hosts) then just wait.
>
> > Does the fix suppress the panic on SPARC machines?
>
> My last build attempts failed due to ongoing changes to 'config' and
> related infrastructure. I am away from the systems until 8 September.
> I will try again then.
--
|/"\ John D. Baker, KN5UKS NetBSD Darwin/MacOS X
|\ / jdbaker[snail]mylinuxisp[flyspeck]com OpenBSD FreeBSD
| X No HTML/proprietary data in email. BSD just sits there and works!
|/ \ GPGkeyID: D703 4A7E 479F 63F8 D3F4 BD99 9572 8F23 E4AD 1645
From: Ryota Ozaki <ozaki-r@netbsd.org>
To: Christos Zoulas <christos@zoulas.com>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>,
"kern-bug-people@netbsd.org" <kern-bug-people@netbsd.org>,
"gnats-admin@netbsd.org" <gnats-admin@netbsd.org>, "netbsd-bugs@netbsd.org" <netbsd-bugs@netbsd.org>,
"jdbaker@mylinuxisp.com" <jdbaker@mylinuxisp.com>
Subject: Re: kern/50186: sparc memfault panic after 7.99.21 ARP changes
Date: Tue, 8 Sep 2015 17:49:21 +0900
On Fri, Sep 4, 2015 at 6:43 PM, Christos Zoulas <christos@zoulas.com> wrote:
> On Sep 4, 6:30pm, ozaki-r@netbsd.org (Ryota Ozaki) wrote:
> -- Subject: Re: kern/50186: sparc memfault panic after 7.99.21 ARP changes
>
> | > Why don't you commit both of them?
> |
> | I want to clarify they really fix the bug(s). I fail to reproduce the panic
> | on my machines. Do my softnet_lock patch fix your issue?
>
> There are races, so they are not 100% reproducible... If it does not
> crash in the next couple of days...
Sure... So how things are going now?
ozaki-r
From: christos@zoulas.com (Christos Zoulas)
To: Ryota Ozaki <ozaki-r@netbsd.org>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>,
"kern-bug-people@netbsd.org" <kern-bug-people@netbsd.org>,
"gnats-admin@netbsd.org" <gnats-admin@netbsd.org>,
"netbsd-bugs@netbsd.org" <netbsd-bugs@netbsd.org>,
"jdbaker@mylinuxisp.com" <jdbaker@mylinuxisp.com>
Subject: Re: kern/50186: sparc memfault panic after 7.99.21 ARP changes
Date: Tue, 8 Sep 2015 07:27:32 -0400
On Sep 8, 5:49pm, ozaki-r@netbsd.org (Ryota Ozaki) wrote:
-- Subject: Re: kern/50186: sparc memfault panic after 7.99.21 ARP changes
| On Fri, Sep 4, 2015 at 6:43 PM, Christos Zoulas <christos@zoulas.com> wrote:
| > On Sep 4, 6:30pm, ozaki-r@netbsd.org (Ryota Ozaki) wrote:
| > -- Subject: Re: kern/50186: sparc memfault panic after 7.99.21 ARP changes
| >
| > | > Why don't you commit both of them?
| > |
| > | I want to clarify they really fix the bug(s). I fail to reproduce the panic
| > | on my machines. Do my softnet_lock patch fix your issue?
| >
| > There are races, so they are not 100% reproducible... If it does not
| > crash in the next couple of days...
|
| Sure... So how things are going now?
has not crashed... I guess commit everything... I think I will try to write a
synthetic test to cause the problem.
Best,
christos
From: "Ryota Ozaki" <ozaki-r@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/50186 CVS commit: src/sys/netinet
Date: Wed, 9 Sep 2015 01:24:01 +0000
Module Name: src
Committed By: ozaki-r
Date: Wed Sep 9 01:24:01 UTC 2015
Modified Files:
src/sys/netinet: if_arp.c
Log Message:
Remove wrong KASSERT in arptfree
la_rt can be NULL because arptimer that calls arptfree doesn't always
free llentry so llentry can remain with la_rt == NULL. So we instead
check whether la_rt is NULL or not and do arptfree if not.
This fixes PR kern/50184 (confirmed by martin@) and
PR kern/50186 (maybe).
To generate a diff of this commit:
cvs rdiff -u -r1.179 -r1.180 src/sys/netinet/if_arp.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "John D. Baker" <jdbaker@mylinuxisp.com>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: PR/50186 CVS commit: src/sys/netinet
Date: Wed, 9 Sep 2015 07:54:08 -0500 (CDT)
On Wed, 9 Sep 2015, Ryota Ozaki wrote:
> Remove wrong KASSERT in arptfree
>
> la_rt can be NULL because arptimer that calls arptfree doesn't always
> free llentry so llentry can remain with la_rt == NULL. So we instead
> check whether la_rt is NULL or not and do arptfree if not.
>
> This fixes PR kern/50184 (confirmed by martin@) and
> PR kern/50186 (maybe).
I've updated and built new release with these changes. So far, no panic
(testing on sparc right now) but instead the following message on the
console:
arptfree: llentry without rt
I will check with amd64 soon.
--
|/"\ John D. Baker, KN5UKS NetBSD Darwin/MacOS X
|\ / jdbaker[snail]mylinuxisp[flyspeck]com OpenBSD FreeBSD
| X No HTML/proprietary data in email. BSD just sits there and works!
|/ \ GPGkeyID: D703 4A7E 479F 63F8 D3F4 BD99 9572 8F23 E4AD 1645
From: "John D. Baker" <jdbaker@mylinuxisp.com>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: PR/50186 CVS commit: src/sys/netinet
Date: Wed, 9 Sep 2015 15:01:27 -0500 (CDT)
On Wed, 9 Sep 2015, John D. Baker wrote:
> I've updated and built new release with these changes. So far, no panic
> (testing on sparc right now) but instead the following message on the
> console:
>
> arptfree: llentry without rt
>
> I will check with amd64 soon.
I've been running the amd64 system that previously panicked with the
KASSERT and so far (several hours uptime now) it has not panicked, nor
emitted the message shown above.
The sparc system has also been up several hours and the message above
has not been repeated.
--
|/"\ John D. Baker, KN5UKS NetBSD Darwin/MacOS X
|\ / jdbaker[snail]mylinuxisp[flyspeck]com OpenBSD FreeBSD
| X No HTML/proprietary data in email. BSD just sits there and works!
|/ \ GPGkeyID: D703 4A7E 479F 63F8 D3F4 BD99 9572 8F23 E4AD 1645
From: Ryota Ozaki <ozaki-r@netbsd.org>
To: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
"John D. Baker" <jdbaker@mylinuxisp.com>
Subject: Re: PR/50186 CVS commit: src/sys/netinet
Date: Tue, 15 Sep 2015 17:42:47 +0900
Hi,
On Thu, Sep 10, 2015 at 5:05 AM, John D. Baker <jdbaker@mylinuxisp.com> wrote:
> The following reply was made to PR kern/50186; it has been noted by GNATS.
>
> From: "John D. Baker" <jdbaker@mylinuxisp.com>
> To: gnats-bugs@NetBSD.org
> Cc:
> Subject: Re: PR/50186 CVS commit: src/sys/netinet
> Date: Wed, 9 Sep 2015 15:01:27 -0500 (CDT)
>
> On Wed, 9 Sep 2015, John D. Baker wrote:
>
> > I've updated and built new release with these changes. So far, no panic
> > (testing on sparc right now) but instead the following message on the
> > console:
> >
> > arptfree: llentry without rt
> >
> > I will check with amd64 soon.
>
> I've been running the amd64 system that previously panicked with the
> KASSERT and so far (several hours uptime now) it has not panicked, nor
> emitted the message shown above.
>
> The sparc system has also been up several hours and the message above
> has not been repeated.
Thank you for testing! I'll close the ticket.
ozaki-r
State-Changed-From-To: open->closed
State-Changed-By: ozaki-r@NetBSD.org
State-Changed-When: Tue, 15 Sep 2015 08:49:04 +0000
State-Changed-Why:
Fix confirmed by the reporter.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.