NetBSD Problem Report #42539
From martin@aprisoft.de Tue Dec 29 09:36:53 2009
Return-Path: <martin@aprisoft.de>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
by www.NetBSD.org (Postfix) with ESMTP id 1E20563B844
for <gnats-bugs@gnats.NetBSD.org>; Tue, 29 Dec 2009 09:36:53 +0000 (UTC)
Message-Id: <20091229093645.52AC8AF580E@emmas.aprisoft.de>
Date: Tue, 29 Dec 2009 10:36:45 +0100 (CET)
From: martin@NetBSD.org
Reply-To: martin@NetBSD.org
To: gnats-bugs@gnats.NetBSD.org
Subject: softint may fire on wrong cpu
X-Send-Pr-Version: 3.95
>Number: 42539
>Category: kern
>Synopsis: softint may fire on wrong cpu
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Tue Dec 29 09:40:00 +0000 2009
>Closed-Date: Sat Jan 16 13:18:59 +0000 2010
>Last-Modified: Sat Jan 16 17:40:09 +0000 2010
>Originator: Martin Husemann
>Release: NetBSD 5.99.22
>Organization:
The NetBSD Foundation, Inc.
>Environment:
System: NetBSD nelly.aprisoft.de 5.99.22 NetBSD 5.99.22 (NELLY.MP) #96: Mon Dec 28 22:26:09 CET 2009 martin@emmas.aprisoft.de:/nelly/usr/src/sys/arch/sparc64/compile/NELLY.MP sparc64
Architecture: sparc64
Machine: sparc64
>Description:
I started seeing a few KASSERTs some time ago, from softint_execute():
KASSERT(si->si_active);
The problem is pretty reliably reproducable for me, so I tried to find out
a bit more details. It turned out that the softint in question always was
"ser/1", the serial softint for cpu1. Due to the way interrupts are currently
handled in sparc64, there will never be a hard serial interrupt dispatched
to cpu1 (since cpu0 establishes the interrupt handler and we do not move
them). Now that explains why the softint is not marked as active, but how
could it fire?
I added some instrumentation and caught it like this:
panic: ci = 0x1814000, l = 0x1317c7c0, l->l_cpu = 0xd728000, l->l_ctxswtch = 1
Stopped in pid 473.1 (xulrunner-bin) at netbsd:cpu_Debugger+0x4: nop
softint_trigger(1814000, 1317c7c0, 3c2120a, d6c6d58, 46842ba0, 800) at netbsd:softint_trigger+0x74
softint_schedule(978, 1, 48d38548, 4684bda0, 46a34d60, 46a34d38) at netbsd:softint_schedule+0xdc
zshard(1, 4, ffffffffffe00000, ffffffff, 0, 0) at netbsd:zshard+0x54
intr_list_handler(0, a, e0017ed0, ffffffffb72bd290, 106f220, 4ed281e8) at netbsd:intr_list_handler+0x14
sparc_interrupt(1, 11eb898, 48b, 11eb938, 0, ffffffffffff6088) at netbsd:sparc_interrupt+0x238
lwp_setlock(1317c7c0, e5f7420, 0, ce10dc0, 0, 1317c810) at netbsd:lwp_setlock+0x2c
mi_switch(1317c7c0, 11ee2b8, 16d, 11eba80, 4, ffffffffffff6040) at netbsd:mi_switch+0x1d8
preempt(1317c7c0, 4050737c, 1322de00, 0, 0, 1411710) at netbsd:preempt+0xa0
trap(1322ded0, fffffffffffffffe, 4050737c, 99820092, 46842ba0, 800) at netbsd:trap+0x770
?(0, 1, 48d38548, 4684bda0, 46a34d60, 46a34d38) at 0x1009070
db{0}> mach cpu
cpu0: self 0x01814000 lwp 0x1317c7c0 pcb 0x1322a000 fplwp 0x1317c7c0
cpu1: self 0x0d728000 lwp 0x0ce18bc0 pcb 0x0d712000 fplwp 0x00000000
db{0}> ps
PID LID S CPU FLAGS STRUCT LWP * NAME WAIT
694 1 3 0 80 e338860 mpg123 aud_wr
533 1 3 0 80 13252fa0 mpg123 select
467 1 3 0 80 13253380 irc-20081115 select
464 1 3 1 80 133bc800 rxvt select
337 1 3 0 80 133bcbe0 ssh select
534 1 3 0 80 e5f7040 sh wait
410 1 3 1 80 1317d740 ssh select
473 35 3 1 80 133bc040 xulrunner-bin parked
473 34 3 0 80 13252bc0 xulrunner-bin parked
473 33 3 1 80 132527e0 xulrunner-bin parked
473 18 3 1 80 e5f64a0 xulrunner-bin parked
473 17 3 1 80 1317c3e0 xulrunner-bin parked
473 13 3 0 80 133bcfc0 xulrunner-bin parked
473 > 4 7 0 0 e5f7420 xulrunner-bin
473 3 3 0 80 13252020 xulrunner-bin parked
473 2 3 1 80 13252400 xulrunner-bin select
473 > 1 7 1 0 1317c7c0 xulrunner-bin
362 1 3 0 80 e5f7be0 qvwm select
356 1 3 1 80 1317c000 tcsh ttyraw
323 1 3 1 80 e5f60c0 tcsh pause
97 1 3 0 80 1317db20 xclock select
96 1 3 1 80 1317d360 rxvt select
136 1 3 1 80 13253760 rxvt select
134 1 3 1 80 13253b40 xload select
105 1 3 1 80 1317cba0 qvwm select
441 1 3 1 80 1317cf80 Xorg select
461 1 3 0 80 e338c40 xinit wait
381 1 3 1 80 e09fba0 ssh-agent select
373 1 3 0 80 e5f6c60 sh wait
371 1 3 1 80 ce24040 getty ttyraw
368 1 3 0 80 e09f7c0 getty ttyraw
370 1 3 1 80 ce24800 getty ttyraw
376 1 3 1 80 ce297a0 login wait
357 1 3 0 80 e5f7800 cron nanoslp
348 1 3 1 80 e3380a0 inetd kqueue
300 1 3 1 80 e338480 sshd select
287 1 3 0 80 e09f3e0 upsmon nanoslp
303 1 3 0 80 e339bc0 upsmon piperd
291 1 3 1 80 e3397e0 upsd select
260 1 3 0 80 e09e080 apcsmart select
266 1 3 1 80 e339400 ntpd pause
247 1 3 0 80 e09ec20 lpd select
238 1 3 0 80 e339020 mserv select
116 1 3 1 80 e09f000 syslogd kqueue
1 1 3 1 80 ce29b80 init wait
0 41 3 1 200 e09e840 nfskqpoll nfskqpw
0 40 3 1 200 e09e460 swapiod swapiod
0 37 3 1 200 ce28060 vmem_rehash vmem_rehash
0 36 3 0 200 ce28440 aiodoned aiodoned
0 35 3 0 200 ce28820 ioflush syncer
0 34 3 1 200 ce28c00 pgdaemon pgdaemon
0 33 3 0 200 ce28fe0 nfsio nfsiod
0 32 3 0 200 ce24be0 nfsio nfsiod
0 31 3 0 200 ce24420 nfsio nfsiod
0 30 3 1 200 ce24fc0 nfsio nfsiod
0 29 3 1 200 ce293c0 unpgc unpgc
0 20 3 0 200 ce253a0 scsibus0 sccomp
0 19 3 1 200 ce25780 xcall/1 xcall
0 18 1 1 200 ce25b60 softser/1
0 17 1 1 200 ce18020 softclk/1
0 16 1 1 200 ce18400 softbio/1
0 15 1 1 200 ce187e0 softnet/1
0 14 1 1 201 ce18bc0 idle/1
0 13 3 0 200 ce18fa0 pmfsuspend pmfsuspend
0 12 3 0 200 ce19380 pmfevent pmfevent
0 11 3 0 200 ce19760 nfssilly nfssilly
0 10 3 0 200 ce19b40 cachegc cachegc
0 9 3 0 200 ce12000 vrele vrele
0 8 3 0 200 ce123e0 modunload modunload
0 7 3 0 200 ce127c0 xcall/0 xcall
0 6 1 0 200 ce12ba0 softser/0
0 5 1 0 200 ce12f80 softclk/0
0 4 1 0 200 ce13360 softbio/0
0 3 1 0 200 ce13740 softnet/0
0 2 1 0 201 ce13b20 idle/0
0 1 3 1 200 140f5e0 swapper uvm
db{0}>
As you can see we are interrupted (zshard, the serial hardware interrupt)
midway in mi_switch, the old lwp has already been marked l_ctxswtch.
We run softint_trigger() on cpu0 (that is curcpu(), the ci= value in the
panic message), but curlwp (the l in the panic message) is being switched
and has l_cpu == cpu1. Since effectively the softint will fire on curlwp->l_cpu
we end up in the softint_execute for this softint on the wrong cpu.
>How-To-Repeat:
Use a multithreaded application (like firefox) with a serial mouse - at least
that pretty reliably triggers it on one of my machines (diskless with root
on NFS, audio playing, both may also be relevant).
Probably you need an arch that does not __HAVE_FAST_SOFTINTS.
>Fix:
Not sure, kill l_cpu in the migrated lwp earlier?
just use curcpu() in softint_trigger?
Or is it a MD bug and sparc64 code needs to do something I'm overlooking?
>Release-Note:
>Audit-Trail:
From: Mindaugas Rasiukevicius <rmind@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/42539 CVS commit: src/sys/kern
Date: Wed, 30 Dec 2009 23:49:59 +0000
Module Name: src
Committed By: rmind
Date: Wed Dec 30 23:49:59 UTC 2009
Modified Files:
src/sys/kern: kern_runq.c
Log Message:
sched_catchlwp: fix the case when other CPU might see curlwp->l_cpu != curcpu()
while LWP is finishing context switch. Should fix PR/42539, tested by martin@.
To generate a diff of this commit:
cvs rdiff -u -r1.27 -r1.28 src/sys/kern/kern_runq.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: open->feedback
State-Changed-By: rmind@NetBSD.org
State-Changed-When: Wed, 30 Dec 2009 23:59:41 +0000
State-Changed-Why:
martin@ cannot reproduce with a fix anymore. macallan@ still gets panics
on ppc, but likely unrelated - let's wait for more data from him.
State-Changed-From-To: feedback->closed
State-Changed-By: martin@NetBSD.org
State-Changed-When: Sat, 16 Jan 2010 13:18:59 +0000
State-Changed-Why:
fixed
From: Manuel Bouyer <bouyer@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/42539 CVS commit: [netbsd-5] src/sys/kern
Date: Sat, 16 Jan 2010 17:39:01 +0000
Module Name: src
Committed By: bouyer
Date: Sat Jan 16 17:39:01 UTC 2010
Modified Files:
src/sys/kern [netbsd-5]: kern_runq.c
Log Message:
Pull up following revision(s) (requested by rmind in ticket #1240):
sys/kern/kern_runq.c: revision 1.28
sched_catchlwp: fix the case when other CPU might see curlwp->l_cpu != curcpu()
while LWP is finishing context switch. Should fix PR/42539, tested by martin@.
To generate a diff of this commit:
cvs rdiff -u -r1.22.4.3 -r1.22.4.4 src/sys/kern/kern_runq.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Manuel Bouyer <bouyer@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/42539 CVS commit: [netbsd-5-0] src/sys/kern
Date: Sat, 16 Jan 2010 17:39:07 +0000
Module Name: src
Committed By: bouyer
Date: Sat Jan 16 17:39:07 UTC 2010
Modified Files:
src/sys/kern [netbsd-5-0]: kern_runq.c
Log Message:
Pull up following revision(s) (requested by rmind in ticket #1240):
sys/kern/kern_runq.c: revision 1.28
sched_catchlwp: fix the case when other CPU might see curlwp->l_cpu != curcpu()
while LWP is finishing context switch. Should fix PR/42539, tested by martin@.
To generate a diff of this commit:
cvs rdiff -u -r1.22.4.3 -r1.22.4.3.2.1 src/sys/kern/kern_runq.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.