NetBSD Problem Report #38335
From jarle@festningen.uninett.no Sat Mar 29 21:57:01 2008
Return-Path: <jarle@festningen.uninett.no>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
by narn.NetBSD.org (Postfix) with ESMTP id 789EB63B8BC
for <gnats-bugs@gnats.NetBSD.org>; Sat, 29 Mar 2008 21:57:01 +0000 (UTC)
Message-Id: <20080329215658.3AE8180B63@festningen.uninett.no>
Date: Sat, 29 Mar 2008 22:56:58 +0100 (CET)
From: jarle@uninett.no
Reply-To: jarle@uninett.no
To: gnats-bugs@gnats.NetBSD.org
Subject: kernel freeze on alpha MP system
X-Send-Pr-Version: 3.95
>Number: 38335
>Category: port-alpha
>Synopsis: kernel freeze on alpha MP system
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: thorpej
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sat Mar 29 22:00:01 +0000 2008
>Closed-Date: Tue Oct 13 15:58:55 +0000 2020
>Last-Modified: Tue Oct 13 15:58:55 +0000 2020
>Originator: Jarle Greipsland
>Release: NetBSD -current
>Organization:
>Environment:
System: NetBSD sweetheart.urc.uninett.no 4.99.58 NetBSD 4.99.58 (GENERIC-$Revision: 1.314 $) #1: Sat Mar 29 20:46:15 CET 2008 jarle@sweetheart.urc.uninett.no:/usr/obj/sys/arch/alpha/compile/CS20.MP alpha
Architecture: alpha
Machine: alpha
>Description:
A -current alpha kernel on a CS20 dual processor system got stuck in the
kernel during a NetBSD build job. After more than three minutes without
any response from the system, I managed to break into ddb, and gather some
data:
Stopped in pid 27607.1 (sh) at netbsd:cpu_Debugger+0x4: ret zero,(ra
)
db{0}> trace
cpu_Debugger() at netbsd:cpu_Debugger+0x4
comintr() at netbsd:comintr+0x718
alpha_shared_intr_dispatch() at netbsd:alpha_shared_intr_dispatch+0x5c
sio_iointr() at netbsd:sio_iointr+0x38
interrupt() at netbsd:interrupt+0x1c0
XentInt() at netbsd:XentInt+0x1c
--- interrupt (from ipl 0) ---
rw_vector_exit() at netbsd:rw_vector_exit+0x90
vm_map_unlock_read() at netbsd:vm_map_unlock_read+0x1c
uvm_fault_internal() at netbsd:uvm_fault_internal+0x16c8
trap() at netbsd:trap+0x6c8
XentMM() at netbsd:XentMM+0x20
--- memory management fault ---
memcpy() at netbsd:memcpy+0x54
copyout() at netbsd:copyout+0x50
copyout_vmspace() at netbsd:copyout_vmspace+0x7c
uiomove() at netbsd:uiomove+0xec
ubc_uiomove() at netbsd:ubc_uiomove+0xc8
ffs_read() at netbsd:ffs_read+0x3a8
VOP_READ() at netbsd:VOP_READ+0xcc
vn_rdwr() at netbsd:vn_rdwr+0xd0
vmcmd_readvn() at netbsd:vmcmd_readvn+0x64
execve1() at netbsd:execve1+0x790
syscall_plain() at netbsd:syscall_plain+0x1c4
XentSys() at netbsd:XentSys+0x60
--- syscall (59) ---
--- user mode ---
db{0}> show reg
v0 0xf9
t0 0xfffffc0000686a74 db_fromconsole
t1 0x1
t2 0xff00
t3 0xfffffc00006c0088 cn_magic
t4 0x2
t5 0xa0b
t6 0
t7 0xfffffc00006e11e8 vm_physmem
s0 0xf9
s1 0xc6
s2 0xfffffe00004cd000
s3 0xfffffe00004cd070
s4 0xfffffe00004d5008
s5 0x7fd
s6 0xfffffc00006a6b60 idstate+0x1ec20
a0 0xfffffc00006ea690 tsp_configuration
a1 0xfffffd01fc0003fd
a2 0x5
a3 0xfffffe0013d0b298
a4 0x22
a5 0
t8 0xfffffc00006e11e8 vm_physmem
t9 0xfffffc00004e38ac bt_hashhead+0x5c
t10 0x600f51dc
t11 0x1a5
ra 0xfffffc00003b1e78 comintr+0x718
t12 0xfffffc00005b7340 cpu_Debugger
at 0xfffffe0013d08000
gp 0xfffffc0000686b60 nfsrtton
sp 0xfffffe0013d0b1c8
pc 0xfffffc00005b7344 cpu_Debugger+0x4
ps 0x6
ai 0x1a5
pv 0xfffffc00005b7340 cpu_Debugger
netbsd:cpu_Debugger+0x4: ret zero,(ra)
db{0}> ps
PID PPID PGRP UID S FLAGS LWPS COMMAND WAIT
>27607 11306 20939 0 2 0 1 sh
11306 8909 20939 0 2 0 1 sh wait
8909 4701 20939 0 2 0x4000 1 sh piperd
4701 20692 20939 0 2 0x4000 1 sh wait
20692 20939 20939 0 2 0x4000 1 nbgmake wait
20939 11274 20939 0 2 0x4000 1 sh wait
11274 18129 18129 0 2 0x4000 1 nbmake select
18129 22431 18129 0 2 0x4000 1 sh wait
22431 5699 5699 0 2 0x4000 1 nbmake select
5699 3213 5699 0 2 0x4000 1 sh wait
3213 1188 1178 0 2 0x4000 1 nbmake select
1084 463 1084 1000 2 0x4000 1 top select
1188 1178 1178 0 2 0x4000 1 sh wait
1182 389 1182 0 2 0x4000 1 tail kqueue
1178 389 1178 0 2 0 1 tcsh pause
463 435 463 1000 2 0x4000 1 bash wait
435 443 443 1000 2 0x4000 1 xterm select
443 437 443 1000 2 0x4000 1 csh pause
437 66 66 1000 2 0x100 1 sshd select
66 248 66 0 2 0x4100 1 sshd netio
389 381 389 0 2 0x4000 1 tcsh pause
381 383 381 1000 2 0x4000 1 bash wait
383 294 294 1000 2 0x4000 1 xterm select
294 301 294 1000 2 0x4000 1 csh pause
301 327 327 1000 2 0x100 1 sshd select
327 248 327 0 2 0x4100 1 sshd netio
377 342 342 12 2 0x4100 1 qmgr kqueue
370 342 342 12 2 0x4100 1 pickup kqueue
369 1 369 0 2 0x4000 1 getty tty
372 1 372 0 2 0 1 cron nanoslp
355 1 355 0 2 0 1 inetd kqueue
342 1 342 0 2 0x4100 1 master kqueue
248 1 248 0 2 0 1 sshd select
236 1 236 0 2 0 1 ntpd pause
156 1 156 0 2 0 1 mount_mfs mfsidl
73 1 73 0 2 0 1 syslogd kqueue
1 0 1 0 2 0x4001 1 init wait
0 -1 0 0 2 0x20002 25 system *
db{0}> mach cpu 1
Using CPU 1
db{0}> trace
CPU 0: fatal kernel trap:
CPU 0 trap entry = 0x4 (unaligned access fault)
CPU 0 a0 = 0xfffffc00006e110d
CPU 0 a1 = 0x29
CPU 0 a2 = 0xa
CPU 0 pc = 0xfffffc00005b7b78
CPU 0 ra = 0xfffffc0000399044
CPU 0 pv = 0xfffffc00005b7ac0
CPU 0 curlwp = 0xfffffc00391713a0
CPU 0 pid = 27607, comm = sh
Caught exception in ddb.
db{0}> show reg
v0 0
t0 0
t1 0
t2 0
t3 0xfffffe0013243880
t4 0
t5 0xfffffc00006ecdc8 kernel_pmap_store
t6 0xfffffc00006ece28 kernel_pmap_store+0x60
t7 0x1
s0 0x1
s1 0xfffffc003f815980
s2 0xfffffc00006e1135 uvm+0xd
s3 0xfffffc00006e1128 uvm
s4 0xfffffe00004a3c00
s5 0x1
s6 0xfffffe001324fdc0
a0 0
a1 0x1
a2 0xfffffe0013bae000
a3 0x1
a4 0
a5 0xfffffe0013bb1dcc
t8 0x80
t9 0x67
t10 0xffffffff
t11 0xfffffc003b1e9328
ra 0xfffffc000049e95c idle_loop+0x12c
t12 0xfffffc00004cb360 sched_curcpu_runnable_p
at 0x3
gp 0xfffffc0000686b60 nfsrtton
sp 0xfffffc00006e1135 uvm+0xd
pc 0xfffffc000049e934 idle_loop+0x104
ps 0
ai 0xfffffc003b1e9328
pv 0xfffffc00004cb360 sched_curcpu_runnable_p
netbsd:idle_loop+0x104: ldq_u t0,d(s3)
>How-To-Repeat:
Run an MP kernel on an alpha system?
>Fix:
>Release-Note:
>Audit-Trail:
State-Changed-From-To: open->feedback
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sat, 19 Sep 2009 21:51:28 +0000
State-Changed-Why:
A bunch of MP alpha fixes just went into -current; might want to give them a
try.
From: Jarle Greipsland <jarle@uninett.no>
To: gnats-bugs@NetBSD.org, dholland@NetBSD.org
Cc: port-alpha-maintainer@netbsd.org, netbsd-bugs@netbsd.org,
gnats-admin@netbsd.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Tue, 22 Sep 2009 19:00:07 +0200 (CEST)
dholland@NetBSD.org writes:
> Synopsis: kernel freeze on alpha MP system
>
> State-Changed-From-To: open->feedback
> State-Changed-By: dholland@NetBSD.org
> State-Changed-When: Sat, 19 Sep 2009 21:51:28 +0000
> State-Changed-Why:
> A bunch of MP alpha fixes just went into -current; might want to give them a
> try.
The problem is still there. I installed and booted GENERIC.MP,
and built a full release. During extraction of the newly built
sets, the system wedged again. This time I could not break into
DDB from the console.
I power-cycled the system, and started to extract the sets again.
During extraction I noticed that the IO performance through the
file system seemed abyssmal (according to 'systat vm'). The 'tar
zxBpf' command consumed almost no CPU, but the disk IO hovered
around 500KB/s. Doing a plain dd from/to a single file on the
same disk gets around 20-30MB/s.
Also, right after root file system detection, the kernel complains:
WARNING: negative runtime; monotonic clock has gone backwards
I don't know if this might be related to the wedging or not.
If you want me to dig out more information, please ask, and I'll
see what I can do.
-jarle
From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: Jarle Greipsland <jarle@uninett.no>
Cc: gnats-bugs@NetBSD.org, dholland@NetBSD.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Sat, 26 Sep 2009 11:04:37 -0600 (MDT)
On Tue, 22 Sep 2009, Jarle Greipsland wrote:
> The problem is still there. I installed and booted GENERIC.MP,
> and built a full release. During extraction of the newly built
> sets, the system wedged again. This time I could not break into
> DDB from the console.
This was a CS20, according to the original PR? If so, you can get
into DDB using the halt switch [it's hidden in a small hole in the front
panel to the right]. After the machine is halted, you can enter
'continue', which will enter into a console trap. The halt should
show the PC of cpu 0 at the time of the halt, and I think that address
should be related to the backtrace of cpu 0. One problem will be that
it's very likely cpu 1 is not halted, and is likely spinning with
interrupts block so the IPI send by cpu 0 to halt will not occur and
there won't be any register available [I'm not ever sure that works
now even when the IPI can be delivered - I need to look into that].
> Also, right after root file system detection, the kernel complains:
>
> WARNING: negative runtime; monotonic clock has gone backwards
>
> I don't know if this might be related to the wedging or not.
Not very likely, I think. I see this every MP boot, and it is related
to using the the PCC timecounter. I haven't figured out exactly how that
is supposed to work on an MP system yet.
> If you want me to dig out more information, please ask, and I'll
> see what I can do.
The problem is likely some kind of locking deadlock, which I've gotten
a few times. I've been trying to debug a problem with the TLB shootdown
code where it gets a corrupted pool_cache entry and ends up with a
list that links to itself. You can try this patch I'm using to attempt
to detect this and work around it:
@@ -3700,6 +3700,12 @@ pmap_tlb_shootdown(pmap_t pmap, vaddr_t
* don't really have to do anything else.
*/
mutex_spin_enter(&pq->pq_lock);
+/**/ if (pj && pj == pq->pq_head.tqh_first) {
+/**/ printf("Whoa! pool_cache_get returned an in-use
entry! ci_index %d pj %p\n",
+ self->ci_index, pj);
+/**/ /*panic("Oops");*/
+/**/ pj = NULL; /* XXX */
+/**/ }
pq->pq_pte |= pte;
if (pq->pq_tbia) {
mutex_spin_exit(&pq->pq_lock);
An alternative workaround is to set PMAP_TLB_SHOOTDOWN_MAXJOBS to 0,
which will invalidate all tlbs instead of trying to invalidate
single entries.
--
Michael L. Hitch mhitch@montana.edu
Computer Consultant
Information Technology Center
Montana State University Bozeman, MT USA
From: Jarle Greipsland <jarle@uninett.no>
To: mhitch@lightning.msu.montana.edu
Cc: gnats-bugs@NetBSD.org, dholland@NetBSD.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Thu, 01 Oct 2009 13:34:50 +0200 (CEST)
"Michael L. Hitch" <mhitch@lightning.msu.montana.edu> writes:
> This was a CS20, according to the original PR? If so, you can get
> into DDB using the halt switch [it's hidden in a small hole in the front
> panel to the right]. After the machine is halted, you can enter
> 'continue', which will enter into a console trap. The halt should
> show the PC of cpu 0 at the time of the halt, and I think that address
> should be related to the backtrace of cpu 0.
Annotated console log:
[ I pressed the halt switch here ]
halted CPU 0
CPU 1 is not halted
halt code = 1
operator initiated halt
PC = fffffc0000736b08
P00>>>continue^U
P00>>>help
[ ... some help text here ... ]
P00>>>halt 1
P00>>>continue
continuing CPU 0
CP - RESTORE_TERM routine to be called
panic: user requested console halt
cpu0: Begin traceback...
alpha trace requires known PC =eject=
cpu0: End traceback...
WARNING: Unable to halt secondary CPUs (0x3)
dumping to dev 8,1 offset 2097597
dump i/o error
sd0(esiop0:0:0:0): polling command not done
panic: scsipi_execute_xs
cpu0: Begin traceback...
alpha trace requires known PC =eject=
cpu0: End traceback...
WARNING: Unable to halt secondary CPUs (0x3)
dumping to dev 8,1 offset 2097597
dump device not ready
rebooting...
CP - RESTORE_TERM exited with hlt_req = 3, r0 = 00000001.00000000
halted CPU 0
halt code = 0
PC = fffffc0000736b08
P00>>>cont
Slot context is not valid
P00>>>boot^U
P00>>>init
Initializing...
.
!1024 Meg of system memory
initializing GCT/FRU at 1d4000
Testing the System
Testing the Memory
Testing the Disks (read only)
Testing ei* devices.
API CS20D 833 MHz Console A5.8-81, 08-MAY-2001 00:09:13
[ .. end of console log .. ]
This seems less helpful than I had imagined; oh well. I hope you
can make some kind of sense out of it. According to gdb, the PC
is somewhere in pmap_do_tlb_shootdown (offset +296 in the listing
below):
(gdb) x/60i pmap_do_tlb_shootdown
0xfffffc00007369e0 <pmap_do_tlb_shootdown>: ldah gp,80(t12)
0xfffffc00007369e4 <pmap_do_tlb_shootdown+4>: lda gp,-31544(gp)
0xfffffc00007369e8 <pmap_do_tlb_shootdown+8>: lda sp,-64(sp)
0xfffffc00007369ec <pmap_do_tlb_shootdown+12>: stq ra,0(sp)
0xfffffc00007369f0 <pmap_do_tlb_shootdown+16>: stq s0,8(sp)
0xfffffc00007369f4 <pmap_do_tlb_shootdown+20>: stq s1,16(sp)
0xfffffc00007369f8 <pmap_do_tlb_shootdown+24>: stq s2,24(sp)
0xfffffc00007369fc <pmap_do_tlb_shootdown+28>: stq s3,32(sp)
0xfffffc0000736a00 <pmap_do_tlb_shootdown+32>: lda t0,48(sp)
0xfffffc0000736a04 <pmap_do_tlb_shootdown+36>: ldq s1,456(a0)
0xfffffc0000736a08 <pmap_do_tlb_shootdown+40>: stq t0,56(sp)
0xfffffc0000736a0c <pmap_do_tlb_shootdown+44>: ldah t0,4(gp)
0xfffffc0000736a10 <pmap_do_tlb_shootdown+48>: lda t0,-25256(t0)
0xfffffc0000736a14 <pmap_do_tlb_shootdown+52>: ldq t12,14240(gp)
0xfffffc0000736a18 <pmap_do_tlb_shootdown+56>: sll s1,0x6,t2
0xfffffc0000736a1c <pmap_do_tlb_shootdown+60>: stq zero,48(sp)
0xfffffc0000736a20 <pmap_do_tlb_shootdown+64>: lda t1,1
0xfffffc0000736a24 <pmap_do_tlb_shootdown+68>: addq t2,t0,s0
0xfffffc0000736a28 <pmap_do_tlb_shootdown+72>: lda s3,16(s0)
0xfffffc0000736a2c <pmap_do_tlb_shootdown+76>: sll t1,s1,s2
0xfffffc0000736a30 <pmap_do_tlb_shootdown+80>: mov s3,a0
0xfffffc0000736a34 <pmap_do_tlb_shootdown+84>: jsr ra,(t12),0xfffffc0000735b80 <pmap_do_reactivate+96>
0xfffffc0000736a38 <pmap_do_tlb_shootdown+88>: ldah gp,80(ra)
0xfffffc0000736a3c <pmap_do_tlb_shootdown+92>: lda gp,-31632(gp)
0xfffffc0000736a40 <pmap_do_tlb_shootdown+96>: ldq t1,0(s0)
0xfffffc0000736a44 <pmap_do_tlb_shootdown+100>: beq t1,0xfffffc0000736a6c <pmap_do_tlb_shootdown+140>
0xfffffc0000736a48 <pmap_do_tlb_shootdown+104>: ldq t0,56(sp)
0xfffffc0000736a4c <pmap_do_tlb_shootdown+108>: stq t1,0(t0)
0xfffffc0000736a50 <pmap_do_tlb_shootdown+112>: ldq t0,56(sp)
0xfffffc0000736a54 <pmap_do_tlb_shootdown+116>: ldq t1,0(s0)
0xfffffc0000736a58 <pmap_do_tlb_shootdown+120>: stq t0,8(t1)
0xfffffc0000736a5c <pmap_do_tlb_shootdown+124>: ldq t2,8(s0)
0xfffffc0000736a60 <pmap_do_tlb_shootdown+128>: stq zero,0(s0)
0xfffffc0000736a64 <pmap_do_tlb_shootdown+132>: stq s0,8(s0)
0xfffffc0000736a68 <pmap_do_tlb_shootdown+136>: stq t2,56(sp)
0xfffffc0000736a6c <pmap_do_tlb_shootdown+140>: ldl t0,40(s0)
0xfffffc0000736a70 <pmap_do_tlb_shootdown+144>: beq t0,0xfffffc0000736a9c <pmap_do_tlb_shootdown+188>
0xfffffc0000736a74 <pmap_do_tlb_shootdown+148>: ldl t0,32(s0)
0xfffffc0000736a78 <pmap_do_tlb_shootdown+152>: and t0,0x10,t0
0xfffffc0000736a7c <pmap_do_tlb_shootdown+156>: sextl t0,t0
0xfffffc0000736a80 <pmap_do_tlb_shootdown+160>: beq t0,0xfffffc0000736b84 <pmap_do_tlb_shootdown+420>
0xfffffc0000736a84 <pmap_do_tlb_shootdown+164>: lda a0,-2
0xfffffc0000736a88 <pmap_do_tlb_shootdown+168>: clr a1
0xfffffc0000736a8c <pmap_do_tlb_shootdown+172>: call_pal 0x33
0xfffffc0000736a90 <pmap_do_tlb_shootdown+176>: stl zero,40(s0)
0xfffffc0000736a94 <pmap_do_tlb_shootdown+180>: stl zero,32(s0)
0xfffffc0000736a98 <pmap_do_tlb_shootdown+184>: br 0xfffffc0000736b14 <pmap_do_tlb_shootdown+308>
0xfffffc0000736a9c <pmap_do_tlb_shootdown+188>: ldq t3,48(sp)
0xfffffc0000736aa0 <pmap_do_tlb_shootdown+192>: beq t3,0xfffffc0000736b10 <pmap_do_tlb_shootdown+304>
0xfffffc0000736aa4 <pmap_do_tlb_shootdown+196>: sll s1,0x4,t2
0xfffffc0000736aa8 <pmap_do_tlb_shootdown+200>: ldah t0,4(gp)
0xfffffc0000736aac <pmap_do_tlb_shootdown+204>: lda t0,-30440(t0)
0xfffffc0000736ab0 <pmap_do_tlb_shootdown+208>: addq t2,t0,t5
0xfffffc0000736ab4 <pmap_do_tlb_shootdown+212>: br 0xfffffc0000736af0 <pmap_do_tlb_shootdown+272>
0xfffffc0000736ab8 <pmap_do_tlb_shootdown+216>: nop
0xfffffc0000736abc <pmap_do_tlb_shootdown+220>: unop
0xfffffc0000736ac0 <pmap_do_tlb_shootdown+224>: ldq t1,24(t3)
0xfffffc0000736ac4 <pmap_do_tlb_shootdown+228>: addq t2,t1,t4
0xfffffc0000736ac8 <pmap_do_tlb_shootdown+232>: ldq t0,64(t1)
0xfffffc0000736acc <pmap_do_tlb_shootdown+236>: and s2,t0,t0
(gdb) x/80i pmap_do_tlb_shootdown
0xfffffc00007369e0 <pmap_do_tlb_shootdown>: ldah gp,80(t12)
0xfffffc00007369e4 <pmap_do_tlb_shootdown+4>: lda gp,-31544(gp)
0xfffffc00007369e8 <pmap_do_tlb_shootdown+8>: lda sp,-64(sp)
0xfffffc00007369ec <pmap_do_tlb_shootdown+12>: stq ra,0(sp)
0xfffffc00007369f0 <pmap_do_tlb_shootdown+16>: stq s0,8(sp)
0xfffffc00007369f4 <pmap_do_tlb_shootdown+20>: stq s1,16(sp)
0xfffffc00007369f8 <pmap_do_tlb_shootdown+24>: stq s2,24(sp)
0xfffffc00007369fc <pmap_do_tlb_shootdown+28>: stq s3,32(sp)
0xfffffc0000736a00 <pmap_do_tlb_shootdown+32>: lda t0,48(sp)
0xfffffc0000736a04 <pmap_do_tlb_shootdown+36>: ldq s1,456(a0)
0xfffffc0000736a08 <pmap_do_tlb_shootdown+40>: stq t0,56(sp)
0xfffffc0000736a0c <pmap_do_tlb_shootdown+44>: ldah t0,4(gp)
0xfffffc0000736a10 <pmap_do_tlb_shootdown+48>: lda t0,-25256(t0)
0xfffffc0000736a14 <pmap_do_tlb_shootdown+52>: ldq t12,14240(gp)
0xfffffc0000736a18 <pmap_do_tlb_shootdown+56>: sll s1,0x6,t2
0xfffffc0000736a1c <pmap_do_tlb_shootdown+60>: stq zero,48(sp)
0xfffffc0000736a20 <pmap_do_tlb_shootdown+64>: lda t1,1
0xfffffc0000736a24 <pmap_do_tlb_shootdown+68>: addq t2,t0,s0
0xfffffc0000736a28 <pmap_do_tlb_shootdown+72>: lda s3,16(s0)
0xfffffc0000736a2c <pmap_do_tlb_shootdown+76>: sll t1,s1,s2
0xfffffc0000736a30 <pmap_do_tlb_shootdown+80>: mov s3,a0
0xfffffc0000736a34 <pmap_do_tlb_shootdown+84>: jsr ra,(t12),0xfffffc0000735b80 <pmap_do_reactivate+96>
0xfffffc0000736a38 <pmap_do_tlb_shootdown+88>: ldah gp,80(ra)
0xfffffc0000736a3c <pmap_do_tlb_shootdown+92>: lda gp,-31632(gp)
0xfffffc0000736a40 <pmap_do_tlb_shootdown+96>: ldq t1,0(s0)
0xfffffc0000736a44 <pmap_do_tlb_shootdown+100>: beq t1,0xfffffc0000736a6c <pmap_do_tlb_shootdown+140>
0xfffffc0000736a48 <pmap_do_tlb_shootdown+104>: ldq t0,56(sp)
0xfffffc0000736a4c <pmap_do_tlb_shootdown+108>: stq t1,0(t0)
0xfffffc0000736a50 <pmap_do_tlb_shootdown+112>: ldq t0,56(sp)
0xfffffc0000736a54 <pmap_do_tlb_shootdown+116>: ldq t1,0(s0)
0xfffffc0000736a58 <pmap_do_tlb_shootdown+120>: stq t0,8(t1)
0xfffffc0000736a5c <pmap_do_tlb_shootdown+124>: ldq t2,8(s0)
0xfffffc0000736a60 <pmap_do_tlb_shootdown+128>: stq zero,0(s0)
0xfffffc0000736a64 <pmap_do_tlb_shootdown+132>: stq s0,8(s0)
0xfffffc0000736a68 <pmap_do_tlb_shootdown+136>: stq t2,56(sp)
0xfffffc0000736a6c <pmap_do_tlb_shootdown+140>: ldl t0,40(s0)
0xfffffc0000736a70 <pmap_do_tlb_shootdown+144>: beq t0,0xfffffc0000736a9c <pmap_do_tlb_shootdown+188>
0xfffffc0000736a74 <pmap_do_tlb_shootdown+148>: ldl t0,32(s0)
0xfffffc0000736a78 <pmap_do_tlb_shootdown+152>: and t0,0x10,t0
0xfffffc0000736a7c <pmap_do_tlb_shootdown+156>: sextl t0,t0
0xfffffc0000736a80 <pmap_do_tlb_shootdown+160>: beq t0,0xfffffc0000736b84 <pmap_do_tlb_shootdown+420>
0xfffffc0000736a84 <pmap_do_tlb_shootdown+164>: lda a0,-2
0xfffffc0000736a88 <pmap_do_tlb_shootdown+168>: clr a1
0xfffffc0000736a8c <pmap_do_tlb_shootdown+172>: call_pal 0x33
0xfffffc0000736a90 <pmap_do_tlb_shootdown+176>: stl zero,40(s0)
0xfffffc0000736a94 <pmap_do_tlb_shootdown+180>: stl zero,32(s0)
0xfffffc0000736a98 <pmap_do_tlb_shootdown+184>: br 0xfffffc0000736b14 <pmap_do_tlb_shootdown+308>
0xfffffc0000736a9c <pmap_do_tlb_shootdown+188>: ldq t3,48(sp)
0xfffffc0000736aa0 <pmap_do_tlb_shootdown+192>: beq t3,0xfffffc0000736b10 <pmap_do_tlb_shootdown+304>
0xfffffc0000736aa4 <pmap_do_tlb_shootdown+196>: sll s1,0x4,t2
0xfffffc0000736aa8 <pmap_do_tlb_shootdown+200>: ldah t0,4(gp)
0xfffffc0000736aac <pmap_do_tlb_shootdown+204>: lda t0,-30440(t0)
0xfffffc0000736ab0 <pmap_do_tlb_shootdown+208>: addq t2,t0,t5
0xfffffc0000736ab4 <pmap_do_tlb_shootdown+212>: br 0xfffffc0000736af0 <pmap_do_tlb_shootdown+272>
0xfffffc0000736ab8 <pmap_do_tlb_shootdown+216>: nop
0xfffffc0000736abc <pmap_do_tlb_shootdown+220>: unop
0xfffffc0000736ac0 <pmap_do_tlb_shootdown+224>: ldq t1,24(t3)
0xfffffc0000736ac4 <pmap_do_tlb_shootdown+228>: addq t2,t1,t4
0xfffffc0000736ac8 <pmap_do_tlb_shootdown+232>: ldq t0,64(t1)
0xfffffc0000736acc <pmap_do_tlb_shootdown+236>: and s2,t0,t0
0xfffffc0000736ad0 <pmap_do_tlb_shootdown+240>: bne t0,0xfffffc0000736b00 <pmap_do_tlb_shootdown+288>
0xfffffc0000736ad4 <pmap_do_tlb_shootdown+244>: ldq t0,88(t4)
0xfffffc0000736ad8 <pmap_do_tlb_shootdown+248>: ldq t1,8(t5)
0xfffffc0000736adc <pmap_do_tlb_shootdown+252>: cmpeq t0,t1,t0
0xfffffc0000736ae0 <pmap_do_tlb_shootdown+256>: bne t0,0xfffffc0000736b94 <pmap_do_tlb_shootdown+436>
0xfffffc0000736ae4 <pmap_do_tlb_shootdown+260>: ldq t3,0(t3)
0xfffffc0000736ae8 <pmap_do_tlb_shootdown+264>: beq t3,0xfffffc0000736b10 <pmap_do_tlb_shootdown+304>
0xfffffc0000736aec <pmap_do_tlb_shootdown+268>: unop
0xfffffc0000736af0 <pmap_do_tlb_shootdown+272>: lda a0,3
0xfffffc0000736af4 <pmap_do_tlb_shootdown+276>: ldq t0,32(t3)
0xfffffc0000736af8 <pmap_do_tlb_shootdown+280>: and t0,0x10,t0
0xfffffc0000736afc <pmap_do_tlb_shootdown+284>: beq t0,0xfffffc0000736ac0 <pmap_do_tlb_shootdown+224>
0xfffffc0000736b00 <pmap_do_tlb_shootdown+288>: ldq a1,16(t3)
0xfffffc0000736b04 <pmap_do_tlb_shootdown+292>: call_pal 0x33
0xfffffc0000736b08 <pmap_do_tlb_shootdown+296>: ldq t3,0(t3)
0xfffffc0000736b0c <pmap_do_tlb_shootdown+300>: bne t3,0xfffffc0000736af0 <pmap_do_tlb_shootdown+272>
0xfffffc0000736b10 <pmap_do_tlb_shootdown+304>: stl zero,32(s0)
0xfffffc0000736b14 <pmap_do_tlb_shootdown+308>: mov s3,a0
0xfffffc0000736b18 <pmap_do_tlb_shootdown+312>: ldq t12,-29928(gp)
0xfffffc0000736b1c <pmap_do_tlb_shootdown+316>: stl zero,36(s0)
-jarle
From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: gnats-bugs@NetBSD.org
Cc: gnats-admin@netbsd.org, jarle@uninett.no
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Thu, 1 Oct 2009 12:20:43 -0600 (MDT)
On Thu, 1 Oct 2009, Jarle Greipsland wrote:
> Annotated console log:
> [ I pressed the halt switch here ]
> halted CPU 0
> CPU 1 is not halted
>
> halt code = 1
> operator initiated halt
> PC = fffffc0000736b08
> P00>>>continue^U
> P00>>>help
> [ ... some help text here ... ]
> P00>>>halt 1
Halting the other cpu is not needed at this point, and might have caused
the problem when continuing cpu 0. Also, do you have ddb.onpanic set to
0? If so, you probably want it 1 so that you enter DDB when it panics.
> WARNING: Unable to halt secondary CPUs (0x3)
This would be caused by your halting cpu 1 earlier, although I'm pretty
sure the same thing would have happened even if you hadn't halted it.
> [ .. end of console log .. ]
> This seems less helpful than I had imagined; oh well. I hope you
> can make some kind of sense out of it. According to gdb, the PC
> is somewhere in pmap_do_tlb_shootdown (offset +296 in the listing
> below):
It does show that what I thought had happened did indeed happen.
> 0xfffffc0000736ac0 <pmap_do_tlb_shootdown+224>: ldq t1,24(t3)
> 0xfffffc0000736ac4 <pmap_do_tlb_shootdown+228>: addq t2,t1,t4
> 0xfffffc0000736ac8 <pmap_do_tlb_shootdown+232>: ldq t0,64(t1)
> 0xfffffc0000736acc <pmap_do_tlb_shootdown+236>: and s2,t0,t0
> 0xfffffc0000736ad0 <pmap_do_tlb_shootdown+240>: bne t0,0xfffffc0000736b00 <pmap_do_tlb_shootdown+288>
> 0xfffffc0000736ad4 <pmap_do_tlb_shootdown+244>: ldq t0,88(t4)
> 0xfffffc0000736ad8 <pmap_do_tlb_shootdown+248>: ldq t1,8(t5)
> 0xfffffc0000736adc <pmap_do_tlb_shootdown+252>: cmpeq t0,t1,t0
> 0xfffffc0000736ae0 <pmap_do_tlb_shootdown+256>: bne t0,0xfffffc0000736b94 <pmap_do_tlb_shootdown+436>
> 0xfffffc0000736ae4 <pmap_do_tlb_shootdown+260>: ldq t3,0(t3)
> 0xfffffc0000736ae8 <pmap_do_tlb_shootdown+264>: beq t3,0xfffffc0000736b10 <pmap_do_tlb_shootdown+304>
> 0xfffffc0000736aec <pmap_do_tlb_shootdown+268>: unop
> 0xfffffc0000736af0 <pmap_do_tlb_shootdown+272>: lda a0,3
> 0xfffffc0000736af4 <pmap_do_tlb_shootdown+276>: ldq t0,32(t3)
> 0xfffffc0000736af8 <pmap_do_tlb_shootdown+280>: and t0,0x10,t0
> 0xfffffc0000736afc <pmap_do_tlb_shootdown+284>: beq t0,0xfffffc0000736ac0 <pmap_do_tlb_shootdown+224>
> 0xfffffc0000736b00 <pmap_do_tlb_shootdown+288>: ldq a1,16(t3)
> 0xfffffc0000736b04 <pmap_do_tlb_shootdown+292>: call_pal 0x33
> 0xfffffc0000736b08 <pmap_do_tlb_shootdown+296>: ldq t3,0(t3)
> 0xfffffc0000736b0c <pmap_do_tlb_shootdown+300>: bne t3,0xfffffc0000736af0 <pmap_do_tlb_shootdown+272>
This is the loop where pmap_do_tlb_shootdown is processing the job queue
for this cpu (cpu 0). T3 is the current queue entry and almost certainly
is linked to itself - meaning it keeps invalidating the same tlb entry
over and over and over...... At this point, cpu 0 will have the job queue
locked, and I'm certain that cpu 1 wants to invalidate another tlb entry,
and is attempting to acquire the lock that cpu 0 holds.
This is the problem I'm still in the process of trying to figure out
what the problem is and how to fix it. The patch I posted previously is a
workaround to detect this particular problem, and will display a message
if it occurs. The other alternative I posted (changing
PMAP_TLB_SHOOTDOWN_MAXJOBS to 0) make it never allocate queue entries so
it should never loop.
What I think may be happening is that the IPI interrupt may be occuring
at level 5, which is the same level as the clock interrupt. The
mutex for the tlb shootdown queue uses IPL_VM, which I think is at level
4. If I understand the mutex locking mechanism, I think this would allow
an IPI request from another cpu interrupt the code that manipulates the
shootdown queue entries, and could possibly corrupt the job queue.
--
Michael L. Hitch mhitch@montana.edu
Computer Consultant
Information Technology Center
Montana State University Bozeman, MT USA
From: Jarle Greipsland <jarle@uninett.no>
To: mhitch@lightning.msu.montana.edu
Cc: gnats-bugs@NetBSD.org, gnats-admin@netbsd.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Fri, 02 Oct 2009 08:54:35 +0200 (CEST)
"Michael L. Hitch" <mhitch@lightning.msu.montana.edu> writes:
> It does show that what I thought had happened did indeed happen.
> > 0xfffffc0000736ac0 <pmap_do_tlb_shootdown+224>: ldq t1,24(t3)
[ ... ]
Is there any more info I can gather for you, or can you just as
easily reproduce this yourself?
> This is the problem I'm still in the process of trying to figure out
> what the problem is and how to fix it. The patch I posted previously is a
> workaround to detect this particular problem, and will display a message
> if it occurs.
It _does_ occur. I applied the patch, and a 'build.sh -j4'
triggered the panic after a while.
-jarle
From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: Jarle Greipsland <jarle@uninett.no>
Cc: gnats-bugs@NetBSD.org, gnats-admin@netbsd.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Sat, 3 Oct 2009 10:25:43 -0600 (MDT)
On Fri, 2 Oct 2009, Jarle Greipsland wrote:
> "Michael L. Hitch" <mhitch@lightning.msu.montana.edu> writes:
>> It does show that what I thought had happened did indeed happen.
>>> 0xfffffc0000736ac0 <pmap_do_tlb_shootdown+224>: ldq t1,24(t3)
> [ ... ]
> Is there any more info I can gather for you, or can you just as
> easily reproduce this yourself?
I'm quite certain in this particular case the job queue entry is linked
to itself resulting in a hung cpu (which then hangs other cpus because it
has the shootdown queue locked for that cpu).
I haven't been able to easily reproduce this. I can complete full
builds fairly often, although a lot of the time I will get a segment fault
in one of the tools (typically grotty or install).
I also got another deadlock situation yesterday: cpu 0 had acquired the
lock for the shootdown queue (presumably for a different cpu - it should
be skipping the current cpu) and got interrupted by a shootdown IPI. The
IPI routine was trying to acquire the lock for the current cpu's queue,
which was currently locked (can't tell what held the lock though).
One thing I haven't tried yet is a LOCKDEBUG kernel, which should do
some additional checking on locking, and should be able to provide
information on what holds the lock. This last time I tried LOCKDEBUG, I
ran into problems and quickly got lost in the locking morass.
>> This is the problem I'm still in the process of trying to figure out
>> what the problem is and how to fix it. The patch I posted previously is a
>> workaround to detect this particular problem, and will display a message
>> if it occurs.
> It _does_ occur. I applied the patch, and a 'build.sh -j4'
> triggered the panic after a while.
What was the panic you got?
--
Michael L. Hitch mhitch@montana.edu
Computer Consultant
Information Technology Center
Montana State University Bozeman, MT USA
From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: Jarle Greipsland <jarle@uninett.no>
Cc: gnats-bugs@NetBSD.org, gnats-admin@netbsd.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Sun, 4 Oct 2009 11:08:29 -0600 (MDT)
I have just committed the following patch, which seems to help
considerably. I've been doing sucessive builds using 2 cpus on an ES40.
I haven't gotten any deadlocks yet, but I did just get a segment violation
on the grotty program on the previous build this morning.
Index: sys/arch/alpha/alpha/pmap.c
===================================================================
RCS file: /cvsroot/src/sys/arch/alpha/alpha/pmap.c,v
retrieving revision 1.242
diff -u -p -r1.242 pmap.c
--- sys/arch/alpha/alpha/pmap.c 10 Sep 2009 22:27:11 -0000 1.242
+++ sys/arch/alpha/alpha/pmap.c 4 Oct 2009 16:58:00 -0000
@@ -962,7 +962,7 @@ pmap_bootstrap(paddr_t ptaddr, u_int max
for (i = 0; i < ALPHA_MAXPROCS; i++) {
TAILQ_INIT(&pmap_tlb_shootdown_q[i].pq_head);
mutex_init(&pmap_tlb_shootdown_q[i].pq_lock, MUTEX_DEFAULT,
- IPL_VM);
+ IPL_SCHED);
}
#endif
--
Michael L. Hitch mhitch@montana.edu
Computer Consultant
Information Technology Center
Montana State University Bozeman, MT USA
From: Jarle Greipsland <jarle@uninett.no>
To: mhitch@lightning.msu.montana.edu
Cc: gnats-bugs@NetBSD.org, gnats-admin@netbsd.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Tue, 06 Oct 2009 11:17:09 +0200 (CEST)
"Michael L. Hitch" <mhitch@lightning.msu.montana.edu> writes:
> One thing I haven't tried yet is a LOCKDEBUG kernel, which should do
> some additional checking on locking, and should be able to provide
> information on what holds the lock. This last time I tried LOCKDEBUG, I
> ran into problems and quickly got lost in the locking morass.
Some time ago I tried to boot a LOCKDEBUG kernel, but it paniced
during autoconfig with an assertion:
Mutex error: lockdebug_alloc: already initialized
This is registered as port-alpha/38358
> >> This is the problem I'm still in the process of trying to figure out
> >> what the problem is and how to fix it. The patch I posted previously is a
> >> workaround to detect this particular problem, and will display a message
> >> if it occurs.
> > It _does_ occur. I applied the patch, and a 'build.sh -j4'
> > triggered the panic after a while.
>
> What was the panic you got?
The one from your patch:
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003f9ee740
panic: Oops
Stopped in pid 9356.1 (as) at netbsd:cpu_Debugger+0x4: ret zero,(ra)
db{0}> trace
cpu_Debugger() at netbsd:cpu_Debugger+0x4
panic() at netbsd:panic+0x268
pmap_tlb_shootdown() at netbsd:pmap_tlb_shootdown+0x274
pmap_kenter_pa() at netbsd:pmap_kenter_pa+0xe8
uvm_pagermapin() at netbsd:uvm_pagermapin+0x1b4
genfs_gop_write() at netbsd:genfs_gop_write+0x48
genfs_do_putpages() at netbsd:genfs_do_putpages+0x74c
VOP_PUTPAGES() at netbsd:VOP_PUTPAGES+0xb0
ffs_write() at netbsd:ffs_write+0x728
VOP_WRITE() at netbsd:VOP_WRITE+0xb0
vn_write() at netbsd:vn_write+0x10c
dofilewrite() at netbsd:dofilewrite+0x90
sys_write() at netbsd:sys_write+0xa8
syscall_plain() at netbsd:syscall_plain+0x160
XentSys() at netbsd:XentSys+0x60
--- syscall (4) ---
--- user mode ---
-jarle
From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: Jarle Greipsland <jarle@uninett.no>
Cc: gnats-bugs@NetBSD.org, gnats-admin@netbsd.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Tue, 6 Oct 2009 07:25:11 -0600 (MDT)
On Tue, 6 Oct 2009, Jarle Greipsland wrote:
>> What was the panic you got?
> The one from your patch:
> Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003f9ee740
> panic: Oops
The patch I posted had the panic call commented out (the panic was from
earlier debug testing). The following line was the workaround that would
skip that entry and do a full tlb invalidate. Things should keep running
(for a while longer anyway) after that.
--
Michael L. Hitch mhitch@montana.edu
Computer Consultant
Information Technology Center
Montana State University Bozeman, MT USA
From: Jarle Greipsland <jarle@uninett.no>
To: mhitch@lightning.msu.montana.edu
Cc: gnats-bugs@NetBSD.org, gnats-admin@netbsd.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Tue, 06 Oct 2009 15:32:15 +0200 (CEST)
"Michael L. Hitch" <mhitch@lightning.msu.montana.edu> writes:
> On Tue, 6 Oct 2009, Jarle Greipsland wrote:
>
> >> What was the panic you got?
> > The one from your patch:
> > Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003f9ee740
> > panic: Oops
>
> The patch I posted had the panic call commented out (the panic was from
> earlier debug testing). The following line was the workaround that would
> skip that entry and do a full tlb invalidate. Things should keep running
> (for a while longer anyway) after that.
Ah. That's what happens when one manually applies a patch :-(
Oh well. I have since updated my kernel, and am now running a
-current GENERIC.MP kernel with the IPL_VM -> IPL_SCHED
modification. I have started a "build -j4" job, and I'll
see how that works out.
-jarle
--
"There's too much blood in my caffeine system."
From: Jarle Greipsland <jarle@uninett.no>
To: mhitch@lightning.msu.montana.edu
Cc: gnats-bugs@NetBSD.org, gnats-admin@netbsd.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Tue, 06 Oct 2009 18:36:59 +0200 (CEST)
Jarle Greipsland <jarle@uninett.no> writes:
> Ah. That's what happens when one manually applies a patch :-(
> Oh well. I have since updated my kernel, and am now running a
> -current GENERIC.MP kernel with the IPL_VM -> IPL_SCHED
> modification. I have started a "build -j4" job, and I'll
> see how that works out.
Another data point: It crashed again; still with the same panic
from your patch, which I forgot to comment out when I updated my
sources. So, even with the IPL_VM->IPL_SCHED change in, it still
reaches the "Whoa! pool_cache_get returned an in-use entry!"
printf statement. I have no idea whether that's significant or
not.
I'll comment out the "Oops"-panic now, and see what happens
then.
-jarle
From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: Jarle Greipsland <jarle@uninett.no>
Cc: gnats-bugs@NetBSD.org, gnats-admin@netbsd.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Tue, 6 Oct 2009 11:28:59 -0600 (MDT)
On Tue, 6 Oct 2009, Jarle Greipsland wrote:
> Another data point: It crashed again; still with the same panic
> from your patch, which I forgot to comment out when I updated my
> sources. So, even with the IPL_VM->IPL_SCHED change in, it still
> reaches the "Whoa! pool_cache_get returned an in-use entry!"
> printf statement. I have no idea whether that's significant or
> not.
>
> I'll comment out the "Oops"-panic now, and see what happens
> then.
That's interesting - I've been running a number of -j4 builds on an
ES40 with only 2 cpus enabled (to more closely simulate the CS20), and
haven't been able to hit that message, or some checks I've got in the
pool_cache code.
What I was trying to do was to see if just changing the mutex
initialization to IPL_SCHED was enough, or if it needed to be IPL_HIGH.
I was also trying IPL_SCHED and IPL_HIGH in the pool_cache_bootstrap()
just prior to the mutex_init.
--
Michael L. Hitch mhitch@montana.edu
Computer Consultant
Information Technology Center
Montana State University Bozeman, MT USA
From: Jarle Greipsland <jarle@uninett.no>
To: mhitch@lightning.msu.montana.edu
Cc: gnats-bugs@NetBSD.org, gnats-admin@netbsd.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Thu, 08 Oct 2009 16:16:03 +0200 (CEST)
Jarle Greipsland <jarle@uninett.no> writes:
> Jarle Greipsland <jarle@uninett.no> writes:
> > Ah. That's what happens when one manually applies a patch :-(
> > Oh well. I have since updated my kernel, and am now running a
> > -current GENERIC.MP kernel with the IPL_VM -> IPL_SCHED
> > modification. I have started a "build -j4" job, and I'll
> > see how that works out.
> Another data point: It crashed again; still with the same panic
> from your patch, which I forgot to comment out when I updated my
> sources. So, even with the IPL_VM->IPL_SCHED change in, it still
> reaches the "Whoa! pool_cache_get returned an in-use entry!"
> printf statement. I have no idea whether that's significant or
> not.
>
> I'll comment out the "Oops"-panic now, and see what happens
> then.
The system then keeps on running. Though for every "build -j4"
job I run I get approximately one
"Whoa! pool_cache_get returned an in-use entry!"
message on the console. I guess this means we have a
papered-over bug now?
-jarle
--
Q: What's the difference between programming and bug collecting?
A: None.
From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: Jarle Greipsland <jarle@uninett.no>
Cc: gnats-bugs@NetBSD.org, gnats-admin@netbsd.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Sun, 11 Oct 2009 12:13:57 -0600 (MDT)
On Thu, 8 Oct 2009, Jarle Greipsland wrote:
> The system then keeps on running. Though for every "build -j4"
> job I run I get approximately one
> "Whoa! pool_cache_get returned an in-use entry!"
> message on the console. I guess this means we have a
> papered-over bug now?
Yes - that patch just attempts to work around a duplicate pool cache
entry. I have not been able to replicate this on my ES40, and don't have
remote power capability on my CS20 which makes it harder to debug when
things can hang hard.
I'm wondering if using IPL_HIGH for the mutex changes anything:
Index: sys/arch/alpha/alpha/pmap.c
===================================================================
RCS file: /cvsroot/src/sys/arch/alpha/alpha/pmap.c,v
retrieving revision 1.243
diff -u -p -r1.243 pmap.c
--- sys/arch/alpha/alpha/pmap.c 4 Oct 2009 17:00:31 -0000 1.243
+++ sys/arch/alpha/alpha/pmap.c 11 Oct 2009 18:01:59 -0000
@@ -962,7 +962,7 @@ pmap_bootstrap(paddr_t ptaddr, u_int max
for (i = 0; i < ALPHA_MAXPROCS; i++) {
TAILQ_INIT(&pmap_tlb_shootdown_q[i].pq_head);
mutex_init(&pmap_tlb_shootdown_q[i].pq_lock, MUTEX_DEFAULT,
- IPL_SCHED);
+ IPL_HIGH);
}
#endif
If that doesn't change anything, then I'd try increasing the IPL level
of the pool_cache entry:
Index: sys/arch/alpha/alpha/pmap.c
===================================================================
RCS file: /cvsroot/src/sys/arch/alpha/alpha/pmap.c,v
retrieving revision 1.243
diff -u -p -r1.243 pmap.c
--- sys/arch/alpha/alpha/pmap.c 4 Oct 2009 17:00:31 -0000 1.243
+++ sys/arch/alpha/alpha/pmap.c 11 Oct 2009 17:59:33 -0000
@@ -958,7 +958,7 @@ pmap_bootstrap(paddr_t ptaddr, u_int max
*/
pool_cache_bootstrap(&pmap_tlb_shootdown_job_cache,
sizeof(struct pmap_tlb_shootdown_job), CACHE_LINE_SIZE,
- 0, PR_LARGECACHE, "pmaptlb", NULL, IPL_VM, NULL, NULL, NULL);
+ 0, PR_LARGECACHE, "pmaptlb", NULL, IPL_SCHED, NULL, NULL, NULL);
for (i = 0; i < ALPHA_MAXPROCS; i++) {
TAILQ_INIT(&pmap_tlb_shootdown_q[i].pq_head);
mutex_init(&pmap_tlb_shootdown_q[i].pq_lock, MUTEX_DEFAULT,
And:
Index: sys/arch/alpha/alpha/pmap.c
===================================================================
RCS file: /cvsroot/src/sys/arch/alpha/alpha/pmap.c,v
retrieving revision 1.243
diff -u -p -r1.243 pmap.c
--- sys/arch/alpha/alpha/pmap.c 4 Oct 2009 17:00:31 -0000 1.243
+++ sys/arch/alpha/alpha/pmap.c 11 Oct 2009 17:59:56 -0000
@@ -958,7 +958,7 @@ pmap_bootstrap(paddr_t ptaddr, u_int max
*/
pool_cache_bootstrap(&pmap_tlb_shootdown_job_cache,
sizeof(struct pmap_tlb_shootdown_job), CACHE_LINE_SIZE,
- 0, PR_LARGECACHE, "pmaptlb", NULL, IPL_VM, NULL, NULL, NULL);
+ 0, PR_LARGECACHE, "pmaptlb", NULL, IPL_HIGH, NULL, NULL, NULL);
for (i = 0; i < ALPHA_MAXPROCS; i++) {
TAILQ_INIT(&pmap_tlb_shootdown_q[i].pq_head);
mutex_init(&pmap_tlb_shootdown_q[i].pq_lock, MUTEX_DEFAULT,
--
Michael L. Hitch mhitch@montana.edu
Computer Consultant
Information Technology Center
Montana State University Bozeman, MT USA
From: Jarle Greipsland <jarle@uninett.no>
To: gnats-bugs@NetBSD.org, mhitch@lightning.msu.montana.edu
Cc: port-alpha-maintainer@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Sun, 18 Oct 2009 17:20:33 +0200 (CEST)
"Michael L. Hitch" <mhitch@lightning.msu.montana.edu> writes:
> Yes - that patch just attempts to work around a duplicate pool cache
> entry. I have not been able to replicate this on my ES40, and don't have
> remote power capability on my CS20 which makes it harder to debug when
> things can hang hard.
Some more results:
I ran a few more 'build.sh -j4' jobs without changing anything,
and the kernel continued to emit spurious "Whoa! pool_cache_get
returned an in-use entry!" messages.
Then, on Oct 14, I synced my source tree to -current again (this
might have been a stupid move on my part...), and built and
booted a new GENERIC.MP kernel. After about an hour into a
'build.sh -j4' job, it paniced with a new (to me) message:
panic: pool_get: pmaptlb: page empty
Stopped in pid 25249.1 (cc1plus) at netbsd:cpu_Debugger+0x4: ret zero,(ra)
db{0}> tr
cpu_Debugger() at netbsd:cpu_Debugger+0x4
panic() at netbsd:panic+0x268
pool_get() at netbsd:pool_get+0x408
pool_cache_get_slow() at netbsd:pool_cache_get_slow+0x1ac
pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x168
pmap_tlb_shootdown() at netbsd:pmap_tlb_shootdown+0xe8
pmap_kenter_pa() at netbsd:pmap_kenter_pa+0xe8
uvm_km_alloc() at netbsd:uvm_km_alloc+0x170
kmem_backend_alloc() at netbsd:kmem_backend_alloc+0x48
vmem_xalloc() at netbsd:vmem_xalloc+0x318
vmem_alloc() at netbsd:vmem_alloc+0x90
kmem_poolpage_alloc() at netbsd:kmem_poolpage_alloc+0x48
pool_grow() at netbsd:pool_grow+0x64
pool_get() at netbsd:pool_get+0x5c
pool_cache_get_slow() at netbsd:pool_cache_get_slow+0x1ac
pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x168
kmem_alloc() at netbsd:kmem_alloc+0xc8
amap_extend() at netbsd:amap_extend+0x1b8
uvm_mapent_trymerge() at netbsd:uvm_mapent_trymerge+0x3dc
amap_copy() at netbsd:amap_copy+0x410
uvm_fault_internal() at netbsd:uvm_fault_internal+0xbb0
trap() at netbsd:trap+0x548
XentMM() at netbsd:XentMM+0x20
--- memory management fault (from ipl 0) ---
--- user mode ---
db{0}> show reg
v0 0x6
t0 0x1
t1 0x1
t2 0xfffffc003ff48000
t3 0
t4 0
t5 0xfffffc0000b46a2d __func__.21238+0x91c
t6 0
t7 0
s0 0xfffffc0000c378e0 msgbufenabled
s1 0x104
s2 0xfffffc0000c350a8 db_onpanic
s3 0xfffffc0000c6b8e8 pmap_tlb_shootdown_job_cache+0x1a8
s4 0
s5 0x2
s6 0
a0 0
a1 0xfffffd01fc0003f8
a2 0
a3 0x8
a4 0x3
a5 0xfffffc0000000008
t8 0
t9 0x8
t10 0xfffffc0000c941a8 uvmexp
t11 0
ra 0xfffffc000080c5a8 panic+0x268
t12 0xfffffc00003eb590 cpu_Debugger
at 0x1
gp 0xfffffc0000c30928 __link_set_prop_linkpools_sym__link__prop_array_pool+0x8008
sp 0xfffffe0013c7f5e8
pc 0xfffffc00003eb594 cpu_Debugger+0x4
ps 0
ai 0
pv 0xfffffc00003eb590 cpu_Debugger
netbsd:cpu_Debugger+0x4: ret zero,(ra)
db{0}> ps
PID LID S CPU FLAGS STRUCT LWP * NAME WAIT
21308> 1 7 1 40004 fffffc0024916c00 cc1plus
9441 1 3 0 84 fffffc0024916000 alpha--netbsd-c+ wait
21710 1 3 0 84 fffffc0024813000 sh wait
26347 1 2 0 40004 fffffc002de34800 cc1plus
18814 1 3 0 84 fffffc0024813400 alpha--netbsd-c+ wait
24418 1 3 0 84 fffffc002daf2800 sh wait
25249> 1 7 0 4 fffffc002d29e400 cc1plus
19917 1 3 0 84 fffffc0024917c00 alpha--netbsd-c+ wait
2263 1 3 1 84 fffffc002daf3000 sh wait
16091 1 2 0 40004 fffffc002caa9800 cc1plus
19836 1 3 0 84 fffffc003edac400 alpha--netbsd-c+ wait
7537 1 3 1 84 fffffc00233e9400 sh wait
18485 1 3 0 84 fffffc0024d52000 nbmake select
8826 1 3 0 84 fffffc002daf2000 sh wait
28933 1 3 1 84 fffffc002cab8000 nbmake select
15439 1 3 1 84 fffffc0024812000 sh wait
22920 1 3 1 84 fffffc0024917800 nbmake select
16006 1 3 1 84 fffffc0024d53400 sh wait
22712 1 3 1 84 fffffc0024d52800 nbmake select
28515 1 3 1 84 fffffc002caa9c00 sh wait
24901 1 3 1 84 fffffc00233e8800 nbmake select
22708 1 3 1 84 fffffc0024917000 pickup kqueue
20418 1 3 0 84 fffffc002d29ec00 bash ttyraw
17752 1 3 0 84 fffffc002caa9400 xterm select
29471 1 3 1 84 fffffc003f678000 csh pause
8486 1 3 0 84 fffffc002cab8c00 sshd select
23567 1 3 1 84 fffffc002d29f400 sshd netio
4106 1 3 0 84 fffffc00233e8000 sh wait
3271 1 3 1 84 fffffc00233e8c00 nbmake select
3779 1 3 0 84 fffffc0024916400 sh wait
3702 1 3 1 84 fffffc00233e9000 nbmake select
3935 1 3 1 84 fffffc003ca9e400 sh wait
3603 1 3 1 84 fffffc003ca9f000 nbmake select
750 1 3 0 84 fffffc003ca9e800 sh wait
730 1 3 1 84 fffffc003ca9ec00 tail kqueue
733 1 3 0 84 fffffc003f678400 tcsh pause
507 1 3 0 84 fffffc003ca9f800 tcsh pause
40 1 3 1 84 fffffc003ca9fc00 tcsh pause
41 1 3 1 84 fffffc003edad000 xterm select
459 1 3 0 84 fffffc003f9fe800 csh pause
473 1 3 1 84 fffffc003edac800 sshd select
528 1 3 0 84 fffffc003f960000 sshd netio
501 1 3 1 84 fffffc003f679800 qmgr kqueue
350 1 3 1 84 fffffc003f961c00 getty ttyraw
484 1 3 1 84 fffffc003f679400 cron nanoslp
498 1 3 1 84 fffffc003edac000 inetd kqueue
468 1 3 1 84 fffffc003edacc00 master kqueue
300 1 3 0 84 fffffc003edad400 sshd select
301 1 3 0 84 fffffc003edad800 ntpd pause
183 1 3 0 84 fffffc003edadc00 mount_mfs mfsidl
152 1 3 1 84 fffffc003f679000 syslogd kqueue
1 1 3 0 84 fffffc003f961000 init wait
0 39 3 0 204 fffffc003f678800 physiod physiod
0 38 3 1 204 fffffc003f960400 vmem_rehash vmem_rehash
0 37 3 0 204 fffffc003f960800 aiodoned aiodoned
0 36 3 1 204 fffffc003f9fe400 ioflush syncer
0 35 3 1 204 fffffc003f9fe000 pgdaemon pgdaemon
0 34 3 0 204 fffffc003f961400 cryptoret crypto_wait
0 33 3 0 204 fffffc003f961800 atapibus0 sccomp
0 31 3 0 204 fffffc003f960c00 unpgc unpgc
0 22 3 1 204 fffffc003f9fec00 atabus1 atath
0 21 3 0 204 fffffc003f9ff000 atabus0 atath
0 20 3 1 204 fffffc003f9ff400 scsibus0 sccomp
0 19 3 1 204 fffffc003f9ff800 xcall/1 xcall
0 18 1 1 204 fffffc003f9ffc00 softser/1
0 17 1 1 204 fffffc003fcf8000 softclk/1
0 16 1 1 204 fffffc003fcf8400 softbio/1
0 15 1 1 204 fffffc003fcf8800 softnet/1
0 14 1 1 205 fffffc003fcf8c00 idle/1
0 13 3 1 204 fffffc003fcf9000 pmfsuspend pmfsuspend
0 12 3 0 204 fffffc003fcf9400 pmfevent pmfevent
0 11 3 1 204 fffffc003fcf9800 nfssilly nfssilly
0 10 3 0 204 fffffc003fcf9c00 cachegc cachegc
0 9 3 1 204 fffffc003fe28000 vrele vrele
0 8 3 1 204 fffffc003fe28400 modunload modunload
0 7 3 0 204 fffffc003fe28800 xcall/0 xcall
0 6 1 0 204 fffffc003fe28c00 softser/0
0 5 1 0 204 fffffc003fe29000 softclk/0
0 4 1 0 204 fffffc003fe29400 softbio/0
0 3 1 0 204 fffffc003fe29800 softnet/0
0 2 1 0 205 fffffc003fe29c00 idle/0
0 1 3 0 204 fffffc0000c16660 swapper schedule
db{0}> mach cpu 1
Using CPU 1
db{0}> tr
CPU 0: fatal kernel trap:
CPU 0 trap entry = 0x2 (memory management fault)
CPU 0 a0 = 0xffffffffffffffd9
CPU 0 a1 = 0x1
CPU 0 a2 = 0x0
CPU 0 pc = 0xfffffc00003ee944
CPU 0 ra = 0xfffffc00003e8104
CPU 0 pv = 0xfffffc00003ee890
CPU 0 curlwp = 0xfffffc002d29e400
CPU 0 pid = 25249, comm = cc1plus
Caught exception in ddb.
db{0}> show reg
v0 0x1
t0 0x5b93a000
t1 0x2000
t2 0
t3 0x2000
t4 0xf
t5 0x16055e050
t6 0x2
t7 0
s0 0x2
s1 0x5b93a000
s2 0x161214008
s3 0x161214000
s4 0x16055e1d0
s5 0x16055e040
s6 0x50
a0 0x5b93a000
a1 0x48
a2 0
a3 0
a4 0x5
a5 0x5
t8 0x160814400
t9 0x3
t10 0x120582e58
t11 0x120581c28
ra 0x16063d168
t12 0x16064db44
at 0x2
gp 0x1606fb790
sp 0x1
pc 0x16064db70
ps 0x8
ai 0x120581c28
pv 0x16064db44
0x16064db70: and t1,#0xff,t4
I also got a kernel core dump from this one. Please give some
instructions if you want me to dig up some data from the dump.
Later on, after booting the same kernel, I got some more
"Whoa! pool_cache_get returned an in-use entry!" messages. Also
at some point (I don't think it correlated in time with any of
the kernel messages) the system hung hard again. I pressed the
halt button, and entered DDB:
halted CPU 0
CPU 1 is not halted
halt code = 1
operator initiated halt
PC = fffffc00005f6320
P00>>>cont
continuing CPU 0
CP - RESTORE_TERM routine to be called
panic: user requested console halt
Stopped in pid 0.35 (system) at netbsd:cpu_Debugger+0x4: ret zero,(ra
)
db{0}> trace
cpu_Debugger() at netbsd:cpu_Debugger+0x4
panic() at netbsd:panic+0x268
console_restart() at netbsd:console_restart+0x78
XentRestart() at netbsd:XentRestart+0x90
--- console restart (from ipl 2) ---
mutex_spin_enter() at netbsd:mutex_spin_enter+0x200
pool_cache_put_slow() at netbsd:pool_cache_put_slow+0x138
pool_cache_put_paddr() at netbsd:pool_cache_put_paddr+0x168
pmap_do_tlb_shootdown() at netbsd:pmap_do_tlb_shootdown+0x178
alpha_ipi_process() at netbsd:alpha_ipi_process+0xb8
interrupt() at netbsd:interrupt+0x88
XentInt() at netbsd:XentInt+0x1c
--- interrupt (from ipl 4) ---
mutex_exit() at netbsd:mutex_exit+0x10
pool_cache_invalidate() at netbsd:pool_cache_invalidate+0x6c
pool_reclaim() at netbsd:pool_reclaim+0x68
pool_drain_end() at netbsd:pool_drain_end+0x44
uvm_pageout() at netbsd:uvm_pageout+0x740
exception_return() at netbsd:exception_return
--- root of call graph ---
db{0}> show reg
v0 0xfffffc0000c0de28 cpu_info_primary
t0 0x1
t1 0x1
t2 0xfffffc003ff48000
t3 0
t4 0
t5 0xfffffc0000b46a2d __func__.21238+0x91c
t6 0xfffffc002e386888
t7 0
s0 0xfffffc0000c378e0 msgbufenabled
s1 0x104
s2 0xfffffc0000c350a8 db_onpanic
s3 0xfffffc0000c8ef18 panicstr
s4 0
s5 0xfffffc0000808e48 pool_cache_put_slow+0x138
s6 0
a0 0
a1 0xfffffd01fc0003f8
a2 0
a3 0x8
a4 0x3
a5 0xfffffc0000000008
t8 0xb
t9 0x8
t10 0x7fff
t11 0
ra 0xfffffc000080c5a8 panic+0x268
t12 0xfffffc00003eb590 cpu_Debugger
at 0x160328000
gp 0xfffffc0000c30928 __link_set_prop_linkpools_sym__link__prop_array_pool+0x8008
sp 0xfffffe00130e5908
pc 0xfffffc00003eb594 cpu_Debugger+0x4
ps 0x7
ai 0
pv 0xfffffc00003eb590 cpu_Debugger
netbsd:cpu_Debugger+0x4: ret zero,(ra)
db{0}> mach cpu 1
Using CPU 1
db{0}> trace
CPU 0: fatal kernel trap:
CPU 0 trap entry = 0x2 (memory management fault)
CPU 0 a0 = 0xffffffffffffffd9
CPU 0 a1 = 0x1
CPU 0 a2 = 0x0
CPU 0 pc = 0xfffffc00003ee944
CPU 0 ra = 0xfffffc00003e8104
CPU 0 pv = 0xfffffc00003ee890
CPU 0 curlwp = 0xfffffc003f9fe800
CPU 0 pid = 0, comm = system
Caught exception in ddb.
db{0}> c
cpu1: shutting down...
dumping to dev 8,1 offset 2097597
dump i/o error
sd0(esiop0:0:0:0): polling command not done
panic: scsipi_execute_xs
Stopped in pid 0.35 (system) at netbsd:cpu_Debugger+0x4: ret zero,(ra
)
db{0}> reboot 4
> I'm wondering if using IPL_HIGH for the mutex changes anything:
>
> Index: sys/arch/alpha/alpha/pmap.c
> ===================================================================
> RCS file: /cvsroot/src/sys/arch/alpha/alpha/pmap.c,v
> retrieving revision 1.243
> diff -u -p -r1.243 pmap.c
> --- sys/arch/alpha/alpha/pmap.c 4 Oct 2009 17:00:31 -0000 1.243
> +++ sys/arch/alpha/alpha/pmap.c 11 Oct 2009 18:01:59 -0000
> @@ -962,7 +962,7 @@ pmap_bootstrap(paddr_t ptaddr, u_int max
> for (i = 0; i < ALPHA_MAXPROCS; i++) {
> TAILQ_INIT(&pmap_tlb_shootdown_q[i].pq_head);
> mutex_init(&pmap_tlb_shootdown_q[i].pq_lock, MUTEX_DEFAULT,
> - IPL_SCHED);
> + IPL_HIGH);
> }
> #endif
I then applied this patch. I tried several 'build.sh -j4' jobs,
but all of them seemed to abort with some host tool tripping up:
/usr/tools/bin/nbgroff: grotty: Illegal instruction (core dumped)
/usr/tools/bin/nbgroff: troff: Illegal instruction (core dumped)
Then on one occasion, the kernel started to repeatedly spew
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003f9ee00
messages to the console. The pj value were identical in all the
messages, but the ci_index value varied (0 or 1).
Do you still think I should try and increase the IPL level of the
pool_cache entry as specified in your message?
-jarle
From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: Jarle Greipsland <jarle@uninett.no>
Cc: gnats-bugs@NetBSD.org, gnats-admin@netbsd.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Sun, 18 Oct 2009 11:37:20 -0600 (MDT)
On Sun, 18 Oct 2009, Jarle Greipsland wrote:
> I ran a few more 'build.sh -j4' jobs without changing anything,
> and the kernel continued to emit spurious "Whoa! pool_cache_get
> returned an in-use entry!" messages.
>
> Then, on Oct 14, I synced my source tree to -current again (this
> might have been a stupid move on my part...), and built and
> booted a new GENERIC.MP kernel. After about an hour into a
> 'build.sh -j4' job, it paniced with a new (to me) message:
>
> panic: pool_get: pmaptlb: page empty
This would likely indicate pool_cache corruption, which would be in line
with the one that my patch worked around.
> db{0}> mach cpu 1
> Using CPU 1
> db{0}> tr
>
> CPU 0: fatal kernel trap:
There's a bug somewhere in handling the registers of the cpus other than
the one that's running DDB. I can't remember if this ever worked
properly, although I'm sure I've at least tried it in the past.
> I also got a kernel core dump from this one. Please give some
> instructions if you want me to dig up some data from the dump.
I don't think there's a way to get much information out of a core dump
since 5.0. The new gdb doesn't yet have the code to access a core dump.
One more thing to try fixing some day.
> Later on, after booting the same kernel, I got some more
> "Whoa! pool_cache_get returned an in-use entry!" messages. Also
> at some point (I don't think it correlated in time with any of
> the kernel messages) the system hung hard again. I pressed the
> halt button, and entered DDB:
> halted CPU 0
> CPU 1 is not halted
>
> halt code = 1
> operator initiated halt
> PC = fffffc00005f6320
> P00>>>cont
>
> continuing CPU 0
> CP - RESTORE_TERM routine to be called
> panic: user requested console halt
> Stopped in pid 0.35 (system) at netbsd:cpu_Debugger+0x4: ret zero,(ra
> )
> db{0}> trace
> cpu_Debugger() at netbsd:cpu_Debugger+0x4
> panic() at netbsd:panic+0x268
> console_restart() at netbsd:console_restart+0x78
> XentRestart() at netbsd:XentRestart+0x90
> --- console restart (from ipl 2) ---
> mutex_spin_enter() at netbsd:mutex_spin_enter+0x200
> pool_cache_put_slow() at netbsd:pool_cache_put_slow+0x138
> pool_cache_put_paddr() at netbsd:pool_cache_put_paddr+0x168
> pmap_do_tlb_shootdown() at netbsd:pmap_do_tlb_shootdown+0x178
> alpha_ipi_process() at netbsd:alpha_ipi_process+0xb8
> interrupt() at netbsd:interrupt+0x88
> XentInt() at netbsd:XentInt+0x1c
> --- interrupt (from ipl 4) ---
> mutex_exit() at netbsd:mutex_exit+0x10
> pool_cache_invalidate() at netbsd:pool_cache_invalidate+0x6c
> pool_reclaim() at netbsd:pool_reclaim+0x68
> pool_drain_end() at netbsd:pool_drain_end+0x44
> uvm_pageout() at netbsd:uvm_pageout+0x740
> exception_return() at netbsd:exception_return
This also appears to be pool cache corruption problems. In this
particular case, CPU0 has received an IPI interrupt to shootdown its tlb
entries, and is trying to to release a pool cache entry used for this
call, and is hung trying to acquire a mutex related to the pool cache.
The other cpu presumably holds that lock (and may be trying to acquire a
lock held by CPU0, leading to the classic deadlock).
>> I'm wondering if using IPL_HIGH for the mutex changes anything:
>>
>> Index: sys/arch/alpha/alpha/pmap.c
>> ===================================================================
>> RCS file: /cvsroot/src/sys/arch/alpha/alpha/pmap.c,v
>> retrieving revision 1.243
>> diff -u -p -r1.243 pmap.c
>> --- sys/arch/alpha/alpha/pmap.c 4 Oct 2009 17:00:31 -0000 1.243
>> +++ sys/arch/alpha/alpha/pmap.c 11 Oct 2009 18:01:59 -0000
>> @@ -962,7 +962,7 @@ pmap_bootstrap(paddr_t ptaddr, u_int max
>> for (i = 0; i < ALPHA_MAXPROCS; i++) {
>> TAILQ_INIT(&pmap_tlb_shootdown_q[i].pq_head);
>> mutex_init(&pmap_tlb_shootdown_q[i].pq_lock, MUTEX_DEFAULT,
>> - IPL_SCHED);
>> + IPL_HIGH);
>> }
>> #endif
>
> I then applied this patch. I tried several 'build.sh -j4' jobs,
> but all of them seemed to abort with some host tool tripping up:
> /usr/tools/bin/nbgroff: grotty: Illegal instruction (core dumped)
> /usr/tools/bin/nbgroff: troff: Illegal instruction (core dumped)
This is the problem I get now on my ES45. I haven't been able to figure
out anything from the process core dump, other than memory seems corrupted
or incorrect. I'm not sure if it might be related to the tlb shootdown
code not working properly, or perhaps a missing memory barrier call
somewhere.
> Then on one occasion, the kernel started to repeatedly spew
> Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003f9ee00
> messages to the console. The pj value were identical in all the
> messages, but the ci_index value varied (0 or 1).
>
> Do you still think I should try and increase the IPL level of the
> pool_cache entry as specified in your message?
Try the higher IPL on the pmap_tlb_shootdown_job_cache. I'm not real
clear on how that IPL is used, but I'm guessing that might be the IPL used
by any locking using by the pool cache routines, and may be needed to
prevent the IPI interrupt from interrupting a pool cache operation. [That
might have caused the deadlock you observed above.] Try IPL_CLOCK first,
and then IPL_HIGH if that still has problems relating to the pool cache.
--
Michael L. Hitch mhitch@montana.edu
Computer Consultant
Information Technology Center
Montana State University Bozeman, MT USA
State-Changed-From-To: feedback->open
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Mon, 19 Oct 2009 00:58:46 +0000
State-Changed-Why:
This does not need to be in feedback any more. Glad to see you guys working
on it :-)
From: Jarle Greipsland <jarle@uninett.no>
To: gnats-bugs@NetBSD.org, mhitch@lightning.msu.montana.edu
Cc: port-alpha-maintainer@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Mon, 26 Oct 2009 15:34:14 +0100 (CET)
"Michael L. Hitch" <mhitch@lightning.msu.montana.edu> writes:
> > Then on one occasion, the kernel started to repeatedly spew
> > Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003f9ee00
> > messages to the console. The pj value were identical in all the
> > messages, but the ci_index value varied (0 or 1).
> >
> > Do you still think I should try and increase the IPL level of the
> > pool_cache entry as specified in your message?
>
> Try the higher IPL on the pmap_tlb_shootdown_job_cache. I'm not real
> clear on how that IPL is used, but I'm guessing that might be the IPL used
> by any locking using by the pool cache routines, and may be needed to
> prevent the IPI interrupt from interrupting a pool cache operation. [That
> might have caused the deadlock you observed above.] Try IPL_CLOCK first,
> and then IPL_HIGH if that still has problems relating to the pool cache.
Results for the IPL_HIGH setting: It still has problems with the
pool_cache_get stuff. A number of consecutive 'build.sh -j4'
resulted in console messages:
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003f9efa00
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003f9ee080
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003f9ee440
Also, at during one of the builds, the system hung completely,
and I had to press the reset button:
----------------------------------------------------------------------
Stopped in pid 0.2 (system) at netbsd:cpu_Debugger+0x4: ret zero,(ra)
db{0}> tr
cpu_Debugger() at netbsd:cpu_Debugger+0x4
comintr() at netbsd:comintr+0x720
alpha_shared_intr_dispatch() at netbsd:alpha_shared_intr_dispatch+0x5c
sio_iointr() at netbsd:sio_iointr+0x38
interrupt() at netbsd:interrupt+0x1c0
XentInt() at netbsd:XentInt+0x1c
--- interrupt (from ipl 0) ---
sched_curcpu_runnable_p() at netbsd:sched_curcpu_runnable_p+0x1c
idle_loop() at netbsd:idle_loop+0x1b8
exception_return() at netbsd:exception_return
--- root of call graph ---
db{0}> mach cpu 1
Using CPU 1
db{0}> tra
CPU 0: fatal kernel trap:
CPU 0 trap entry = 0x2 (memory management fault)
CPU 0 a0 = 0xffffffffffffffd9
CPU 0 a1 = 0x1
CPU 0 a2 = 0x0
CPU 0 pc = 0xfffffc00003ee944
CPU 0 ra = 0xfffffc00003e8104
CPU 0 pv = 0xfffffc00003ee890
CPU 0 curlwp = 0xfffffc003fe29c00
CPU 0 pid = 0, comm = system
Caught exception in ddb.
db{0}> mach cpu 0
CPU 0 not paused
db{0}> show reg
v0 0x1
t0 0xfffffc0000b44e9c uvmadvice
t1 0xfffffc0000b44e9c uvmadvice
t2 0
t3 0x2000
t4 0
t5 0x3
t6 0x4
t7 0x1
s0 0xfffffc0038581698
s1 0
s2 0x2
s3 0xe78
s4 0xfffffe0013bc16e0
s5 0x160042000
s6 0xfffffc0031b69000
a0 0xfffffc0038581698
a1 0xfffffc0014c0f760
a2 0xfffffe0013bc1448
a3 0x160042000
a4 0xfffffe0013bc16e0
a5 0
t8 0x1
t9 0
t10 0xfffffc0000c941e8 uvmexp
t11 0x80
ra 0xfffffc00008e0bd4 uvm_fault_internal+0x124
t12 0xfffffc00008e5f50 uvm_map_lookup_entry
at 0xfffffe0013bbe000
gp 0xfffffc0000c30968 __link_set_prop_linkpools_sym__link__prop_array_pool+0x8008
sp 0x1
pc 0xfffffc00008e0e50 uvm_fault_internal+0x3a0
ps 0
ai 0x80
pv 0xfffffc00008e5f50 uvm_map_lookup_entry
netbsd:uvm_fault_internal+0x3a0: srl t3,#0xd,t3
db{0}> tr
CPU 0: fatal kernel trap:
CPU 0 trap entry = 0x2 (memory management fault)
CPU 0 a0 = 0xffffffffffffffd9
CPU 0 a1 = 0x1
CPU 0 a2 = 0x0
CPU 0 pc = 0xfffffc00003ee944
CPU 0 ra = 0xfffffc00003e8104
CPU 0 pv = 0xfffffc00003ee890
CPU 0 curlwp = 0xfffffc003fe29c00
CPU 0 pid = 0, comm = system
Caught exception in ddb.
db{0}> reboot 4
----------------------------------------------------------------------
Also, for another build, I got a
----------------------------------------------------------------------
panic: fpsave ipi didn't
Stopped in pid 22623.1 (sh) at netbsd:cpu_Debugger+0x4: ret zero,(ra)
db{0}> trace
cpu_Debugger() at netbsd:cpu_Debugger+0x4
panic() at netbsd:panic+0x268
fpusave_proc() at netbsd:fpusave_proc+0x1b4
cpu_lwp_free() at netbsd:cpu_lwp_free+0x28
exit1() at netbsd:exit1+0x568
sys_exit() at netbsd:sys_exit+0x7c
syscall_plain() at netbsd:syscall_plain+0x160
XentSys() at netbsd:XentSys+0x60
--- syscall (1) ---
--- user mode ---
db{0}> mach cpu 1
CPU 1 not paused
db{0}> show reg
v0 0x6
t0 0x1
t1 0x1
t2 0xfffffc003ff48000
t3 0
t4 0
t5 0xfffffc0000b46a65 __func__.21238+0x91c
t6 0xc5343806
t7 0xfffffffffffffcbe
s0 0xfffffc0000c37920 msgbufenabled
s1 0x104
s2 0xfffffc0000c350e8 db_onpanic
s3 0xfffffc003b546800
s4 0xfffffc0037d7b458
s5 0xfffffc0000c7bff0 initproc
s6 0x12003dab0
a0 0x6
a1 0xfffffd01fc0003f8
a2 0
a3 0x8
a4 0x3
a5 0x8
t8 0x2
t9 0x8
t10 0
t11 0x7
ra 0xfffffc000080c5a8 panic+0x268
t12 0xfffffc00003eb590 cpu_Debugger
at 0xfffffe0013b1a000
gp 0xfffffc0000c30968 __link_set_prop_linkpools_sym__link__prop_array_pool+0x8008
sp 0xfffffe00139cbcc8
pc 0xfffffc00003eb594 cpu_Debugger+0x4
ps 0x6
ai 0x7
pv 0xfffffc00003eb590 cpu_Debugger
netbsd:cpu_Debugger+0x4: ret zero,(ra)
db{0}> reboot 4
----------------------------------------------------------------------
Hope this helps.
-jarle
From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: gnats-bugs@NetBSD.org
Cc: jarle@uninett.no
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Tue, 27 Oct 2009 15:33:06 -0600 (MDT)
On Mon, 26 Oct 2009, Jarle Greipsland wrote:
> Results for the IPL_HIGH setting: It still has problems with the
> pool_cache_get stuff. A number of consecutive 'build.sh -j4'
> resulted in console messages:
> Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003f9efa00
> Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003f9ee080
> Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003f9ee440
OK, here's something else to try. I was looking through the alpha
hardware reference manual and checking some of the multiprocessor
information. I noted that it showed the use of memory barriers when
sending/receiving interrupts between processors. It looks like the atomic
operations that were used in the netbsd-4 branch included the memory
barrier, but the ones used in netbsd-5 and later do not. This patch
should add back the memory barriers need for the IPI stuff.
Index: sys/arch/alpha/alpha/ipifuncs.c
===================================================================
RCS file: /cvsroot/src/sys/arch/alpha/alpha/ipifuncs.c,v
retrieving revision 1.40
diff -u -p -r1.40 ipifuncs.c
--- sys/arch/alpha/alpha/ipifuncs.c 28 Apr 2008 20:23:10 -0000 1.40
+++ sys/arch/alpha/alpha/ipifuncs.c 27 Oct 2009 21:23:05 -0000
@@ -131,6 +131,7 @@ alpha_ipi_process(struct cpu_info *ci, s
}
#endif
+ alpha_mb();
pending_ipis = atomic_swap_ulong(&ci->ci_ipis, 0);
/*
@@ -167,6 +168,7 @@ alpha_send_ipi(u_long cpu_id, u_long ipi
#endif
atomic_or_ulong(&cpu_info[cpu_id]->ci_ipis, ipimask);
+ alpha_mb();
alpha_pal_wripir(cpu_id);
}
--
Michael L. Hitch mhitch@montana.edu
Computer Consultant
Information Technology Center
Montana State University Bozeman, MT USA
From: Jarle Greipsland <jarle@uninett.no>
To: mhitch@lightning.msu.montana.edu
Cc: gnats-bugs@NetBSD.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Wed, 28 Oct 2009 16:01:33 +0100 (CET)
"Michael L. Hitch" <mhitch@lightning.msu.montana.edu> writes:
> OK, here's something else to try. I was looking through the alpha
> hardware reference manual and checking some of the multiprocessor
> information. I noted that it showed the use of memory barriers when
> sending/receiving interrupts between processors. It looks like the atomic
> operations that were used in the netbsd-4 branch included the memory
> barrier, but the ones used in netbsd-5 and later do not. This patch
> should add back the memory barriers need for the IPI stuff.
[ ... ]
OK, I have applied your patch (and removed the old ones except
for the one that generates the "Whoa!"-warnings.). The kernel
I'm running is GENERIC.MP based on -current from Oct 14th, with
the following diff:
----------------------------------------------------------------------
Index: arch/alpha/alpha/ipifuncs.c
===================================================================
RCS file: /cvsroot/src/sys/arch/alpha/alpha/ipifuncs.c,v
retrieving revision 1.40
diff -u -r1.40 ipifuncs.c
--- arch/alpha/alpha/ipifuncs.c 28 Apr 2008 20:23:10 -0000 1.40
+++ arch/alpha/alpha/ipifuncs.c 28 Oct 2009 14:55:46 -0000
@@ -130,7 +130,7 @@
return;
}
#endif
-
+ alpha_mb();
pending_ipis = atomic_swap_ulong(&ci->ci_ipis, 0);
/*
@@ -167,6 +167,7 @@
#endif
atomic_or_ulong(&cpu_info[cpu_id]->ci_ipis, ipimask);
+ alpha_mb();
alpha_pal_wripir(cpu_id);
}
Index: arch/alpha/alpha/pmap.c
===================================================================
RCS file: /cvsroot/src/sys/arch/alpha/alpha/pmap.c,v
retrieving revision 1.243
diff -u -r1.243 pmap.c
--- arch/alpha/alpha/pmap.c 4 Oct 2009 17:00:31 -0000 1.243
+++ arch/alpha/alpha/pmap.c 28 Oct 2009 14:55:46 -0000
@@ -3699,6 +3699,12 @@
* don't really have to do anything else.
*/
mutex_spin_enter(&pq->pq_lock);
+ if (pj && pj == pq->pq_head.tqh_first) {
+ printf("Whoa! pool_cache_get returned an in-use entry! ci_index %d pj %p\n",
+ self->ci_index, pj);
+/**/ /* panic("Oops"); */
+ pj = NULL;
+ }
pq->pq_pte |= pte;
if (pq->pq_tbia) {
mutex_spin_exit(&pq->pq_lock);
----------------------------------------------------------------------
And it both "Whoa!"s and panics:
----------------------------------------------------------------------
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003f9ef980
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003f9ee400
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003f9ee400
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003f9ee400
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003f9ee400
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003f9ee400
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003f9ee400
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003f9ee400
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003f9ee400
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003f9ee400
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003f9ee400
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003f9ee400
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003f9ee400
Whoa! pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003f9ee400
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003f9ee400
CPU 1: fatal kernel trap:
CPU 1 trap entry = 0x2 (memory management fault)
CPU 1 a0 = 0x40
CPU 1 a1 = 0x1
CPU 1 a2 = 0x0
CPU 1 pc = 0xfffffc00007371a8
CPU 1 ra = 0xfffffc0000737118
CPU 1 pv = 0xfffffc00005f6130
CPU 1 curlwp = 0xfffffc003f960800
CPU 1 pid = 0, comm = system
panic: trap
Stopped in pid 0.37 (system) at netbsd:cpu_Debugger+0x4: ret zero,(ra)
db{1}> tr
cpu_Debugger() at netbsd:cpu_Debugger+0x4
panic() at netbsd:panic+0x268
trap() at netbsd:trap+0x35c
XentMM() at netbsd:XentMM+0x20
--- memory management fault (from ipl 5) ---
pmap_do_tlb_shootdown() at netbsd:pmap_do_tlb_shootdown+0xe8
alpha_ipi_process() at netbsd:alpha_ipi_process+0xb8
interrupt() at netbsd:interrupt+0x88
XentInt() at netbsd:XentInt+0x1c
--- interrupt (from ipl 0) ---
mutex_spin_exit() at netbsd:mutex_spin_exit+0x5c
pmap_tlb_shootdown() at netbsd:pmap_tlb_shootdown+0x170
pmap_kremove() at netbsd:pmap_kremove+0xac
uvm_pagermapout() at netbsd:uvm_pagermapout+0x40
uvm_aio_aiodone() at netbsd:uvm_aio_aiodone+0xd4
db{1}> show reg
v0 0xfffffe0000034800
t0 0x1
t1 0x1
t2 0xfffffc003ff48000
t3 0
t4 0
t5 0xfffffc0000b46a3d __func__.21238+0x91c
t6 0
t7 0
s0 0xfffffc0000c378e0 msgbufenabled
s1 0x104
s2 0xfffffc0000c350a8 db_onpanic
s3 0xfffffc00009bf5fc reg_to_frame+0x5c8
s4 0xfffffe0013923a38
s5 0x40
s6 0xfffffc003f960800
a0 0x5
a1 0xfffffd01fc0003f8
a2 0
a3 0x8
a4 0x3
a5 0xfffffe0000000008
t8 0xfffffe00139237ff
t9 0x8
t10 0x3ea0a5
t11 0x1ff800
ra 0xfffffc000080c5b8 panic+0x268
t12 0xfffffc00003eb590 cpu_Debugger
at 0x12002438c
gp 0xfffffc0000c30928 __link_set_prop_linkpools_sym__link__prop_array_pool+0x8008
sp 0xfffffe0013923888
pc 0xfffffc00003eb594 cpu_Debugger+0x4
ps 0x6
ai 0x1ff800
pv 0xfffffc00003eb590 cpu_Debugger
netbsd:cpu_Debugger+0x4: ret zero,(ra)
db{1}> mach cpu 0
Using CPU 0
db{1}> tr
CPU 1: fatal kernel trap:
CPU 1 trap entry = 0x2 (memory management fault)
CPU 1 a0 = 0xffffffffffffffd9
CPU 1 a1 = 0x1
CPU 1 a2 = 0x0
CPU 1 pc = 0xfffffc00003ee944
CPU 1 ra = 0xfffffc00003e8104
CPU 1 pv = 0xfffffc00003ee890
CPU 1 curlwp = 0xfffffc003f960800
CPU 1 pid = 0, comm = system
Caught exception in ddb.
db{1}> show reg
v0 0
t0 0
t1 0xfffffc003fe29c00
t2 0xfffffe0012c8a400
t3 0
t4 0
t5 0xfffffc003fe29c60
t6 0xfffffc0000c68410 kernel_pmap_store+0x50
t7 0
s0 0x1
s1 0xfffffc003fe29c00
s2 0xfffffc0000c0de60 cpu_info_primary+0x38
s3 0xfffffc0000c0de28 cpu_info_primary
s4 0
s5 0
s6 0
a0 0
a1 0
a2 0xfffffe0012c9a000
a3 0x1
a4 0xfffffc0000c94308 uvm_fpageqlock
a5 0
t8 0x1604db790
t9 0
t10 0xffffffff
t11 0xfffffc003f90a8b8
ra 0xfffffc00005eac88 idle_loop+0x1b8
t12 0xfffffc000060e200 kpreempt_enable
at 0xfffffe0013984000
gp 0xfffffc0000c30928 __link_set_prop_linkpools_sym__link__prop_array_pool+0x8008
sp 0x1
pc 0xfffffc00005eac34 idle_loop+0x164
ps 0
ai 0xfffffc003f90a8b8
pv 0xfffffc000060e200 kpreempt_enable
netbsd:idle_loop+0x164: ldq pv,-1d30(gp)
----------------------------------------------------------------------
So, no cigar this time. Anything else I should try?
-jarle
--
Q: What's the difference between programming and bug collecting?
A: None.
From: "Michael L. Hitch" <mhitch@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/38335 CVS commit: src/sys/arch/alpha/alpha
Date: Mon, 9 Nov 2009 04:31:04 +0000
Module Name: src
Committed By: mhitch
Date: Mon Nov 9 04:31:03 UTC 2009
Modified Files:
src/sys/arch/alpha/alpha: pmap.c
Log Message:
The tlb shootdown IPI can interrupt the pool_cache_get() in places that
can cause a deadlock or pool cache corruption. Take the shootdown job
queue mutex before calling pool_cache_get(), which will block the IPI
interrupts and seems to fix the remaining tlb shootdown deadlocks and
pool cache corruption I've been seeing. Should address both
PR port-amiga/38335 and PR port-amiga/42174.
To generate a diff of this commit:
cvs rdiff -u -r1.249 -r1.250 src/sys/arch/alpha/alpha/pmap.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: gnats-bugs@NetBSD.org
Cc: jarle@uninett.no
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Sun, 8 Nov 2009 21:34:34 -0700 (MST)
On Wed, 28 Oct 2009, Jarle Greipsland wrote:
> So, no cigar this time. Anything else I should try?
>
> -jarle
I've committed a fix to current that I think will fix this.
Index: sys/arch/alpha/alpha/pmap.c
===================================================================
RCS file: /cvsroot/src/sys/arch/alpha/alpha/pmap.c,v
retrieving revision 1.249
diff -u -p -r1.249 pmap.c
--- sys/arch/alpha/alpha/pmap.c 7 Nov 2009 07:27:40 -0000 1.249
+++ sys/arch/alpha/alpha/pmap.c 9 Nov 2009 04:21:12 -0000
@@ -3581,6 +3581,7 @@ pmap_tlb_shootdown(pmap_t pmap, vaddr_t
cpumask |= 1UL << ci->ci_cpuid;
pq = &pmap_tlb_shootdown_q[ci->ci_cpuid];
+ mutex_spin_enter(&pq->pq_lock);
/*
* Allocate a job.
@@ -3596,7 +3597,6 @@ pmap_tlb_shootdown(pmap_t pmap, vaddr_t
* If a global flush is already pending, we
* don't really have to do anything else.
*/
- mutex_spin_enter(&pq->pq_lock);
pq->pq_pte |= pte;
if (pq->pq_tbia) {
mutex_spin_exit(&pq->pq_lock);
--
Michael L. Hitch mhitch@montana.edu
Computer Consultant
Information Technology Center
Montana State University Bozeman, MT USA
From: Soren Jacobsen <snj@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/38335 CVS commit: [netbsd-5] src/sys/arch/alpha/alpha
Date: Sun, 15 Nov 2009 05:54:38 +0000
Module Name: src
Committed By: snj
Date: Sun Nov 15 05:54:38 UTC 2009
Modified Files:
src/sys/arch/alpha/alpha [netbsd-5]: pmap.c
Log Message:
Pull up following revision(s) (requested by mhitch in ticket #1137):
sys/arch/alpha/alpha/pmap.c: revision 1.250
The tlb shootdown IPI can interrupt the pool_cache_get() in places that
can cause a deadlock or pool cache corruption. Take the shootdown job
queue mutex before calling pool_cache_get(), which will block the IPI
interrupts and seems to fix the remaining tlb shootdown deadlocks and
pool cache corruption I've been seeing. Should address both
PR port-alpha/38335 and PR port-alpha/42174.
To generate a diff of this commit:
cvs rdiff -u -r1.236.10.3 -r1.236.10.4 src/sys/arch/alpha/alpha/pmap.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Soren Jacobsen <snj@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/38335 CVS commit: [netbsd-5-0] src/sys/arch/alpha/alpha
Date: Sun, 15 Nov 2009 05:55:01 +0000
Module Name: src
Committed By: snj
Date: Sun Nov 15 05:55:00 UTC 2009
Modified Files:
src/sys/arch/alpha/alpha [netbsd-5-0]: pmap.c
Log Message:
Pull up following revision(s) (requested by mhitch in ticket #1137):
sys/arch/alpha/alpha/pmap.c: revision 1.250
The tlb shootdown IPI can interrupt the pool_cache_get() in places that
can cause a deadlock or pool cache corruption. Take the shootdown job
queue mutex before calling pool_cache_get(), which will block the IPI
interrupts and seems to fix the remaining tlb shootdown deadlocks and
pool cache corruption I've been seeing. Should address both
PR port-alpha/38335 and PR port-alpha/42174.
To generate a diff of this commit:
cvs rdiff -u -r1.236.14.2 -r1.236.14.3 src/sys/arch/alpha/alpha/pmap.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: open->feedback
State-Changed-By: snj@NetBSD.org
State-Changed-When: Thu, 04 Feb 2010 02:41:25 +0000
State-Changed-Why:
Did the committed fix take care of the issue you're seeing?
From: Jarle Greipsland <jarle@uninett.no>
To: gnats-bugs@NetBSD.org, snj@NetBSD.org
Cc: port-alpha-maintainer@netbsd.org, netbsd-bugs@netbsd.org,
gnats-admin@netbsd.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Fri, 05 Feb 2010 09:20:03 +0100 (CET)
snj@NetBSD.org writes:
> Synopsis: kernel freeze on alpha MP system
>
> State-Changed-From-To: open->feedback
> State-Changed-By: snj@NetBSD.org
> State-Changed-When: Thu, 04 Feb 2010 02:41:25 +0000
> State-Changed-Why:
> Did the committed fix take care of the issue you're seeing?
I don't think so. I have been running with a LOCKDEBUG-kernel,
and yesterday, while trying to do a 'build.sh release', it
panicked.
login: Mutex error: lockdebug_wantlock: locking against myself
lock address : 0xfffffc0000882f68 type : spin
initialized : 0xfffffc00006735d8
shared holds : 0 exclusive: 1
shares wanted: 0 exclusive: 1
current cpu : 0 last held: 0
current l : 0xfffffc003fab2400 last held: 0xfffffc003fab2400
last locked : 0xfffffc0000674690 unlocked : 0xfffffc0000672478
owner field : 0x0000000000000400 wait/sn: 0/1
panic: LOCKDEBUG
Stopped in pid 0.35 (system) at netbsd:cpu_Debugger+0x4: ret zero,(ra
)
db{0}>
db{0}> trace
cpu_Debugger() at netbsd:cpu_Debugger+0x4
panic() at netbsd:panic+0x278
lockdebug_abort1() at netbsd:lockdebug_abort1+0x150
mutex_enter() at netbsd:mutex_enter+0x38c
pool_get() at netbsd:pool_get+0x58
pool_cache_put_slow() at netbsd:pool_cache_put_slow+0x2b4
pool_cache_put_paddr() at netbsd:pool_cache_put_paddr+0x1e0
pmap_do_tlb_shootdown() at netbsd:pmap_do_tlb_shootdown+0x178
alpha_ipi_process() at netbsd:alpha_ipi_process+0xb8
interrupt() at netbsd:interrupt+0x88
XentInt() at netbsd:XentInt+0x1c
--- interrupt (from ipl 4) ---
lockdebug_mem_check() at netbsd:lockdebug_mem_check+0x1b4
pool_put() at netbsd:pool_put+0x7c
pool_cache_invalidate_groups() at netbsd:pool_cache_invalidate_groups+0xf4
pool_cache_invalidate() at netbsd:pool_cache_invalidate+0x8c
pool_reclaim() at netbsd:pool_reclaim+0x68
pool_drain_end() at netbsd:pool_drain_end+0x60
uvm_pageout() at netbsd:uvm_pageout+0x880
exception_return() at netbsd:exception_return
--- root of call graph ---
db{0}> ps
PID LID S CPU FLAGS STRUCT LWP * NAME WAIT
23711 1 3 1 0 fffffc003d13cc00 alpha--netbsd-in biowait
28502 1 3 1 80 fffffc002abf4800 sh wait
26148 1 2 0 0 fffffc003fab2000 sh
25393 1 3 1 0 fffffc0013211c00 alpha--netbsd-in nodebug
13042 1 3 1 80 fffffc002abf5c00 sh wait
13735 1 3 1 0 fffffc003d6a9c00 alpha--netbsd-in biowait
8778 1 3 1 80 fffffc002c220400 sh wait
20723 1 3 1 80 fffffc0029f73c00 nbmake select
3839 1 3 1 80 fffffc003ecbe400 sh wait
10654 1 3 1 80 fffffc0028789800 nbmake select
27117 1 3 0 40080 fffffc002aff8400 sh wait
15901 1 3 1 80 fffffc0028788c00 nbmake select
18873 1 3 1 80 fffffc003d13c800 sh wait
21384 1 3 0 80 fffffc003d6a8800 nbmake select
18742 1 3 1 80 fffffc0029f73000 sh wait
4474 1 3 1 80 fffffc0028789400 nbmake select
20634 1 3 0 80 fffffc001ba59400 sh wait
4601 1 3 1 80 fffffc0035c89c00 nbmake select
18174 1 3 1 80 fffffc002ab44800 sh wait
23525 1 3 1 80 fffffc0035c89000 nbmake select
16471 1 3 1 80 fffffc0029f72000 sh wait
18058 1 3 0 80 fffffc002aff9000 nbmake select
1991 1 3 1 80 fffffc002aff8800 sh wait
3250 1 3 0 80 fffffc003fab2800 nbmake select
1383 1 3 0 80 fffffc003d6a8000 sh wait
22718 1 3 0 80 fffffc0035c89800 nbmake select
18050 1 3 1 80 fffffc003d533400 sh wait
22901 1 3 1 80 fffffc002aff9c00 nbmake select
27793 1 3 0 80 fffffc001ba58c00 nbmake select
2373 1 3 1 80 fffffc0035c88800 sh wait
26438 1 3 1 80 fffffc002aff9800 sh wait
11413 1 3 0 80 fffffc002abf4400 nbmake select
11875 1 3 1 80 fffffc002c220c00 pickup kqueue
4144 1 3 1 80 fffffc002c221800 bash ttyraw
12438 1 3 1 80 fffffc003d13d400 xterm select
5290 1 3 1 80 fffffc003d532800 csh pause
1592 1 3 0 80 fffffc002c221c00 sshd select
21396 1 3 1 80 fffffc001ba58400 sshd netio
26068 1 3 0 80 fffffc0028788800 sh wait
18777 1 3 0 80 fffffc003d533c00 nbmake select
24640 1 3 1 80 fffffc002ab45000 sh wait
13734 1 3 0 80 fffffc002c221400 nbmake select
26943 1 3 0 80 fffffc003ecbe800 sh wait
3019 1 3 0 80 fffffc002abf4000 nbmake select
18256 1 3 0 80 fffffc003d6a9800 sh wait
6532 1 3 0 80 fffffc003d532000 nbmake select
23787 1 3 0 80 fffffc002abf5800 sh wait
20208 1 3 1 80 fffffc003d6a8c00 tcsh pause
527 1 3 1 80 fffffc003fa09800 getty ttyraw
515 1 3 0 80 fffffc003d533000 cron nanoslp
508 1 3 1 80 fffffc003ecbec00 python2.5 select
484 1 3 0 80 fffffc003f6fc000 qmgr kqueue
443 1 3 1 80 fffffc003d533800 inetd kqueue
478 1 3 1 80 fffffc003f6fc400 master kqueue
300 1 3 1 80 fffffc003ecbf400 sshd select
301 1 3 0 80 fffffc003ecbf800 ntpd pause
183 1 3 0 80 fffffc003ecbfc00 mount_mfs mfsidl
152 1 3 1 80 fffffc003f6fd000 syslogd kqueue
1 1 3 1 80 fffffc003fa09400 init wait
0 39 3 1 200 fffffc003f6fc800 physiod physiod
0 38 3 1 200 fffffc003fa08000 vmem_rehash vmem_rehash
0 37 3 1 200 fffffc003fa08400 aiodoned aiodoned
0 36 3 0 200 fffffc003fa08800 ioflush syncer
0 > 35 7 0 200 fffffc003fab2400 pgdaemon
0 34 3 1 200 fffffc003fa09000 cryptoret crypto_wait
0 33 3 1 200 fffffc003fa09c00 atapibus0 sccomp
0 31 3 1 200 fffffc003fa08c00 unpgc unpgc
0 22 3 0 200 fffffc003fab2c00 atabus1 atath
0 21 3 1 200 fffffc003fab3000 atabus0 atath
0 20 3 0 200 fffffc003fab3400 scsibus0 sccomp
0 19 3 1 200 fffffc003fab3800 xcall/1 xcall
0 18 1 1 200 fffffc003fab3c00 softser/1
0 17 1 1 200 fffffc003fcea000 softclk/1
0 16 1 1 200 fffffc003fcea400 softbio/1
0 15 1 1 200 fffffc003fcea800 softnet/1
0 > 14 7 1 201 fffffc003fceac00 idle/1
0 13 3 1 200 fffffc003fceb000 pmfsuspend pmfsuspend
0 12 3 0 200 fffffc003fceb400 pmfevent pmfevent
0 11 3 1 200 fffffc003fceb800 nfssilly nfssilly
0 10 3 1 200 fffffc003fcebc00 cachegc cachegc
0 9 3 1 200 fffffc003fe18000 vrele vrele
0 8 3 0 200 fffffc003fe18400 modunload modunload
0 7 3 0 200 fffffc003fe18800 xcall/0 xcall
0 6 1 0 200 fffffc003fe18c00 softser/0
0 5 1 0 200 fffffc003fe19000 softclk/0
0 4 1 0 200 fffffc003fe19400 softbio/0
0 3 1 0 200 fffffc003fe19800 softnet/0
0 2 1 0 201 fffffc003fe19c00 idle/0
0 1 3 1 200 fffffc0000841580 swapper uvm
db{0}> show reg
v0 0xfffffc000083b980 cpu_info_primary
t0 0x1
t1 0x1
t2 0xfffffc003ff48000
t3 0xfffffc000084f201 kprintf_inited
t4 0
t5 0xfffffc00007e1fd2 sun_noname+0x286
t6 0
t7 0
s0 0xfffffc00008503c4 msgbufenabled
s1 0x104
s2 0xfffffc000084e720 db_onpanic
s3 0xfffffc00007e23e5 __func__.8678+0x2cc
s4 0x1
s5 0x5
s6 0xfffffc003fa936c0
a0 0
a1 0xfffffc0000677ca4 kprintf_unlock+0x54
a2 0
a3 0x8
a4 0x3
a5 0xfffffc0000000008
t8 0xfffffe0013041787
t9 0x8
t10 0x3a46e6e0
t11 0x9
ra 0xfffffc0000679328 panic+0x278
t12 0xfffffc000047b0b0 cpu_Debugger
at 0x17
gp 0xfffffc000084e3d0 __link_set_prop_linkpools_sym__link__prop_array_pool+0x8008
sp 0xfffffe0013041870
pc 0xfffffc000047b0b4 cpu_Debugger+0x4
ps 0x6
ai 0x9
pv 0xfffffc000047b0b0 cpu_Debugger
netbsd:cpu_Debugger+0x4: ret zero,(ra)
db{0}> mach cpu 1
Using CPU 1
db{0}> trace
db{0}> show reg
v0 0xfffffe0000032000
t0 0
t1 0xfffffe0012fd2000
t2 0xfffffe0012fd2000
t3 0
t4 0
t5 0xfffffc003fceac60
t6 0
t7 0xfffffc00008aae18 vm_physmem
s0 0xfffffc003fe20cc0
s1 0
s2 0xfffffe0000032000
s3 0xfffffe0000032038
s4 0xfffffc000078e3d0 config+0x437e
s5 0xfffffc00007ae3d0 wildcard.5079+0xd88
s6 0xfffffc00007ae3d0 wildcard.5079+0xd88
a0 0xfffffc0000793f8c config+0x9f3a
a1 0xfffffc00007aee79 __func__.5392+0xe3
a2 0x1dd
a3 0xfffffc00007aee90 __func__.5392+0xfa
a4 0
a5 0
t8 0xfffffc00008aae18 vm_physmem
t9 0xfffffc00005835b4 calcru+0x204
t10 0
t11 0xa
ra 0xfffffc0000585390 sched_curcpu_runnable_p+0x70
t12 0xfffffc0000594950 kpreempt_enable
at 0x1601b306c
gp 0xfffffc000084e3d0 __link_set_prop_linkpools_sym__link__prop_array_pool+0x8008
sp 0xfffffc0000593ca0 kpreempt_disable+0x20
pc 0xfffffc0000594950 kpreempt_enable
ps 0
ai 0xa
pv 0xfffffc0000594950 kpreempt_enable
netbsd:kpreempt_enable: ldah gp,2c(pv)
db{0}> c
cpu1: shutting down...
dumping to dev 8,1 offset 2097597
dump i/o error
sd0(esiop0:0:0:0): polling command not done
panic: scsipi_execute_xs
panic: alpha_send_ipi: CPU 1 not running
panic: alpha_send_ipi: CPU 1 not running
[ ... ]
-jarle
State-Changed-From-To: feedback->open
State-Changed-By: snj@NetBSD.org
State-Changed-When: Fri, 05 Feb 2010 17:46:44 +0000
State-Changed-Why:
Feedback received.
From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: Jarle Greipsland <jarle@uninett.no>
Cc: gnats-bugs@NetBSD.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Thu, 18 Feb 2010 15:04:28 -0700 (MST)
On Fri, 5 Feb 2010, Jarle Greipsland wrote:
>> Did the committed fix take care of the issue you're seeing?
> I don't think so. I have been running with a LOCKDEBUG-kernel,
> and yesterday, while trying to do a 'build.sh release', it
> panicked.
>
> login: Mutex error: lockdebug_wantlock: locking against myself
>
> lock address : 0xfffffc0000882f68 type : spin
> initialized : 0xfffffc00006735d8
> shared holds : 0 exclusive: 1
> shares wanted: 0 exclusive: 1
> current cpu : 0 last held: 0
> current l : 0xfffffc003fab2400 last held: 0xfffffc003fab2400
> last locked : 0xfffffc0000674690 unlocked : 0xfffffc0000672478
> owner field : 0x0000000000000400 wait/sn: 0/1
>
> panic: LOCKDEBUG
> Stopped in pid 0.35 (system) at netbsd:cpu_Debugger+0x4: ret zero,(ra
> )
> db{0}>
> db{0}> trace
> cpu_Debugger() at netbsd:cpu_Debugger+0x4
> panic() at netbsd:panic+0x278
> lockdebug_abort1() at netbsd:lockdebug_abort1+0x150
> mutex_enter() at netbsd:mutex_enter+0x38c
> pool_get() at netbsd:pool_get+0x58
> pool_cache_put_slow() at netbsd:pool_cache_put_slow+0x2b4
> pool_cache_put_paddr() at netbsd:pool_cache_put_paddr+0x1e0
> pmap_do_tlb_shootdown() at netbsd:pmap_do_tlb_shootdown+0x178
> alpha_ipi_process() at netbsd:alpha_ipi_process+0xb8
> interrupt() at netbsd:interrupt+0x88
> XentInt() at netbsd:XentInt+0x1c
> --- interrupt (from ipl 4) ---
> lockdebug_mem_check() at netbsd:lockdebug_mem_check+0x1b4
> pool_put() at netbsd:pool_put+0x7c
> pool_cache_invalidate_groups() at netbsd:pool_cache_invalidate_groups+0xf4
> pool_cache_invalidate() at netbsd:pool_cache_invalidate+0x8c
> pool_reclaim() at netbsd:pool_reclaim+0x68
> pool_drain_end() at netbsd:pool_drain_end+0x60
> uvm_pageout() at netbsd:uvm_pageout+0x880
> exception_return() at netbsd:exception_return
This looks like the tlb shootdown IPI interrupted a pool_put()
operation, which likley holds a mutex related to some pool. The
tlb shootdown is attempting to return an entry to the shootdown
job pool, and tries to acquire the same lock.
I hadn't seen this type of deadlock (which it would have without
LOCKDEBUG) before, but I recently started seeing seeing several
deadlocks just grepping a number of large files. Trying to build
a current LOCKDEBUG kernel resulted in a hard hang when booting the
kernel, requiring a power cycle of my CS20 and I had trouble recovering
after that. I switched to a 4 CPU ES40 to debug the LOCKDEBUG problem,
and eventually found what that was and using a workaround was finally able
to get a working LOCKDEBUG kernel. My grepping was able to cause
deadlocks fairly easily, and I had been trying to to keep the CS20 from
hanging in the deadlock by enabling the SPINLOCK_SPINOUT part of
LOCKDEBUG. Some of the deadlocks looked very similar to the above.
Try the following change, which will use IPL_SCHED for the pool cache
used by the tlb shootdown code. I've done this on my CS20, and have been
able to run numereous greps that would previously deadlock. In the above
traceback, I can't quite see how that would work, unless the
pool_drain_end() is going to be locking the tlb shootdown job pool.
Index: sys/arch/alpha/alpha/pmap.c
===================================================================
RCS file: /cvsroot/src/sys/arch/alpha/alpha/pmap.c,v
retrieving revision 1.252
diff -u -p -r1.252 pmap.c
--- sys/arch/alpha/alpha/pmap.c 26 Nov 2009 00:19:11 -0000 1.252
+++ sys/arch/alpha/alpha/pmap.c 18 Feb 2010 21:52:16 -0000
@@ -937,7 +937,7 @@ pmap_bootstrap(paddr_t ptaddr, u_int max
*/
pool_cache_bootstrap(&pmap_tlb_shootdown_job_cache,
sizeof(struct pmap_tlb_shootdown_job), CACHE_LINE_SIZE,
- 0, PR_LARGECACHE, "pmaptlb", NULL, IPL_VM, NULL, NULL, NULL);
+ 0, PR_LARGECACHE, "pmaptlb", NULL, IPL_SCHED, NULL, NULL, NULL);
for (i = 0; i < ALPHA_MAXPROCS; i++) {
TAILQ_INIT(&pmap_tlb_shootdown_q[i].pq_head);
mutex_init(&pmap_tlb_shootdown_q[i].pq_lock, MUTEX_DEFAULT,
--
Michael L. Hitch mhitch@montana.edu
Computer Consultant
Information Technology Center
Montana State University Bozeman, MT USA
From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: gnats-bugs@NetBSD.org
Cc: jarle@uninett.no
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Fri, 19 Feb 2010 11:10:17 -0700 (MST)
On Thu, 18 Feb 2010, Michael L. Hitch wrote:
> > db{0}> trace
> > cpu_Debugger() at netbsd:cpu_Debugger+0x4
> > panic() at netbsd:panic+0x278
> > lockdebug_abort1() at netbsd:lockdebug_abort1+0x150
> > mutex_enter() at netbsd:mutex_enter+0x38c
> > pool_get() at netbsd:pool_get+0x58
> > pool_cache_put_slow() at netbsd:pool_cache_put_slow+0x2b4
> > pool_cache_put_paddr() at netbsd:pool_cache_put_paddr+0x1e0
> > pmap_do_tlb_shootdown() at netbsd:pmap_do_tlb_shootdown+0x178
> > alpha_ipi_process() at netbsd:alpha_ipi_process+0xb8
> > interrupt() at netbsd:interrupt+0x88
> > XentInt() at netbsd:XentInt+0x1c
> > --- interrupt (from ipl 4) ---
> > lockdebug_mem_check() at netbsd:lockdebug_mem_check+0x1b4
> > pool_put() at netbsd:pool_put+0x7c
> > pool_cache_invalidate_groups() at netbsd:pool_cache_invalidate_groups+0xf4
> > pool_cache_invalidate() at netbsd:pool_cache_invalidate+0x8c
> > pool_reclaim() at netbsd:pool_reclaim+0x68
> > pool_drain_end() at netbsd:pool_drain_end+0x60
> > uvm_pageout() at netbsd:uvm_pageout+0x880
> > exception_return() at netbsd:exception_return
...
> Try the following change, which will use IPL_SCHED for the pool cache
> used by the tlb shootdown code. I've done this on my CS20, and have been
> able to run numereous greps that would previously deadlock. In the above
> traceback, I can't quite see how that would work, unless the
> pool_drain_end() is going to be locking the tlb shootdown job pool.
The patch I suggested is not going to help. I just ran one more grep
today and got a panic just like the above one.
The problem is that the lock being held is for the pcg_large_pool, which
uses IPL_VM, which can be interrupted by the tlb shootdown IPI. In this
case, the lock has been taken by the cpu when doing the pool cache
invalidate, and gets interrupted by the IPI on the same cpu. It needs to
do a pool_get() when the tlb shootdown routine is trying to put a pool
cache entry back into the cache, and that needs to also lock
pcg_large_pool.
I don't really know all that much about the pool cache or pools, and
don't see any easy solution to this.
A quick and dirty workaround would be to change the maximum number of
'jobs' the tlb shutdown does to 0, which effectively forces it to always
invalid the entire TLB and never use any pool cache entries.
An alternative that I looked at quite some time ago (2 years maybe?) was
to replace the pool cache usage with statically allocated entries in the
pmap_tlb_shootdown_q[].
--
Michael L. Hitch mhitch@montana.edu
Computer Consultant
Information Technology Center
Montana State University Bozeman, MT USA
Responsible-Changed-From-To: port-alpha-maintainer->thorpej
Responsible-Changed-By: thorpej@NetBSD.org
Responsible-Changed-When: Tue, 13 Oct 2020 15:58:55 +0000
Responsible-Changed-Why:
Take.
State-Changed-From-To: open->closed
State-Changed-By: thorpej@NetBSD.org
State-Changed-When: Tue, 13 Oct 2020 15:58:55 +0000
State-Changed-Why:
Ancient PR, many changes in this area in the intervening time. I have
done an awful lot of MP stress testing on Alpha recently and fixed many
bugs in this area.
>Unformatted:
(Contact us)
$NetBSD: gnats-precook-prs,v 1.4 2018/12/21 14:20:20 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.