NetBSD Problem Report #38335

From jarle@festningen.uninett.no  Sat Mar 29 21:57:01 2008
Return-Path: <jarle@festningen.uninett.no>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by narn.NetBSD.org (Postfix) with ESMTP id 789EB63B8BC
	for <gnats-bugs@gnats.NetBSD.org>; Sat, 29 Mar 2008 21:57:01 +0000 (UTC)
Message-Id: <20080329215658.3AE8180B63@festningen.uninett.no>
Date: Sat, 29 Mar 2008 22:56:58 +0100 (CET)
From: jarle@uninett.no
Reply-To: jarle@uninett.no
To: gnats-bugs@gnats.NetBSD.org
Subject: kernel freeze on alpha MP system
X-Send-Pr-Version: 3.95

>Number:         38335
>Category:       port-alpha
>Synopsis:       kernel freeze on alpha MP system
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    thorpej
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Mar 29 22:00:01 +0000 2008
>Closed-Date:    Tue Oct 13 15:58:55 +0000 2020
>Last-Modified:  Tue Oct 13 15:58:55 +0000 2020
>Originator:     Jarle Greipsland
>Release:        NetBSD -current
>Organization:

>Environment:


System: NetBSD sweetheart.urc.uninett.no 4.99.58 NetBSD 4.99.58 (GENERIC-$Revision: 1.314 $) #1: Sat Mar 29 20:46:15 CET 2008  jarle@sweetheart.urc.uninett.no:/usr/obj/sys/arch/alpha/compile/CS20.MP alpha
Architecture: alpha
Machine: alpha
>Description:
A -current alpha kernel on a CS20 dual processor system got stuck in the
kernel during a NetBSD build job.  After more than three minutes without
any response from the system, I managed to break into ddb, and gather some
data:

Stopped in pid 27607.1 (sh) at  netbsd:cpu_Debugger+0x4:        ret     zero,(ra
)
db{0}> trace
cpu_Debugger() at netbsd:cpu_Debugger+0x4
comintr() at netbsd:comintr+0x718
alpha_shared_intr_dispatch() at netbsd:alpha_shared_intr_dispatch+0x5c
sio_iointr() at netbsd:sio_iointr+0x38
interrupt() at netbsd:interrupt+0x1c0
XentInt() at netbsd:XentInt+0x1c
--- interrupt (from ipl 0) ---
rw_vector_exit() at netbsd:rw_vector_exit+0x90
vm_map_unlock_read() at netbsd:vm_map_unlock_read+0x1c
uvm_fault_internal() at netbsd:uvm_fault_internal+0x16c8
trap() at netbsd:trap+0x6c8
XentMM() at netbsd:XentMM+0x20
--- memory management fault ---
memcpy() at netbsd:memcpy+0x54
copyout() at netbsd:copyout+0x50
copyout_vmspace() at netbsd:copyout_vmspace+0x7c
uiomove() at netbsd:uiomove+0xec
ubc_uiomove() at netbsd:ubc_uiomove+0xc8
ffs_read() at netbsd:ffs_read+0x3a8
VOP_READ() at netbsd:VOP_READ+0xcc
vn_rdwr() at netbsd:vn_rdwr+0xd0
vmcmd_readvn() at netbsd:vmcmd_readvn+0x64
execve1() at netbsd:execve1+0x790
syscall_plain() at netbsd:syscall_plain+0x1c4
XentSys() at netbsd:XentSys+0x60
--- syscall (59) ---
--- user mode ---
db{0}> show reg
v0          0xf9
t0          0xfffffc0000686a74  db_fromconsole
t1          0x1
t2          0xff00
t3          0xfffffc00006c0088  cn_magic
t4          0x2
t5          0xa0b
t6          0
t7          0xfffffc00006e11e8  vm_physmem
s0          0xf9
s1          0xc6
s2          0xfffffe00004cd000
s3          0xfffffe00004cd070
s4          0xfffffe00004d5008
s5          0x7fd
s6          0xfffffc00006a6b60  idstate+0x1ec20
a0          0xfffffc00006ea690  tsp_configuration
a1          0xfffffd01fc0003fd
a2          0x5
a3          0xfffffe0013d0b298
a4          0x22
a5          0
t8          0xfffffc00006e11e8  vm_physmem
t9          0xfffffc00004e38ac  bt_hashhead+0x5c
t10         0x600f51dc
t11         0x1a5
ra          0xfffffc00003b1e78  comintr+0x718
t12         0xfffffc00005b7340  cpu_Debugger
at          0xfffffe0013d08000
gp          0xfffffc0000686b60  nfsrtton
sp          0xfffffe0013d0b1c8
pc          0xfffffc00005b7344  cpu_Debugger+0x4
ps          0x6
ai          0x1a5
pv          0xfffffc00005b7340  cpu_Debugger
netbsd:cpu_Debugger+0x4:        ret     zero,(ra)
db{0}> ps
 PID           PPID     PGRP        UID S   FLAGS LWPS          COMMAND    WAIT
>27607        11306    20939          0 2       0    1               sh
 11306         8909    20939          0 2       0    1               sh    wait
 8909          4701    20939          0 2  0x4000    1               sh  piperd
 4701         20692    20939          0 2  0x4000    1               sh    wait
 20692        20939    20939          0 2  0x4000    1          nbgmake    wait
 20939        11274    20939          0 2  0x4000    1               sh    wait
 11274        18129    18129          0 2  0x4000    1           nbmake  select
 18129        22431    18129          0 2  0x4000    1               sh    wait
 22431         5699     5699          0 2  0x4000    1           nbmake  select
 5699          3213     5699          0 2  0x4000    1               sh    wait
 3213          1188     1178          0 2  0x4000    1           nbmake  select
 1084           463     1084       1000 2  0x4000    1              top  select
 1188          1178     1178          0 2  0x4000    1               sh    wait
 1182           389     1182          0 2  0x4000    1             tail  kqueue
 1178           389     1178          0 2       0    1             tcsh   pause
 463            435      463       1000 2  0x4000    1             bash    wait
 435            443      443       1000 2  0x4000    1            xterm  select
 443            437      443       1000 2  0x4000    1              csh   pause
 437             66       66       1000 2   0x100    1             sshd  select
 66             248       66          0 2  0x4100    1             sshd   netio
 389            381      389          0 2  0x4000    1             tcsh   pause
 381            383      381       1000 2  0x4000    1             bash    wait
 383            294      294       1000 2  0x4000    1            xterm  select
 294            301      294       1000 2  0x4000    1              csh   pause
 301            327      327       1000 2   0x100    1             sshd  select
 327            248      327          0 2  0x4100    1             sshd   netio
 377            342      342         12 2  0x4100    1             qmgr  kqueue
 370            342      342         12 2  0x4100    1           pickup  kqueue
 369              1      369          0 2  0x4000    1            getty     tty
 372              1      372          0 2       0    1             cron nanoslp
 355              1      355          0 2       0    1            inetd  kqueue
 342              1      342          0 2  0x4100    1           master  kqueue
 248              1      248          0 2       0    1             sshd  select
 236              1      236          0 2       0    1             ntpd   pause
 156              1      156          0 2       0    1        mount_mfs  mfsidl
 73               1       73          0 2       0    1          syslogd  kqueue
 1                0        1          0 2  0x4001    1             init    wait
 0               -1        0          0 2 0x20002   25           system       *
db{0}> mach cpu 1 
Using CPU 1
db{0}> trace

CPU 0: fatal kernel trap:

CPU 0    trap entry = 0x4 (unaligned access fault)
CPU 0    a0         = 0xfffffc00006e110d
CPU 0    a1         = 0x29
CPU 0    a2         = 0xa
CPU 0    pc         = 0xfffffc00005b7b78
CPU 0    ra         = 0xfffffc0000399044
CPU 0    pv         = 0xfffffc00005b7ac0
CPU 0    curlwp     = 0xfffffc00391713a0
CPU 0        pid = 27607, comm = sh

Caught exception in ddb.
db{0}> show reg
v0          0
t0          0
t1          0
t2          0
t3          0xfffffe0013243880
t4          0
t5          0xfffffc00006ecdc8  kernel_pmap_store
t6          0xfffffc00006ece28  kernel_pmap_store+0x60
t7          0x1
s0          0x1
s1          0xfffffc003f815980
s2          0xfffffc00006e1135  uvm+0xd
s3          0xfffffc00006e1128  uvm
s4          0xfffffe00004a3c00
s5          0x1
s6          0xfffffe001324fdc0
a0          0
a1          0x1
a2          0xfffffe0013bae000
a3          0x1
a4          0
a5          0xfffffe0013bb1dcc
t8          0x80
t9          0x67
t10         0xffffffff
t11         0xfffffc003b1e9328
ra          0xfffffc000049e95c  idle_loop+0x12c
t12         0xfffffc00004cb360  sched_curcpu_runnable_p
at          0x3
gp          0xfffffc0000686b60  nfsrtton
sp          0xfffffc00006e1135  uvm+0xd
pc          0xfffffc000049e934  idle_loop+0x104
ps          0
ai          0xfffffc003b1e9328
pv          0xfffffc00004cb360  sched_curcpu_runnable_p
netbsd:idle_loop+0x104: ldq_u   t0,d(s3)


>How-To-Repeat:
Run an MP kernel on an alpha system?

>Fix:


>Release-Note:

>Audit-Trail:

State-Changed-From-To: open->feedback
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sat, 19 Sep 2009 21:51:28 +0000
State-Changed-Why:
A bunch of MP alpha fixes just went into -current; might want to give them a
try.


From: Jarle Greipsland <jarle@uninett.no>
To: gnats-bugs@NetBSD.org, dholland@NetBSD.org
Cc: port-alpha-maintainer@netbsd.org, netbsd-bugs@netbsd.org,
	gnats-admin@netbsd.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Tue, 22 Sep 2009 19:00:07 +0200 (CEST)

 dholland@NetBSD.org writes:
 > Synopsis: kernel freeze on alpha MP system
 > 
 > State-Changed-From-To: open->feedback
 > State-Changed-By: dholland@NetBSD.org
 > State-Changed-When: Sat, 19 Sep 2009 21:51:28 +0000
 > State-Changed-Why:
 > A bunch of MP alpha fixes just went into -current; might want to give them a
 > try.
 The problem is still there.  I installed and booted GENERIC.MP,
 and built a full release.  During extraction of the newly built
 sets, the system wedged again.  This time I could not break into
 DDB from the console.

 I power-cycled the system, and started to extract the sets again.
 During extraction I noticed that the IO performance through the
 file system seemed abyssmal (according to 'systat vm').  The 'tar
 zxBpf' command consumed almost no CPU, but the disk IO hovered
 around 500KB/s.  Doing a plain dd from/to a single file on the
 same disk gets around 20-30MB/s.

 Also, right after root file system detection, the kernel complains:

 WARNING: negative runtime; monotonic clock has gone backwards

 I don't know if this might be related to the wedging or not.

 If you want me to dig out more information, please ask, and I'll
 see what I can do.
 					-jarle

From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: Jarle Greipsland <jarle@uninett.no>
Cc: gnats-bugs@NetBSD.org, dholland@NetBSD.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Sat, 26 Sep 2009 11:04:37 -0600 (MDT)

 On Tue, 22 Sep 2009, Jarle Greipsland wrote:

 > The problem is still there.  I installed and booted GENERIC.MP,
 > and built a full release.  During extraction of the newly built
 > sets, the system wedged again.  This time I could not break into
 > DDB from the console.

    This was a CS20, according to the original PR?  If so, you can get
 into DDB using the halt switch [it's hidden in a small hole in the front
 panel to the right].  After the machine is halted, you can enter
 'continue', which will enter into a console trap.  The halt should
 show the PC of cpu 0 at the time of the halt, and I think that address
 should be related to the backtrace of cpu 0.  One problem will be that
 it's very likely cpu 1 is not halted, and is likely spinning with
 interrupts block so the IPI send by cpu 0 to halt will not occur and
 there won't be any register available [I'm not ever sure that works
 now even when the IPI can be delivered - I need to look into that].

 > Also, right after root file system detection, the kernel complains:
 >
 > WARNING: negative runtime; monotonic clock has gone backwards
 >
 > I don't know if this might be related to the wedging or not.

    Not very likely, I think.  I see this every MP boot, and it is related
 to using the the PCC timecounter.  I haven't figured out exactly how that
 is supposed to work on an MP system yet.

 > If you want me to dig out more information, please ask, and I'll
 > see what I can do.

    The problem is likely some kind of locking deadlock, which I've gotten
 a few times.  I've been trying to debug a problem with the TLB shootdown
 code where it gets a corrupted pool_cache entry and ends up with a
 list that links to itself.  You can try this patch I'm using to attempt
 to detect this and work around it:

 @@ -3700,6 +3700,12 @@ pmap_tlb_shootdown(pmap_t pmap, vaddr_t
                   * don't really have to do anything else.
                   */
                  mutex_spin_enter(&pq->pq_lock);
 +/**/           if (pj && pj == pq->pq_head.tqh_first) {
 +/**/                   printf("Whoa!  pool_cache_get returned an in-use 
 entry! ci_index %d pj %p\n",
 +                           self->ci_index,  pj);
 +/**/                   /*panic("Oops");*/
 +/**/                   pj = NULL;      /* XXX */
 +/**/           }
                  pq->pq_pte |= pte;
                  if (pq->pq_tbia) {
                          mutex_spin_exit(&pq->pq_lock);

    An alternative workaround is to set PMAP_TLB_SHOOTDOWN_MAXJOBS to 0, 
 which will invalidate all tlbs instead of trying to invalidate
 single entries.

 --
 Michael L. Hitch			mhitch@montana.edu
 Computer Consultant
 Information Technology Center
 Montana State University	Bozeman, MT	USA

From: Jarle Greipsland <jarle@uninett.no>
To: mhitch@lightning.msu.montana.edu
Cc: gnats-bugs@NetBSD.org, dholland@NetBSD.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Thu, 01 Oct 2009 13:34:50 +0200 (CEST)

 "Michael L. Hitch" <mhitch@lightning.msu.montana.edu> writes:
 >    This was a CS20, according to the original PR?  If so, you can get
 > into DDB using the halt switch [it's hidden in a small hole in the front
 > panel to the right].  After the machine is halted, you can enter
 > 'continue', which will enter into a console trap.  The halt should
 > show the PC of cpu 0 at the time of the halt, and I think that address
 > should be related to the backtrace of cpu 0.

 Annotated console log:
 [ I pressed the halt switch here ]
 halted CPU 0
 CPU 1 is not halted

 halt code = 1
 operator initiated halt
 PC = fffffc0000736b08
 P00>>>continue^U
 P00>>>help
 [ ... some help text here ... ]
 P00>>>halt 1
 P00>>>continue 

 continuing CPU 0
 CP - RESTORE_TERM routine to be called
 panic: user requested console halt
 cpu0: Begin traceback...
 alpha trace requires known PC =eject=
 cpu0: End traceback...
 WARNING: Unable to halt secondary CPUs (0x3)

 dumping to dev 8,1 offset 2097597
 dump i/o error


 sd0(esiop0:0:0:0): polling command not done
 panic: scsipi_execute_xs
 cpu0: Begin traceback...
 alpha trace requires known PC =eject=
 cpu0: End traceback...
 WARNING: Unable to halt secondary CPUs (0x3)

 dumping to dev 8,1 offset 2097597
 dump device not ready


 rebooting...

 CP - RESTORE_TERM exited with hlt_req = 3, r0 = 00000001.00000000

 halted CPU 0

 halt code = 0
 PC = fffffc0000736b08
 P00>>>cont
 Slot context is not valid
 P00>>>boot^U
 P00>>>init
 Initializing...
 .
 !1024 Meg of system memory
 initializing GCT/FRU at 1d4000
 Testing the System
 Testing the Memory
 Testing the Disks (read only)
 Testing ei* devices.

 API CS20D 833 MHz Console A5.8-81, 08-MAY-2001 00:09:13

 [ .. end of console log .. ]
 This seems less helpful than I had imagined; oh well.  I hope you
 can make some kind of sense out of it.  According to gdb, the PC
 is somewhere in pmap_do_tlb_shootdown (offset +296 in the listing
 below):

 (gdb) x/60i pmap_do_tlb_shootdown
 0xfffffc00007369e0 <pmap_do_tlb_shootdown>:     ldah    gp,80(t12)
 0xfffffc00007369e4 <pmap_do_tlb_shootdown+4>:   lda     gp,-31544(gp)
 0xfffffc00007369e8 <pmap_do_tlb_shootdown+8>:   lda     sp,-64(sp)
 0xfffffc00007369ec <pmap_do_tlb_shootdown+12>:  stq     ra,0(sp)
 0xfffffc00007369f0 <pmap_do_tlb_shootdown+16>:  stq     s0,8(sp)
 0xfffffc00007369f4 <pmap_do_tlb_shootdown+20>:  stq     s1,16(sp)
 0xfffffc00007369f8 <pmap_do_tlb_shootdown+24>:  stq     s2,24(sp)
 0xfffffc00007369fc <pmap_do_tlb_shootdown+28>:  stq     s3,32(sp)
 0xfffffc0000736a00 <pmap_do_tlb_shootdown+32>:  lda     t0,48(sp)
 0xfffffc0000736a04 <pmap_do_tlb_shootdown+36>:  ldq     s1,456(a0)
 0xfffffc0000736a08 <pmap_do_tlb_shootdown+40>:  stq     t0,56(sp)
 0xfffffc0000736a0c <pmap_do_tlb_shootdown+44>:  ldah    t0,4(gp)
 0xfffffc0000736a10 <pmap_do_tlb_shootdown+48>:  lda     t0,-25256(t0)
 0xfffffc0000736a14 <pmap_do_tlb_shootdown+52>:  ldq     t12,14240(gp)
 0xfffffc0000736a18 <pmap_do_tlb_shootdown+56>:  sll     s1,0x6,t2
 0xfffffc0000736a1c <pmap_do_tlb_shootdown+60>:  stq     zero,48(sp)
 0xfffffc0000736a20 <pmap_do_tlb_shootdown+64>:  lda     t1,1
 0xfffffc0000736a24 <pmap_do_tlb_shootdown+68>:  addq    t2,t0,s0
 0xfffffc0000736a28 <pmap_do_tlb_shootdown+72>:  lda     s3,16(s0)
 0xfffffc0000736a2c <pmap_do_tlb_shootdown+76>:  sll     t1,s1,s2
 0xfffffc0000736a30 <pmap_do_tlb_shootdown+80>:  mov     s3,a0
 0xfffffc0000736a34 <pmap_do_tlb_shootdown+84>:  jsr     ra,(t12),0xfffffc0000735b80 <pmap_do_reactivate+96>
 0xfffffc0000736a38 <pmap_do_tlb_shootdown+88>:  ldah    gp,80(ra)
 0xfffffc0000736a3c <pmap_do_tlb_shootdown+92>:  lda     gp,-31632(gp)
 0xfffffc0000736a40 <pmap_do_tlb_shootdown+96>:  ldq     t1,0(s0)
 0xfffffc0000736a44 <pmap_do_tlb_shootdown+100>: beq     t1,0xfffffc0000736a6c <pmap_do_tlb_shootdown+140>
 0xfffffc0000736a48 <pmap_do_tlb_shootdown+104>: ldq     t0,56(sp)
 0xfffffc0000736a4c <pmap_do_tlb_shootdown+108>: stq     t1,0(t0)
 0xfffffc0000736a50 <pmap_do_tlb_shootdown+112>: ldq     t0,56(sp)
 0xfffffc0000736a54 <pmap_do_tlb_shootdown+116>: ldq     t1,0(s0)
 0xfffffc0000736a58 <pmap_do_tlb_shootdown+120>: stq     t0,8(t1)
 0xfffffc0000736a5c <pmap_do_tlb_shootdown+124>: ldq     t2,8(s0)
 0xfffffc0000736a60 <pmap_do_tlb_shootdown+128>: stq     zero,0(s0)
 0xfffffc0000736a64 <pmap_do_tlb_shootdown+132>: stq     s0,8(s0)
 0xfffffc0000736a68 <pmap_do_tlb_shootdown+136>: stq     t2,56(sp)
 0xfffffc0000736a6c <pmap_do_tlb_shootdown+140>: ldl     t0,40(s0)
 0xfffffc0000736a70 <pmap_do_tlb_shootdown+144>: beq     t0,0xfffffc0000736a9c <pmap_do_tlb_shootdown+188>
 0xfffffc0000736a74 <pmap_do_tlb_shootdown+148>: ldl     t0,32(s0)
 0xfffffc0000736a78 <pmap_do_tlb_shootdown+152>: and     t0,0x10,t0
 0xfffffc0000736a7c <pmap_do_tlb_shootdown+156>: sextl   t0,t0
 0xfffffc0000736a80 <pmap_do_tlb_shootdown+160>: beq     t0,0xfffffc0000736b84 <pmap_do_tlb_shootdown+420>
 0xfffffc0000736a84 <pmap_do_tlb_shootdown+164>: lda     a0,-2
 0xfffffc0000736a88 <pmap_do_tlb_shootdown+168>: clr     a1
 0xfffffc0000736a8c <pmap_do_tlb_shootdown+172>: call_pal        0x33
 0xfffffc0000736a90 <pmap_do_tlb_shootdown+176>: stl     zero,40(s0)
 0xfffffc0000736a94 <pmap_do_tlb_shootdown+180>: stl     zero,32(s0)
 0xfffffc0000736a98 <pmap_do_tlb_shootdown+184>: br      0xfffffc0000736b14 <pmap_do_tlb_shootdown+308>
 0xfffffc0000736a9c <pmap_do_tlb_shootdown+188>: ldq     t3,48(sp)
 0xfffffc0000736aa0 <pmap_do_tlb_shootdown+192>: beq     t3,0xfffffc0000736b10 <pmap_do_tlb_shootdown+304>
 0xfffffc0000736aa4 <pmap_do_tlb_shootdown+196>: sll     s1,0x4,t2
 0xfffffc0000736aa8 <pmap_do_tlb_shootdown+200>: ldah    t0,4(gp)
 0xfffffc0000736aac <pmap_do_tlb_shootdown+204>: lda     t0,-30440(t0)
 0xfffffc0000736ab0 <pmap_do_tlb_shootdown+208>: addq    t2,t0,t5
 0xfffffc0000736ab4 <pmap_do_tlb_shootdown+212>: br      0xfffffc0000736af0 <pmap_do_tlb_shootdown+272>
 0xfffffc0000736ab8 <pmap_do_tlb_shootdown+216>: nop     
 0xfffffc0000736abc <pmap_do_tlb_shootdown+220>: unop    
 0xfffffc0000736ac0 <pmap_do_tlb_shootdown+224>: ldq     t1,24(t3)
 0xfffffc0000736ac4 <pmap_do_tlb_shootdown+228>: addq    t2,t1,t4
 0xfffffc0000736ac8 <pmap_do_tlb_shootdown+232>: ldq     t0,64(t1)
 0xfffffc0000736acc <pmap_do_tlb_shootdown+236>: and     s2,t0,t0
 (gdb) x/80i pmap_do_tlb_shootdown
 0xfffffc00007369e0 <pmap_do_tlb_shootdown>:     ldah    gp,80(t12)
 0xfffffc00007369e4 <pmap_do_tlb_shootdown+4>:   lda     gp,-31544(gp)
 0xfffffc00007369e8 <pmap_do_tlb_shootdown+8>:   lda     sp,-64(sp)
 0xfffffc00007369ec <pmap_do_tlb_shootdown+12>:  stq     ra,0(sp)
 0xfffffc00007369f0 <pmap_do_tlb_shootdown+16>:  stq     s0,8(sp)
 0xfffffc00007369f4 <pmap_do_tlb_shootdown+20>:  stq     s1,16(sp)
 0xfffffc00007369f8 <pmap_do_tlb_shootdown+24>:  stq     s2,24(sp)
 0xfffffc00007369fc <pmap_do_tlb_shootdown+28>:  stq     s3,32(sp)
 0xfffffc0000736a00 <pmap_do_tlb_shootdown+32>:  lda     t0,48(sp)
 0xfffffc0000736a04 <pmap_do_tlb_shootdown+36>:  ldq     s1,456(a0)
 0xfffffc0000736a08 <pmap_do_tlb_shootdown+40>:  stq     t0,56(sp)
 0xfffffc0000736a0c <pmap_do_tlb_shootdown+44>:  ldah    t0,4(gp)
 0xfffffc0000736a10 <pmap_do_tlb_shootdown+48>:  lda     t0,-25256(t0)
 0xfffffc0000736a14 <pmap_do_tlb_shootdown+52>:  ldq     t12,14240(gp)
 0xfffffc0000736a18 <pmap_do_tlb_shootdown+56>:  sll     s1,0x6,t2
 0xfffffc0000736a1c <pmap_do_tlb_shootdown+60>:  stq     zero,48(sp)
 0xfffffc0000736a20 <pmap_do_tlb_shootdown+64>:  lda     t1,1
 0xfffffc0000736a24 <pmap_do_tlb_shootdown+68>:  addq    t2,t0,s0
 0xfffffc0000736a28 <pmap_do_tlb_shootdown+72>:  lda     s3,16(s0)
 0xfffffc0000736a2c <pmap_do_tlb_shootdown+76>:  sll     t1,s1,s2
 0xfffffc0000736a30 <pmap_do_tlb_shootdown+80>:  mov     s3,a0
 0xfffffc0000736a34 <pmap_do_tlb_shootdown+84>:  jsr     ra,(t12),0xfffffc0000735b80 <pmap_do_reactivate+96>
 0xfffffc0000736a38 <pmap_do_tlb_shootdown+88>:  ldah    gp,80(ra)
 0xfffffc0000736a3c <pmap_do_tlb_shootdown+92>:  lda     gp,-31632(gp)
 0xfffffc0000736a40 <pmap_do_tlb_shootdown+96>:  ldq     t1,0(s0)
 0xfffffc0000736a44 <pmap_do_tlb_shootdown+100>: beq     t1,0xfffffc0000736a6c <pmap_do_tlb_shootdown+140>
 0xfffffc0000736a48 <pmap_do_tlb_shootdown+104>: ldq     t0,56(sp)
 0xfffffc0000736a4c <pmap_do_tlb_shootdown+108>: stq     t1,0(t0)
 0xfffffc0000736a50 <pmap_do_tlb_shootdown+112>: ldq     t0,56(sp)
 0xfffffc0000736a54 <pmap_do_tlb_shootdown+116>: ldq     t1,0(s0)
 0xfffffc0000736a58 <pmap_do_tlb_shootdown+120>: stq     t0,8(t1)
 0xfffffc0000736a5c <pmap_do_tlb_shootdown+124>: ldq     t2,8(s0)
 0xfffffc0000736a60 <pmap_do_tlb_shootdown+128>: stq     zero,0(s0)
 0xfffffc0000736a64 <pmap_do_tlb_shootdown+132>: stq     s0,8(s0)
 0xfffffc0000736a68 <pmap_do_tlb_shootdown+136>: stq     t2,56(sp)
 0xfffffc0000736a6c <pmap_do_tlb_shootdown+140>: ldl     t0,40(s0)
 0xfffffc0000736a70 <pmap_do_tlb_shootdown+144>: beq     t0,0xfffffc0000736a9c <pmap_do_tlb_shootdown+188>
 0xfffffc0000736a74 <pmap_do_tlb_shootdown+148>: ldl     t0,32(s0)
 0xfffffc0000736a78 <pmap_do_tlb_shootdown+152>: and     t0,0x10,t0
 0xfffffc0000736a7c <pmap_do_tlb_shootdown+156>: sextl   t0,t0
 0xfffffc0000736a80 <pmap_do_tlb_shootdown+160>: beq     t0,0xfffffc0000736b84 <pmap_do_tlb_shootdown+420>
 0xfffffc0000736a84 <pmap_do_tlb_shootdown+164>: lda     a0,-2
 0xfffffc0000736a88 <pmap_do_tlb_shootdown+168>: clr     a1
 0xfffffc0000736a8c <pmap_do_tlb_shootdown+172>: call_pal        0x33
 0xfffffc0000736a90 <pmap_do_tlb_shootdown+176>: stl     zero,40(s0)
 0xfffffc0000736a94 <pmap_do_tlb_shootdown+180>: stl     zero,32(s0)
 0xfffffc0000736a98 <pmap_do_tlb_shootdown+184>: br      0xfffffc0000736b14 <pmap_do_tlb_shootdown+308>
 0xfffffc0000736a9c <pmap_do_tlb_shootdown+188>: ldq     t3,48(sp)
 0xfffffc0000736aa0 <pmap_do_tlb_shootdown+192>: beq     t3,0xfffffc0000736b10 <pmap_do_tlb_shootdown+304>
 0xfffffc0000736aa4 <pmap_do_tlb_shootdown+196>: sll     s1,0x4,t2
 0xfffffc0000736aa8 <pmap_do_tlb_shootdown+200>: ldah    t0,4(gp)
 0xfffffc0000736aac <pmap_do_tlb_shootdown+204>: lda     t0,-30440(t0)
 0xfffffc0000736ab0 <pmap_do_tlb_shootdown+208>: addq    t2,t0,t5
 0xfffffc0000736ab4 <pmap_do_tlb_shootdown+212>: br      0xfffffc0000736af0 <pmap_do_tlb_shootdown+272>
 0xfffffc0000736ab8 <pmap_do_tlb_shootdown+216>: nop     
 0xfffffc0000736abc <pmap_do_tlb_shootdown+220>: unop    
 0xfffffc0000736ac0 <pmap_do_tlb_shootdown+224>: ldq     t1,24(t3)
 0xfffffc0000736ac4 <pmap_do_tlb_shootdown+228>: addq    t2,t1,t4
 0xfffffc0000736ac8 <pmap_do_tlb_shootdown+232>: ldq     t0,64(t1)
 0xfffffc0000736acc <pmap_do_tlb_shootdown+236>: and     s2,t0,t0
 0xfffffc0000736ad0 <pmap_do_tlb_shootdown+240>: bne     t0,0xfffffc0000736b00 <pmap_do_tlb_shootdown+288>
 0xfffffc0000736ad4 <pmap_do_tlb_shootdown+244>: ldq     t0,88(t4)
 0xfffffc0000736ad8 <pmap_do_tlb_shootdown+248>: ldq     t1,8(t5)
 0xfffffc0000736adc <pmap_do_tlb_shootdown+252>: cmpeq   t0,t1,t0
 0xfffffc0000736ae0 <pmap_do_tlb_shootdown+256>: bne     t0,0xfffffc0000736b94 <pmap_do_tlb_shootdown+436>
 0xfffffc0000736ae4 <pmap_do_tlb_shootdown+260>: ldq     t3,0(t3)
 0xfffffc0000736ae8 <pmap_do_tlb_shootdown+264>: beq     t3,0xfffffc0000736b10 <pmap_do_tlb_shootdown+304>
 0xfffffc0000736aec <pmap_do_tlb_shootdown+268>: unop    
 0xfffffc0000736af0 <pmap_do_tlb_shootdown+272>: lda     a0,3
 0xfffffc0000736af4 <pmap_do_tlb_shootdown+276>: ldq     t0,32(t3)
 0xfffffc0000736af8 <pmap_do_tlb_shootdown+280>: and     t0,0x10,t0
 0xfffffc0000736afc <pmap_do_tlb_shootdown+284>: beq     t0,0xfffffc0000736ac0 <pmap_do_tlb_shootdown+224>
 0xfffffc0000736b00 <pmap_do_tlb_shootdown+288>: ldq     a1,16(t3)
 0xfffffc0000736b04 <pmap_do_tlb_shootdown+292>: call_pal        0x33
 0xfffffc0000736b08 <pmap_do_tlb_shootdown+296>: ldq     t3,0(t3)
 0xfffffc0000736b0c <pmap_do_tlb_shootdown+300>: bne     t3,0xfffffc0000736af0 <pmap_do_tlb_shootdown+272>
 0xfffffc0000736b10 <pmap_do_tlb_shootdown+304>: stl     zero,32(s0)
 0xfffffc0000736b14 <pmap_do_tlb_shootdown+308>: mov     s3,a0
 0xfffffc0000736b18 <pmap_do_tlb_shootdown+312>: ldq     t12,-29928(gp)
 0xfffffc0000736b1c <pmap_do_tlb_shootdown+316>: stl     zero,36(s0)

 					-jarle

From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: gnats-bugs@NetBSD.org
Cc: gnats-admin@netbsd.org, jarle@uninett.no
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Thu, 1 Oct 2009 12:20:43 -0600 (MDT)

 On Thu, 1 Oct 2009, Jarle Greipsland wrote:

 > Annotated console log:
 > [ I pressed the halt switch here ]
 > halted CPU 0
 > CPU 1 is not halted
 >
 > halt code = 1
 > operator initiated halt
 > PC = fffffc0000736b08
 > P00>>>continue^U
 > P00>>>help
 > [ ... some help text here ... ]
 > P00>>>halt 1

    Halting the other cpu is not needed at this point, and might have caused
 the problem when continuing cpu 0.  Also, do you have ddb.onpanic set to 
 0?  If so, you probably want it 1 so that you enter DDB when it panics.

 > WARNING: Unable to halt secondary CPUs (0x3)

    This would be caused by your halting cpu 1 earlier, although I'm pretty
 sure the same thing would have happened even if you hadn't halted it.

 > [ .. end of console log .. ]
 > This seems less helpful than I had imagined; oh well.  I hope you
 > can make some kind of sense out of it.  According to gdb, the PC
 > is somewhere in pmap_do_tlb_shootdown (offset +296 in the listing
 > below):

    It does show that what I thought had happened did indeed happen.

 > 0xfffffc0000736ac0 <pmap_do_tlb_shootdown+224>: ldq     t1,24(t3)
 > 0xfffffc0000736ac4 <pmap_do_tlb_shootdown+228>: addq    t2,t1,t4
 > 0xfffffc0000736ac8 <pmap_do_tlb_shootdown+232>: ldq     t0,64(t1)
 > 0xfffffc0000736acc <pmap_do_tlb_shootdown+236>: and     s2,t0,t0
 > 0xfffffc0000736ad0 <pmap_do_tlb_shootdown+240>: bne     t0,0xfffffc0000736b00 <pmap_do_tlb_shootdown+288>
 > 0xfffffc0000736ad4 <pmap_do_tlb_shootdown+244>: ldq     t0,88(t4)
 > 0xfffffc0000736ad8 <pmap_do_tlb_shootdown+248>: ldq     t1,8(t5)
 > 0xfffffc0000736adc <pmap_do_tlb_shootdown+252>: cmpeq   t0,t1,t0
 > 0xfffffc0000736ae0 <pmap_do_tlb_shootdown+256>: bne     t0,0xfffffc0000736b94 <pmap_do_tlb_shootdown+436>
 > 0xfffffc0000736ae4 <pmap_do_tlb_shootdown+260>: ldq     t3,0(t3)
 > 0xfffffc0000736ae8 <pmap_do_tlb_shootdown+264>: beq     t3,0xfffffc0000736b10 <pmap_do_tlb_shootdown+304>
 > 0xfffffc0000736aec <pmap_do_tlb_shootdown+268>: unop
 > 0xfffffc0000736af0 <pmap_do_tlb_shootdown+272>: lda     a0,3
 > 0xfffffc0000736af4 <pmap_do_tlb_shootdown+276>: ldq     t0,32(t3)
 > 0xfffffc0000736af8 <pmap_do_tlb_shootdown+280>: and     t0,0x10,t0
 > 0xfffffc0000736afc <pmap_do_tlb_shootdown+284>: beq     t0,0xfffffc0000736ac0 <pmap_do_tlb_shootdown+224>
 > 0xfffffc0000736b00 <pmap_do_tlb_shootdown+288>: ldq     a1,16(t3)
 > 0xfffffc0000736b04 <pmap_do_tlb_shootdown+292>: call_pal        0x33
 > 0xfffffc0000736b08 <pmap_do_tlb_shootdown+296>: ldq     t3,0(t3)
 > 0xfffffc0000736b0c <pmap_do_tlb_shootdown+300>: bne     t3,0xfffffc0000736af0 <pmap_do_tlb_shootdown+272>

    This is the loop where pmap_do_tlb_shootdown is processing the job queue 
 for this cpu (cpu 0).  T3 is the current queue entry and almost certainly 
 is linked to itself - meaning it keeps invalidating the same tlb entry 
 over and over and over......  At this point, cpu 0 will have the job queue 
 locked, and I'm certain that cpu 1 wants to invalidate another tlb entry, 
 and is attempting to acquire the lock that cpu 0 holds.

    This is the problem I'm still in the process of trying to figure out 
 what the problem is and how to fix it.  The patch I posted previously is a 
 workaround to detect this particular problem, and will display a message 
 if it occurs.  The other alternative I posted (changing 
 PMAP_TLB_SHOOTDOWN_MAXJOBS to 0) make it never allocate queue entries so 
 it should never loop.

    What I think may be happening is that the IPI interrupt may be occuring 
 at level 5, which is the same level as the clock interrupt.  The 
 mutex for the tlb shootdown queue uses IPL_VM, which I think is at level 
 4.  If I understand the mutex locking mechanism, I think this would allow 
 an IPI request from another cpu interrupt the code that manipulates the 
 shootdown queue entries, and could possibly corrupt the job queue.

 --
 Michael L. Hitch			mhitch@montana.edu
 Computer Consultant
 Information Technology Center
 Montana State University	Bozeman, MT	USA

From: Jarle Greipsland <jarle@uninett.no>
To: mhitch@lightning.msu.montana.edu
Cc: gnats-bugs@NetBSD.org, gnats-admin@netbsd.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Fri, 02 Oct 2009 08:54:35 +0200 (CEST)

 "Michael L. Hitch" <mhitch@lightning.msu.montana.edu> writes:
 >    It does show that what I thought had happened did indeed happen.
 > > 0xfffffc0000736ac0 <pmap_do_tlb_shootdown+224>: ldq     t1,24(t3)
 [ ... ]
 Is there any more info I can gather for you, or can you just as
 easily reproduce this yourself?

 >    This is the problem I'm still in the process of trying to figure out 
 > what the problem is and how to fix it.  The patch I posted previously is a 
 > workaround to detect this particular problem, and will display a message 
 > if it occurs.
 It _does_ occur.  I applied the patch, and a 'build.sh -j4'
 triggered the panic after a while.
 					-jarle

From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: Jarle Greipsland <jarle@uninett.no>
Cc: gnats-bugs@NetBSD.org, gnats-admin@netbsd.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Sat, 3 Oct 2009 10:25:43 -0600 (MDT)

 On Fri, 2 Oct 2009, Jarle Greipsland wrote:

 > "Michael L. Hitch" <mhitch@lightning.msu.montana.edu> writes:
 >>    It does show that what I thought had happened did indeed happen.
 >>> 0xfffffc0000736ac0 <pmap_do_tlb_shootdown+224>: ldq     t1,24(t3)
 > [ ... ]
 > Is there any more info I can gather for you, or can you just as
 > easily reproduce this yourself?

    I'm quite certain in this particular case the job queue entry is linked
 to itself resulting in a hung cpu (which then hangs other cpus because it 
 has the shootdown queue locked for that cpu).

    I haven't been able to easily reproduce this.  I can complete full 
 builds fairly often, although a lot of the time I will get a segment fault 
 in one of the tools (typically grotty or install).

    I also got another deadlock situation yesterday:  cpu 0 had acquired the 
 lock for the shootdown queue (presumably for a different cpu - it should 
 be skipping the current cpu) and got interrupted by a shootdown IPI.  The 
 IPI routine was trying to acquire the lock for the current cpu's queue, 
 which was currently locked (can't tell what held the lock though).

    One thing I haven't tried yet is a LOCKDEBUG kernel, which should do 
 some additional checking on locking, and should be able to provide 
 information on what holds the lock.  This last time I tried LOCKDEBUG, I 
 ran into problems and quickly got lost in the locking morass.

 >>    This is the problem I'm still in the process of trying to figure out
 >> what the problem is and how to fix it.  The patch I posted previously is a
 >> workaround to detect this particular problem, and will display a message
 >> if it occurs.
 > It _does_ occur.  I applied the patch, and a 'build.sh -j4'
 > triggered the panic after a while.

    What was the panic you got?


 --
 Michael L. Hitch			mhitch@montana.edu
 Computer Consultant
 Information Technology Center
 Montana State University	Bozeman, MT	USA

From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: Jarle Greipsland <jarle@uninett.no>
Cc: gnats-bugs@NetBSD.org, gnats-admin@netbsd.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Sun, 4 Oct 2009 11:08:29 -0600 (MDT)

    I have just committed the following patch, which seems to help 
 considerably.  I've been doing sucessive builds using 2 cpus on an ES40.
 I haven't gotten any deadlocks yet, but I did just get a segment violation 
 on the grotty program on the previous build this morning.

 Index: sys/arch/alpha/alpha/pmap.c
 ===================================================================
 RCS file: /cvsroot/src/sys/arch/alpha/alpha/pmap.c,v
 retrieving revision 1.242
 diff -u -p -r1.242 pmap.c
 --- sys/arch/alpha/alpha/pmap.c 10 Sep 2009 22:27:11 -0000      1.242
 +++ sys/arch/alpha/alpha/pmap.c 4 Oct 2009 16:58:00 -0000
 @@ -962,7 +962,7 @@ pmap_bootstrap(paddr_t ptaddr, u_int max
          for (i = 0; i < ALPHA_MAXPROCS; i++) {
                  TAILQ_INIT(&pmap_tlb_shootdown_q[i].pq_head);
                  mutex_init(&pmap_tlb_shootdown_q[i].pq_lock, MUTEX_DEFAULT,
 -                   IPL_VM);
 +                   IPL_SCHED);
          }
   #endif



 --
 Michael L. Hitch			mhitch@montana.edu
 Computer Consultant
 Information Technology Center
 Montana State University	Bozeman, MT	USA

From: Jarle Greipsland <jarle@uninett.no>
To: mhitch@lightning.msu.montana.edu
Cc: gnats-bugs@NetBSD.org, gnats-admin@netbsd.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Tue, 06 Oct 2009 11:17:09 +0200 (CEST)

 "Michael L. Hitch" <mhitch@lightning.msu.montana.edu> writes:
 >    One thing I haven't tried yet is a LOCKDEBUG kernel, which should do 
 > some additional checking on locking, and should be able to provide 
 > information on what holds the lock.  This last time I tried LOCKDEBUG, I 
 > ran into problems and quickly got lost in the locking morass.
 Some time ago I tried to boot a LOCKDEBUG kernel, but it paniced
 during autoconfig with an assertion:
 Mutex error: lockdebug_alloc: already initialized
 This is registered as port-alpha/38358

 > >>    This is the problem I'm still in the process of trying to figure out
 > >> what the problem is and how to fix it.  The patch I posted previously is a
 > >> workaround to detect this particular problem, and will display a message
 > >> if it occurs.
 > > It _does_ occur.  I applied the patch, and a 'build.sh -j4'
 > > triggered the panic after a while.
 > 
 >    What was the panic you got?
 The one from your patch:
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003f9ee740
 panic: Oops
 Stopped in pid 9356.1 (as) at   netbsd:cpu_Debugger+0x4:        ret     zero,(ra)
 db{0}> trace
 cpu_Debugger() at netbsd:cpu_Debugger+0x4
 panic() at netbsd:panic+0x268
 pmap_tlb_shootdown() at netbsd:pmap_tlb_shootdown+0x274
 pmap_kenter_pa() at netbsd:pmap_kenter_pa+0xe8
 uvm_pagermapin() at netbsd:uvm_pagermapin+0x1b4
 genfs_gop_write() at netbsd:genfs_gop_write+0x48
 genfs_do_putpages() at netbsd:genfs_do_putpages+0x74c
 VOP_PUTPAGES() at netbsd:VOP_PUTPAGES+0xb0
 ffs_write() at netbsd:ffs_write+0x728
 VOP_WRITE() at netbsd:VOP_WRITE+0xb0
 vn_write() at netbsd:vn_write+0x10c
 dofilewrite() at netbsd:dofilewrite+0x90
 sys_write() at netbsd:sys_write+0xa8
 syscall_plain() at netbsd:syscall_plain+0x160
 XentSys() at netbsd:XentSys+0x60
 --- syscall (4) ---
 --- user mode ---

 					-jarle

From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: Jarle Greipsland <jarle@uninett.no>
Cc: gnats-bugs@NetBSD.org, gnats-admin@netbsd.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Tue, 6 Oct 2009 07:25:11 -0600 (MDT)

 On Tue, 6 Oct 2009, Jarle Greipsland wrote:

 >>    What was the panic you got?
 > The one from your patch:
 > Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003f9ee740
 > panic: Oops

    The patch I posted had the panic call commented out (the panic was from 
 earlier debug testing).  The following line was the workaround that would 
 skip that entry and do a full tlb invalidate.  Things should keep running 
 (for a while longer anyway) after that.

 --
 Michael L. Hitch			mhitch@montana.edu
 Computer Consultant
 Information Technology Center
 Montana State University	Bozeman, MT	USA

From: Jarle Greipsland <jarle@uninett.no>
To: mhitch@lightning.msu.montana.edu
Cc: gnats-bugs@NetBSD.org, gnats-admin@netbsd.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Tue, 06 Oct 2009 15:32:15 +0200 (CEST)

 "Michael L. Hitch" <mhitch@lightning.msu.montana.edu> writes:
 > On Tue, 6 Oct 2009, Jarle Greipsland wrote:
 > 
 > >>    What was the panic you got?
 > > The one from your patch:
 > > Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003f9ee740
 > > panic: Oops
 > 
 >    The patch I posted had the panic call commented out (the panic was from 
 > earlier debug testing).  The following line was the workaround that would 
 > skip that entry and do a full tlb invalidate.  Things should keep running 
 > (for a while longer anyway) after that.

 Ah.  That's what happens when one manually applies a patch :-(
 Oh well.  I have since updated my kernel, and am now running a
 -current GENERIC.MP kernel with the IPL_VM -> IPL_SCHED
 modification.  I have started a "build -j4" job, and I'll
 see how that works out.
 					-jarle
 -- 
 "There's too much blood in my caffeine system."

From: Jarle Greipsland <jarle@uninett.no>
To: mhitch@lightning.msu.montana.edu
Cc: gnats-bugs@NetBSD.org, gnats-admin@netbsd.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Tue, 06 Oct 2009 18:36:59 +0200 (CEST)

 Jarle Greipsland <jarle@uninett.no> writes:
 > Ah.  That's what happens when one manually applies a patch :-(
 > Oh well.  I have since updated my kernel, and am now running a
 > -current GENERIC.MP kernel with the IPL_VM -> IPL_SCHED
 > modification.  I have started a "build -j4" job, and I'll
 > see how that works out.
 Another data point: It crashed again; still with the same panic
 from your patch, which I forgot to comment out when I updated my
 sources.  So, even with the IPL_VM->IPL_SCHED change in, it still
 reaches the "Whoa!  pool_cache_get returned an in-use entry!"
 printf statement.  I have no idea whether that's significant or
 not.

 I'll comment out the "Oops"-panic now, and see what happens
 then.
 				-jarle

From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: Jarle Greipsland <jarle@uninett.no>
Cc: gnats-bugs@NetBSD.org, gnats-admin@netbsd.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Tue, 6 Oct 2009 11:28:59 -0600 (MDT)

 On Tue, 6 Oct 2009, Jarle Greipsland wrote:

 > Another data point: It crashed again; still with the same panic
 > from your patch, which I forgot to comment out when I updated my
 > sources.  So, even with the IPL_VM->IPL_SCHED change in, it still
 > reaches the "Whoa!  pool_cache_get returned an in-use entry!"
 > printf statement.  I have no idea whether that's significant or
 > not.
 >
 > I'll comment out the "Oops"-panic now, and see what happens
 > then.

    That's interesting - I've been running a number of -j4 builds on an
 ES40 with only 2 cpus enabled (to more closely simulate the CS20), and 
 haven't been able to hit that message, or some checks I've got in the
 pool_cache code.

    What I was trying to do was to see if just changing the mutex 
 initialization to IPL_SCHED was enough, or if it needed to be IPL_HIGH.
 I was also trying IPL_SCHED and IPL_HIGH in the pool_cache_bootstrap()
 just prior to the mutex_init.

 --
 Michael L. Hitch			mhitch@montana.edu
 Computer Consultant
 Information Technology Center
 Montana State University	Bozeman, MT	USA

From: Jarle Greipsland <jarle@uninett.no>
To: mhitch@lightning.msu.montana.edu
Cc: gnats-bugs@NetBSD.org, gnats-admin@netbsd.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Thu, 08 Oct 2009 16:16:03 +0200 (CEST)

 Jarle Greipsland <jarle@uninett.no> writes:
 > Jarle Greipsland <jarle@uninett.no> writes:
 > > Ah.  That's what happens when one manually applies a patch :-(
 > > Oh well.  I have since updated my kernel, and am now running a
 > > -current GENERIC.MP kernel with the IPL_VM -> IPL_SCHED
 > > modification.  I have started a "build -j4" job, and I'll
 > > see how that works out.
 > Another data point: It crashed again; still with the same panic
 > from your patch, which I forgot to comment out when I updated my
 > sources.  So, even with the IPL_VM->IPL_SCHED change in, it still
 > reaches the "Whoa!  pool_cache_get returned an in-use entry!"
 > printf statement.  I have no idea whether that's significant or
 > not.
 > 
 > I'll comment out the "Oops"-panic now, and see what happens
 > then.
 The system then keeps on running.  Though for every "build -j4"
 job I run I get approximately one
   "Whoa!  pool_cache_get returned an in-use entry!"
 message on the console.  I guess this means we have a
 papered-over bug now?
 					-jarle
 -- 
 Q: What's the difference between programming and bug collecting?
 A: None.

From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: Jarle Greipsland <jarle@uninett.no>
Cc: gnats-bugs@NetBSD.org, gnats-admin@netbsd.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Sun, 11 Oct 2009 12:13:57 -0600 (MDT)

 On Thu, 8 Oct 2009, Jarle Greipsland wrote:

 > The system then keeps on running.  Though for every "build -j4"
 > job I run I get approximately one
 >  "Whoa!  pool_cache_get returned an in-use entry!"
 > message on the console.  I guess this means we have a
 > papered-over bug now?

    Yes - that patch just attempts to work around a duplicate pool cache 
 entry.  I have not been able to replicate this on my ES40, and don't have 
 remote power capability on my CS20 which makes it harder to debug when 
 things can hang hard.

    I'm wondering if using IPL_HIGH for the mutex changes anything:

 Index: sys/arch/alpha/alpha/pmap.c
 ===================================================================
 RCS file: /cvsroot/src/sys/arch/alpha/alpha/pmap.c,v
 retrieving revision 1.243
 diff -u -p -r1.243 pmap.c
 --- sys/arch/alpha/alpha/pmap.c 4 Oct 2009 17:00:31 -0000       1.243
 +++ sys/arch/alpha/alpha/pmap.c 11 Oct 2009 18:01:59 -0000
 @@ -962,7 +962,7 @@ pmap_bootstrap(paddr_t ptaddr, u_int max
          for (i = 0; i < ALPHA_MAXPROCS; i++) {
                  TAILQ_INIT(&pmap_tlb_shootdown_q[i].pq_head);
                  mutex_init(&pmap_tlb_shootdown_q[i].pq_lock, MUTEX_DEFAULT,
 -                   IPL_SCHED);
 +                   IPL_HIGH);
          }
   #endif


    If that doesn't change anything, then I'd try increasing the IPL level 
 of the pool_cache entry:

 Index: sys/arch/alpha/alpha/pmap.c
 ===================================================================
 RCS file: /cvsroot/src/sys/arch/alpha/alpha/pmap.c,v
 retrieving revision 1.243
 diff -u -p -r1.243 pmap.c
 --- sys/arch/alpha/alpha/pmap.c 4 Oct 2009 17:00:31 -0000       1.243
 +++ sys/arch/alpha/alpha/pmap.c 11 Oct 2009 17:59:33 -0000
 @@ -958,7 +958,7 @@ pmap_bootstrap(paddr_t ptaddr, u_int max
           */
          pool_cache_bootstrap(&pmap_tlb_shootdown_job_cache,
              sizeof(struct pmap_tlb_shootdown_job), CACHE_LINE_SIZE,
 -            0, PR_LARGECACHE, "pmaptlb", NULL, IPL_VM, NULL, NULL, NULL);
 +            0, PR_LARGECACHE, "pmaptlb", NULL, IPL_SCHED, NULL, NULL, NULL);
          for (i = 0; i < ALPHA_MAXPROCS; i++) {
                  TAILQ_INIT(&pmap_tlb_shootdown_q[i].pq_head);
                  mutex_init(&pmap_tlb_shootdown_q[i].pq_lock, MUTEX_DEFAULT,

    And:

 Index: sys/arch/alpha/alpha/pmap.c
 ===================================================================
 RCS file: /cvsroot/src/sys/arch/alpha/alpha/pmap.c,v
 retrieving revision 1.243
 diff -u -p -r1.243 pmap.c
 --- sys/arch/alpha/alpha/pmap.c 4 Oct 2009 17:00:31 -0000       1.243
 +++ sys/arch/alpha/alpha/pmap.c 11 Oct 2009 17:59:56 -0000
 @@ -958,7 +958,7 @@ pmap_bootstrap(paddr_t ptaddr, u_int max
           */
          pool_cache_bootstrap(&pmap_tlb_shootdown_job_cache,
              sizeof(struct pmap_tlb_shootdown_job), CACHE_LINE_SIZE,
 -            0, PR_LARGECACHE, "pmaptlb", NULL, IPL_VM, NULL, NULL, NULL);
 +            0, PR_LARGECACHE, "pmaptlb", NULL, IPL_HIGH, NULL, NULL, NULL);
          for (i = 0; i < ALPHA_MAXPROCS; i++) {
                  TAILQ_INIT(&pmap_tlb_shootdown_q[i].pq_head);
                  mutex_init(&pmap_tlb_shootdown_q[i].pq_lock, MUTEX_DEFAULT,


 --
 Michael L. Hitch			mhitch@montana.edu
 Computer Consultant
 Information Technology Center
 Montana State University	Bozeman, MT	USA

From: Jarle Greipsland <jarle@uninett.no>
To: gnats-bugs@NetBSD.org, mhitch@lightning.msu.montana.edu
Cc: port-alpha-maintainer@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Sun, 18 Oct 2009 17:20:33 +0200 (CEST)

 "Michael L. Hitch" <mhitch@lightning.msu.montana.edu> writes:
 >     Yes - that patch just attempts to work around a duplicate pool cache 
 >  entry.  I have not been able to replicate this on my ES40, and don't have 
 >  remote power capability on my CS20 which makes it harder to debug when 
 >  things can hang hard.
 Some more results:

 I ran a few more 'build.sh -j4' jobs without changing anything,
 and the kernel continued to emit spurious "Whoa!  pool_cache_get
 returned an in-use entry!"  messages.

 Then, on Oct 14, I synced my source tree to -current again (this
 might have been a stupid move on my part...), and built and
 booted a new GENERIC.MP kernel.  After about an hour into a
 'build.sh -j4' job, it paniced with a new (to me) message:

 panic: pool_get: pmaptlb: page empty

 Stopped in pid 25249.1 (cc1plus) at     netbsd:cpu_Debugger+0x4:        ret     zero,(ra)

 db{0}> tr
 cpu_Debugger() at netbsd:cpu_Debugger+0x4
 panic() at netbsd:panic+0x268
 pool_get() at netbsd:pool_get+0x408
 pool_cache_get_slow() at netbsd:pool_cache_get_slow+0x1ac
 pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x168
 pmap_tlb_shootdown() at netbsd:pmap_tlb_shootdown+0xe8
 pmap_kenter_pa() at netbsd:pmap_kenter_pa+0xe8
 uvm_km_alloc() at netbsd:uvm_km_alloc+0x170
 kmem_backend_alloc() at netbsd:kmem_backend_alloc+0x48
 vmem_xalloc() at netbsd:vmem_xalloc+0x318
 vmem_alloc() at netbsd:vmem_alloc+0x90
 kmem_poolpage_alloc() at netbsd:kmem_poolpage_alloc+0x48
 pool_grow() at netbsd:pool_grow+0x64
 pool_get() at netbsd:pool_get+0x5c
 pool_cache_get_slow() at netbsd:pool_cache_get_slow+0x1ac
 pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x168
 kmem_alloc() at netbsd:kmem_alloc+0xc8
 amap_extend() at netbsd:amap_extend+0x1b8
 uvm_mapent_trymerge() at netbsd:uvm_mapent_trymerge+0x3dc
 amap_copy() at netbsd:amap_copy+0x410
 uvm_fault_internal() at netbsd:uvm_fault_internal+0xbb0
 trap() at netbsd:trap+0x548
 XentMM() at netbsd:XentMM+0x20
 --- memory management fault (from ipl 0) ---
 --- user mode ---
 db{0}> show reg
 v0          0x6
 t0          0x1
 t1          0x1
 t2          0xfffffc003ff48000
 t3          0
 t4          0
 t5          0xfffffc0000b46a2d  __func__.21238+0x91c
 t6          0
 t7          0
 s0          0xfffffc0000c378e0  msgbufenabled
 s1          0x104
 s2          0xfffffc0000c350a8  db_onpanic
 s3          0xfffffc0000c6b8e8  pmap_tlb_shootdown_job_cache+0x1a8
 s4          0
 s5          0x2
 s6          0
 a0          0
 a1          0xfffffd01fc0003f8
 a2          0
 a3          0x8
 a4          0x3
 a5          0xfffffc0000000008
 t8          0
 t9          0x8
 t10         0xfffffc0000c941a8  uvmexp
 t11         0
 ra          0xfffffc000080c5a8  panic+0x268
 t12         0xfffffc00003eb590  cpu_Debugger
 at          0x1
 gp          0xfffffc0000c30928  __link_set_prop_linkpools_sym__link__prop_array_pool+0x8008
 sp          0xfffffe0013c7f5e8
 pc          0xfffffc00003eb594  cpu_Debugger+0x4
 ps          0
 ai          0
 pv          0xfffffc00003eb590  cpu_Debugger
 netbsd:cpu_Debugger+0x4:        ret     zero,(ra)
 db{0}> ps
 PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
 21308>   1 7   1     40004   fffffc0024916c00            cc1plus
 9441     1 3   0        84   fffffc0024916000   alpha--netbsd-c+ wait
 21710    1 3   0        84   fffffc0024813000                 sh wait
 26347    1 2   0     40004   fffffc002de34800            cc1plus
 18814    1 3   0        84   fffffc0024813400   alpha--netbsd-c+ wait
 24418    1 3   0        84   fffffc002daf2800                 sh wait
 25249>   1 7   0         4   fffffc002d29e400            cc1plus
 19917    1 3   0        84   fffffc0024917c00   alpha--netbsd-c+ wait
 2263     1 3   1        84   fffffc002daf3000                 sh wait
 16091    1 2   0     40004   fffffc002caa9800            cc1plus
 19836    1 3   0        84   fffffc003edac400   alpha--netbsd-c+ wait
 7537     1 3   1        84   fffffc00233e9400                 sh wait
 18485    1 3   0        84   fffffc0024d52000             nbmake select
 8826     1 3   0        84   fffffc002daf2000                 sh wait
 28933    1 3   1        84   fffffc002cab8000             nbmake select
 15439    1 3   1        84   fffffc0024812000                 sh wait
 22920    1 3   1        84   fffffc0024917800             nbmake select
 16006    1 3   1        84   fffffc0024d53400                 sh wait
 22712    1 3   1        84   fffffc0024d52800             nbmake select
 28515    1 3   1        84   fffffc002caa9c00                 sh wait
 24901    1 3   1        84   fffffc00233e8800             nbmake select
 22708    1 3   1        84   fffffc0024917000             pickup kqueue
 20418    1 3   0        84   fffffc002d29ec00               bash ttyraw
 17752    1 3   0        84   fffffc002caa9400              xterm select
 29471    1 3   1        84   fffffc003f678000                csh pause
 8486     1 3   0        84   fffffc002cab8c00               sshd select
 23567    1 3   1        84   fffffc002d29f400               sshd netio
 4106     1 3   0        84   fffffc00233e8000                 sh wait
 3271     1 3   1        84   fffffc00233e8c00             nbmake select
 3779     1 3   0        84   fffffc0024916400                 sh wait
 3702     1 3   1        84   fffffc00233e9000             nbmake select
 3935     1 3   1        84   fffffc003ca9e400                 sh wait
 3603     1 3   1        84   fffffc003ca9f000             nbmake select
 750      1 3   0        84   fffffc003ca9e800                 sh wait
 730      1 3   1        84   fffffc003ca9ec00               tail kqueue
 733      1 3   0        84   fffffc003f678400               tcsh pause
 507      1 3   0        84   fffffc003ca9f800               tcsh pause
 40       1 3   1        84   fffffc003ca9fc00               tcsh pause
 41       1 3   1        84   fffffc003edad000              xterm select
 459      1 3   0        84   fffffc003f9fe800                csh pause
 473      1 3   1        84   fffffc003edac800               sshd select
 528      1 3   0        84   fffffc003f960000               sshd netio
 501      1 3   1        84   fffffc003f679800               qmgr kqueue
 350      1 3   1        84   fffffc003f961c00              getty ttyraw
 484      1 3   1        84   fffffc003f679400               cron nanoslp
 498      1 3   1        84   fffffc003edac000              inetd kqueue
 468      1 3   1        84   fffffc003edacc00             master kqueue
 300      1 3   0        84   fffffc003edad400               sshd select
 301      1 3   0        84   fffffc003edad800               ntpd pause
 183      1 3   0        84   fffffc003edadc00          mount_mfs mfsidl
 152      1 3   1        84   fffffc003f679000            syslogd kqueue
 1        1 3   0        84   fffffc003f961000               init wait
 0       39 3   0       204   fffffc003f678800            physiod physiod
 0       38 3   1       204   fffffc003f960400        vmem_rehash vmem_rehash
 0       37 3   0       204   fffffc003f960800           aiodoned aiodoned
 0       36 3   1       204   fffffc003f9fe400            ioflush syncer
 0       35 3   1       204   fffffc003f9fe000           pgdaemon pgdaemon
 0       34 3   0       204   fffffc003f961400          cryptoret crypto_wait
 0       33 3   0       204   fffffc003f961800          atapibus0 sccomp
 0       31 3   0       204   fffffc003f960c00              unpgc unpgc
 0       22 3   1       204   fffffc003f9fec00            atabus1 atath
 0       21 3   0       204   fffffc003f9ff000            atabus0 atath
 0       20 3   1       204   fffffc003f9ff400           scsibus0 sccomp
 0       19 3   1       204   fffffc003f9ff800            xcall/1 xcall
 0       18 1   1       204   fffffc003f9ffc00          softser/1
 0       17 1   1       204   fffffc003fcf8000          softclk/1
 0       16 1   1       204   fffffc003fcf8400          softbio/1
 0       15 1   1       204   fffffc003fcf8800          softnet/1
 0       14 1   1       205   fffffc003fcf8c00             idle/1
 0       13 3   1       204   fffffc003fcf9000         pmfsuspend pmfsuspend
 0       12 3   0       204   fffffc003fcf9400           pmfevent pmfevent
 0       11 3   1       204   fffffc003fcf9800           nfssilly nfssilly
 0       10 3   0       204   fffffc003fcf9c00            cachegc cachegc
 0        9 3   1       204   fffffc003fe28000              vrele vrele
 0        8 3   1       204   fffffc003fe28400          modunload modunload
 0        7 3   0       204   fffffc003fe28800            xcall/0 xcall
 0        6 1   0       204   fffffc003fe28c00          softser/0
 0        5 1   0       204   fffffc003fe29000          softclk/0
 0        4 1   0       204   fffffc003fe29400          softbio/0
 0        3 1   0       204   fffffc003fe29800          softnet/0
 0        2 1   0       205   fffffc003fe29c00             idle/0
 0        1 3   0       204   fffffc0000c16660            swapper schedule
 db{0}> mach cpu 1
 Using CPU 1
 db{0}> tr

 CPU 0: fatal kernel trap:

 CPU 0    trap entry = 0x2 (memory management fault)
 CPU 0    a0         = 0xffffffffffffffd9
 CPU 0    a1         = 0x1
 CPU 0    a2         = 0x0
 CPU 0    pc         = 0xfffffc00003ee944
 CPU 0    ra         = 0xfffffc00003e8104
 CPU 0    pv         = 0xfffffc00003ee890
 CPU 0    curlwp     = 0xfffffc002d29e400
 CPU 0        pid = 25249, comm = cc1plus

 Caught exception in ddb.
 db{0}> show reg
 v0          0x1
 t0          0x5b93a000
 t1          0x2000
 t2          0
 t3          0x2000
 t4          0xf
 t5          0x16055e050
 t6          0x2
 t7          0
 s0          0x2
 s1          0x5b93a000
 s2          0x161214008
 s3          0x161214000
 s4          0x16055e1d0
 s5          0x16055e040
 s6          0x50
 a0          0x5b93a000
 a1          0x48
 a2          0
 a3          0
 a4          0x5
 a5          0x5
 t8          0x160814400
 t9          0x3
 t10         0x120582e58
 t11         0x120581c28
 ra          0x16063d168
 t12         0x16064db44
 at          0x2
 gp          0x1606fb790
 sp          0x1
 pc          0x16064db70
 ps          0x8
 ai          0x120581c28
 pv          0x16064db44
 0x16064db70:    and     t1,#0xff,t4

 I also got a kernel core dump from this one.  Please give some
 instructions if you want me to dig up some data from the dump.

 Later on, after booting the same kernel, I got some more
 "Whoa!  pool_cache_get returned an in-use entry!" messages.  Also
 at some point (I don't think it correlated in time with any of
 the kernel messages) the system hung hard again.  I pressed the
 halt button, and entered DDB:
 halted CPU 0
 CPU 1 is not halted

 halt code = 1
 operator initiated halt
 PC = fffffc00005f6320
 P00>>>cont

 continuing CPU 0
 CP - RESTORE_TERM routine to be called
 panic: user requested console halt
 Stopped in pid 0.35 (system) at netbsd:cpu_Debugger+0x4:        ret     zero,(ra
 )
 db{0}> trace
 cpu_Debugger() at netbsd:cpu_Debugger+0x4
 panic() at netbsd:panic+0x268
 console_restart() at netbsd:console_restart+0x78
 XentRestart() at netbsd:XentRestart+0x90
 --- console restart (from ipl 2) ---
 mutex_spin_enter() at netbsd:mutex_spin_enter+0x200
 pool_cache_put_slow() at netbsd:pool_cache_put_slow+0x138
 pool_cache_put_paddr() at netbsd:pool_cache_put_paddr+0x168
 pmap_do_tlb_shootdown() at netbsd:pmap_do_tlb_shootdown+0x178
 alpha_ipi_process() at netbsd:alpha_ipi_process+0xb8
 interrupt() at netbsd:interrupt+0x88
 XentInt() at netbsd:XentInt+0x1c
 --- interrupt (from ipl 4) ---
 mutex_exit() at netbsd:mutex_exit+0x10
 pool_cache_invalidate() at netbsd:pool_cache_invalidate+0x6c
 pool_reclaim() at netbsd:pool_reclaim+0x68
 pool_drain_end() at netbsd:pool_drain_end+0x44
 uvm_pageout() at netbsd:uvm_pageout+0x740
 exception_return() at netbsd:exception_return
 --- root of call graph ---
 db{0}> show reg
 v0          0xfffffc0000c0de28  cpu_info_primary
 t0          0x1
 t1          0x1
 t2          0xfffffc003ff48000
 t3          0
 t4          0
 t5          0xfffffc0000b46a2d  __func__.21238+0x91c
 t6          0xfffffc002e386888
 t7          0
 s0          0xfffffc0000c378e0  msgbufenabled
 s1          0x104
 s2          0xfffffc0000c350a8  db_onpanic
 s3          0xfffffc0000c8ef18  panicstr
 s4          0
 s5          0xfffffc0000808e48  pool_cache_put_slow+0x138
 s6          0
 a0          0
 a1          0xfffffd01fc0003f8
 a2          0
 a3          0x8
 a4          0x3
 a5          0xfffffc0000000008
 t8          0xb
 t9          0x8
 t10         0x7fff
 t11         0
 ra          0xfffffc000080c5a8  panic+0x268
 t12         0xfffffc00003eb590  cpu_Debugger
 at          0x160328000
 gp          0xfffffc0000c30928  __link_set_prop_linkpools_sym__link__prop_array_pool+0x8008
 sp          0xfffffe00130e5908
 pc          0xfffffc00003eb594  cpu_Debugger+0x4
 ps          0x7
 ai          0
 pv          0xfffffc00003eb590  cpu_Debugger
 netbsd:cpu_Debugger+0x4:        ret     zero,(ra)
 db{0}> mach cpu 1
 Using CPU 1
 db{0}> trace

 CPU 0: fatal kernel trap:

 CPU 0    trap entry = 0x2 (memory management fault)
 CPU 0    a0         = 0xffffffffffffffd9
 CPU 0    a1         = 0x1
 CPU 0    a2         = 0x0
 CPU 0    pc         = 0xfffffc00003ee944
 CPU 0    ra         = 0xfffffc00003e8104
 CPU 0    pv         = 0xfffffc00003ee890
 CPU 0    curlwp     = 0xfffffc003f9fe800
 CPU 0        pid = 0, comm = system

 Caught exception in ddb.
 db{0}> c     
 cpu1: shutting down...

 dumping to dev 8,1 offset 2097597
 dump i/o error


 sd0(esiop0:0:0:0): polling command not done
 panic: scsipi_execute_xs
 Stopped in pid 0.35 (system) at netbsd:cpu_Debugger+0x4:        ret     zero,(ra
 )
 db{0}> reboot 4

 >     I'm wondering if using IPL_HIGH for the mutex changes anything:
 >  
 >  Index: sys/arch/alpha/alpha/pmap.c
 >  ===================================================================
 >  RCS file: /cvsroot/src/sys/arch/alpha/alpha/pmap.c,v
 >  retrieving revision 1.243
 >  diff -u -p -r1.243 pmap.c
 >  --- sys/arch/alpha/alpha/pmap.c 4 Oct 2009 17:00:31 -0000       1.243
 >  +++ sys/arch/alpha/alpha/pmap.c 11 Oct 2009 18:01:59 -0000
 >  @@ -962,7 +962,7 @@ pmap_bootstrap(paddr_t ptaddr, u_int max
 >           for (i = 0; i < ALPHA_MAXPROCS; i++) {
 >                   TAILQ_INIT(&pmap_tlb_shootdown_q[i].pq_head);
 >                   mutex_init(&pmap_tlb_shootdown_q[i].pq_lock, MUTEX_DEFAULT,
 >  -                   IPL_SCHED);
 >  +                   IPL_HIGH);
 >           }
 >    #endif

 I then applied this patch.  I tried several 'build.sh -j4' jobs,
 but all of them seemed to abort with some host tool tripping up:
 /usr/tools/bin/nbgroff: grotty: Illegal instruction (core dumped)
 /usr/tools/bin/nbgroff: troff: Illegal instruction (core dumped)

 Then on one occasion, the kernel started to repeatedly spew 
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003f9ee00
 messages to the console.  The pj value were identical in all the
 messages, but the ci_index value varied (0 or 1).

 Do you still think I should try and increase the IPL level of the
 pool_cache entry as specified in your message?

 					-jarle

From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: Jarle Greipsland <jarle@uninett.no>
Cc: gnats-bugs@NetBSD.org, gnats-admin@netbsd.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Sun, 18 Oct 2009 11:37:20 -0600 (MDT)

 On Sun, 18 Oct 2009, Jarle Greipsland wrote:

 > I ran a few more 'build.sh -j4' jobs without changing anything,
 > and the kernel continued to emit spurious "Whoa!  pool_cache_get
 > returned an in-use entry!"  messages.
 >
 > Then, on Oct 14, I synced my source tree to -current again (this
 > might have been a stupid move on my part...), and built and
 > booted a new GENERIC.MP kernel.  After about an hour into a
 > 'build.sh -j4' job, it paniced with a new (to me) message:
 >
 > panic: pool_get: pmaptlb: page empty

    This would likely indicate pool_cache corruption, which would be in line 
 with the one that my patch worked around.

 > db{0}> mach cpu 1
 > Using CPU 1
 > db{0}> tr
 >
 > CPU 0: fatal kernel trap:

    There's a bug somewhere in handling the registers of the cpus other than 
 the one that's running DDB.  I can't remember if this ever worked 
 properly, although I'm sure I've at least tried it in the past.

 > I also got a kernel core dump from this one.  Please give some
 > instructions if you want me to dig up some data from the dump.

    I don't think there's a way to get much information out of a core dump 
 since 5.0.  The new gdb doesn't yet have the code to access a core dump.
 One more thing to try fixing some day.

 > Later on, after booting the same kernel, I got some more
 > "Whoa!  pool_cache_get returned an in-use entry!" messages.  Also
 > at some point (I don't think it correlated in time with any of
 > the kernel messages) the system hung hard again.  I pressed the
 > halt button, and entered DDB:
 > halted CPU 0
 > CPU 1 is not halted
 >
 > halt code = 1
 > operator initiated halt
 > PC = fffffc00005f6320
 > P00>>>cont
 >
 > continuing CPU 0
 > CP - RESTORE_TERM routine to be called
 > panic: user requested console halt
 > Stopped in pid 0.35 (system) at netbsd:cpu_Debugger+0x4:        ret     zero,(ra
 > )
 > db{0}> trace
 > cpu_Debugger() at netbsd:cpu_Debugger+0x4
 > panic() at netbsd:panic+0x268
 > console_restart() at netbsd:console_restart+0x78
 > XentRestart() at netbsd:XentRestart+0x90
 > --- console restart (from ipl 2) ---
 > mutex_spin_enter() at netbsd:mutex_spin_enter+0x200
 > pool_cache_put_slow() at netbsd:pool_cache_put_slow+0x138
 > pool_cache_put_paddr() at netbsd:pool_cache_put_paddr+0x168
 > pmap_do_tlb_shootdown() at netbsd:pmap_do_tlb_shootdown+0x178
 > alpha_ipi_process() at netbsd:alpha_ipi_process+0xb8
 > interrupt() at netbsd:interrupt+0x88
 > XentInt() at netbsd:XentInt+0x1c
 > --- interrupt (from ipl 4) ---
 > mutex_exit() at netbsd:mutex_exit+0x10
 > pool_cache_invalidate() at netbsd:pool_cache_invalidate+0x6c
 > pool_reclaim() at netbsd:pool_reclaim+0x68
 > pool_drain_end() at netbsd:pool_drain_end+0x44
 > uvm_pageout() at netbsd:uvm_pageout+0x740
 > exception_return() at netbsd:exception_return

    This also appears to be pool cache corruption problems.  In this 
 particular case, CPU0 has received an IPI interrupt to shootdown its tlb 
 entries, and is trying to to release a pool cache entry used for this 
 call, and is hung trying to acquire a mutex related to the pool cache.
 The other cpu presumably holds that lock (and may be trying to acquire a 
 lock held by CPU0, leading to the classic deadlock).

 >>     I'm wondering if using IPL_HIGH for the mutex changes anything:
 >>
 >>  Index: sys/arch/alpha/alpha/pmap.c
 >>  ===================================================================
 >>  RCS file: /cvsroot/src/sys/arch/alpha/alpha/pmap.c,v
 >>  retrieving revision 1.243
 >>  diff -u -p -r1.243 pmap.c
 >>  --- sys/arch/alpha/alpha/pmap.c 4 Oct 2009 17:00:31 -0000       1.243
 >>  +++ sys/arch/alpha/alpha/pmap.c 11 Oct 2009 18:01:59 -0000
 >>  @@ -962,7 +962,7 @@ pmap_bootstrap(paddr_t ptaddr, u_int max
 >>           for (i = 0; i < ALPHA_MAXPROCS; i++) {
 >>                   TAILQ_INIT(&pmap_tlb_shootdown_q[i].pq_head);
 >>                   mutex_init(&pmap_tlb_shootdown_q[i].pq_lock, MUTEX_DEFAULT,
 >>  -                   IPL_SCHED);
 >>  +                   IPL_HIGH);
 >>           }
 >>    #endif
 >
 > I then applied this patch.  I tried several 'build.sh -j4' jobs,
 > but all of them seemed to abort with some host tool tripping up:
 > /usr/tools/bin/nbgroff: grotty: Illegal instruction (core dumped)
 > /usr/tools/bin/nbgroff: troff: Illegal instruction (core dumped)

    This is the problem I get now on my ES45.  I haven't been able to figure 
 out anything from the process core dump, other than memory seems corrupted 
 or incorrect.  I'm not sure if it might be related to the tlb shootdown 
 code not working properly, or perhaps a missing memory barrier call 
 somewhere.

 > Then on one occasion, the kernel started to repeatedly spew
 > Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003f9ee00
 > messages to the console.  The pj value were identical in all the
 > messages, but the ci_index value varied (0 or 1).
 >
 > Do you still think I should try and increase the IPL level of the
 > pool_cache entry as specified in your message?

    Try the higher IPL on the pmap_tlb_shootdown_job_cache.  I'm not real 
 clear on how that IPL is used, but I'm guessing that might be the IPL used 
 by any locking using by the pool cache routines, and may be needed to 
 prevent the IPI interrupt from interrupting a pool cache operation.  [That 
 might have caused the deadlock you observed above.]  Try IPL_CLOCK first, 
 and then IPL_HIGH if that still has problems relating to the pool cache.

 --
 Michael L. Hitch			mhitch@montana.edu
 Computer Consultant
 Information Technology Center
 Montana State University	Bozeman, MT	USA

State-Changed-From-To: feedback->open
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Mon, 19 Oct 2009 00:58:46 +0000
State-Changed-Why:
This does not need to be in feedback any more. Glad to see you guys working
on it :-)


From: Jarle Greipsland <jarle@uninett.no>
To: gnats-bugs@NetBSD.org, mhitch@lightning.msu.montana.edu
Cc: port-alpha-maintainer@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Mon, 26 Oct 2009 15:34:14 +0100 (CET)

 "Michael L. Hitch" <mhitch@lightning.msu.montana.edu> writes:
 >  > Then on one occasion, the kernel started to repeatedly spew
 >  > Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003f9ee00
 >  > messages to the console.  The pj value were identical in all the
 >  > messages, but the ci_index value varied (0 or 1).
 >  >
 >  > Do you still think I should try and increase the IPL level of the
 >  > pool_cache entry as specified in your message?
 >  
 >     Try the higher IPL on the pmap_tlb_shootdown_job_cache.  I'm not real 
 >  clear on how that IPL is used, but I'm guessing that might be the IPL used 
 >  by any locking using by the pool cache routines, and may be needed to 
 >  prevent the IPI interrupt from interrupting a pool cache operation.  [That 
 >  might have caused the deadlock you observed above.]  Try IPL_CLOCK first, 
 >  and then IPL_HIGH if that still has problems relating to the pool cache.
 Results for the IPL_HIGH setting: It still has problems with the
 pool_cache_get stuff.  A number of consecutive 'build.sh -j4'
 resulted in console messages:
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003f9efa00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003f9ee080
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003f9ee440

 Also, at during one of the builds, the system hung completely,
 and I had to press the reset button:
 ----------------------------------------------------------------------
 Stopped in pid 0.2 (system) at  netbsd:cpu_Debugger+0x4:	ret     zero,(ra)
 db{0}> tr
 cpu_Debugger() at netbsd:cpu_Debugger+0x4
 comintr() at netbsd:comintr+0x720
 alpha_shared_intr_dispatch() at netbsd:alpha_shared_intr_dispatch+0x5c
 sio_iointr() at netbsd:sio_iointr+0x38
 interrupt() at netbsd:interrupt+0x1c0
 XentInt() at netbsd:XentInt+0x1c
 --- interrupt (from ipl 0) ---
 sched_curcpu_runnable_p() at netbsd:sched_curcpu_runnable_p+0x1c
 idle_loop() at netbsd:idle_loop+0x1b8
 exception_return() at netbsd:exception_return
 --- root of call graph ---
 db{0}> mach cpu 1
 Using CPU 1
 db{0}> tra

 CPU 0: fatal kernel trap:

 CPU 0    trap entry = 0x2 (memory management fault)
 CPU 0    a0         = 0xffffffffffffffd9
 CPU 0    a1         = 0x1
 CPU 0    a2         = 0x0
 CPU 0    pc         = 0xfffffc00003ee944
 CPU 0    ra         = 0xfffffc00003e8104
 CPU 0    pv         = 0xfffffc00003ee890
 CPU 0    curlwp     = 0xfffffc003fe29c00
 CPU 0        pid = 0, comm = system

 Caught exception in ddb.
 db{0}> mach cpu 0
 CPU 0 not paused
 db{0}> show reg
 v0          0x1
 t0          0xfffffc0000b44e9c  uvmadvice
 t1          0xfffffc0000b44e9c  uvmadvice
 t2          0
 t3          0x2000
 t4          0
 t5          0x3
 t6          0x4
 t7          0x1
 s0          0xfffffc0038581698
 s1          0
 s2          0x2
 s3          0xe78
 s4          0xfffffe0013bc16e0
 s5          0x160042000
 s6          0xfffffc0031b69000
 a0          0xfffffc0038581698
 a1          0xfffffc0014c0f760
 a2          0xfffffe0013bc1448
 a3          0x160042000
 a4          0xfffffe0013bc16e0
 a5          0
 t8          0x1
 t9          0
 t10         0xfffffc0000c941e8  uvmexp
 t11         0x80
 ra          0xfffffc00008e0bd4  uvm_fault_internal+0x124
 t12         0xfffffc00008e5f50  uvm_map_lookup_entry
 at          0xfffffe0013bbe000
 gp          0xfffffc0000c30968  __link_set_prop_linkpools_sym__link__prop_array_pool+0x8008
 sp          0x1
 pc          0xfffffc00008e0e50  uvm_fault_internal+0x3a0
 ps          0
 ai          0x80
 pv          0xfffffc00008e5f50  uvm_map_lookup_entry
 netbsd:uvm_fault_internal+0x3a0:        srl     t3,#0xd,t3
 db{0}> tr

 CPU 0: fatal kernel trap:

 CPU 0    trap entry = 0x2 (memory management fault)
 CPU 0    a0         = 0xffffffffffffffd9
 CPU 0    a1         = 0x1
 CPU 0    a2         = 0x0
 CPU 0    pc         = 0xfffffc00003ee944
 CPU 0    ra         = 0xfffffc00003e8104
 CPU 0    pv         = 0xfffffc00003ee890
 CPU 0    curlwp     = 0xfffffc003fe29c00
 CPU 0        pid = 0, comm = system

 Caught exception in ddb.
 db{0}> reboot 4
 ----------------------------------------------------------------------

 Also, for another build, I got a 

 ----------------------------------------------------------------------
 panic: fpsave ipi didn't
 Stopped in pid 22623.1 (sh) at  netbsd:cpu_Debugger+0x4:        ret     zero,(ra)
 db{0}> trace 
 cpu_Debugger() at netbsd:cpu_Debugger+0x4
 panic() at netbsd:panic+0x268
 fpusave_proc() at netbsd:fpusave_proc+0x1b4
 cpu_lwp_free() at netbsd:cpu_lwp_free+0x28
 exit1() at netbsd:exit1+0x568
 sys_exit() at netbsd:sys_exit+0x7c
 syscall_plain() at netbsd:syscall_plain+0x160
 XentSys() at netbsd:XentSys+0x60
 --- syscall (1) ---
 --- user mode ---
 db{0}> mach cpu 1
 CPU 1 not paused
 db{0}> show reg
 v0          0x6
 t0          0x1
 t1          0x1
 t2          0xfffffc003ff48000
 t3          0
 t4          0
 t5          0xfffffc0000b46a65  __func__.21238+0x91c
 t6          0xc5343806
 t7          0xfffffffffffffcbe
 s0          0xfffffc0000c37920  msgbufenabled
 s1          0x104
 s2          0xfffffc0000c350e8  db_onpanic
 s3          0xfffffc003b546800
 s4          0xfffffc0037d7b458
 s5          0xfffffc0000c7bff0  initproc
 s6          0x12003dab0
 a0          0x6
 a1          0xfffffd01fc0003f8
 a2          0
 a3          0x8
 a4          0x3
 a5          0x8
 t8          0x2
 t9          0x8
 t10         0
 t11         0x7
 ra          0xfffffc000080c5a8  panic+0x268
 t12         0xfffffc00003eb590  cpu_Debugger
 at          0xfffffe0013b1a000
 gp          0xfffffc0000c30968  __link_set_prop_linkpools_sym__link__prop_array_pool+0x8008
 sp          0xfffffe00139cbcc8
 pc          0xfffffc00003eb594  cpu_Debugger+0x4
 ps          0x6
 ai          0x7
 pv          0xfffffc00003eb590  cpu_Debugger
 netbsd:cpu_Debugger+0x4:        ret     zero,(ra)
 db{0}> reboot 4
 ----------------------------------------------------------------------

 Hope this helps.
 					-jarle

From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: gnats-bugs@NetBSD.org
Cc: jarle@uninett.no
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Tue, 27 Oct 2009 15:33:06 -0600 (MDT)

 On Mon, 26 Oct 2009, Jarle Greipsland wrote:

 > Results for the IPL_HIGH setting: It still has problems with the
 > pool_cache_get stuff.  A number of consecutive 'build.sh -j4'
 > resulted in console messages:
 > Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003f9efa00
 > Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003f9ee080
 > Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003f9ee440

    OK, here's something else to try.  I was looking through the alpha 
 hardware reference manual and checking some of the multiprocessor 
 information.  I noted that it showed the use of memory barriers when 
 sending/receiving interrupts between processors.  It looks like the atomic 
 operations that were used in the netbsd-4 branch included the memory 
 barrier, but the ones used in netbsd-5 and later do not.  This patch 
 should add back the memory barriers need for the IPI stuff.

 Index: sys/arch/alpha/alpha/ipifuncs.c
 ===================================================================
 RCS file: /cvsroot/src/sys/arch/alpha/alpha/ipifuncs.c,v
 retrieving revision 1.40
 diff -u -p -r1.40 ipifuncs.c
 --- sys/arch/alpha/alpha/ipifuncs.c	28 Apr 2008 20:23:10 -0000	1.40
 +++ sys/arch/alpha/alpha/ipifuncs.c	27 Oct 2009 21:23:05 -0000
 @@ -131,6 +131,7 @@ alpha_ipi_process(struct cpu_info *ci, s
   	}
   #endif

 +	alpha_mb();
   	pending_ipis = atomic_swap_ulong(&ci->ci_ipis, 0);

   	/*
 @@ -167,6 +168,7 @@ alpha_send_ipi(u_long cpu_id, u_long ipi
   #endif

   	atomic_or_ulong(&cpu_info[cpu_id]->ci_ipis, ipimask);
 +	alpha_mb();
   	alpha_pal_wripir(cpu_id);
   }


 --
 Michael L. Hitch			mhitch@montana.edu
 Computer Consultant
 Information Technology Center
 Montana State University	Bozeman, MT	USA

From: Jarle Greipsland <jarle@uninett.no>
To: mhitch@lightning.msu.montana.edu
Cc: gnats-bugs@NetBSD.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Wed, 28 Oct 2009 16:01:33 +0100 (CET)

 "Michael L. Hitch" <mhitch@lightning.msu.montana.edu> writes:
 >    OK, here's something else to try.  I was looking through the alpha 
 > hardware reference manual and checking some of the multiprocessor 
 > information.  I noted that it showed the use of memory barriers when 
 > sending/receiving interrupts between processors.  It looks like the atomic 
 > operations that were used in the netbsd-4 branch included the memory 
 > barrier, but the ones used in netbsd-5 and later do not.  This patch 
 > should add back the memory barriers need for the IPI stuff.
 [ ... ]
 OK, I have applied your patch (and removed the old ones except
 for the one that generates the "Whoa!"-warnings.).  The kernel
 I'm running is GENERIC.MP based on -current from Oct 14th, with
 the following diff:
 ----------------------------------------------------------------------
 Index: arch/alpha/alpha/ipifuncs.c
 ===================================================================
 RCS file: /cvsroot/src/sys/arch/alpha/alpha/ipifuncs.c,v
 retrieving revision 1.40
 diff -u -r1.40 ipifuncs.c
 --- arch/alpha/alpha/ipifuncs.c	28 Apr 2008 20:23:10 -0000	1.40
 +++ arch/alpha/alpha/ipifuncs.c	28 Oct 2009 14:55:46 -0000
 @@ -130,7 +130,7 @@
  		return;
  	}
  #endif
 -
 +	alpha_mb();
  	pending_ipis = atomic_swap_ulong(&ci->ci_ipis, 0);

  	/*
 @@ -167,6 +167,7 @@
  #endif

  	atomic_or_ulong(&cpu_info[cpu_id]->ci_ipis, ipimask);
 +	alpha_mb();
  	alpha_pal_wripir(cpu_id);
  }

 Index: arch/alpha/alpha/pmap.c
 ===================================================================
 RCS file: /cvsroot/src/sys/arch/alpha/alpha/pmap.c,v
 retrieving revision 1.243
 diff -u -r1.243 pmap.c
 --- arch/alpha/alpha/pmap.c	4 Oct 2009 17:00:31 -0000	1.243
 +++ arch/alpha/alpha/pmap.c	28 Oct 2009 14:55:46 -0000
 @@ -3699,6 +3699,12 @@
  		 * don't really have to do anything else.
  		 */
  		mutex_spin_enter(&pq->pq_lock);
 +		if (pj && pj == pq->pq_head.tqh_first) {
 +			printf("Whoa!  pool_cache_get returned an in-use entry! ci_index %d pj %p\n",
 +			    self->ci_index, pj);
 +/**/		/*	panic("Oops"); */
 +			pj = NULL;
 +		}
  		pq->pq_pte |= pte;
  		if (pq->pq_tbia) {
  			mutex_spin_exit(&pq->pq_lock);
 ----------------------------------------------------------------------

 And it both "Whoa!"s and panics:
 ----------------------------------------------------------------------
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003f9ef980
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003d8bfc00
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003f9ee400
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003f9ee400
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003f9ee400
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003f9ee400
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003f9ee400
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003f9ee400
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003f9ee400
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003f9ee400
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003f9ee400
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003f9ee400
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003f9ee400
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003f9ee400
 Whoa!  pool_cache_get returned an in-use entry! ci_index 1 pj 0xfffffc003f9ee400
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003f9ee400

 CPU 1: fatal kernel trap:

 CPU 1    trap entry = 0x2 (memory management fault)
 CPU 1    a0         = 0x40
 CPU 1    a1         = 0x1
 CPU 1    a2         = 0x0
 CPU 1    pc         = 0xfffffc00007371a8
 CPU 1    ra         = 0xfffffc0000737118
 CPU 1    pv         = 0xfffffc00005f6130
 CPU 1    curlwp     = 0xfffffc003f960800
 CPU 1        pid = 0, comm = system

 panic: trap
 Stopped in pid 0.37 (system) at netbsd:cpu_Debugger+0x4:        ret     zero,(ra)
 db{1}> tr
 cpu_Debugger() at netbsd:cpu_Debugger+0x4
 panic() at netbsd:panic+0x268
 trap() at netbsd:trap+0x35c
 XentMM() at netbsd:XentMM+0x20
 --- memory management fault (from ipl 5) ---
 pmap_do_tlb_shootdown() at netbsd:pmap_do_tlb_shootdown+0xe8
 alpha_ipi_process() at netbsd:alpha_ipi_process+0xb8
 interrupt() at netbsd:interrupt+0x88
 XentInt() at netbsd:XentInt+0x1c
 --- interrupt (from ipl 0) ---
 mutex_spin_exit() at netbsd:mutex_spin_exit+0x5c
 pmap_tlb_shootdown() at netbsd:pmap_tlb_shootdown+0x170
 pmap_kremove() at netbsd:pmap_kremove+0xac
 uvm_pagermapout() at netbsd:uvm_pagermapout+0x40
 uvm_aio_aiodone() at netbsd:uvm_aio_aiodone+0xd4
 db{1}> show reg
 v0          0xfffffe0000034800
 t0          0x1
 t1          0x1
 t2          0xfffffc003ff48000
 t3          0
 t4          0
 t5          0xfffffc0000b46a3d  __func__.21238+0x91c
 t6          0
 t7          0
 s0          0xfffffc0000c378e0  msgbufenabled
 s1          0x104
 s2          0xfffffc0000c350a8  db_onpanic
 s3          0xfffffc00009bf5fc  reg_to_frame+0x5c8
 s4          0xfffffe0013923a38
 s5          0x40
 s6          0xfffffc003f960800
 a0          0x5
 a1          0xfffffd01fc0003f8
 a2          0
 a3          0x8
 a4          0x3
 a5          0xfffffe0000000008
 t8          0xfffffe00139237ff
 t9          0x8
 t10         0x3ea0a5
 t11         0x1ff800
 ra          0xfffffc000080c5b8  panic+0x268
 t12         0xfffffc00003eb590  cpu_Debugger
 at          0x12002438c
 gp          0xfffffc0000c30928  __link_set_prop_linkpools_sym__link__prop_array_pool+0x8008
 sp          0xfffffe0013923888
 pc          0xfffffc00003eb594  cpu_Debugger+0x4
 ps          0x6
 ai          0x1ff800
 pv          0xfffffc00003eb590  cpu_Debugger
 netbsd:cpu_Debugger+0x4:        ret     zero,(ra)
 db{1}> mach cpu 0
 Using CPU 0
 db{1}> tr

 CPU 1: fatal kernel trap:

 CPU 1    trap entry = 0x2 (memory management fault)
 CPU 1    a0         = 0xffffffffffffffd9
 CPU 1    a1         = 0x1
 CPU 1    a2         = 0x0
 CPU 1    pc         = 0xfffffc00003ee944
 CPU 1    ra         = 0xfffffc00003e8104
 CPU 1    pv         = 0xfffffc00003ee890
 CPU 1    curlwp     = 0xfffffc003f960800
 CPU 1        pid = 0, comm = system

 Caught exception in ddb.
 db{1}> show reg
 v0          0
 t0          0
 t1          0xfffffc003fe29c00
 t2          0xfffffe0012c8a400
 t3          0
 t4          0
 t5          0xfffffc003fe29c60
 t6          0xfffffc0000c68410  kernel_pmap_store+0x50
 t7          0
 s0          0x1
 s1          0xfffffc003fe29c00
 s2          0xfffffc0000c0de60  cpu_info_primary+0x38
 s3          0xfffffc0000c0de28  cpu_info_primary
 s4          0
 s5          0
 s6          0
 a0          0
 a1          0
 a2          0xfffffe0012c9a000
 a3          0x1
 a4          0xfffffc0000c94308  uvm_fpageqlock
 a5          0
 t8          0x1604db790
 t9          0
 t10         0xffffffff
 t11         0xfffffc003f90a8b8
 ra          0xfffffc00005eac88  idle_loop+0x1b8
 t12         0xfffffc000060e200  kpreempt_enable
 at          0xfffffe0013984000
 gp          0xfffffc0000c30928  __link_set_prop_linkpools_sym__link__prop_array_pool+0x8008
 sp          0x1
 pc          0xfffffc00005eac34  idle_loop+0x164
 ps          0
 ai          0xfffffc003f90a8b8
 pv          0xfffffc000060e200  kpreempt_enable
 netbsd:idle_loop+0x164: ldq     pv,-1d30(gp)
 ----------------------------------------------------------------------

 So, no cigar this time.  Anything else I should try?

 					-jarle
 -- 
 Q: What's the difference between programming and bug collecting?
 A: None.

From: "Michael L. Hitch" <mhitch@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/38335 CVS commit: src/sys/arch/alpha/alpha
Date: Mon, 9 Nov 2009 04:31:04 +0000

 Module Name:	src
 Committed By:	mhitch
 Date:		Mon Nov  9 04:31:03 UTC 2009

 Modified Files:
 	src/sys/arch/alpha/alpha: pmap.c

 Log Message:
 The tlb shootdown IPI can interrupt the pool_cache_get() in places that
 can cause a deadlock or pool cache corruption.  Take the shootdown job
 queue mutex before calling pool_cache_get(), which will block the IPI
 interrupts and seems to fix the remaining tlb shootdown deadlocks and
 pool cache corruption I've been seeing.  Should address both
 PR port-amiga/38335 and PR port-amiga/42174.


 To generate a diff of this commit:
 cvs rdiff -u -r1.249 -r1.250 src/sys/arch/alpha/alpha/pmap.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: gnats-bugs@NetBSD.org
Cc: jarle@uninett.no
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Sun, 8 Nov 2009 21:34:34 -0700 (MST)

 On Wed, 28 Oct 2009, Jarle Greipsland wrote:

 > So, no cigar this time.  Anything else I should try?
 >
 > 					-jarle

    I've committed a fix to current that I think will fix this.

 Index: sys/arch/alpha/alpha/pmap.c
 ===================================================================
 RCS file: /cvsroot/src/sys/arch/alpha/alpha/pmap.c,v
 retrieving revision 1.249
 diff -u -p -r1.249 pmap.c
 --- sys/arch/alpha/alpha/pmap.c 7 Nov 2009 07:27:40 -0000       1.249
 +++ sys/arch/alpha/alpha/pmap.c 9 Nov 2009 04:21:12 -0000
 @@ -3581,6 +3581,7 @@ pmap_tlb_shootdown(pmap_t pmap, vaddr_t
                  cpumask |= 1UL << ci->ci_cpuid;

                  pq = &pmap_tlb_shootdown_q[ci->ci_cpuid];
 +               mutex_spin_enter(&pq->pq_lock);

                  /*
                   * Allocate a job.
 @@ -3596,7 +3597,6 @@ pmap_tlb_shootdown(pmap_t pmap, vaddr_t
                   * If a global flush is already pending, we
                   * don't really have to do anything else.
                   */
 -               mutex_spin_enter(&pq->pq_lock);
                  pq->pq_pte |= pte;
                  if (pq->pq_tbia) {
                          mutex_spin_exit(&pq->pq_lock);


 --
 Michael L. Hitch			mhitch@montana.edu
 Computer Consultant
 Information Technology Center
 Montana State University	Bozeman, MT	USA

From: Soren Jacobsen <snj@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/38335 CVS commit: [netbsd-5] src/sys/arch/alpha/alpha
Date: Sun, 15 Nov 2009 05:54:38 +0000

 Module Name:	src
 Committed By:	snj
 Date:		Sun Nov 15 05:54:38 UTC 2009

 Modified Files:
 	src/sys/arch/alpha/alpha [netbsd-5]: pmap.c

 Log Message:
 Pull up following revision(s) (requested by mhitch in ticket #1137):
 	sys/arch/alpha/alpha/pmap.c: revision 1.250
 The tlb shootdown IPI can interrupt the pool_cache_get() in places that
 can cause a deadlock or pool cache corruption.  Take the shootdown job
 queue mutex before calling pool_cache_get(), which will block the IPI
 interrupts and seems to fix the remaining tlb shootdown deadlocks and
 pool cache corruption I've been seeing.  Should address both
 PR port-alpha/38335 and PR port-alpha/42174.


 To generate a diff of this commit:
 cvs rdiff -u -r1.236.10.3 -r1.236.10.4 src/sys/arch/alpha/alpha/pmap.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Soren Jacobsen <snj@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/38335 CVS commit: [netbsd-5-0] src/sys/arch/alpha/alpha
Date: Sun, 15 Nov 2009 05:55:01 +0000

 Module Name:	src
 Committed By:	snj
 Date:		Sun Nov 15 05:55:00 UTC 2009

 Modified Files:
 	src/sys/arch/alpha/alpha [netbsd-5-0]: pmap.c

 Log Message:
 Pull up following revision(s) (requested by mhitch in ticket #1137):
 	sys/arch/alpha/alpha/pmap.c: revision 1.250
 The tlb shootdown IPI can interrupt the pool_cache_get() in places that
 can cause a deadlock or pool cache corruption.  Take the shootdown job
 queue mutex before calling pool_cache_get(), which will block the IPI
 interrupts and seems to fix the remaining tlb shootdown deadlocks and
 pool cache corruption I've been seeing.  Should address both
 PR port-alpha/38335 and PR port-alpha/42174.


 To generate a diff of this commit:
 cvs rdiff -u -r1.236.14.2 -r1.236.14.3 src/sys/arch/alpha/alpha/pmap.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->feedback
State-Changed-By: snj@NetBSD.org
State-Changed-When: Thu, 04 Feb 2010 02:41:25 +0000
State-Changed-Why:
Did the committed fix take care of the issue you're seeing?


From: Jarle Greipsland <jarle@uninett.no>
To: gnats-bugs@NetBSD.org, snj@NetBSD.org
Cc: port-alpha-maintainer@netbsd.org, netbsd-bugs@netbsd.org,
	gnats-admin@netbsd.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Fri, 05 Feb 2010 09:20:03 +0100 (CET)

 snj@NetBSD.org writes:
 > Synopsis: kernel freeze on alpha MP system
 > 
 > State-Changed-From-To: open->feedback
 > State-Changed-By: snj@NetBSD.org
 > State-Changed-When: Thu, 04 Feb 2010 02:41:25 +0000
 > State-Changed-Why:
 > Did the committed fix take care of the issue you're seeing?
 I don't think so.  I have been running with a LOCKDEBUG-kernel,
 and yesterday, while trying to do a 'build.sh release', it
 panicked.

 login: Mutex error: lockdebug_wantlock: locking against myself

 lock address : 0xfffffc0000882f68 type     :               spin
 initialized  : 0xfffffc00006735d8
 shared holds :                  0 exclusive:                  1
 shares wanted:                  0 exclusive:                  1
 current cpu  :                  0 last held:                  0
 current l  : 0xfffffc003fab2400 last held: 0xfffffc003fab2400
 last locked  : 0xfffffc0000674690 unlocked : 0xfffffc0000672478
 owner field  : 0x0000000000000400 wait/sn:                0/1

 panic: LOCKDEBUG
 Stopped in pid 0.35 (system) at netbsd:cpu_Debugger+0x4:        ret     zero,(ra
 )
 db{0}> 
 db{0}> trace
 cpu_Debugger() at netbsd:cpu_Debugger+0x4
 panic() at netbsd:panic+0x278
 lockdebug_abort1() at netbsd:lockdebug_abort1+0x150
 mutex_enter() at netbsd:mutex_enter+0x38c
 pool_get() at netbsd:pool_get+0x58
 pool_cache_put_slow() at netbsd:pool_cache_put_slow+0x2b4
 pool_cache_put_paddr() at netbsd:pool_cache_put_paddr+0x1e0
 pmap_do_tlb_shootdown() at netbsd:pmap_do_tlb_shootdown+0x178
 alpha_ipi_process() at netbsd:alpha_ipi_process+0xb8
 interrupt() at netbsd:interrupt+0x88
 XentInt() at netbsd:XentInt+0x1c
 --- interrupt (from ipl 4) ---
 lockdebug_mem_check() at netbsd:lockdebug_mem_check+0x1b4
 pool_put() at netbsd:pool_put+0x7c
 pool_cache_invalidate_groups() at netbsd:pool_cache_invalidate_groups+0xf4
 pool_cache_invalidate() at netbsd:pool_cache_invalidate+0x8c
 pool_reclaim() at netbsd:pool_reclaim+0x68
 pool_drain_end() at netbsd:pool_drain_end+0x60
 uvm_pageout() at netbsd:uvm_pageout+0x880
 exception_return() at netbsd:exception_return
 --- root of call graph ---
 db{0}> ps
 PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
 23711    1 3   1         0   fffffc003d13cc00   alpha--netbsd-in biowait
 28502    1 3   1        80   fffffc002abf4800                 sh wait
 26148    1 2   0         0   fffffc003fab2000                 sh
 25393    1 3   1         0   fffffc0013211c00   alpha--netbsd-in nodebug
 13042    1 3   1        80   fffffc002abf5c00                 sh wait
 13735    1 3   1         0   fffffc003d6a9c00   alpha--netbsd-in biowait
 8778     1 3   1        80   fffffc002c220400                 sh wait
 20723    1 3   1        80   fffffc0029f73c00             nbmake select
 3839     1 3   1        80   fffffc003ecbe400                 sh wait
 10654    1 3   1        80   fffffc0028789800             nbmake select
 27117    1 3   0     40080   fffffc002aff8400                 sh wait
 15901    1 3   1        80   fffffc0028788c00             nbmake select
 18873    1 3   1        80   fffffc003d13c800                 sh wait
 21384    1 3   0        80   fffffc003d6a8800             nbmake select
 18742    1 3   1        80   fffffc0029f73000                 sh wait
 4474     1 3   1        80   fffffc0028789400             nbmake select
 20634    1 3   0        80   fffffc001ba59400                 sh wait
 4601     1 3   1        80   fffffc0035c89c00             nbmake select
 18174    1 3   1        80   fffffc002ab44800                 sh wait
 23525    1 3   1        80   fffffc0035c89000             nbmake select
 16471    1 3   1        80   fffffc0029f72000                 sh wait
 18058    1 3   0        80   fffffc002aff9000             nbmake select
 1991     1 3   1        80   fffffc002aff8800                 sh wait
 3250     1 3   0        80   fffffc003fab2800             nbmake select
 1383     1 3   0        80   fffffc003d6a8000                 sh wait
 22718    1 3   0        80   fffffc0035c89800             nbmake select
 18050    1 3   1        80   fffffc003d533400                 sh wait
 22901    1 3   1        80   fffffc002aff9c00             nbmake select
 27793    1 3   0        80   fffffc001ba58c00             nbmake select
 2373     1 3   1        80   fffffc0035c88800                 sh wait
 26438    1 3   1        80   fffffc002aff9800                 sh wait
 11413    1 3   0        80   fffffc002abf4400             nbmake select
 11875    1 3   1        80   fffffc002c220c00             pickup kqueue
 4144     1 3   1        80   fffffc002c221800               bash ttyraw
 12438    1 3   1        80   fffffc003d13d400              xterm select
 5290     1 3   1        80   fffffc003d532800                csh pause
 1592     1 3   0        80   fffffc002c221c00               sshd select
 21396    1 3   1        80   fffffc001ba58400               sshd netio
 26068    1 3   0        80   fffffc0028788800                 sh wait
 18777    1 3   0        80   fffffc003d533c00             nbmake select
 24640    1 3   1        80   fffffc002ab45000                 sh wait
 13734    1 3   0        80   fffffc002c221400             nbmake select
 26943    1 3   0        80   fffffc003ecbe800                 sh wait
 3019     1 3   0        80   fffffc002abf4000             nbmake select
 18256    1 3   0        80   fffffc003d6a9800                 sh wait
 6532     1 3   0        80   fffffc003d532000             nbmake select
 23787    1 3   0        80   fffffc002abf5800                 sh wait
 20208    1 3   1        80   fffffc003d6a8c00               tcsh pause
 527      1 3   1        80   fffffc003fa09800              getty ttyraw
 515      1 3   0        80   fffffc003d533000               cron nanoslp
 508      1 3   1        80   fffffc003ecbec00          python2.5 select
 484      1 3   0        80   fffffc003f6fc000               qmgr kqueue
 443      1 3   1        80   fffffc003d533800              inetd kqueue
 478      1 3   1        80   fffffc003f6fc400             master kqueue
 300      1 3   1        80   fffffc003ecbf400               sshd select
 301      1 3   0        80   fffffc003ecbf800               ntpd pause
 183      1 3   0        80   fffffc003ecbfc00          mount_mfs mfsidl
 152      1 3   1        80   fffffc003f6fd000            syslogd kqueue
 1        1 3   1        80   fffffc003fa09400               init wait
 0       39 3   1       200   fffffc003f6fc800            physiod physiod
 0       38 3   1       200   fffffc003fa08000        vmem_rehash vmem_rehash
 0       37 3   1       200   fffffc003fa08400           aiodoned aiodoned
 0       36 3   0       200   fffffc003fa08800            ioflush syncer
 0    >  35 7   0       200   fffffc003fab2400           pgdaemon
 0       34 3   1       200   fffffc003fa09000          cryptoret crypto_wait
 0       33 3   1       200   fffffc003fa09c00          atapibus0 sccomp
 0       31 3   1       200   fffffc003fa08c00              unpgc unpgc
 0       22 3   0       200   fffffc003fab2c00            atabus1 atath
 0       21 3   1       200   fffffc003fab3000            atabus0 atath
 0       20 3   0       200   fffffc003fab3400           scsibus0 sccomp
 0       19 3   1       200   fffffc003fab3800            xcall/1 xcall
 0       18 1   1       200   fffffc003fab3c00          softser/1
 0       17 1   1       200   fffffc003fcea000          softclk/1
 0       16 1   1       200   fffffc003fcea400          softbio/1
 0       15 1   1       200   fffffc003fcea800          softnet/1
 0    >  14 7   1       201   fffffc003fceac00             idle/1
 0       13 3   1       200   fffffc003fceb000         pmfsuspend pmfsuspend
 0       12 3   0       200   fffffc003fceb400           pmfevent pmfevent
 0       11 3   1       200   fffffc003fceb800           nfssilly nfssilly
 0       10 3   1       200   fffffc003fcebc00            cachegc cachegc
 0        9 3   1       200   fffffc003fe18000              vrele vrele
 0        8 3   0       200   fffffc003fe18400          modunload modunload
 0        7 3   0       200   fffffc003fe18800            xcall/0 xcall
 0        6 1   0       200   fffffc003fe18c00          softser/0
 0        5 1   0       200   fffffc003fe19000          softclk/0
 0        4 1   0       200   fffffc003fe19400          softbio/0
 0        3 1   0       200   fffffc003fe19800          softnet/0
 0        2 1   0       201   fffffc003fe19c00             idle/0
 0        1 3   1       200   fffffc0000841580            swapper uvm
 db{0}> show reg
 v0          0xfffffc000083b980  cpu_info_primary
 t0          0x1
 t1          0x1
 t2          0xfffffc003ff48000
 t3          0xfffffc000084f201  kprintf_inited
 t4          0
 t5          0xfffffc00007e1fd2  sun_noname+0x286
 t6          0
 t7          0
 s0          0xfffffc00008503c4  msgbufenabled
 s1          0x104
 s2          0xfffffc000084e720  db_onpanic
 s3          0xfffffc00007e23e5  __func__.8678+0x2cc
 s4          0x1
 s5          0x5
 s6          0xfffffc003fa936c0
 a0          0
 a1          0xfffffc0000677ca4  kprintf_unlock+0x54
 a2          0
 a3          0x8
 a4          0x3
 a5          0xfffffc0000000008
 t8          0xfffffe0013041787
 t9          0x8
 t10         0x3a46e6e0
 t11         0x9
 ra          0xfffffc0000679328  panic+0x278
 t12         0xfffffc000047b0b0  cpu_Debugger
 at          0x17
 gp          0xfffffc000084e3d0  __link_set_prop_linkpools_sym__link__prop_array_pool+0x8008
 sp          0xfffffe0013041870
 pc          0xfffffc000047b0b4  cpu_Debugger+0x4
 ps          0x6
 ai          0x9
 pv          0xfffffc000047b0b0  cpu_Debugger
 netbsd:cpu_Debugger+0x4:        ret     zero,(ra)
 db{0}> mach cpu 1
 Using CPU 1
 db{0}> trace
 db{0}> show reg
 v0          0xfffffe0000032000
 t0          0
 t1          0xfffffe0012fd2000
 t2          0xfffffe0012fd2000
 t3          0
 t4          0
 t5          0xfffffc003fceac60
 t6          0
 t7          0xfffffc00008aae18  vm_physmem
 s0          0xfffffc003fe20cc0
 s1          0
 s2          0xfffffe0000032000
 s3          0xfffffe0000032038
 s4          0xfffffc000078e3d0  config+0x437e
 s5          0xfffffc00007ae3d0  wildcard.5079+0xd88
 s6          0xfffffc00007ae3d0  wildcard.5079+0xd88
 a0          0xfffffc0000793f8c  config+0x9f3a
 a1          0xfffffc00007aee79  __func__.5392+0xe3
 a2          0x1dd
 a3          0xfffffc00007aee90  __func__.5392+0xfa
 a4          0
 a5          0
 t8          0xfffffc00008aae18  vm_physmem
 t9          0xfffffc00005835b4  calcru+0x204
 t10         0
 t11         0xa
 ra          0xfffffc0000585390  sched_curcpu_runnable_p+0x70
 t12         0xfffffc0000594950  kpreempt_enable
 at          0x1601b306c
 gp          0xfffffc000084e3d0  __link_set_prop_linkpools_sym__link__prop_array_pool+0x8008
 sp          0xfffffc0000593ca0  kpreempt_disable+0x20
 pc          0xfffffc0000594950  kpreempt_enable
 ps          0
 ai          0xa
 pv          0xfffffc0000594950  kpreempt_enable
 netbsd:kpreempt_enable: ldah    gp,2c(pv)
 db{0}> c
 cpu1: shutting down...

 dumping to dev 8,1 offset 2097597
 dump i/o error


 sd0(esiop0:0:0:0): polling command not done
 panic: scsipi_execute_xs
 panic: alpha_send_ipi: CPU 1 not running
 panic: alpha_send_ipi: CPU 1 not running
 [ ... ]

 					-jarle

State-Changed-From-To: feedback->open
State-Changed-By: snj@NetBSD.org
State-Changed-When: Fri, 05 Feb 2010 17:46:44 +0000
State-Changed-Why:
Feedback received.


From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: Jarle Greipsland <jarle@uninett.no>
Cc: gnats-bugs@NetBSD.org
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Thu, 18 Feb 2010 15:04:28 -0700 (MST)

 On Fri, 5 Feb 2010, Jarle Greipsland wrote:

 >> Did the committed fix take care of the issue you're seeing?
 > I don't think so.  I have been running with a LOCKDEBUG-kernel,
 > and yesterday, while trying to do a 'build.sh release', it
 > panicked.
 >
 > login: Mutex error: lockdebug_wantlock: locking against myself
 >
 > lock address : 0xfffffc0000882f68 type     :               spin
 > initialized  : 0xfffffc00006735d8
 > shared holds :                  0 exclusive:                  1
 > shares wanted:                  0 exclusive:                  1
 > current cpu  :                  0 last held:                  0
 > current l  : 0xfffffc003fab2400 last held: 0xfffffc003fab2400
 > last locked  : 0xfffffc0000674690 unlocked : 0xfffffc0000672478
 > owner field  : 0x0000000000000400 wait/sn:                0/1
 >
 > panic: LOCKDEBUG
 > Stopped in pid 0.35 (system) at netbsd:cpu_Debugger+0x4:        ret     zero,(ra
 > )
 > db{0}>
 > db{0}> trace
 > cpu_Debugger() at netbsd:cpu_Debugger+0x4
 > panic() at netbsd:panic+0x278
 > lockdebug_abort1() at netbsd:lockdebug_abort1+0x150
 > mutex_enter() at netbsd:mutex_enter+0x38c
 > pool_get() at netbsd:pool_get+0x58
 > pool_cache_put_slow() at netbsd:pool_cache_put_slow+0x2b4
 > pool_cache_put_paddr() at netbsd:pool_cache_put_paddr+0x1e0
 > pmap_do_tlb_shootdown() at netbsd:pmap_do_tlb_shootdown+0x178
 > alpha_ipi_process() at netbsd:alpha_ipi_process+0xb8
 > interrupt() at netbsd:interrupt+0x88
 > XentInt() at netbsd:XentInt+0x1c
 > --- interrupt (from ipl 4) ---
 > lockdebug_mem_check() at netbsd:lockdebug_mem_check+0x1b4
 > pool_put() at netbsd:pool_put+0x7c
 > pool_cache_invalidate_groups() at netbsd:pool_cache_invalidate_groups+0xf4
 > pool_cache_invalidate() at netbsd:pool_cache_invalidate+0x8c
 > pool_reclaim() at netbsd:pool_reclaim+0x68
 > pool_drain_end() at netbsd:pool_drain_end+0x60
 > uvm_pageout() at netbsd:uvm_pageout+0x880
 > exception_return() at netbsd:exception_return

    This looks like the tlb shootdown IPI interrupted a pool_put() 
 operation, which likley holds a mutex related to some pool.  The
 tlb shootdown is attempting to return an entry to the shootdown
 job pool, and tries to acquire the same lock.

    I hadn't seen this type of deadlock (which it would have without
 LOCKDEBUG) before, but I recently started seeing seeing several
 deadlocks just grepping a number of large files.  Trying to build
 a current LOCKDEBUG kernel resulted in a hard hang when booting the 
 kernel, requiring a power cycle of my CS20 and I had trouble recovering 
 after that.  I switched to a 4 CPU ES40 to debug the LOCKDEBUG problem, 
 and eventually found what that was and using a workaround was finally able 
 to get a working LOCKDEBUG kernel.  My grepping was able to cause 
 deadlocks fairly easily, and I had been trying to to keep the CS20 from 
 hanging in the deadlock by enabling the SPINLOCK_SPINOUT part of 
 LOCKDEBUG.  Some of the deadlocks looked very similar to the above.

    Try the following change, which will use IPL_SCHED for the pool cache 
 used by the tlb shootdown code.  I've done this on my CS20, and have been 
 able to run numereous greps that would previously deadlock.  In the above 
 traceback, I can't quite see how that would work, unless the 
 pool_drain_end() is going to be locking the tlb shootdown job pool.

 Index: sys/arch/alpha/alpha/pmap.c
 ===================================================================
 RCS file: /cvsroot/src/sys/arch/alpha/alpha/pmap.c,v
 retrieving revision 1.252
 diff -u -p -r1.252 pmap.c
 --- sys/arch/alpha/alpha/pmap.c 26 Nov 2009 00:19:11 -0000      1.252
 +++ sys/arch/alpha/alpha/pmap.c 18 Feb 2010 21:52:16 -0000
 @@ -937,7 +937,7 @@ pmap_bootstrap(paddr_t ptaddr, u_int max
           */
          pool_cache_bootstrap(&pmap_tlb_shootdown_job_cache,
              sizeof(struct pmap_tlb_shootdown_job), CACHE_LINE_SIZE,
 -            0, PR_LARGECACHE, "pmaptlb", NULL, IPL_VM, NULL, NULL, NULL);
 +            0, PR_LARGECACHE, "pmaptlb", NULL, IPL_SCHED, NULL, NULL, NULL);
          for (i = 0; i < ALPHA_MAXPROCS; i++) {
                  TAILQ_INIT(&pmap_tlb_shootdown_q[i].pq_head);
                  mutex_init(&pmap_tlb_shootdown_q[i].pq_lock, MUTEX_DEFAULT,


 --
 Michael L. Hitch			mhitch@montana.edu
 Computer Consultant
 Information Technology Center
 Montana State University	Bozeman, MT	USA

From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: gnats-bugs@NetBSD.org
Cc: jarle@uninett.no
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Fri, 19 Feb 2010 11:10:17 -0700 (MST)

 On Thu, 18 Feb 2010, Michael L. Hitch wrote:

 > > db{0}> trace
 > > cpu_Debugger() at netbsd:cpu_Debugger+0x4
 > > panic() at netbsd:panic+0x278
 > > lockdebug_abort1() at netbsd:lockdebug_abort1+0x150
 > > mutex_enter() at netbsd:mutex_enter+0x38c
 > > pool_get() at netbsd:pool_get+0x58
 > > pool_cache_put_slow() at netbsd:pool_cache_put_slow+0x2b4
 > > pool_cache_put_paddr() at netbsd:pool_cache_put_paddr+0x1e0
 > > pmap_do_tlb_shootdown() at netbsd:pmap_do_tlb_shootdown+0x178
 > > alpha_ipi_process() at netbsd:alpha_ipi_process+0xb8
 > > interrupt() at netbsd:interrupt+0x88
 > > XentInt() at netbsd:XentInt+0x1c
 > > --- interrupt (from ipl 4) ---
 > > lockdebug_mem_check() at netbsd:lockdebug_mem_check+0x1b4
 > > pool_put() at netbsd:pool_put+0x7c
 > > pool_cache_invalidate_groups() at netbsd:pool_cache_invalidate_groups+0xf4
 > > pool_cache_invalidate() at netbsd:pool_cache_invalidate+0x8c
 > > pool_reclaim() at netbsd:pool_reclaim+0x68
 > > pool_drain_end() at netbsd:pool_drain_end+0x60
 > > uvm_pageout() at netbsd:uvm_pageout+0x880
 > > exception_return() at netbsd:exception_return
 ...
 >    Try the following change, which will use IPL_SCHED for the pool cache
 > used by the tlb shootdown code.  I've done this on my CS20, and have been
 > able to run numereous greps that would previously deadlock.  In the above
 > traceback, I can't quite see how that would work, unless the
 > pool_drain_end() is going to be locking the tlb shootdown job pool.

    The patch I suggested is not going to help.  I just ran one more grep 
 today and got a panic just like the above one.

    The problem is that the lock being held is for the pcg_large_pool, which 
 uses IPL_VM, which can be interrupted by the tlb shootdown IPI.  In this 
 case, the lock has been taken by the cpu when doing the pool cache 
 invalidate, and gets interrupted by the IPI on the same cpu.  It needs to 
 do a pool_get() when the tlb shootdown routine is trying to put a pool 
 cache entry back into the cache, and that needs to also lock 
 pcg_large_pool.

    I don't really know all that much about the pool cache or pools, and 
 don't see any easy solution to this.

    A quick and dirty workaround would be to change the maximum number of 
 'jobs' the tlb shutdown does to 0, which effectively forces it to always 
 invalid the entire TLB and never use any pool cache entries.

    An alternative that I looked at quite some time ago (2 years maybe?) was 
 to replace the pool cache usage with statically allocated entries in the 
 pmap_tlb_shootdown_q[].

 --
 Michael L. Hitch			mhitch@montana.edu
 Computer Consultant
 Information Technology Center
 Montana State University	Bozeman, MT	USA

Responsible-Changed-From-To: port-alpha-maintainer->thorpej
Responsible-Changed-By: thorpej@NetBSD.org
Responsible-Changed-When: Tue, 13 Oct 2020 15:58:55 +0000
Responsible-Changed-Why:
Take.


State-Changed-From-To: open->closed
State-Changed-By: thorpej@NetBSD.org
State-Changed-When: Tue, 13 Oct 2020 15:58:55 +0000
State-Changed-Why:
Ancient PR, many changes in this area in the intervening time.  I have
done an awful lot of MP stress testing on Alpha recently and fixed many
bugs in this area.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: gnats-precook-prs,v 1.4 2018/12/21 14:20:20 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.