NetBSD Problem Report #42174

From www@NetBSD.org  Mon Oct 12 06:46:04 2009
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by www.NetBSD.org (Postfix) with ESMTP id 2F47363B8B6
	for <gnats-bugs@gnats.netbsd.org>; Mon, 12 Oct 2009 06:46:04 +0000 (UTC)
Message-Id: <20091012064604.012E563B877@www.NetBSD.org>
Date: Mon, 12 Oct 2009 06:46:03 +0000 (UTC)
From: dmarquess@gmail.com
Reply-To: dmarquess@gmail.com
To: gnats-bugs@NetBSD.org
Subject: GENERIC.MP on Alpha results in hard lock
X-Send-Pr-Version: www-1.0

>Number:         42174
>Category:       port-alpha
>Synopsis:       GENERIC.MP on Alpha results in hard lock
>Confidential:   no
>Severity:       non-critical
>Priority:       medium
>Responsible:    port-alpha-maintainer
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Oct 12 06:50:00 +0000 2009
>Closed-Date:    Sun Nov 15 06:12:05 +0000 2009
>Last-Modified:  Sun Nov 15 06:12:05 +0000 2009
>Originator:     Dustin Marquess
>Release:        5.0_STABLE
>Organization:
>Environment:
NetBSD 5.0_STABLE (GENERIC-$Revision: 1.325.4.1 $) #0: Sun Oct 11 02:37:40 UTC 2009
        builds@b8.netbsd.org:/home/builds/ab/netbsd-5/alpha/200910110000Z-obj/home/builds/ab/netbsd-5/src/sys/arch/alpha/compile/GENERIC.MP
>Description:
Installed 5.0_STABLE on an API CS20 / HP DS20L & then booted a GENERIC.MP kernel.  SSHd into the machine, downloaded the latest pkgsrc.tar.bz2, and proceeded to uncompress it.  About 70% of the way into it, the machine promptly hard locked.  No response is given on the network, or the serial console.  Not sure how to debug w/o serial console.

Michael L. Hitch documented this issue when fixing PR #41106, however no PR seems to have been created for this problem.

On a side note, the machine was setup using FFSv1 + WAPBL.  When rebooting, it appears that the filesystem was hosed.  Not sure if this is related or not.
>How-To-Repeat:
Use a MP kernel on an Alpha.
>Fix:
Unknown.  Not sure how to debug w/o serial console.

>Release-Note:

>Audit-Trail:
From: Dustin Marquess <dmarquess@gmail.com>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: port-alpha/42174
Date: Fri, 16 Oct 2009 22:52:13 -0500

 Got the machine to deadlock again.  Used the halt button to get into
 ddb and this is the stack trace that I got:

 db{0}> trace
 cpu_Debugger() at netbsd:cpu_Debugger+0x4
 panic() at netbsd:panic+0x244
 console_restart() at netbsd:console_restart+0x78
 XentRestart() at netbsd:XentRestart+0x90
 --- console restart (from ipl 4) ---
 nullop() at netbsd:nullop
 _kernel_lock() at netbsd:_kernel_lock+0x1c0
 interrupt() at netbsd:interrupt+0x2a4
 XentInt() at netbsd:XentInt+0x1c
 --- interrupt (from ipl 0) ---
 kpreempt_disable() at netbsd:kpreempt_disable
 sched_curcpu_runnable_p() at netbsd:sched_curcpu_runnable_p+0x1c
 idle_loop() at netbsd:idle_loop+0x17c
 exception_return() at netbsd:exception_return
 --- root of call graph ---

From: Dustin Marquess <dmarquess@gmail.com>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: port-alpha/42174
Date: Sun, 25 Oct 2009 12:45:54 -0500

 I've triggered this bug about 4 times now, and every single time it
 hits, when I reboot the disklabel on the drive is completely gone.
 This seems like a huge coincidence to me.

From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: gnats-bugs@NetBSD.org
Cc: gnats-admin@netbsd.org, dmarquess@gmail.com
Subject: Re: port-alpha/42174
Date: Mon, 26 Oct 2009 11:50:24 -0600 (MDT)

 On Sat, 17 Oct 2009, Dustin Marquess wrote:

 > Got the machine to deadlock again.  Used the halt button to get into
 > ddb and this is the stack trace that I got:
 >
 > db{0}> trace
 > cpu_Debugger() at netbsd:cpu_Debugger+0x4
 > panic() at netbsd:panic+0x244
 > console_restart() at netbsd:console_restart+0x78
 > XentRestart() at netbsd:XentRestart+0x90
 > --- console restart (from ipl 4) ---
 > nullop() at netbsd:nullop
 > _kernel_lock() at netbsd:_kernel_lock+0x1c0

     This would indicate that CPU0 is attempting to acquire the kernel lock,
 but presumably CPU1 currently has it and is unable to release it for some 
 reasone.

    I don't know of any easy way to determine what CPU1 is currently doing. 
 It most like is unable to process the IPI interrupt sent by entering DDB
 and pausing.  Even if it was able to, something is broken in looking at a 
 backtrace in the other cpus, although I think the register contents can be 
 displayed when the cpu is able to respond to the IPI pause request.

    A LOCKDEBUG kernel would provide more information on what is currently 
 locked, but I don't think that's working on the alpha at the moment.



 --
 Michael L. Hitch			mhitch@montana.edu
 Computer Consultant
 Information Technology Center
 Montana State University	Bozeman, MT	USA

From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: port-alpha/42174
Date: Fri, 30 Oct 2009 06:40:35 +0000

 On Sun, Oct 25, 2009 at 05:55:02PM +0000, Dustin Marquess wrote:
  >  I've triggered this bug about 4 times now, and every single time it
  >  hits, when I reboot the disklabel on the drive is completely gone.
  >  This seems like a huge coincidence to me.

 The *disklabel*? Properly speaking that shouldn't be possible - the
 disklabel should be read-only during all normal operation...

 -- 
 David A. Holland
 dholland@netbsd.org

From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: gnats-bugs@NetBSD.org
Cc: dmarquess@gmail.com
Subject: Re: port-alpha/42174
Date: Sun, 1 Nov 2009 20:36:38 -0700 (MST)

 On Mon, 26 Oct 2009, Michael L. Hitch wrote:

 >    A LOCKDEBUG kernel would provide more information on what is currently
 > locked, but I don't think that's working on the alpha at the moment.

    I've committed a fix for the LOCKDEBUG problem on the CS20/DS20L, and it
 has been pulled up to the netbsd-5 branch.  One of the things that 
 LOCKDEBUG will do is to limit how long the cpu spins waiting for a lock, 
 so should eventually panic instead of hanging forever deadlocked.  It also 
 keeps more information about the locks, including which cpu holds the lock
 and the address where the lock was acquired.


 --
 Michael L. Hitch			mhitch@montana.edu
 Computer Consultant
 Information Technology Center
 Montana State University	Bozeman, MT	USA

From: Dustin Marquess <dmarquess@gmail.com>
To: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
Cc: gnats-bugs@netbsd.org
Subject: Re: port-alpha/42174
Date: Mon, 2 Nov 2009 11:42:08 -0600

 I just hit:

 panic: kernel diagnostic assertion "pcg->pcg_avail == 0" failed: file
 "../../../../kern/subr_pool.c", line 2580
 Stopped in pid 0.12 (system) at netbsd:cpu_Debugger+0x4:        ret     zero,(ra
 )
 db{1}> bt
 cpu_Debugger() at netbsd:cpu_Debugger+0x4
 panic() at netbsd:panic+0x244
 __kernassert() at netbsd:__kernassert+0x4c
 pool_cache_put_slow() at netbsd:pool_cache_put_slow+0x2f0
 pool_cache_put_paddr() at netbsd:pool_cache_put_paddr+0x1b0
 pmap_do_tlb_shootdown() at netbsd:pmap_do_tlb_shootdown+0x174
 alpha_ipi_process() at netbsd:alpha_ipi_process+0xb8
 interrupt() at netbsd:interrupt+0x84
 XentInt() at netbsd:XentInt+0x1c
 --- interrupt (from ipl 0) ---
 idle_loop() at netbsd:idle_loop+0x19c
 cpu_spinup_trampoline() at netbsd:cpu_spinup_trampoline+0x5c
 --- root of call graph ---

 Not sure if that's related to DIAGNOSTIC or LOCKDEBUG.  If it's
 related to DIAGNOSTIC I'll recompile with that disabled for now.

From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: gnats-bugs@NetBSD.org
Cc: dmarquess@gmail.com
Subject: Re: port-alpha/42174
Date: Mon, 2 Nov 2009 11:22:33 -0700 (MST)

 On Mon, 2 Nov 2009, Dustin Marquess wrote:

 > I just hit:
 >
 > panic: kernel diagnostic assertion "pcg->pcg_avail == 0" failed: file
 > "../../../../kern/subr_pool.c", line 2580

    Hmm, I wonder if this might be related to some of the pool cache
 corruption I've seen at times.

 > Not sure if that's related to DIAGNOSTIC or LOCKDEBUG.  If it's
 > related to DIAGNOSTIC I'll recompile with that disabled for now.

    The KASSERT() is enabled by DIAGNOSTIC, which I now see isn't enabled
 in my kernel and would explain why I haven't seen this happen.  I
 would presume that the condition being tested shouldn't have happened,
 and ignoring it by turning off DIAGNOSTIC would likely just result in
 problems later.

    There's a couple of things you can try that essentially disable the
 use of using pool cache entries to shootdown tlbs.  One is a variable
 you can patch on boot or change in the source to disable all the pool
 caching.  The variable is pool_cache_disable in sys/kern/subr_pool.c
 about line 182.  This can also be set temporarily on boot by using
 the -d flag to enter the debugger on startup and setting it with the
 command "write pool_cache_disable 1" and continuing.

    The other thing is to change the maximum number of TLB entries to
 shootdown from 6 to 0 in sys/arch/alpha/alpha/pmap.c, about line 433.
 This will force the tlb shootdown routine to always invalid all tlb
 entries and never attempt to use pool cache entries.

 --
 Michael L. Hitch			mhitch@montana.edu
 Computer Consultant
 Information Technology Center
 Montana State University	Bozeman, MT	USA

From: Dustin Marquess <dmarquess@gmail.com>
To: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
Cc: gnats-bugs@netbsd.org
Subject: Re: port-alpha/42174
Date: Tue, 3 Nov 2009 20:35:52 -0600

 Okay, got it back up and running and within minutes hit:

 Mutex error: mutex_vector_enter: spinout

 lock address : 0xfffffc0000c85fe8 type     :               spin
 initialized  : 0xfffffc00006eb9b8
 shared holds :                  0 exclusive:                  0
 shares wanted:                  0 exclusive:                  3
 current cpu  :                  0 last held:                  1
 current lwp  : 0xfffffc007f69d800 last held: 000000000000000000
 last locked  : 0xfffffc00006eccc8 unlocked : 0xfffffc00006ece90
 owner field  : 0x0000000000000400 wait/spin:                0/1

 panic: LOCKDEBUG
 Stopped in pid 0.36 (system) at netbsd:cpu_Debugger+0x4:        ret     zero,(ra
 )
 db{0}> bt
 cpu_Debugger() at netbsd:cpu_Debugger+0x4
 panic() at netbsd:panic+0x244
 lockdebug_abort1() at netbsd:lockdebug_abort1+0x150
 mutex_abort() at netbsd:mutex_abort+0x34
 mutex_vector_enter() at netbsd:mutex_vector_enter+0x324
 pool_cache_put_slow() at netbsd:pool_cache_put_slow+0x198
 pool_cache_put_paddr() at netbsd:pool_cache_put_paddr+0x1b0
 pmap_do_tlb_shootdown() at netbsd:pmap_do_tlb_shootdown+0x174
 alpha_ipi_process() at netbsd:alpha_ipi_process+0xb8
 interrupt() at netbsd:interrupt+0x84
 XentInt() at netbsd:XentInt+0x1c
 --- interrupt (from ipl 4) ---
 lockdebug_wantlock() at netbsd:lockdebug_wantlock+0x218
 mutex_tryenter() at netbsd:mutex_tryenter+0x274
 pool_cache_get_slow() at netbsd:pool_cache_get_slow+0x78
 pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x26c
 pmap_tlb_shootdown() at netbsd:pmap_tlb_shootdown+0x248
 pmap_kremove() at netbsd:pmap_kremove+0x124
 uvm_pagermapout() at netbsd:uvm_pagermapout+0x40
 uvm_aio_aiodone() at netbsd:uvm_aio_aiodone+0xf4

 I'm still in ddb if I need to do anything else.

From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: gnats-bugs@NetBSD.org
Cc: dmarquess@gmail.com
Subject: Re: port-alpha/42174
Date: Thu, 5 Nov 2009 11:59:39 -0700 (MST)

 On Wed, 4 Nov 2009, Dustin Marquess wrote:

 > panic: LOCKDEBUG
 > Stopped in pid 0.36 (system) at netbsd:cpu_Debugger+0x4:        ret     zero,(ra
 > )
 > db{0}> bt
 > cpu_Debugger() at netbsd:cpu_Debugger+0x4
 > panic() at netbsd:panic+0x244
 > lockdebug_abort1() at netbsd:lockdebug_abort1+0x150
 > mutex_abort() at netbsd:mutex_abort+0x34
 > mutex_vector_enter() at netbsd:mutex_vector_enter+0x324
 > pool_cache_put_slow() at netbsd:pool_cache_put_slow+0x198
 > pool_cache_put_paddr() at netbsd:pool_cache_put_paddr+0x1b0
 > pmap_do_tlb_shootdown() at netbsd:pmap_do_tlb_shootdown+0x174
 > alpha_ipi_process() at netbsd:alpha_ipi_process+0xb8
 > interrupt() at netbsd:interrupt+0x84
 > XentInt() at netbsd:XentInt+0x1c
 > --- interrupt (from ipl 4) ---
 > lockdebug_wantlock() at netbsd:lockdebug_wantlock+0x218
 > mutex_tryenter() at netbsd:mutex_tryenter+0x274
 > pool_cache_get_slow() at netbsd:pool_cache_get_slow+0x78
 > pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x26c
 > pmap_tlb_shootdown() at netbsd:pmap_tlb_shootdown+0x248
 > pmap_kremove() at netbsd:pmap_kremove+0x124
 > uvm_pagermapout() at netbsd:uvm_pagermapout+0x40
 > uvm_aio_aiodone() at netbsd:uvm_aio_aiodone+0xf4
 >
 > I'm still in ddb if I need to do anything else.

    It confirms one of the deadlocks I suspected.  I should soon have a
 patch you can try out.

 --
 Michael L. Hitch			mhitch@montana.edu
 Computer Consultant
 Information Technology Center
 Montana State University	Bozeman, MT	USA

From: "Michael L. Hitch" <mhitch@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/42174 CVS commit: src/sys/arch/alpha/alpha
Date: Mon, 9 Nov 2009 04:31:04 +0000

 Module Name:	src
 Committed By:	mhitch
 Date:		Mon Nov  9 04:31:03 UTC 2009

 Modified Files:
 	src/sys/arch/alpha/alpha: pmap.c

 Log Message:
 The tlb shootdown IPI can interrupt the pool_cache_get() in places that
 can cause a deadlock or pool cache corruption.  Take the shootdown job
 queue mutex before calling pool_cache_get(), which will block the IPI
 interrupts and seems to fix the remaining tlb shootdown deadlocks and
 pool cache corruption I've been seeing.  Should address both
 PR port-amiga/38335 and PR port-amiga/42174.


 To generate a diff of this commit:
 cvs rdiff -u -r1.249 -r1.250 src/sys/arch/alpha/alpha/pmap.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: gnats-bugs@NetBSD.org
Cc: dmarquess@gmail.com
Subject: Re: port-alpha/42174
Date: Sun, 8 Nov 2009 21:36:36 -0700 (MST)

 On Thu, 5 Nov 2009, Michael L. Hitch wrote:

 > > I'm still in ddb if I need to do anything else.
 >
 >    It confirms one of the deadlocks I suspected.  I should soon have a
 > patch you can try out.

    I've committed a fix to current that should take care of this.

 Index: sys/arch/alpha/alpha/pmap.c
 ===================================================================
 RCS file: /cvsroot/src/sys/arch/alpha/alpha/pmap.c,v
 retrieving revision 1.249
 diff -u -p -r1.249 pmap.c
 --- sys/arch/alpha/alpha/pmap.c 7 Nov 2009 07:27:40 -0000       1.249
 +++ sys/arch/alpha/alpha/pmap.c 9 Nov 2009 04:21:12 -0000
 @@ -3581,6 +3581,7 @@ pmap_tlb_shootdown(pmap_t pmap, vaddr_t
                  cpumask |= 1UL << ci->ci_cpuid;

                  pq = &pmap_tlb_shootdown_q[ci->ci_cpuid];
 +               mutex_spin_enter(&pq->pq_lock);

                  /*
                   * Allocate a job.
 @@ -3596,7 +3597,6 @@ pmap_tlb_shootdown(pmap_t pmap, vaddr_t
                   * If a global flush is already pending, we
                   * don't really have to do anything else.
                   */
 -               mutex_spin_enter(&pq->pq_lock);
                  pq->pq_pte |= pte;
                  if (pq->pq_tbia) {
                          mutex_spin_exit(&pq->pq_lock);


 --
 Michael L. Hitch			mhitch@montana.edu
 Computer Consultant
 Information Technology Center
 Montana State University	Bozeman, MT	USA

From: Dustin Marquess <dmarquess@gmail.com>
To: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
Cc: gnats-bugs@netbsd.org
Subject: Re: port-alpha/42174
Date: Tue, 10 Nov 2009 09:54:18 -0600

 I applied Michael's patch along w/ Jason Thorpe's L1 patch to a fresh
 netbsd-5 checkout.  Recompiled w/ DEBUG/DIAGNOSTIC/LOCKDEBUG.  So far
 the new kernel has been banged on for 24 hours w/o even a slight
 hiccup.  Before the patches, it was lucky if it made it 5 minutes w/
 DIAGNOSTIC/LOCKDEBUG.

 So while it hasn't had in-depth testing, it is definitely a *LOT* more
 stable than what it was, so hopefully this PR can be resolved now.

 On a side note, the disk issue I was having seems to be unrelated.  It
 appears that something (not sure if disk itself or controller) would
 get in some kind of weird stat.  That would cause problems and make it
 appear that the disk had no label.  I swapped the drive from the
 Adaptec 29160 that I had installed to make Tru64 happy back to the
 internal 53c1010, and so far the problem hasn't reoccurred.  So it
 seems to be completely unrelated, sorry!

State-Changed-From-To: open->pending-pullups
State-Changed-By: mhitch@NetBSD.org
State-Changed-When: Tue, 10 Nov 2009 18:48:40 +0000
State-Changed-Why:
MP is much more stable after recent changes.
Pullups have been requested to the netbsd 5 branches.


From: Soren Jacobsen <snj@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/42174 CVS commit: [netbsd-5] src/sys/arch/alpha/alpha
Date: Sun, 15 Nov 2009 05:54:38 +0000

 Module Name:	src
 Committed By:	snj
 Date:		Sun Nov 15 05:54:38 UTC 2009

 Modified Files:
 	src/sys/arch/alpha/alpha [netbsd-5]: pmap.c

 Log Message:
 Pull up following revision(s) (requested by mhitch in ticket #1137):
 	sys/arch/alpha/alpha/pmap.c: revision 1.250
 The tlb shootdown IPI can interrupt the pool_cache_get() in places that
 can cause a deadlock or pool cache corruption.  Take the shootdown job
 queue mutex before calling pool_cache_get(), which will block the IPI
 interrupts and seems to fix the remaining tlb shootdown deadlocks and
 pool cache corruption I've been seeing.  Should address both
 PR port-alpha/38335 and PR port-alpha/42174.


 To generate a diff of this commit:
 cvs rdiff -u -r1.236.10.3 -r1.236.10.4 src/sys/arch/alpha/alpha/pmap.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Soren Jacobsen <snj@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/42174 CVS commit: [netbsd-5-0] src/sys/arch/alpha/alpha
Date: Sun, 15 Nov 2009 05:55:01 +0000

 Module Name:	src
 Committed By:	snj
 Date:		Sun Nov 15 05:55:00 UTC 2009

 Modified Files:
 	src/sys/arch/alpha/alpha [netbsd-5-0]: pmap.c

 Log Message:
 Pull up following revision(s) (requested by mhitch in ticket #1137):
 	sys/arch/alpha/alpha/pmap.c: revision 1.250
 The tlb shootdown IPI can interrupt the pool_cache_get() in places that
 can cause a deadlock or pool cache corruption.  Take the shootdown job
 queue mutex before calling pool_cache_get(), which will block the IPI
 interrupts and seems to fix the remaining tlb shootdown deadlocks and
 pool cache corruption I've been seeing.  Should address both
 PR port-alpha/38335 and PR port-alpha/42174.


 To generate a diff of this commit:
 cvs rdiff -u -r1.236.14.2 -r1.236.14.3 src/sys/arch/alpha/alpha/pmap.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: pending-pullups->closed
State-Changed-By: snj@NetBSD.org
State-Changed-When: Sun, 15 Nov 2009 06:12:05 +0000
State-Changed-Why:
Pullups complete.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.