NetBSD Problem Report #42174
From www@NetBSD.org Mon Oct 12 06:46:04 2009
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
by www.NetBSD.org (Postfix) with ESMTP id 2F47363B8B6
for <gnats-bugs@gnats.netbsd.org>; Mon, 12 Oct 2009 06:46:04 +0000 (UTC)
Message-Id: <20091012064604.012E563B877@www.NetBSD.org>
Date: Mon, 12 Oct 2009 06:46:03 +0000 (UTC)
From: dmarquess@gmail.com
Reply-To: dmarquess@gmail.com
To: gnats-bugs@NetBSD.org
Subject: GENERIC.MP on Alpha results in hard lock
X-Send-Pr-Version: www-1.0
>Number: 42174
>Category: port-alpha
>Synopsis: GENERIC.MP on Alpha results in hard lock
>Confidential: no
>Severity: non-critical
>Priority: medium
>Responsible: port-alpha-maintainer
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Mon Oct 12 06:50:00 +0000 2009
>Closed-Date: Sun Nov 15 06:12:05 +0000 2009
>Last-Modified: Sun Nov 15 06:12:05 +0000 2009
>Originator: Dustin Marquess
>Release: 5.0_STABLE
>Organization:
>Environment:
NetBSD 5.0_STABLE (GENERIC-$Revision: 1.325.4.1 $) #0: Sun Oct 11 02:37:40 UTC 2009
builds@b8.netbsd.org:/home/builds/ab/netbsd-5/alpha/200910110000Z-obj/home/builds/ab/netbsd-5/src/sys/arch/alpha/compile/GENERIC.MP
>Description:
Installed 5.0_STABLE on an API CS20 / HP DS20L & then booted a GENERIC.MP kernel. SSHd into the machine, downloaded the latest pkgsrc.tar.bz2, and proceeded to uncompress it. About 70% of the way into it, the machine promptly hard locked. No response is given on the network, or the serial console. Not sure how to debug w/o serial console.
Michael L. Hitch documented this issue when fixing PR #41106, however no PR seems to have been created for this problem.
On a side note, the machine was setup using FFSv1 + WAPBL. When rebooting, it appears that the filesystem was hosed. Not sure if this is related or not.
>How-To-Repeat:
Use a MP kernel on an Alpha.
>Fix:
Unknown. Not sure how to debug w/o serial console.
>Release-Note:
>Audit-Trail:
From: Dustin Marquess <dmarquess@gmail.com>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: port-alpha/42174
Date: Fri, 16 Oct 2009 22:52:13 -0500
Got the machine to deadlock again. Used the halt button to get into
ddb and this is the stack trace that I got:
db{0}> trace
cpu_Debugger() at netbsd:cpu_Debugger+0x4
panic() at netbsd:panic+0x244
console_restart() at netbsd:console_restart+0x78
XentRestart() at netbsd:XentRestart+0x90
--- console restart (from ipl 4) ---
nullop() at netbsd:nullop
_kernel_lock() at netbsd:_kernel_lock+0x1c0
interrupt() at netbsd:interrupt+0x2a4
XentInt() at netbsd:XentInt+0x1c
--- interrupt (from ipl 0) ---
kpreempt_disable() at netbsd:kpreempt_disable
sched_curcpu_runnable_p() at netbsd:sched_curcpu_runnable_p+0x1c
idle_loop() at netbsd:idle_loop+0x17c
exception_return() at netbsd:exception_return
--- root of call graph ---
From: Dustin Marquess <dmarquess@gmail.com>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: port-alpha/42174
Date: Sun, 25 Oct 2009 12:45:54 -0500
I've triggered this bug about 4 times now, and every single time it
hits, when I reboot the disklabel on the drive is completely gone.
This seems like a huge coincidence to me.
From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: gnats-bugs@NetBSD.org
Cc: gnats-admin@netbsd.org, dmarquess@gmail.com
Subject: Re: port-alpha/42174
Date: Mon, 26 Oct 2009 11:50:24 -0600 (MDT)
On Sat, 17 Oct 2009, Dustin Marquess wrote:
> Got the machine to deadlock again. Used the halt button to get into
> ddb and this is the stack trace that I got:
>
> db{0}> trace
> cpu_Debugger() at netbsd:cpu_Debugger+0x4
> panic() at netbsd:panic+0x244
> console_restart() at netbsd:console_restart+0x78
> XentRestart() at netbsd:XentRestart+0x90
> --- console restart (from ipl 4) ---
> nullop() at netbsd:nullop
> _kernel_lock() at netbsd:_kernel_lock+0x1c0
This would indicate that CPU0 is attempting to acquire the kernel lock,
but presumably CPU1 currently has it and is unable to release it for some
reasone.
I don't know of any easy way to determine what CPU1 is currently doing.
It most like is unable to process the IPI interrupt sent by entering DDB
and pausing. Even if it was able to, something is broken in looking at a
backtrace in the other cpus, although I think the register contents can be
displayed when the cpu is able to respond to the IPI pause request.
A LOCKDEBUG kernel would provide more information on what is currently
locked, but I don't think that's working on the alpha at the moment.
--
Michael L. Hitch mhitch@montana.edu
Computer Consultant
Information Technology Center
Montana State University Bozeman, MT USA
From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: port-alpha/42174
Date: Fri, 30 Oct 2009 06:40:35 +0000
On Sun, Oct 25, 2009 at 05:55:02PM +0000, Dustin Marquess wrote:
> I've triggered this bug about 4 times now, and every single time it
> hits, when I reboot the disklabel on the drive is completely gone.
> This seems like a huge coincidence to me.
The *disklabel*? Properly speaking that shouldn't be possible - the
disklabel should be read-only during all normal operation...
--
David A. Holland
dholland@netbsd.org
From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: gnats-bugs@NetBSD.org
Cc: dmarquess@gmail.com
Subject: Re: port-alpha/42174
Date: Sun, 1 Nov 2009 20:36:38 -0700 (MST)
On Mon, 26 Oct 2009, Michael L. Hitch wrote:
> A LOCKDEBUG kernel would provide more information on what is currently
> locked, but I don't think that's working on the alpha at the moment.
I've committed a fix for the LOCKDEBUG problem on the CS20/DS20L, and it
has been pulled up to the netbsd-5 branch. One of the things that
LOCKDEBUG will do is to limit how long the cpu spins waiting for a lock,
so should eventually panic instead of hanging forever deadlocked. It also
keeps more information about the locks, including which cpu holds the lock
and the address where the lock was acquired.
--
Michael L. Hitch mhitch@montana.edu
Computer Consultant
Information Technology Center
Montana State University Bozeman, MT USA
From: Dustin Marquess <dmarquess@gmail.com>
To: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
Cc: gnats-bugs@netbsd.org
Subject: Re: port-alpha/42174
Date: Mon, 2 Nov 2009 11:42:08 -0600
I just hit:
panic: kernel diagnostic assertion "pcg->pcg_avail == 0" failed: file
"../../../../kern/subr_pool.c", line 2580
Stopped in pid 0.12 (system) at netbsd:cpu_Debugger+0x4: ret zero,(ra
)
db{1}> bt
cpu_Debugger() at netbsd:cpu_Debugger+0x4
panic() at netbsd:panic+0x244
__kernassert() at netbsd:__kernassert+0x4c
pool_cache_put_slow() at netbsd:pool_cache_put_slow+0x2f0
pool_cache_put_paddr() at netbsd:pool_cache_put_paddr+0x1b0
pmap_do_tlb_shootdown() at netbsd:pmap_do_tlb_shootdown+0x174
alpha_ipi_process() at netbsd:alpha_ipi_process+0xb8
interrupt() at netbsd:interrupt+0x84
XentInt() at netbsd:XentInt+0x1c
--- interrupt (from ipl 0) ---
idle_loop() at netbsd:idle_loop+0x19c
cpu_spinup_trampoline() at netbsd:cpu_spinup_trampoline+0x5c
--- root of call graph ---
Not sure if that's related to DIAGNOSTIC or LOCKDEBUG. If it's
related to DIAGNOSTIC I'll recompile with that disabled for now.
From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: gnats-bugs@NetBSD.org
Cc: dmarquess@gmail.com
Subject: Re: port-alpha/42174
Date: Mon, 2 Nov 2009 11:22:33 -0700 (MST)
On Mon, 2 Nov 2009, Dustin Marquess wrote:
> I just hit:
>
> panic: kernel diagnostic assertion "pcg->pcg_avail == 0" failed: file
> "../../../../kern/subr_pool.c", line 2580
Hmm, I wonder if this might be related to some of the pool cache
corruption I've seen at times.
> Not sure if that's related to DIAGNOSTIC or LOCKDEBUG. If it's
> related to DIAGNOSTIC I'll recompile with that disabled for now.
The KASSERT() is enabled by DIAGNOSTIC, which I now see isn't enabled
in my kernel and would explain why I haven't seen this happen. I
would presume that the condition being tested shouldn't have happened,
and ignoring it by turning off DIAGNOSTIC would likely just result in
problems later.
There's a couple of things you can try that essentially disable the
use of using pool cache entries to shootdown tlbs. One is a variable
you can patch on boot or change in the source to disable all the pool
caching. The variable is pool_cache_disable in sys/kern/subr_pool.c
about line 182. This can also be set temporarily on boot by using
the -d flag to enter the debugger on startup and setting it with the
command "write pool_cache_disable 1" and continuing.
The other thing is to change the maximum number of TLB entries to
shootdown from 6 to 0 in sys/arch/alpha/alpha/pmap.c, about line 433.
This will force the tlb shootdown routine to always invalid all tlb
entries and never attempt to use pool cache entries.
--
Michael L. Hitch mhitch@montana.edu
Computer Consultant
Information Technology Center
Montana State University Bozeman, MT USA
From: Dustin Marquess <dmarquess@gmail.com>
To: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
Cc: gnats-bugs@netbsd.org
Subject: Re: port-alpha/42174
Date: Tue, 3 Nov 2009 20:35:52 -0600
Okay, got it back up and running and within minutes hit:
Mutex error: mutex_vector_enter: spinout
lock address : 0xfffffc0000c85fe8 type : spin
initialized : 0xfffffc00006eb9b8
shared holds : 0 exclusive: 0
shares wanted: 0 exclusive: 3
current cpu : 0 last held: 1
current lwp : 0xfffffc007f69d800 last held: 000000000000000000
last locked : 0xfffffc00006eccc8 unlocked : 0xfffffc00006ece90
owner field : 0x0000000000000400 wait/spin: 0/1
panic: LOCKDEBUG
Stopped in pid 0.36 (system) at netbsd:cpu_Debugger+0x4: ret zero,(ra
)
db{0}> bt
cpu_Debugger() at netbsd:cpu_Debugger+0x4
panic() at netbsd:panic+0x244
lockdebug_abort1() at netbsd:lockdebug_abort1+0x150
mutex_abort() at netbsd:mutex_abort+0x34
mutex_vector_enter() at netbsd:mutex_vector_enter+0x324
pool_cache_put_slow() at netbsd:pool_cache_put_slow+0x198
pool_cache_put_paddr() at netbsd:pool_cache_put_paddr+0x1b0
pmap_do_tlb_shootdown() at netbsd:pmap_do_tlb_shootdown+0x174
alpha_ipi_process() at netbsd:alpha_ipi_process+0xb8
interrupt() at netbsd:interrupt+0x84
XentInt() at netbsd:XentInt+0x1c
--- interrupt (from ipl 4) ---
lockdebug_wantlock() at netbsd:lockdebug_wantlock+0x218
mutex_tryenter() at netbsd:mutex_tryenter+0x274
pool_cache_get_slow() at netbsd:pool_cache_get_slow+0x78
pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x26c
pmap_tlb_shootdown() at netbsd:pmap_tlb_shootdown+0x248
pmap_kremove() at netbsd:pmap_kremove+0x124
uvm_pagermapout() at netbsd:uvm_pagermapout+0x40
uvm_aio_aiodone() at netbsd:uvm_aio_aiodone+0xf4
I'm still in ddb if I need to do anything else.
From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: gnats-bugs@NetBSD.org
Cc: dmarquess@gmail.com
Subject: Re: port-alpha/42174
Date: Thu, 5 Nov 2009 11:59:39 -0700 (MST)
On Wed, 4 Nov 2009, Dustin Marquess wrote:
> panic: LOCKDEBUG
> Stopped in pid 0.36 (system) at netbsd:cpu_Debugger+0x4: ret zero,(ra
> )
> db{0}> bt
> cpu_Debugger() at netbsd:cpu_Debugger+0x4
> panic() at netbsd:panic+0x244
> lockdebug_abort1() at netbsd:lockdebug_abort1+0x150
> mutex_abort() at netbsd:mutex_abort+0x34
> mutex_vector_enter() at netbsd:mutex_vector_enter+0x324
> pool_cache_put_slow() at netbsd:pool_cache_put_slow+0x198
> pool_cache_put_paddr() at netbsd:pool_cache_put_paddr+0x1b0
> pmap_do_tlb_shootdown() at netbsd:pmap_do_tlb_shootdown+0x174
> alpha_ipi_process() at netbsd:alpha_ipi_process+0xb8
> interrupt() at netbsd:interrupt+0x84
> XentInt() at netbsd:XentInt+0x1c
> --- interrupt (from ipl 4) ---
> lockdebug_wantlock() at netbsd:lockdebug_wantlock+0x218
> mutex_tryenter() at netbsd:mutex_tryenter+0x274
> pool_cache_get_slow() at netbsd:pool_cache_get_slow+0x78
> pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x26c
> pmap_tlb_shootdown() at netbsd:pmap_tlb_shootdown+0x248
> pmap_kremove() at netbsd:pmap_kremove+0x124
> uvm_pagermapout() at netbsd:uvm_pagermapout+0x40
> uvm_aio_aiodone() at netbsd:uvm_aio_aiodone+0xf4
>
> I'm still in ddb if I need to do anything else.
It confirms one of the deadlocks I suspected. I should soon have a
patch you can try out.
--
Michael L. Hitch mhitch@montana.edu
Computer Consultant
Information Technology Center
Montana State University Bozeman, MT USA
From: "Michael L. Hitch" <mhitch@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/42174 CVS commit: src/sys/arch/alpha/alpha
Date: Mon, 9 Nov 2009 04:31:04 +0000
Module Name: src
Committed By: mhitch
Date: Mon Nov 9 04:31:03 UTC 2009
Modified Files:
src/sys/arch/alpha/alpha: pmap.c
Log Message:
The tlb shootdown IPI can interrupt the pool_cache_get() in places that
can cause a deadlock or pool cache corruption. Take the shootdown job
queue mutex before calling pool_cache_get(), which will block the IPI
interrupts and seems to fix the remaining tlb shootdown deadlocks and
pool cache corruption I've been seeing. Should address both
PR port-amiga/38335 and PR port-amiga/42174.
To generate a diff of this commit:
cvs rdiff -u -r1.249 -r1.250 src/sys/arch/alpha/alpha/pmap.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: gnats-bugs@NetBSD.org
Cc: dmarquess@gmail.com
Subject: Re: port-alpha/42174
Date: Sun, 8 Nov 2009 21:36:36 -0700 (MST)
On Thu, 5 Nov 2009, Michael L. Hitch wrote:
> > I'm still in ddb if I need to do anything else.
>
> It confirms one of the deadlocks I suspected. I should soon have a
> patch you can try out.
I've committed a fix to current that should take care of this.
Index: sys/arch/alpha/alpha/pmap.c
===================================================================
RCS file: /cvsroot/src/sys/arch/alpha/alpha/pmap.c,v
retrieving revision 1.249
diff -u -p -r1.249 pmap.c
--- sys/arch/alpha/alpha/pmap.c 7 Nov 2009 07:27:40 -0000 1.249
+++ sys/arch/alpha/alpha/pmap.c 9 Nov 2009 04:21:12 -0000
@@ -3581,6 +3581,7 @@ pmap_tlb_shootdown(pmap_t pmap, vaddr_t
cpumask |= 1UL << ci->ci_cpuid;
pq = &pmap_tlb_shootdown_q[ci->ci_cpuid];
+ mutex_spin_enter(&pq->pq_lock);
/*
* Allocate a job.
@@ -3596,7 +3597,6 @@ pmap_tlb_shootdown(pmap_t pmap, vaddr_t
* If a global flush is already pending, we
* don't really have to do anything else.
*/
- mutex_spin_enter(&pq->pq_lock);
pq->pq_pte |= pte;
if (pq->pq_tbia) {
mutex_spin_exit(&pq->pq_lock);
--
Michael L. Hitch mhitch@montana.edu
Computer Consultant
Information Technology Center
Montana State University Bozeman, MT USA
From: Dustin Marquess <dmarquess@gmail.com>
To: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
Cc: gnats-bugs@netbsd.org
Subject: Re: port-alpha/42174
Date: Tue, 10 Nov 2009 09:54:18 -0600
I applied Michael's patch along w/ Jason Thorpe's L1 patch to a fresh
netbsd-5 checkout. Recompiled w/ DEBUG/DIAGNOSTIC/LOCKDEBUG. So far
the new kernel has been banged on for 24 hours w/o even a slight
hiccup. Before the patches, it was lucky if it made it 5 minutes w/
DIAGNOSTIC/LOCKDEBUG.
So while it hasn't had in-depth testing, it is definitely a *LOT* more
stable than what it was, so hopefully this PR can be resolved now.
On a side note, the disk issue I was having seems to be unrelated. It
appears that something (not sure if disk itself or controller) would
get in some kind of weird stat. That would cause problems and make it
appear that the disk had no label. I swapped the drive from the
Adaptec 29160 that I had installed to make Tru64 happy back to the
internal 53c1010, and so far the problem hasn't reoccurred. So it
seems to be completely unrelated, sorry!
State-Changed-From-To: open->pending-pullups
State-Changed-By: mhitch@NetBSD.org
State-Changed-When: Tue, 10 Nov 2009 18:48:40 +0000
State-Changed-Why:
MP is much more stable after recent changes.
Pullups have been requested to the netbsd 5 branches.
From: Soren Jacobsen <snj@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/42174 CVS commit: [netbsd-5] src/sys/arch/alpha/alpha
Date: Sun, 15 Nov 2009 05:54:38 +0000
Module Name: src
Committed By: snj
Date: Sun Nov 15 05:54:38 UTC 2009
Modified Files:
src/sys/arch/alpha/alpha [netbsd-5]: pmap.c
Log Message:
Pull up following revision(s) (requested by mhitch in ticket #1137):
sys/arch/alpha/alpha/pmap.c: revision 1.250
The tlb shootdown IPI can interrupt the pool_cache_get() in places that
can cause a deadlock or pool cache corruption. Take the shootdown job
queue mutex before calling pool_cache_get(), which will block the IPI
interrupts and seems to fix the remaining tlb shootdown deadlocks and
pool cache corruption I've been seeing. Should address both
PR port-alpha/38335 and PR port-alpha/42174.
To generate a diff of this commit:
cvs rdiff -u -r1.236.10.3 -r1.236.10.4 src/sys/arch/alpha/alpha/pmap.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Soren Jacobsen <snj@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/42174 CVS commit: [netbsd-5-0] src/sys/arch/alpha/alpha
Date: Sun, 15 Nov 2009 05:55:01 +0000
Module Name: src
Committed By: snj
Date: Sun Nov 15 05:55:00 UTC 2009
Modified Files:
src/sys/arch/alpha/alpha [netbsd-5-0]: pmap.c
Log Message:
Pull up following revision(s) (requested by mhitch in ticket #1137):
sys/arch/alpha/alpha/pmap.c: revision 1.250
The tlb shootdown IPI can interrupt the pool_cache_get() in places that
can cause a deadlock or pool cache corruption. Take the shootdown job
queue mutex before calling pool_cache_get(), which will block the IPI
interrupts and seems to fix the remaining tlb shootdown deadlocks and
pool cache corruption I've been seeing. Should address both
PR port-alpha/38335 and PR port-alpha/42174.
To generate a diff of this commit:
cvs rdiff -u -r1.236.14.2 -r1.236.14.3 src/sys/arch/alpha/alpha/pmap.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: pending-pullups->closed
State-Changed-By: snj@NetBSD.org
State-Changed-When: Sun, 15 Nov 2009 06:12:05 +0000
State-Changed-Why:
Pullups complete.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.