NetBSD Problem Report #33234

From jld@panix.com  Mon Apr 10 22:52:29 2006
Return-Path: <jld@panix.com>
Received: from mail3.panix.com (mail3.panix.com [166.84.1.74])
	by narn.netbsd.org (Postfix) with ESMTP id E89CD63B9C2
	for <gnats-bugs@gnats.NetBSD.org>; Mon, 10 Apr 2006 22:52:28 +0000 (UTC)
Message-Id: <200604102252.k3AMqNL12237@byzantium.nyc.access.net>
Date: Mon, 10 Apr 2006 18:52:23 -0400 (EDT)
From: jld@panix.com
Reply-To: jld@panix.com
To: gnats-bugs@netbsd.org
Subject: LOCKDEBUG MP panic in i386 pmap_load from sys_execve
X-Send-Pr-Version: 3.95

>Number:         33234
>Category:       kern
>Synopsis:       MP LOCKDEBUG assertion failure, due to uninitialized pmap during exec
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Apr 10 22:55:00 +0000 2006
>Closed-Date:    Wed Nov 01 13:18:37 +0000 2017
>Last-Modified:  Wed Nov 01 13:18:37 +0000 2017
>Originator:     Jed Davis
>Release:        NetBSD 3.0
>Organization:
PANIX Public Access Internet and UNIX, NYC
>Environment:
System: NetBSD mail2.panix.com 3.0 NetBSD 3.0 (PANIX-STD-MP-DEBUG) #0: Fri Apr  7 04:35:36 EDT 2006  root@juggler.panix.com:/devel/netbsd/3.0/src/sys/arch/i386/compile/PANIX-STD-MP-DEBUG i386
Architecture: i386
Machine: i386
>Description:

We're trying to track down some MP-related panic and deadlock problems
in 3.0/i386, that weren't an issue under 2.0.x; so, this host was
running a kernel with DIAGNOSTIC, DEBUG, and LOCKDEBUG.  Here's the
panic and trace:

panic: kernel debugging assertion "(v == __SIMPLELOCK_LOCKED) || (v == __SIMPLELOCK_UNLOCKED)" failed: file "../../../../arch/x86/x86/lock_machdep.c", line 83
Begin traceback...
__main(c042a890,c0468d20,53,c0468ce0,1) at netbsd:__main
__cpu_simple_lock(cec89cf4,c049b420,1,286,c049b420) at netbsd:__cpu_simple_lock+0xd5
_simple_lock(cec89cf4,c046a3e0,73b,c049b420,cec89cf4) at netbsd:_simple_lock+0x7a
pmap_reference(cec89cf4,c049489c,480,297,282) at netbsd:pmap_reference+0x1a
pmap_load(c0266bb7,cce38000,804a000,480,cec5d1a4) at netbsd:pmap_load+0xc4
copyout(cce38000,480,cec03d14,282,1000) at netbsd:copyout+0xf
ffs_read(cec03cb4,ce7041f4,10001,20001,c03b5360) at netbsd:ffs_read+0x4a6
VOP_READ(ce7041f4,cec03d14,1,ccd20000,0) at netbsd:VOP_READ+0x34
vn_rdwr(0,ce7041f4,804a000,480,1000) at netbsd:vn_rdwr+0xb4
vmcmd_readvn(cedb422c,c227521c,bfc00000,0,0) at netbsd:vmcmd_readvn+0x2f
sys_execve(cec5d1a4,cec03f64,cec03f5c,c04930c4,282) at netbsd:sys_execve+0x620
syscall_plain() at netbsd:syscall_plain+0x1a5

The value at cec89cf4 (the "v" in the assert) is 0xDEADBEEF, as is much
of the rest of the struct pmap that should reside at that location.

We have a core, but it wasn't dumped until after the host panicked again
trying to sync disks (ddb.onpanic was 0); there, stack traces from both
CPUs were written to the serial console at the same time, making them
partially illegible.

>How-To-Repeat:

Letting this host (a mail relay) run a DIAGNOSTIC/DEBUG/LOCKDEBUG kernel
for a few hours usually makes it panic from something; with a regular MP
kernel, it takes longer.

The sys_execve -> vmcmd_readvn -> vn_rdwr -> copyout path seems
particularly troubled.

>Fix:

A seeming workaround: run a uniprocessor kernel.  This costs us a CPU (a 
real one, too, not a "hyperthread"), but it improves stability.

>Release-Note:

>Audit-Trail:
From: Jed Davis <jld@NetBSD.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/33234
Date: Tue, 11 Apr 2006 02:01:39 -0400

 This one is happening fairly frequently, and so I've determined (by
 chasing pointers by hand in ddb; ick) that, after the panic, the vmspace
 in question (ci->ci_curlwp->l_proc->p_vmspace) has been returned to the
 pool and deadbeef'ed.  And that clearly hasn't happened at the top of
 pmap_load, where it gets a valid pointer to the pmap (which has been
 destroyed by the time pmap_reference is reached).  (I also now have a
 core that's not potentially tainted by the double-panic on sync.)

 This really does seem like insufficient locking on the vmspace, as if
 one process thinks it's removed the last reference and destroys it while
 another is switching to it on the other CPU; except that doesn't make
 much sense in this context, assuming I understand the context correctly.

State-Changed-From-To: open->feedback
State-Changed-By: chs@NetBSD.org
State-Changed-When: Mon, 19 Mar 2012 01:10:07 +0000
State-Changed-Why:
have you still seen this problem more recently, with netbsd 5.0 or later?


State-Changed-From-To: feedback->closed
State-Changed-By: maya@NetBSD.org
State-Changed-When: Wed, 01 Nov 2017 13:18:37 +0000
State-Changed-Why:
feedback timeout. a lot of this code has changed since, and these problems don't happen much.even if they did, the information in the bug report is dated. feel free to report another bug if you are still having issues.


>Unformatted:
Home
PR Database Search
(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.