NetBSD Problem Report #36608
From sverre@abbor.fesk.com Thu Jul 5 02:28:20 2007
Return-Path: <sverre@abbor.fesk.com>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
by narn.NetBSD.org (Postfix) with ESMTP id 7361563B882
for <gnats-bugs@gnats.NetBSD.org>; Thu, 5 Jul 2007 02:28:20 +0000 (UTC)
Message-Id: <20070705002834.A81D71D0@abbor.fesk.com>
Date: Wed, 4 Jul 2007 18:28:34 -0600 (MDT)
From: sverre@viewmark.com
Reply-To: sverre@viewmark.com
To: gnats-bugs@NetBSD.org
Subject: LFS related panic with LOCKDEBUG
X-Send-Pr-Version: 3.95
>Number: 36608
>Category: kern
>Synopsis: Panic in LFS with LOCKDEBUG defined (since mid April)
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: kern-bug-people
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Thu Jul 05 02:30:00 +0000 2007
>Closed-Date: Wed Nov 18 17:54:14 +0000 2009
>Last-Modified: Wed Nov 18 17:54:14 +0000 2009
>Originator: Sverre Froyen
>Release: NetBSD 4.99.22 (2007-07-04)
>Organization:
Viewmark
>Environment:
System: NetBSD abbor.fesk.com 4.99.22 NetBSD 4.99.22 (GENERIC_LAPTOP) #1: Wed Jul 4 09:10:15 MDT 2007 toor@abbor.fesk.com:/usr/src/sys/arch/i386/compile/GENERIC_LAPTOP i386 (GENERIC_LAPTOP has LOCKDEBUG defined)
Architecture: i386
Machine: i386
>Description:
I get reproducible LFS related panics when running the command
bogofilter -n < mailmessage
where mailmessage is a single plain email message.
In order for the panics to occur, LOCKDEBUG has to be defined and the
bogofilter database (.bogofilter/wordlist.db) has to be in a certain state.
The result is (copied from the screen):
switching with held simple_lock 0xcd155b2c CPU 0 ../../../../ufs/lfs/lfs_vnops.c:1742
0xccf9bd80:
Stopped in pid 679.1 (bogofilter) at netbsd:cpu_Debugger+0x4: popl
%
ebp
db> bt
cpu_Debugger(0,c07259a6,ccf7c9ac,c03bbf4a,c0815aa8) at netbsd:cpu_Debugger+0x4
simple_lock_switchcheck(c0815aa8,1,ccf7c9bc,c03cc255,c081f400) at netbsd:simple_lock_switchcheck+0x1b
mi_switch(ccf9bd80,11,ccf7c9dc,c03aed6f,c07b276a) at netbsd:mi_switch+0x2a
sleepq_block(0,0,b1,c0723def,c0795570) at netbsd:sleepq_block+0x10a
ltsleep(ccb30fa4,11,c0723def,0,c16e9754) at netbsd:ltsleep+0x151
lfs_segunlock(c16e9000,0,8f2,7fffffff,45c000) at netbsd:lfs_segunlock+0x224
lfs_putpages(ccf7cb30,ccefea84,1,c064cd40,cd155b2c) at netbsd:lfs_putpages+0xcb1
VOP_PUTPAGES(cd155b2c,0,0,0,0) at netbsd:VOP_PUTPAGES+0x40
lfs_fsync(ccf7cbb8,10002,ccf7cbdc,c040b48f,cd155b2c) at netbsd:lfs_fsync+0x13f
VOP_FSYNC(cd155b2c,cd00faa8,3,0,0) at netbsd:VOP_FSYNC+0x49
sys_fdatasync(ccf9bd80,ccf7cc48,ccf7cc68,640,ccf9bd80) at netbsd:sys_fdatasync+0x95
syscall_plain at netbsd:syscall_plain+0x116
--- syscall (number 241) ---
0xbb88c16b:
db>show lock ufs_hashlock
lock address : 0x00000000c0811554 type : sleep/adaptive
shared holds : 0 exclusive: 0
shares wanted: 0 exclusive: 0
current cpu : 0 last held: 0
current lwp : 0x00000000ccf9bd80 last held: 000000000000000000
last locked : 0x00000000c034005a unlocked : 0x00000000c03400fe
owner field : 000000000000000000 wait/spin: 0/0
Rurnstile chain at 0xc0817040.
=> No active turnstile for this lock.
db>show lock ufs_ihash_lock
lock address : 0x00000000c081154c type : sleep/adaptive
shared holds : 0 exclusive: 0
shares wanted: 0 exclusive: 0
current cpu : 0 last held: 0
current lwp : 0x00000000ccf9bd80 last held: 000000000000000000
last locked : 0x00000000c034ada7 unlocked : 0x00000000c03400f2
owner field : 000000000000000000 wait/spin: 0/0
Rurnstile chain at 0xc0817020.
=> No active turnstile for this lock.
db> ps/l
PID LID S FLAGS STRUCT LWP * UAREA * WAIT
>679 > 1 3 0x20000004 0xccf9bd80 0xccf7cce0 seg_iocount
...
It looks like the problem was introduced on 2007-04-17 or 18. Kernels
before and including 2007-04-16 do not panic. Kernels after and including
2007-04-19 panic consistently.
See
http://mail-index.netbsd.org/current-users/2007/05/21/0028.html
http://mail-index.netbsd.org/current-users/2007/05/25/0004.html
for more information.
>How-To-Repeat:
Use the known bad bogofilter DB:
cp .bogofilter/wordlist.db.bad .bogofilter/wordlist.db
Reboot with an affected kernel and type:
bogofilter -n < mailmessage
>Fix:
http://mail-index.netbsd.org/source-changes/2009/11/17/msg003309.html
>Release-Note:
>Audit-Trail:
From: Sverre Froyen <sverre@viewmark.com>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/36608: LFS related panic with LOCKDEBUG
Date: Mon, 30 Jul 2007 16:03:04 -0600
In lfs_vnops.c there is a comment about genfs_putpages stating:
* (2) It needs to explicitly send blocks to be written when it is done.
* If VOP_PUTPAGES is called without the seglock held, we simply take
* the seglock and let lfs_segunlock wait for us.
* XXX There might be a bad situation if we have to flush a vnode while
* XXX lfs_markv is in operation. As of this writing we panic in this
* XXX case.
I have done a litle more investigation and I find that I consistently get a
double lock panic on the vnode(?) that is locked immediately before the call
to lfs_segunlock, around line 2290 in lfs_vnops.c:
simple_unlock(&vp->v_interlock);
simple_lock(&vp->v_interlock);
write_and_wait(fs, vp, busypg, seglocked, NULL);
*** vp is locked at this point
if (!seglocked) {
lfs_release_finfo(fs);
lfs_segunlock(fs);
*** I get the panic before the call to lfs_segunlock returns
}
sp->vp = NULL;
goto get_seglock;
It looks like lfs_segunlock is sleeping in the second while loop in this code
snippet from lfs_subr.c:
simple_lock(&fs->lfs_interlock);
while (ckp && sync && fs->lfs_iocount)
(void)ltsleep(&fs->lfs_iocount, PRIBIO + 1,
"lfs_iocount", 0, &fs->lfs_interlock);
while (sync && sp->seg_iocount) {
(void)ltsleep(&sp->seg_iocount, PRIBIO + 1,
"seg_iocount", 0, &fs->lfs_interlock);
DLOG((DLOG_SEG, "sleeping on iocount %x == %d\n", sp,
sp
->seg_iocount));
}
simple_unlock(&fs->lfs_interlock);
I do not know if the comment above refers to the case I'm seeing or not, but
while lfs_segunlock is sleeping some other code comes along and attempts to
lock the vnode that was locked in genfs_putpages.
State-Changed-From-To: open->closed
State-Changed-By: eeh@NetBSD.org
State-Changed-When: Wed, 18 Nov 2009 17:54:14 +0000
State-Changed-Why:
I think the fix for a hang also fixes this problem. Someone should verify the fix.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.