NetBSD Problem Report #36608

From sverre@abbor.fesk.com  Thu Jul  5 02:28:20 2007
Return-Path: <sverre@abbor.fesk.com>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by narn.NetBSD.org (Postfix) with ESMTP id 7361563B882
	for <gnats-bugs@gnats.NetBSD.org>; Thu,  5 Jul 2007 02:28:20 +0000 (UTC)
Message-Id: <20070705002834.A81D71D0@abbor.fesk.com>
Date: Wed,  4 Jul 2007 18:28:34 -0600 (MDT)
From: sverre@viewmark.com
Reply-To: sverre@viewmark.com
To: gnats-bugs@NetBSD.org
Subject: LFS related panic with LOCKDEBUG
X-Send-Pr-Version: 3.95

>Number:         36608
>Category:       kern
>Synopsis:       Panic in LFS with LOCKDEBUG defined (since mid April)
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Jul 05 02:30:00 +0000 2007
>Closed-Date:    Wed Nov 18 17:54:14 +0000 2009
>Last-Modified:  Wed Nov 18 17:54:14 +0000 2009
>Originator:     Sverre Froyen
>Release:        NetBSD 4.99.22 (2007-07-04)
>Organization:
	Viewmark
>Environment:
System: NetBSD abbor.fesk.com 4.99.22 NetBSD 4.99.22 (GENERIC_LAPTOP) #1: Wed Jul 4 09:10:15 MDT 2007 toor@abbor.fesk.com:/usr/src/sys/arch/i386/compile/GENERIC_LAPTOP i386 (GENERIC_LAPTOP has LOCKDEBUG defined)
Architecture: i386
Machine: i386
>Description:
I get reproducible LFS related panics when running the command
bogofilter -n < mailmessage
where mailmessage is a single plain email message.
In order for the panics to occur, LOCKDEBUG has to be defined and the
bogofilter database (.bogofilter/wordlist.db) has to be in a certain state.
The result is (copied from the screen):

switching with held simple_lock 0xcd155b2c CPU 0 ../../../../ufs/lfs/lfs_vnops.c:1742
0xccf9bd80:
Stopped in pid 679.1 (bogofilter) at netbsd:cpu_Debugger+0x4: popl
%
ebp
db> bt
cpu_Debugger(0,c07259a6,ccf7c9ac,c03bbf4a,c0815aa8) at netbsd:cpu_Debugger+0x4
simple_lock_switchcheck(c0815aa8,1,ccf7c9bc,c03cc255,c081f400) at netbsd:simple_lock_switchcheck+0x1b
mi_switch(ccf9bd80,11,ccf7c9dc,c03aed6f,c07b276a) at netbsd:mi_switch+0x2a
sleepq_block(0,0,b1,c0723def,c0795570) at netbsd:sleepq_block+0x10a
ltsleep(ccb30fa4,11,c0723def,0,c16e9754) at netbsd:ltsleep+0x151
lfs_segunlock(c16e9000,0,8f2,7fffffff,45c000) at netbsd:lfs_segunlock+0x224
lfs_putpages(ccf7cb30,ccefea84,1,c064cd40,cd155b2c) at netbsd:lfs_putpages+0xcb1
VOP_PUTPAGES(cd155b2c,0,0,0,0) at netbsd:VOP_PUTPAGES+0x40
lfs_fsync(ccf7cbb8,10002,ccf7cbdc,c040b48f,cd155b2c) at netbsd:lfs_fsync+0x13f
VOP_FSYNC(cd155b2c,cd00faa8,3,0,0) at netbsd:VOP_FSYNC+0x49
sys_fdatasync(ccf9bd80,ccf7cc48,ccf7cc68,640,ccf9bd80) at netbsd:sys_fdatasync+0x95
syscall_plain at netbsd:syscall_plain+0x116
--- syscall (number 241) ---
0xbb88c16b:
db>show lock ufs_hashlock
lock address : 0x00000000c0811554 type     :     sleep/adaptive
shared holds :                  0 exclusive:                  0
shares wanted:                  0 exclusive:                  0
current cpu  :                  0 last held:                  0
current lwp  : 0x00000000ccf9bd80 last held: 000000000000000000
last locked  : 0x00000000c034005a unlocked : 0x00000000c03400fe
owner field  : 000000000000000000 wait/spin:                0/0

Rurnstile chain at 0xc0817040.
=> No active turnstile for this lock.
db>show lock ufs_ihash_lock
lock address : 0x00000000c081154c type     :     sleep/adaptive
shared holds :                  0 exclusive:                  0
shares wanted:                  0 exclusive:                  0
current cpu  :                  0 last held:                  0
current lwp  : 0x00000000ccf9bd80 last held: 000000000000000000
last locked  : 0x00000000c034ada7 unlocked : 0x00000000c03400f2
owner field  : 000000000000000000 wait/spin:                0/0

Rurnstile chain at 0xc0817020.
=> No active turnstile for this lock.
db> ps/l
 PID         LID S     FLAGS       STRUCT LWP *             UAREA * WAIT
>679       >   1 3 0x20000004        0xccf9bd80          0xccf7cce0 seg_iocount
...

It looks like the problem was introduced on 2007-04-17 or 18.  Kernels
before and including 2007-04-16 do not panic.  Kernels after and including
2007-04-19 panic consistently.

See

http://mail-index.netbsd.org/current-users/2007/05/21/0028.html
http://mail-index.netbsd.org/current-users/2007/05/25/0004.html

for more information.

>How-To-Repeat:
Use the known bad bogofilter DB:
cp .bogofilter/wordlist.db.bad .bogofilter/wordlist.db
Reboot with an affected kernel and type:
bogofilter -n < mailmessage
>Fix:
http://mail-index.netbsd.org/source-changes/2009/11/17/msg003309.html

>Release-Note:

>Audit-Trail:
From: Sverre Froyen <sverre@viewmark.com>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/36608: LFS related panic with LOCKDEBUG
Date: Mon, 30 Jul 2007 16:03:04 -0600

 In lfs_vnops.c there is a comment about genfs_putpages stating:

  * (2) It needs to explicitly send blocks to be written when it is done.
  *     If VOP_PUTPAGES is called without the seglock held, we simply take
  *     the seglock and let lfs_segunlock wait for us.
  *     XXX There might be a bad situation if we have to flush a vnode while
  *     XXX lfs_markv is in operation.  As of this writing we panic in this
  *     XXX case.

 I have done a litle more investigation and I find that I consistently get a 
 double lock panic on the vnode(?) that is locked immediately before the call 
 to lfs_segunlock, around line 2290 in lfs_vnops.c:

                         simple_unlock(&vp->v_interlock);

                         simple_lock(&vp->v_interlock);
                         write_and_wait(fs, vp, busypg, seglocked, NULL);
 *** vp is locked at this point
                         if (!seglocked) {
                                 lfs_release_finfo(fs);
                                 lfs_segunlock(fs);
 *** I get the panic before the call to lfs_segunlock returns
                         }
                         sp->vp = NULL;
                         goto get_seglock;

 It looks like lfs_segunlock is sleeping in the second while loop in this code 
 snippet from lfs_subr.c:

                 simple_lock(&fs->lfs_interlock);
                 while (ckp && sync && fs->lfs_iocount)
                         (void)ltsleep(&fs->lfs_iocount, PRIBIO + 1,
                                       "lfs_iocount", 0, &fs->lfs_interlock);
                 while (sync && sp->seg_iocount) {
                         (void)ltsleep(&sp->seg_iocount, PRIBIO + 1,
                                      "seg_iocount", 0, &fs->lfs_interlock);
                         DLOG((DLOG_SEG, "sleeping on iocount %x == %d\n", sp, 
 sp
 ->seg_iocount));
                 }
                 simple_unlock(&fs->lfs_interlock);

 I do not know if the comment above refers to the case I'm seeing or not, but 
 while lfs_segunlock is sleeping some other code comes along and attempts to 
 lock the vnode that was locked in genfs_putpages.


State-Changed-From-To: open->closed
State-Changed-By: eeh@NetBSD.org
State-Changed-When: Wed, 18 Nov 2009 17:54:14 +0000
State-Changed-Why:
I think the fix for a hang also fixes this problem.  Someone should verify the fix.


>Unformatted:
Home
PR Database Search
(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.