NetBSD Problem Report #35198

From stix@stix.id.au  Thu Dec  7 11:58:48 2006
Return-Path: <stix@stix.id.au>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by narn.NetBSD.org (Postfix) with ESMTP id 5D97263BA6D
	for <gnats-bugs@gnats.NetBSD.org>; Thu,  7 Dec 2006 11:58:48 +0000 (UTC)
Message-Id: <20061207115835.C27A977@hactar.stix.org.au>
Date: Thu,  7 Dec 2006 22:58:35 +1100 (EST)
From: stix@stix.id.au
Reply-To: stix@stix.id.au
To: gnats-bugs@NetBSD.org
Subject: lfs_pchain corruption causing hang or panic
X-Send-Pr-Version: 3.95

>Number:         35198
>Category:       kern
>Synopsis:       lfs_pchain corruption causing hang or panic
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Dec 07 12:00:01 +0000 2006
>Closed-Date:    Wed Jan 23 11:33:51 +0000 2008
>Last-Modified:  Wed Jan 23 13:30:00 +0000 2008
>Originator:     Paul Ripke
>Release:        NetBSD 4.99.5
>Organization:
>Environment:
System: NetBSD hactar.stix.org.au 4.99.5 NetBSD 4.99.5 (GENERIC) #0: Sun Dec 3 15:54:49 EST 2006 stix@zion.stix.org.au:/export/netbsd/current/obj.i386/export/netbsd/current/src/sys/arch/i386/compile/GENERIC i386
Architecture: i386
Machine: i386
>Description:

I've been investigating a reproducible lfs hang and crash that appear to be
related. From my investigation, it appears that the lfs_pchain tailqueue is
getting corrupted somehow.

The hang I'm seeing has lfs_pchain with one or more inodes on the chain
(usually just one), but none have IN_PAGING set, and it spins on the goto in
lfs_flush_pchain():

(gdb) bt
...
#14 0xc017f5e4 in pckbcintr (vsc=0xc08c7300) at /export/netbsd/current/src/sys/dev/ic/pckbc.c:640
#15 0xc030cb7f in intr_biglock_wrapper (vp=0xc0908040) at /export/netbsd/current/src/sys/arch/x86/x86/intr.c:544
#16 0xc0102f8c in Xintr_ioapic_edge1 ()
#17 0xc0232754 in lfs_flush_pchain (fs=0xc0d45800) at /export/netbsd/current/src/sys/ufs/lfs/lfs_vnops.c:1344
#18 0xc022b866 in lfs_writerd (arg=0xca2cd5c8) at /export/netbsd/current/src/sys/ufs/lfs/lfs_vfsops.c:236
#19 0xc01002e7 in proc_trampoline ()
(gdb) f 17
#17 0xc0232754 in lfs_flush_pchain (fs=0xc0d45800) at /export/netbsd/current/src/sys/ufs/lfs/lfs_vnops.c:1344
warning: Source file is more recent than executable.
1344                    nip = TAILQ_NEXT(ip, i_lfs_pchain);
(gdb) list
1339             * fast and async.
1340             */
1341            simple_lock(&fs->lfs_interlock);
1342        top:
1343            for (ip = TAILQ_FIRST(&fs->lfs_pchainhd); ip != NULL; ip = nip) {
1344                    nip = TAILQ_NEXT(ip, i_lfs_pchain);
1345                    vp = ITOV(ip);
1346
1347                    if (!(ip->i_flags & IN_PAGING))
1348                            goto top;
(gdb) p fs->lfs_pchainhd
$1 = {tqh_first = 0xcb91ea40, tqh_last = 0xcb919044}
(gdb) p ip
$2 = (struct inode *) 0xcb91ea40
(gdb) p ip.i_flags
$4 = 0
(gdb) p ip->inode_ext.lfs->lfs_pchain.tqe_next
$5 = (struct inode *) 0x0

The crash I'm seeing is attempting to remove an inode from an empty
lfs_pchain after clearing IN_PAGING, in lfs_putpages():

(gdb) bt
...
#8  0xc0337e8e in trap (frame=0xcb2c5804) at /export/netbsd/current/src/sys/arch/i386/i386/trap.c:313
#9  0xc010c01e in calltrap ()
#10 0xc02382a5 in lfs_putpages (v=0xcb2c58f0) at /export/netbsd/current/src/sys/ufs/lfs/lfs_vnops.c:1920
#11 0xc02f5510 in VOP_PUTPAGES (vp=0xcc2d359c, offlo=0, offhi=0, flags=27)
    at /export/netbsd/current/src/sys/kern/vnode_if.c:1592
#12 0xc02e81f8 in vinvalbuf (vp=0xcc2d359c, flags=1, cred=0xffffffff, l=0xcb029564, slpflag=0, slptimeo=0)
    at /export/netbsd/current/src/sys/kern/vfs_subr.c:715
#13 0xc02e862e in vclean (vp=0xcc2d359c, flags=<value optimized out>, l=0xcb029564)
    at /export/netbsd/current/src/sys/kern/vfs_subr.c:1555
#14 0xc02e8bc8 in vgonel (vp=0xcc2d359c, l=0xcb029564) at /export/netbsd/current/src/sys/kern/vfs_subr.c:1734
#15 0xc02e9044 in getcleanvnode (l=0xcb029564) at /export/netbsd/current/src/sys/kern/vfs_subr.c:272
#16 0xc02e91f0 in getnewvnode (tag=VT_LFS, mp=0xc0a74000, vops=0xc0967500, vpp=0xcb2c5bb8)
    at /export/netbsd/current/src/sys/kern/vfs_subr.c:561
#17 0xc0234f76 in lfs_set_dirop_create (dvp=0xcc22ba54, vpp=0xcb2c5bb8)
    at /export/netbsd/current/src/sys/ufs/lfs/lfs_vnops.c:466
#18 0xc0235787 in lfs_create (v=0xcb2c5ab8) at /export/netbsd/current/src/sys/ufs/lfs/lfs_vnops.c:674
#19 0xc02f4c11 in VOP_CREATE (dvp=0xcc22ba54, vpp=0xcb2c5bb8, cnp=0xcb2c5bcc, vap=0xcb2c5afc)
    at /export/netbsd/current/src/sys/kern/vnode_if.c:164
#20 0xc02f28bd in vn_open (ndp=0xcb2c5ba8, fmode=1538, cmode=384)
    at /export/netbsd/current/src/sys/kern/vfs_vnops.c:169
#21 0xc02ef2a2 in sys_open (l=0xcb029564, v=0xcb2c5c48, retval=0xcb2c5c68)
    at /export/netbsd/current/src/sys/kern/vfs_syscalls.c:1171
#22 0xc03374f8 in syscall_plain (frame=0xcb2c5c88) at /export/netbsd/current/src/sys/arch/i386/i386/syscall.c:144
#23 0xc01006e0 in syscall1 ()
(gdb) f 10
#10 0xc02382a5 in lfs_putpages (v=0xcb2c58f0) at /export/netbsd/current/src/sys/ufs/lfs/lfs_vnops.c:1919
warning: Source file is more recent than executable.
1920                            TAILQ_REMOVE(&fs->lfs_pchainhd, ip, i_lfs_pchain);
(gdb) list
1914                    simple_unlock(&vp->v_interlock);
1915
1916                    /* Remove us from paging queue, if we were on it */
1917                    simple_lock(&fs->lfs_interlock);
1918                    if (ip->i_flags & IN_PAGING) {
1919                            ip->i_flags &= ~IN_PAGING;
1920                            TAILQ_REMOVE(&fs->lfs_pchainhd, ip, i_lfs_pchain);
1921                    }
1922                    simple_unlock(&fs->lfs_interlock);
(gdb) p ip
$11 = (struct inode *) 0xcc2d4bf0
(gdb) p ip->i_flags
$12 = 0
(gdb) p fs->lfs_pchainhd
$13 = {tqh_first = 0x0, tqh_last = 0x0}

Reading the source, all lfs_pchain operations appear to be protected by
lfs_interlock, and very little code touches i_flags, so I'm at a loss as
to how this is happening.

FYI: line numbers mightn't quite match with current, I've been sprinkling
a few KASSERTs around to try to track this down. I can reproduce both the
hang and crash with both stock GENERIC and a custom kernel config based on
GENERIC.MP.

As you can see from the above, I have plenty of dumps available, and can
readily reproduce the problem as required.

>How-To-Repeat:
Both problems appear to be exacerbated by low memory; this ancient test
system only has 128 MB RAM, and even then, I can reduce the time to failure
from minutes to seconds by filling up RAM with processes. I have also tuned
down the ubc file cache:
ksh$ sysctl -a | egrep 'vm\.....m(in|ax)'
vm.anonmin = 10
vm.filemin = 5
vm.execmin = 5
vm.anonmax = 80
vm.filemax = 10
vm.execmax = 30

Both problems will occur when either unpacking pkgsrc onto a pristine LFS
file system or when /etc/daily runs a find(1) over an LFS file system
containing pkgsrc. Keeping as much memory free as possible allows the
unpacking to succeed, occasionally.

>Fix:
unknown

>Release-Note:

>Audit-Trail:
From: Paul Ripke <stix@stix.id.au>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/35198: lfs_pchain corruption causing hang or panic
Date: Wed, 23 Jan 2008 11:15:46 +1100

 Although this was 100% repeatable on this specific computer, I was
 unable to reproduce hang or panic under QEMU. I now suspect that this
 may have been due to a dodgy driver stomping very specifically on these
 structs - possibly disk?

 In any case, I think this can be closed.

From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org, stix@stix.id.au
Subject: Re: kern/35198: lfs_pchain corruption causing hang or panic
Date: Wed, 23 Jan 2008 00:38:41 +0000

 On Wed, Jan 23, 2008 at 12:20:04AM +0000, Paul Ripke wrote:
  >  Although this was 100% repeatable on this specific computer, I was
  >  unable to reproduce hang or panic under QEMU. I now suspect that this
  >  may have been due to a dodgy driver stomping very specifically on these
  >  structs - possibly disk?
  >  
  >  In any case, I think this can be closed.

 Well, if there's a driver corrupting things it would be nice to fix
 that... but if you don't have the time/patience/whatnot to track it
 down, I can close the PR.

 -- 
 David A. Holland
 dholland@netbsd.org

From: Paul Ripke <stix@stix.id.au>
To: David Holland <dholland-bugs@netbsd.org>
Cc: NetBSD gnats-bugs <gnats-bugs@NetBSD.org>
Subject: Re: kern/35198: lfs_pchain corruption causing hang or panic
Date: Wed, 23 Jan 2008 11:58:36 +1100

 On Wed, Jan 23, 2008 at 12:38:41AM +0000, David Holland wrote:
 > On Wed, Jan 23, 2008 at 12:20:04AM +0000, Paul Ripke wrote:
 >  >  Although this was 100% repeatable on this specific computer, I was
 >  >  unable to reproduce hang or panic under QEMU. I now suspect that this
 >  >  may have been due to a dodgy driver stomping very specifically on these
 >  >  structs - possibly disk?
 >  >  
 >  >  In any case, I think this can be closed.
 > 
 > Well, if there's a driver corrupting things it would be nice to fix
 > that... but if you don't have the time/patience/whatnot to track it
 > down, I can close the PR.

 I'd like to chase down the issue, but the system is pretty old, the
 disk is a:

 amr0 at pci1 dev 10 function 1: AMI RAID <Series 466>
 amr0: interrupting at ioapic1 pin 8 (irq 14)
 amr0: firmware <3.01>, BIOS <1.36>, 16MB RAM
 ld0 at amr0 unit 0: RAID 5, optimal
 ld0: 42840 MB, 10880 cyl, 128 head, 63 sec, 512 bytes/sect x 87736320 sectors

 which is pretty ancient, and the system is about 16000km away from
 me right now, and yes, I don't have the time, either. I say close it,
 and if I feel the inclination in the future, I'll open another, more
 relevant PR.

 Thanks,
 -- 
 Paul

State-Changed-From-To: open->closed
State-Changed-By: jnemeth@narn.netbsd.org
State-Changed-When: Wed, 23 Jan 2008 11:33:51 +0000
State-Changed-Why:
submitter can no longer test and requested that the PR be closed


From: David Holland <dholland-bugs@netbsd.org>
To: Paul Ripke <stix@stix.id.au>
Cc: David Holland <dholland-bugs@netbsd.org>,
	NetBSD gnats-bugs <gnats-bugs@NetBSD.org>
Subject: Re: kern/35198: lfs_pchain corruption causing hang or panic
Date: Wed, 23 Jan 2008 13:26:40 +0000

 On Wed, Jan 23, 2008 at 11:58:36AM +1100, Paul Ripke wrote:
  > >  >  In any case, I think this can be closed.
  > > 
  > > Well, if there's a driver corrupting things it would be nice to fix
  > > that... but if you don't have the time/patience/whatnot to track it
  > > down, I can close the PR.
  > 
  > I'd like to chase down the issue, but the system is pretty old, the
  > [...]
  > 
  > which is pretty ancient, and the system is about 16000km away from
  > me right now, and yes, I don't have the time, either. I say close it,
  > and if I feel the inclination in the future, I'll open another, more
  > relevant PR.

 Understood. Thanks for your input.

 -- 
 David A. Holland
 dholland@netbsd.org

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.