NetBSD Problem Report #35198
From stix@stix.id.au Thu Dec 7 11:58:48 2006
Return-Path: <stix@stix.id.au>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
by narn.NetBSD.org (Postfix) with ESMTP id 5D97263BA6D
for <gnats-bugs@gnats.NetBSD.org>; Thu, 7 Dec 2006 11:58:48 +0000 (UTC)
Message-Id: <20061207115835.C27A977@hactar.stix.org.au>
Date: Thu, 7 Dec 2006 22:58:35 +1100 (EST)
From: stix@stix.id.au
Reply-To: stix@stix.id.au
To: gnats-bugs@NetBSD.org
Subject: lfs_pchain corruption causing hang or panic
X-Send-Pr-Version: 3.95
>Number: 35198
>Category: kern
>Synopsis: lfs_pchain corruption causing hang or panic
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Thu Dec 07 12:00:01 +0000 2006
>Closed-Date: Wed Jan 23 11:33:51 +0000 2008
>Last-Modified: Wed Jan 23 13:30:00 +0000 2008
>Originator: Paul Ripke
>Release: NetBSD 4.99.5
>Organization:
>Environment:
System: NetBSD hactar.stix.org.au 4.99.5 NetBSD 4.99.5 (GENERIC) #0: Sun Dec 3 15:54:49 EST 2006 stix@zion.stix.org.au:/export/netbsd/current/obj.i386/export/netbsd/current/src/sys/arch/i386/compile/GENERIC i386
Architecture: i386
Machine: i386
>Description:
I've been investigating a reproducible lfs hang and crash that appear to be
related. From my investigation, it appears that the lfs_pchain tailqueue is
getting corrupted somehow.
The hang I'm seeing has lfs_pchain with one or more inodes on the chain
(usually just one), but none have IN_PAGING set, and it spins on the goto in
lfs_flush_pchain():
(gdb) bt
...
#14 0xc017f5e4 in pckbcintr (vsc=0xc08c7300) at /export/netbsd/current/src/sys/dev/ic/pckbc.c:640
#15 0xc030cb7f in intr_biglock_wrapper (vp=0xc0908040) at /export/netbsd/current/src/sys/arch/x86/x86/intr.c:544
#16 0xc0102f8c in Xintr_ioapic_edge1 ()
#17 0xc0232754 in lfs_flush_pchain (fs=0xc0d45800) at /export/netbsd/current/src/sys/ufs/lfs/lfs_vnops.c:1344
#18 0xc022b866 in lfs_writerd (arg=0xca2cd5c8) at /export/netbsd/current/src/sys/ufs/lfs/lfs_vfsops.c:236
#19 0xc01002e7 in proc_trampoline ()
(gdb) f 17
#17 0xc0232754 in lfs_flush_pchain (fs=0xc0d45800) at /export/netbsd/current/src/sys/ufs/lfs/lfs_vnops.c:1344
warning: Source file is more recent than executable.
1344 nip = TAILQ_NEXT(ip, i_lfs_pchain);
(gdb) list
1339 * fast and async.
1340 */
1341 simple_lock(&fs->lfs_interlock);
1342 top:
1343 for (ip = TAILQ_FIRST(&fs->lfs_pchainhd); ip != NULL; ip = nip) {
1344 nip = TAILQ_NEXT(ip, i_lfs_pchain);
1345 vp = ITOV(ip);
1346
1347 if (!(ip->i_flags & IN_PAGING))
1348 goto top;
(gdb) p fs->lfs_pchainhd
$1 = {tqh_first = 0xcb91ea40, tqh_last = 0xcb919044}
(gdb) p ip
$2 = (struct inode *) 0xcb91ea40
(gdb) p ip.i_flags
$4 = 0
(gdb) p ip->inode_ext.lfs->lfs_pchain.tqe_next
$5 = (struct inode *) 0x0
The crash I'm seeing is attempting to remove an inode from an empty
lfs_pchain after clearing IN_PAGING, in lfs_putpages():
(gdb) bt
...
#8 0xc0337e8e in trap (frame=0xcb2c5804) at /export/netbsd/current/src/sys/arch/i386/i386/trap.c:313
#9 0xc010c01e in calltrap ()
#10 0xc02382a5 in lfs_putpages (v=0xcb2c58f0) at /export/netbsd/current/src/sys/ufs/lfs/lfs_vnops.c:1920
#11 0xc02f5510 in VOP_PUTPAGES (vp=0xcc2d359c, offlo=0, offhi=0, flags=27)
at /export/netbsd/current/src/sys/kern/vnode_if.c:1592
#12 0xc02e81f8 in vinvalbuf (vp=0xcc2d359c, flags=1, cred=0xffffffff, l=0xcb029564, slpflag=0, slptimeo=0)
at /export/netbsd/current/src/sys/kern/vfs_subr.c:715
#13 0xc02e862e in vclean (vp=0xcc2d359c, flags=<value optimized out>, l=0xcb029564)
at /export/netbsd/current/src/sys/kern/vfs_subr.c:1555
#14 0xc02e8bc8 in vgonel (vp=0xcc2d359c, l=0xcb029564) at /export/netbsd/current/src/sys/kern/vfs_subr.c:1734
#15 0xc02e9044 in getcleanvnode (l=0xcb029564) at /export/netbsd/current/src/sys/kern/vfs_subr.c:272
#16 0xc02e91f0 in getnewvnode (tag=VT_LFS, mp=0xc0a74000, vops=0xc0967500, vpp=0xcb2c5bb8)
at /export/netbsd/current/src/sys/kern/vfs_subr.c:561
#17 0xc0234f76 in lfs_set_dirop_create (dvp=0xcc22ba54, vpp=0xcb2c5bb8)
at /export/netbsd/current/src/sys/ufs/lfs/lfs_vnops.c:466
#18 0xc0235787 in lfs_create (v=0xcb2c5ab8) at /export/netbsd/current/src/sys/ufs/lfs/lfs_vnops.c:674
#19 0xc02f4c11 in VOP_CREATE (dvp=0xcc22ba54, vpp=0xcb2c5bb8, cnp=0xcb2c5bcc, vap=0xcb2c5afc)
at /export/netbsd/current/src/sys/kern/vnode_if.c:164
#20 0xc02f28bd in vn_open (ndp=0xcb2c5ba8, fmode=1538, cmode=384)
at /export/netbsd/current/src/sys/kern/vfs_vnops.c:169
#21 0xc02ef2a2 in sys_open (l=0xcb029564, v=0xcb2c5c48, retval=0xcb2c5c68)
at /export/netbsd/current/src/sys/kern/vfs_syscalls.c:1171
#22 0xc03374f8 in syscall_plain (frame=0xcb2c5c88) at /export/netbsd/current/src/sys/arch/i386/i386/syscall.c:144
#23 0xc01006e0 in syscall1 ()
(gdb) f 10
#10 0xc02382a5 in lfs_putpages (v=0xcb2c58f0) at /export/netbsd/current/src/sys/ufs/lfs/lfs_vnops.c:1919
warning: Source file is more recent than executable.
1920 TAILQ_REMOVE(&fs->lfs_pchainhd, ip, i_lfs_pchain);
(gdb) list
1914 simple_unlock(&vp->v_interlock);
1915
1916 /* Remove us from paging queue, if we were on it */
1917 simple_lock(&fs->lfs_interlock);
1918 if (ip->i_flags & IN_PAGING) {
1919 ip->i_flags &= ~IN_PAGING;
1920 TAILQ_REMOVE(&fs->lfs_pchainhd, ip, i_lfs_pchain);
1921 }
1922 simple_unlock(&fs->lfs_interlock);
(gdb) p ip
$11 = (struct inode *) 0xcc2d4bf0
(gdb) p ip->i_flags
$12 = 0
(gdb) p fs->lfs_pchainhd
$13 = {tqh_first = 0x0, tqh_last = 0x0}
Reading the source, all lfs_pchain operations appear to be protected by
lfs_interlock, and very little code touches i_flags, so I'm at a loss as
to how this is happening.
FYI: line numbers mightn't quite match with current, I've been sprinkling
a few KASSERTs around to try to track this down. I can reproduce both the
hang and crash with both stock GENERIC and a custom kernel config based on
GENERIC.MP.
As you can see from the above, I have plenty of dumps available, and can
readily reproduce the problem as required.
>How-To-Repeat:
Both problems appear to be exacerbated by low memory; this ancient test
system only has 128 MB RAM, and even then, I can reduce the time to failure
from minutes to seconds by filling up RAM with processes. I have also tuned
down the ubc file cache:
ksh$ sysctl -a | egrep 'vm\.....m(in|ax)'
vm.anonmin = 10
vm.filemin = 5
vm.execmin = 5
vm.anonmax = 80
vm.filemax = 10
vm.execmax = 30
Both problems will occur when either unpacking pkgsrc onto a pristine LFS
file system or when /etc/daily runs a find(1) over an LFS file system
containing pkgsrc. Keeping as much memory free as possible allows the
unpacking to succeed, occasionally.
>Fix:
unknown
>Release-Note:
>Audit-Trail:
From: Paul Ripke <stix@stix.id.au>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/35198: lfs_pchain corruption causing hang or panic
Date: Wed, 23 Jan 2008 11:15:46 +1100
Although this was 100% repeatable on this specific computer, I was
unable to reproduce hang or panic under QEMU. I now suspect that this
may have been due to a dodgy driver stomping very specifically on these
structs - possibly disk?
In any case, I think this can be closed.
From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org, stix@stix.id.au
Subject: Re: kern/35198: lfs_pchain corruption causing hang or panic
Date: Wed, 23 Jan 2008 00:38:41 +0000
On Wed, Jan 23, 2008 at 12:20:04AM +0000, Paul Ripke wrote:
> Although this was 100% repeatable on this specific computer, I was
> unable to reproduce hang or panic under QEMU. I now suspect that this
> may have been due to a dodgy driver stomping very specifically on these
> structs - possibly disk?
>
> In any case, I think this can be closed.
Well, if there's a driver corrupting things it would be nice to fix
that... but if you don't have the time/patience/whatnot to track it
down, I can close the PR.
--
David A. Holland
dholland@netbsd.org
From: Paul Ripke <stix@stix.id.au>
To: David Holland <dholland-bugs@netbsd.org>
Cc: NetBSD gnats-bugs <gnats-bugs@NetBSD.org>
Subject: Re: kern/35198: lfs_pchain corruption causing hang or panic
Date: Wed, 23 Jan 2008 11:58:36 +1100
On Wed, Jan 23, 2008 at 12:38:41AM +0000, David Holland wrote:
> On Wed, Jan 23, 2008 at 12:20:04AM +0000, Paul Ripke wrote:
> > Although this was 100% repeatable on this specific computer, I was
> > unable to reproduce hang or panic under QEMU. I now suspect that this
> > may have been due to a dodgy driver stomping very specifically on these
> > structs - possibly disk?
> >
> > In any case, I think this can be closed.
>
> Well, if there's a driver corrupting things it would be nice to fix
> that... but if you don't have the time/patience/whatnot to track it
> down, I can close the PR.
I'd like to chase down the issue, but the system is pretty old, the
disk is a:
amr0 at pci1 dev 10 function 1: AMI RAID <Series 466>
amr0: interrupting at ioapic1 pin 8 (irq 14)
amr0: firmware <3.01>, BIOS <1.36>, 16MB RAM
ld0 at amr0 unit 0: RAID 5, optimal
ld0: 42840 MB, 10880 cyl, 128 head, 63 sec, 512 bytes/sect x 87736320 sectors
which is pretty ancient, and the system is about 16000km away from
me right now, and yes, I don't have the time, either. I say close it,
and if I feel the inclination in the future, I'll open another, more
relevant PR.
Thanks,
--
Paul
State-Changed-From-To: open->closed
State-Changed-By: jnemeth@narn.netbsd.org
State-Changed-When: Wed, 23 Jan 2008 11:33:51 +0000
State-Changed-Why:
submitter can no longer test and requested that the PR be closed
From: David Holland <dholland-bugs@netbsd.org>
To: Paul Ripke <stix@stix.id.au>
Cc: David Holland <dholland-bugs@netbsd.org>,
NetBSD gnats-bugs <gnats-bugs@NetBSD.org>
Subject: Re: kern/35198: lfs_pchain corruption causing hang or panic
Date: Wed, 23 Jan 2008 13:26:40 +0000
On Wed, Jan 23, 2008 at 11:58:36AM +1100, Paul Ripke wrote:
> > > In any case, I think this can be closed.
> >
> > Well, if there's a driver corrupting things it would be nice to fix
> > that... but if you don't have the time/patience/whatnot to track it
> > down, I can close the PR.
>
> I'd like to chase down the issue, but the system is pretty old, the
> [...]
>
> which is pretty ancient, and the system is about 16000km away from
> me right now, and yes, I don't have the time, either. I say close it,
> and if I feel the inclination in the future, I'll open another, more
> relevant PR.
Understood. Thanks for your input.
--
David A. Holland
dholland@netbsd.org
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.