NetBSD Problem Report #41024
From apb@cequrux.com Mon Mar 16 07:07:03 2009
Return-Path: <apb@cequrux.com>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
by www.NetBSD.org (Postfix) with ESMTP id 59B5363B8EC
for <gnats-bugs@gnats.NetBSD.org>; Mon, 16 Mar 2009 07:07:03 +0000 (UTC)
Message-Id: <20090316070525.16CBAE93431@apb-laptoy.apb.alt.za>
Date: Mon, 16 Mar 2009 07:05:25 +0000 (UTC)
From: apb@cequrux.com
To: gnats-bugs@gnats.NetBSD.org
Subject: wapbl causes file system corruption
X-Send-Pr-Version: 3.95
>Number: 41024
>Category: kern
>Synopsis: usb causes file system corruption, triggered with wapbl
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Mon Mar 16 07:10:00 +0000 2009
>Closed-Date: Mon Mar 12 19:57:17 +0000 2018
>Last-Modified: Mon Mar 12 19:57:17 +0000 2018
>Originator: Alan Barrett
>Release: NetBSD 5.99.8
>Organization:
Not much
>Environment:
System: NetBSD 5.99.8 i386
>Description:
I have an external USB disk that I use for backups. Very frequently,
while attempting to make a backup, the system panics, usually with a
message like this:
/mnt: bad dir ino 16170501 at offset 0: mangled entry
panic: bad dir
The file system is ffs+wapbl on cgd. The kernel includes the recent
change to make cgd pass the DIOCCACHESYNC ioctl through to the
underlying disk (see PR 41016).
Backups are made using rsync, but I can replicate the panics simply
using find(1) to read the file system.
The disk and its parents are attached as follows:
pci0 at mainbus0 bus 0: configuration mode 1
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
ehci0 at pci0 dev 29 function 7: vendor 0x8086 product 0x27cc (rev. 0x01)
ehci0: interrupting at ioapic0 pin 20
ehci0: EHCI version 1.0
ehci0: companion controllers, 2 ports each: uhci0 uhci1 uhci2 uhci3
usb4 at ehci0: USB revision 2.0
uhub2 at usb4: vendor 0x8086 EHCI root hub, class 9/0, rev 2.00/1.00, addr 1
uhub2: 8 ports with 8 removable, self powered
umass0 at uhub2 port 6 configuration 1 interface 0
umass0: Western Digital External HDD, rev 2.00/2.06, addr 5
umass0: using SCSI over Bulk-Only
scsibus0 at umass0: 2 targets, 1 lun per target
sd0 at scsibus0 target 0 lun 0: <WD, 3200JB External, 0107> disk fixed
sd0: fabricating a geometry
sd0: 298 GB, 305245 cyl, 64 head, 32 sec, 512 bytes/sect x 625142448 sectors
sd0: fabricating a geometry
Most of the disk space is allocated to the sd0e partition, which is
configured as a cgd device (cgd2 in the following description).
The whole of the cgd2 device is allocated to the cgd2a file system,
which is formatted as ffs+wapbl.
>How-To-Repeat:
Ensure that the file system is clean:
$ fsck -f -y -P /dev/rcgd2a
[fsck fixes several problems from a previous crash]
$ fsck -f -y -P /dev/rcgd2a
[no problems]
Verify that mounting without wapbl does not cause problems:
$ mount -o nolog /dev/cgd2a /mnt
$ find /mnt -type d -print | tail
[no problems]
$ umount /mnt
$ fsck -f -y -P /dev/rcgd2a
[no problems]
Verify that mounting with wapbl + noatime does not cause problems:
$ mount -o log,noatime /dev/cgd2a /mnt
$ find /mnt -type d -print | tail
[no problems]
$ umount /mnt
$ fsck -f -y -P /dev/rcgd2a
[no problems]
Verify that mounting with wapbl causes a crash:
$ mount -o log /dev/cgd2a /mnt
$ find /mnt -type d -print | tail
/mnt: bad dir ino 16170501 at offset 0: mangled entry
panic: bad dir
[...]
stoped in pid 14915.1 (find) [...]
Reboot and examine the crash dump:
$ crash -M netbsd.203.core -N netbsd.203
crash> bt
[...]
panic(c0ad6b07,ce09d0f8,f6be05,0,0,c0a7f762,200,d771967c,0,dbcaf000) at 0xc06f25fa
ufs_dirbad(dadc3e84,0,c0a7f762,0,cca07aa8,0,0,0,0,c4335218) at 0xc076dc1a
ufs_lookup(cca07ad8,1,cca07adc,c07e372d,d771967c,c0a51140,d771967c,cca07c14,cca07c28,d771967c) at 0xc076e60b
VOP_LOOKUP(d771967c,cca07c14,cca07c28,c07f4890,cca07b1c,1000,1,0,20,0) at 0xc07f56ec
lookup(cca07c00,20002,400,cca07c1c,1,c3a88000,cca07c1c,c0771adf,c0bab0a0,c0b3f5c0) at 0xc07d850c
namei(cca07c00,e00,cca07bfc,c07e372d,d771967c,ce09d000,cca07c1c,bb916090,0,0) at 0xc07d8c0d
do_sys_stat(bb916090,0,cca07c68,cca07c98,0,0,0,ceb57760,cca07c90,1) at 0xc07dff47
sys___lstat50(cf027000,cca07d00,cca07d28,bb916090,bb9160ac,bfbfeb98,bbb2a66d,1,bb268c80,804ee6c) at 0xc07dffac
syscall(cca07d48,bb9200b3,bb9000ab,bb90001f,bfbf001f,bb9160ac,bb916040,bfbfebf8,bbbc21dc,bb916040) at 0xc0711e7d
crash> quit
Examime the file system:
$ fsdb -f /dev/rcgd2a
fsdb (inum: 2)> inode 16170501
current inode: directory
I=16170501 MODE=40755 SIZE=512
MTIME=Mar 30 23:23:50 2005 [0 nsec]
CTIME=Feb 27 15:21:12 2009 [658332227 nsec]
ATIME=Mar 15 21:09:23 2009 [866735678 nsec]
OWNER=apb GRP=apb LINKCNT=2 FLAGS=0x0 BLKCNT=0x4 GEN=0x72b4af9f
fsdb (inum: 16170501)> ls
fsdb (inum: 16170501)> blks
I=16170501 4 blocks
Direct blocks:
0: 65573431
fsdb (inum: 16170501)> quit
*** FILE SYSTEM MARKED DIRTY
*** BE SURE TO RUN FSCK TO CLEAN UP ANY DAMAGE
*** IF IT WAS MOUNTED, RE-MOUNT WITH -u -o reload
$ dd if=/dev/rcgd2a bs=512 skip=65573431 count=1 | hexdump -C
1+0 records in
1+0 records out
512 bytes transferred in 0.010 secs (51200 bytes/sec)
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000200
$ tunefs -N /dev/rcgd2a
tunefs: tuning /dev/rcgd2a
tunefs: current settings of /dev/rcgd2a
maximum contiguous block count 4
maximum blocks per file in a cylinder group 4096
minimum percentage of free space 5%
optimization preference: time
average file size: 16384
expected number of files per directory: 64
journal log file location: in filesystem
journal log file size: 64MB (67108864 bytes)
journal log flags:
tunefs: no changes made
$ dumpfs -s /dev/rcgd2a
file system: /dev/rcgd2a
endian little-endian
magic 11954 (UFS1) time Sun Mar 15 22:40:18 2009
superblock location 8192 id [ 4960ee56 5d043c18 ]
cylgrp dynamic inodes 4.4BSD sblock FFSv2 fslevel 4
nbfree 11412090 ndir 642379 nifree 30237094 nffree 947460
ncg 1391 size 131285595 blocks 129238024
bsize 16384 shift 14 mask 0xffffc000
fsize 2048 shift 11 mask 0xfffff800
frag 8 shift 3 fsbtodb 2
bpg 11798 fpg 94384 ipg 23296
minfree 5% optim time maxcontig 4 maxbpg 4096
symlinklen 60 contigsumsize 4
maxfilesize 0x000400400402ffff
nindir 4096 inopb 128
avgfilesize 16384 avgfpdir 64
sblkno 8 cblkno 16 iblkno 24 dblkno 1480
sbsize 2048 cgsize 16384
csaddr 1480 cssize 22528
cgrotor 0 fmod 0 ronly 0 clean 0x00
wapbl version 0x1 location 2 flags 0x0
wapbl loc0 262179520 loc1 131072 loc2 512 loc3 3
flags none
fsmnt /mnt
volname swuid 0
>Fix:
Unknown. I will keep the crash dump for some weeks or months, and I
will keep the file system unmodified for a few days in case anybody
needs information from them.
>Release-Note:
>Audit-Trail:
From: Scott Ellis <scotte@warped.com>
To: gnats-bugs@NetBSD.org
Cc: netbsd-bugs@netbsd.org
Subject: Re: kern/41024: wapbl causes file system corruption
Date: Mon, 16 Mar 2009 19:17:44 -0700
I see similar behavior on a non-CGD system (doing rsync backups, or
heavy disk I/O to an external USB disk). It seems like my issue got
more severe after upgrading the server from 4GB to 8GB of RAM (this is
amd64), but it could have coincided with recent changes in -current as well.
From: Andrew Doran <ad@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/41024: wapbl causes file system corruption
Date: Tue, 17 Mar 2009 22:05:19 +0000
I changed the sysopsis for this PR:
[wapbl causes file system corruption]
->
[usb causes file system corruption, triggered with wapbl]
From: Scott Ellis <scotte@warped.com>
To: gnats-bugs@gnats.netbsd.org
Cc:
Subject: Re: kern/41024
Date: Tue, 17 Mar 2009 18:43:54 -0700
I'm not sure wapbl is even involved here.
Mounting my (broken, ala the original PR) filesystem and doing an "fsck
-fy" on it with a debug kernel yields a number of "usbd_do_request: not
in process context" messages. This is with -current amd64 as of "Sun
Mar 15 09:29:16 PDT 2009".
Wapbl is out of the picture when fsck'ing, right?
From: Scott Ellis <scotte@warped.com>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org, apb@cequrux.com
Subject: Re: kern/41024
Date: Tue, 17 Mar 2009 19:53:10 -0700
Doing an rsync to my USB drive (with wapbl, etc.), after freshly
formatting it eventually yielded:
panic: kernel diagnostic assertion "de->td_node == node" failed: file
"/nbu/source/netbsd/src/sys/fs/tmpfs/tmpfs_vnops.c", line 657
fatal breakpoint trap in supervisor mode
trap type 1 code 0 rip ffffffff80152215 cs 8 rflags 246 cr2
ffff800057d4d670 cpl 0 rsp ffff8000790339d0
Stopped in pid 18134.1 (cc) at netbsd:breakpoint+0x5: leave
db{1}> bt
breakpoint() at netbsd:breakpoint+0x5
panic() at netbsd:panic+0x272
__kernassert() at netbsd:__kernassert+0x2d
tmpfs_remove() at netbsd:tmpfs_remove+0x2e7
VOP_REMOVE() at netbsd:VOP_REMOVE+0x69
do_sys_unlink() at netbsd:do_sys_unlink+0xa8
syscall() at netbsd:syscall+0x92
db{1}>
Given that I wasn't doing anything with tmpfs at the time (or
rather...rsync wasn't...maybe some other process was, since that's
/tmp), it seems like a strange backtrace.
Maybe there's more than one issue going on here?
From: "Scott Ellis" <scotte@warped.com>
To: scotte@warped.com
Cc: gnats-bugs@netbsd.org,
kern-bug-people@netbsd.org,
gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org,
apb@cequrux.com
Subject: Re: kern/41024
Date: Mon, 23 Mar 2009 09:33:34 -0700 (PDT)
Ignore my tmpfs crash and backtrace. That problem has been tracked down
to faulty RAM in the system (which has since been replaced). I'm still
seeing the occasional crash when rsync'ing to an EHCI-connected drive
though. Will try and get a trace from that when it happens again (which
is unfortunately part of daily/weekly/monthly.local!).
State-Changed-From-To: open->feedback
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sat, 05 Sep 2015 20:27:06 +0000
State-Changed-Why:
Is this problem still live? Haven't seen or heard about anything like this
in a long time...
State-Changed-From-To: feedback->open
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sun, 01 May 2016 05:37:48 +0000
State-Changed-Why:
new question...
State-Changed-From-To: open->feedback
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sun, 01 May 2016 05:38:34 +0000
State-Changed-Why:
(in addition to the old one, namely, is this problem still current)
Were/are you using -o discard when seeing this problem? If so, it's
probably the same as 50725.
State-Changed-From-To: feedback->open
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Mon, 12 Mar 2018 00:13:46 +0000
State-Changed-Why:
apb's mail has been bouncing for some time :-(
State-Changed-From-To: open->closed
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Mon, 12 Mar 2018 19:57:17 +0000
State-Changed-Why:
Closing, it's possible it's fixed with the WAPBL fixes, and there was no
other reports about similar issue recently.
Thank you.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.