NetBSD Problem Report #41024

From apb@cequrux.com  Mon Mar 16 07:07:03 2009
Return-Path: <apb@cequrux.com>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by www.NetBSD.org (Postfix) with ESMTP id 59B5363B8EC
	for <gnats-bugs@gnats.NetBSD.org>; Mon, 16 Mar 2009 07:07:03 +0000 (UTC)
Message-Id: <20090316070525.16CBAE93431@apb-laptoy.apb.alt.za>
Date: Mon, 16 Mar 2009 07:05:25 +0000 (UTC)
From: apb@cequrux.com
To: gnats-bugs@gnats.NetBSD.org
Subject: wapbl causes file system corruption
X-Send-Pr-Version: 3.95

>Number:         41024
>Category:       kern
>Synopsis:       usb causes file system corruption, triggered with wapbl
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Mar 16 07:10:00 +0000 2009
>Closed-Date:    Mon Mar 12 19:57:17 +0000 2018
>Last-Modified:  Mon Mar 12 19:57:17 +0000 2018
>Originator:     Alan Barrett
>Release:        NetBSD 5.99.8
>Organization:
Not much
>Environment:
System: NetBSD 5.99.8 i386
>Description:
I have an external USB disk that I use for backups.  Very frequently,
while attempting to make a backup, the system panics, usually with a
message like this:

    /mnt: bad dir ino 16170501 at offset 0: mangled entry
    panic: bad dir

The file system is ffs+wapbl on cgd.  The kernel includes the recent
change to make cgd pass the DIOCCACHESYNC ioctl through to the
underlying disk (see PR 41016).

Backups are made using rsync, but I can replicate the panics simply
using find(1) to read the file system.

The disk and its parents are attached as follows:

    pci0 at mainbus0 bus 0: configuration mode 1
    pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
    ehci0 at pci0 dev 29 function 7: vendor 0x8086 product 0x27cc (rev. 0x01)
    ehci0: interrupting at ioapic0 pin 20
    ehci0: EHCI version 1.0
    ehci0: companion controllers, 2 ports each: uhci0 uhci1 uhci2 uhci3
    usb4 at ehci0: USB revision 2.0
    uhub2 at usb4: vendor 0x8086 EHCI root hub, class 9/0, rev 2.00/1.00, addr 1
    uhub2: 8 ports with 8 removable, self powered
    umass0 at uhub2 port 6 configuration 1 interface 0
    umass0: Western Digital External HDD, rev 2.00/2.06, addr 5
    umass0: using SCSI over Bulk-Only
    scsibus0 at umass0: 2 targets, 1 lun per target
    sd0 at scsibus0 target 0 lun 0: <WD, 3200JB External, 0107> disk fixed
    sd0: fabricating a geometry
    sd0: 298 GB, 305245 cyl, 64 head, 32 sec, 512 bytes/sect x 625142448 sectors
    sd0: fabricating a geometry

Most of the disk space is allocated to the sd0e partition, which is
configured as a cgd device (cgd2 in the following description).

The whole of the cgd2 device is allocated to the cgd2a file system,
which is formatted as ffs+wapbl.

>How-To-Repeat:
Ensure that the file system is clean:

    $ fsck -f -y -P /dev/rcgd2a 
	[fsck fixes several problems from a previous crash]
    $ fsck -f -y -P /dev/rcgd2a
	[no problems]

Verify that mounting without wapbl does not cause problems:

    $ mount -o nolog /dev/cgd2a /mnt
    $ find /mnt -type d -print | tail
	[no problems]
    $ umount /mnt
    $ fsck -f -y -P /dev/rcgd2a
	[no problems]

Verify that mounting with wapbl + noatime does not cause problems:

    $ mount -o log,noatime /dev/cgd2a /mnt
    $ find /mnt -type d -print | tail
	[no problems]
    $ umount /mnt
    $ fsck -f -y -P /dev/rcgd2a
	[no problems]

Verify that mounting with wapbl causes a crash:

    $ mount -o log /dev/cgd2a /mnt
    $ find /mnt -type d -print | tail

    /mnt: bad dir ino 16170501 at offset 0: mangled entry
    panic: bad dir
    [...]
    stoped in pid 14915.1 (find) [...]

Reboot and examine the crash dump:

$ crash -M netbsd.203.core -N netbsd.203
crash> bt
[...]
panic(c0ad6b07,ce09d0f8,f6be05,0,0,c0a7f762,200,d771967c,0,dbcaf000) at 0xc06f25fa
ufs_dirbad(dadc3e84,0,c0a7f762,0,cca07aa8,0,0,0,0,c4335218) at 0xc076dc1a
ufs_lookup(cca07ad8,1,cca07adc,c07e372d,d771967c,c0a51140,d771967c,cca07c14,cca07c28,d771967c) at 0xc076e60b
VOP_LOOKUP(d771967c,cca07c14,cca07c28,c07f4890,cca07b1c,1000,1,0,20,0) at 0xc07f56ec
lookup(cca07c00,20002,400,cca07c1c,1,c3a88000,cca07c1c,c0771adf,c0bab0a0,c0b3f5c0) at 0xc07d850c
namei(cca07c00,e00,cca07bfc,c07e372d,d771967c,ce09d000,cca07c1c,bb916090,0,0) at 0xc07d8c0d
do_sys_stat(bb916090,0,cca07c68,cca07c98,0,0,0,ceb57760,cca07c90,1) at 0xc07dff47
sys___lstat50(cf027000,cca07d00,cca07d28,bb916090,bb9160ac,bfbfeb98,bbb2a66d,1,bb268c80,804ee6c) at 0xc07dffac
syscall(cca07d48,bb9200b3,bb9000ab,bb90001f,bfbf001f,bb9160ac,bb916040,bfbfebf8,bbbc21dc,bb916040) at 0xc0711e7d
crash> quit

Examime the file system:

$ fsdb -f /dev/rcgd2a
fsdb (inum: 2)> inode 16170501
current inode: directory
I=16170501 MODE=40755 SIZE=512
        MTIME=Mar 30 23:23:50 2005 [0 nsec]
        CTIME=Feb 27 15:21:12 2009 [658332227 nsec]
        ATIME=Mar 15 21:09:23 2009 [866735678 nsec]
OWNER=apb GRP=apb LINKCNT=2 FLAGS=0x0 BLKCNT=0x4 GEN=0x72b4af9f
fsdb (inum: 16170501)> ls
fsdb (inum: 16170501)> blks
I=16170501 4 blocks
Direct blocks:
0: 65573431
fsdb (inum: 16170501)> quit
*** FILE SYSTEM MARKED DIRTY
*** BE SURE TO RUN FSCK TO CLEAN UP ANY DAMAGE
*** IF IT WAS MOUNTED, RE-MOUNT WITH -u -o reload

$ dd if=/dev/rcgd2a bs=512 skip=65573431 count=1 | hexdump -C
1+0 records in
1+0 records out
512 bytes transferred in 0.010 secs (51200 bytes/sec)
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000200

$ tunefs -N /dev/rcgd2a
tunefs: tuning /dev/rcgd2a
tunefs: current settings of /dev/rcgd2a
        maximum contiguous block count 4
        maximum blocks per file in a cylinder group 4096
        minimum percentage of free space 5%
        optimization preference: time
        average file size: 16384
        expected number of files per directory: 64
        journal log file location: in filesystem
        journal log file size: 64MB (67108864 bytes)
        journal log flags:
tunefs: no changes made

$ dumpfs -s /dev/rcgd2a
file system: /dev/rcgd2a
endian	little-endian
magic	11954 (UFS1)	time	Sun Mar 15 22:40:18 2009
superblock location	8192	id	[ 4960ee56 5d043c18 ]
cylgrp	dynamic	inodes	4.4BSD	sblock	FFSv2	fslevel 4
nbfree	11412090	ndir	642379	nifree	30237094	nffree	947460
ncg	1391	size	131285595	blocks	129238024
bsize	16384	shift	14	mask	0xffffc000
fsize	2048	shift	11	mask	0xfffff800
frag	8	shift	3	fsbtodb	2
bpg	11798	fpg	94384	ipg	23296
minfree	5%	optim	time	maxcontig 4	maxbpg	4096
symlinklen 60	contigsumsize 4
maxfilesize 0x000400400402ffff
nindir	4096	inopb	128
avgfilesize 16384	avgfpdir 64
sblkno	8	cblkno	16	iblkno	24	dblkno	1480
sbsize	2048	cgsize	16384
csaddr	1480	cssize	22528
cgrotor	0	fmod	0	ronly	0	clean	0x00
wapbl version 0x1	location 2	flags 0x0
wapbl loc0 262179520	loc1 131072	loc2 512	loc3 3
flags	none
fsmnt	/mnt
volname		swuid	0

>Fix:
Unknown.  I will keep the crash dump for some weeks or months, and I
will keep the file system unmodified for a few days in case anybody
needs information from them.

>Release-Note:

>Audit-Trail:
From: Scott Ellis <scotte@warped.com>
To: gnats-bugs@NetBSD.org
Cc: netbsd-bugs@netbsd.org
Subject: Re: kern/41024: wapbl causes file system corruption
Date: Mon, 16 Mar 2009 19:17:44 -0700

 I see similar behavior on a non-CGD system (doing rsync backups, or 
 heavy disk I/O to an external USB disk).  It seems like my issue got 
 more severe after upgrading the server from 4GB to 8GB of RAM (this is 
 amd64), but it could have coincided with recent changes in -current as well.

From: Andrew Doran <ad@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/41024: wapbl causes file system corruption
Date: Tue, 17 Mar 2009 22:05:19 +0000

 I changed the sysopsis for this PR:

 [wapbl causes file system corruption]
 ->
 [usb causes file system corruption, triggered with wapbl]

From: Scott Ellis <scotte@warped.com>
To: gnats-bugs@gnats.netbsd.org
Cc: 
Subject: Re: kern/41024
Date: Tue, 17 Mar 2009 18:43:54 -0700

 I'm not sure wapbl is even involved here.

 Mounting my (broken, ala the original PR) filesystem and doing an "fsck 
 -fy" on it with a debug kernel yields a number of "usbd_do_request: not 
 in process context" messages.  This is with -current amd64 as of "Sun 
 Mar 15 09:29:16 PDT 2009".

 Wapbl is out of the picture when fsck'ing, right?

From: Scott Ellis <scotte@warped.com>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, 
 netbsd-bugs@netbsd.org, apb@cequrux.com
Subject: Re: kern/41024
Date: Tue, 17 Mar 2009 19:53:10 -0700

 Doing an rsync to my USB drive (with wapbl, etc.), after freshly 
 formatting it eventually yielded:

 panic: kernel diagnostic assertion "de->td_node == node" failed: file 
 "/nbu/source/netbsd/src/sys/fs/tmpfs/tmpfs_vnops.c", line 657
 fatal breakpoint trap in supervisor mode
 trap type 1 code 0 rip ffffffff80152215 cs 8 rflags 246 cr2 
 ffff800057d4d670 cpl 0 rsp ffff8000790339d0
 Stopped in pid 18134.1 (cc) at  netbsd:breakpoint+0x5:  leave
 db{1}> bt
 breakpoint() at netbsd:breakpoint+0x5
 panic() at netbsd:panic+0x272
 __kernassert() at netbsd:__kernassert+0x2d
 tmpfs_remove() at netbsd:tmpfs_remove+0x2e7
 VOP_REMOVE() at netbsd:VOP_REMOVE+0x69
 do_sys_unlink() at netbsd:do_sys_unlink+0xa8
 syscall() at netbsd:syscall+0x92
 db{1}>

 Given that I wasn't doing anything with tmpfs at the time (or 
 rather...rsync wasn't...maybe some other process was, since that's 
 /tmp), it seems like a strange backtrace.

 Maybe there's more than one issue going on here?

From: "Scott Ellis" <scotte@warped.com>
To: scotte@warped.com
Cc: gnats-bugs@netbsd.org,
 kern-bug-people@netbsd.org,
 gnats-admin@netbsd.org,
 netbsd-bugs@netbsd.org,
 apb@cequrux.com
Subject: Re: kern/41024
Date: Mon, 23 Mar 2009 09:33:34 -0700 (PDT)

 Ignore my tmpfs crash and backtrace.  That problem has been tracked down
 to faulty RAM in the system (which has since been replaced).  I'm still
 seeing the occasional crash when rsync'ing to an EHCI-connected drive
 though.  Will try and get a trace from that when it happens again (which
 is unfortunately part of daily/weekly/monthly.local!).

State-Changed-From-To: open->feedback
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sat, 05 Sep 2015 20:27:06 +0000
State-Changed-Why:
Is this problem still live? Haven't seen or heard about anything like this
in a long time...


State-Changed-From-To: feedback->open
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sun, 01 May 2016 05:37:48 +0000
State-Changed-Why:
new question...


State-Changed-From-To: open->feedback
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sun, 01 May 2016 05:38:34 +0000
State-Changed-Why:
(in addition to the old one, namely, is this problem still current)

Were/are you using -o discard when seeing this problem? If so, it's
probably the same as 50725.


State-Changed-From-To: feedback->open
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Mon, 12 Mar 2018 00:13:46 +0000
State-Changed-Why:
apb's mail has been bouncing for some time :-(


State-Changed-From-To: open->closed
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Mon, 12 Mar 2018 19:57:17 +0000
State-Changed-Why:
Closing, it's possible it's fixed with the WAPBL fixes, and there was no
other reports about similar issue recently.
Thank you.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.