NetBSD Problem Report #50108

From www@NetBSD.org  Fri Jul 31 21:52:30 2015
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 302A5A6551
	for <gnats-bugs@gnats.NetBSD.org>; Fri, 31 Jul 2015 21:52:30 +0000 (UTC)
Message-Id: <20150731215228.C525CA6551@mollari.NetBSD.org>
Date: Fri, 31 Jul 2015 21:52:28 +0000 (UTC)
From: dudinea@gmail.com
Reply-To: dudinea@gmail.com
To: gnats-bugs@NetBSD.org
Subject: fsck_ffs fails replaying wapbl journal on filesystem with 4k sectors 
X-Send-Pr-Version: www-1.0

>Number:         50108
>Category:       bin
>Synopsis:       fsck_ffs fails replaying wapbl journal on filesystem with 4k sectors
>Confidential:   no
>Severity:       critical
>Priority:       medium
>Responsible:    mlelstv
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Jul 31 21:55:00 +0000 2015
>Closed-Date:    
>Last-Modified:  Sun Sep 06 08:25:00 +0000 2015
>Originator:     Eugene Doudine
>Release:        NetBSD 7.99.20
>Organization:
>Environment:
NetBSD nbsd1.trekflow.com 7.99.20 NetBSD 7.99.20 (GENERIC) #1: Wed Jul 29 09:54:29 IDT 2015  root@nbsd1.trekflow.com:/usr/obj/sys/arch/amd64/compile/GENERIC amd64


fsck_ffs ident:
     $NetBSD: crt0.S,v 1.3 2011/07/01 02:59:05 joerg Exp $
     $NetBSD: crt0-common.c,v 1.13 2013/01/31 22:24:25 matt Exp $
     $NetBSD: crti.S,v 1.1 2010/08/07 18:01:35 joerg Exp $
     $NetBSD: crtbegin.S,v 1.2 2010/11/30 18:37:59 joerg Exp $
     $NetBSD: dir.c,v 1.57 2013/06/23 07:28:36 dholland Exp $
     $NetBSD: inode.c,v 1.71 2014/04/05 12:32:27 justin Exp $
     $NetBSD: main.c,v 1.83 2015/06/16 23:58:30 christos Exp $
     $NetBSD: pass1.c,v 1.56 2013/10/19 01:09:58 christos Exp $
     $NetBSD: pass1b.c,v 1.23 2013/01/22 09:39:12 dholland Exp $
     $NetBSD: pass2.c,v 1.50 2013/06/09 17:57:09 dholland Exp $
     $NetBSD: pass3.c,v 1.21 2013/01/22 09:39:12 dholland Exp $
     $NetBSD: pass4.c,v 1.28 2013/06/23 22:03:34 dholland Exp $
     $NetBSD: pass5.c,v 1.54 2013/06/23 22:03:34 dholland Exp $
     $NetBSD: fsutil.c,v 1.26 2015/06/21 04:01:40 dholland Exp $
     $NetBSD: setup.c,v 1.100 2013/06/23 07:28:36 dholland Exp $
     $NetBSD: utilities.c,v 1.64 2013/10/19 01:09:58 christos Exp $
     $NetBSD: ffs_bswap.c,v 1.39 2015/05/20 18:21:17 riastradh Exp $
     $NetBSD: ffs_subr.c,v 1.48 2013/10/20 00:29:10 htodd Exp $
     $NetBSD: ffs_tables.c,v 1.9 2005/12/11 12:25:25 christos Exp $
     $NetBSD: ffs_appleufs.c,v 1.15 2015/02/15 11:04:43 maxv Exp $
     $NetBSD: partutil.c,v 1.15 2015/06/03 17:53:23 martin Exp $
     $NetBSD: quota2_subr.c,v 1.5 2012/02/05 14:19:04 dholland Exp $
     $NetBSD: progress.c,v 1.5 2009/04/11 06:48:36 lukem Exp $
     $NetBSD: vfs_wapbl.c,v 1.61 2014/10/18 08:33:29 snj Exp $
     $NetBSD: wapbl.c,v 1.5 2010/03/06 11:31:40 mlelstv Exp $
     $NetBSD: crtend.S,v 1.1 2010/08/07 18:01:34 joerg Exp $
     $NetBSD: crtn.S,v 1.1 2010/08/07 18:01:35 joerg Exp $

>Description:
When trying to run fsck on a FFS filesystem that has been created with 4k sectors, mounted with -o log and then not cleanly unmounted, 
fsck_ffs fails when it's replaying the journal and tries to write wrong blocks off the device end:

fsck_ffs  -d -d /dev/rvnd0a 
** /dev/rvnd0a
wapbl_replay_start: vp=0x0 off=91808 count=8192 blksize=4096
wapbl_read: 8192 bytes at block 91808 on fd 0x3
wapbl_replay: head=5332992 tail=5095424 off=8192 len=33546240 used=237568
wapbl_read: 4096 bytes at block 93052 on fd 0x3
wapbl_read: 4096 bytes at block 93108 on fd 0x3
wapbl_read: 4096 bytes at block 93109 on fd 0x3
** File system is journaled; replaying journal
wapbl_read: 4096 bytes at block 93092 on fd 0x3
wapbl_write: 4096 bytes at block 367616 on fd 0x4

CANNOT WRITE: BLK 367616
CONTINUE? [yn] n

Looks like it's trying to replay journal into blocks
that lie off the device ends. I suppose that it can
corrupt file system as well when wrong blocks to be written
lie within device boundaries.

Details of the device and filesystem above:

device created with:
vnconfig -c vnd0 ./diskimg.bin 4096/50/20/100

dumpfs -s /dev/rvnd0a:

file system: /dev/rvnd0a
format  FFSv1
endian  little-endian
magic   11954           time    Sun Jul 26 21:16:49 2015
superblock location     8192    id      [ 55b51630 27fdad44 ]
cylgrp  dynamic inodes  4.4BSD  sblock  FFSv2   fslevel 4
nbfree  11377   ndir    1       nifree  22522   nffree  11
ncg     4       size    91808   blocks  91031
bsize   32768   shift   15      mask    0xffff8000
fsize   4096    shift   12      mask    0xfffff000
frag    8       shift   3       fsbtodb 0
bpg     2869    fpg     22952   ipg     5632
minfree 5%      optim   time    maxcontig 2     maxbpg  8192
symlinklen 60   contigsumsize 2
maxfilesize 0x004002001005ffff
nindir  8192    inopb   256
avgfilesize 16384       avgfpdir 64
sblkno  8       cblkno  16      iblkno  24      dblkno  200
sbsize  4096    cgsize  8192
csaddr  200     cssize  4096
cgrotor 0       fmod    0       ronly   0       clean   0x02
wapbl version 0x1       location 1      flags 0x0
wapbl loc0 91808        loc1 8192       loc2 4096       loc3 0
flags   wapbl 
fsmnt   /mnt/test_fsck
volname         swuid   0

>How-To-Repeat:
Create an FFS filesystem on 4k system disk or with -S 4096 option

Mount with -o log

Crash

fsck /dev/{DEVNAME}

I've been able to reproduce problem on both real and vnode
devices. On the other hand kernel can replay journal 
on such filesystems without errors. 


>Fix:

>Release-Note:

>Audit-Trail:
From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: bin/50108: fsck_ffs fails replaying wapbl journal on filesystem
 with 4k sectors
Date: Sun, 2 Aug 2015 07:48:59 +0000

 On Fri, Jul 31, 2015 at 09:55:00PM +0000, dudinea@gmail.com wrote:
  > When trying to run fsck on a FFS filesystem that has been created
  > with 4k sectors, mounted with -o log and then not cleanly
  > unmounted, fsck_ffs fails when it's replaying the journal and tries
  > to write wrong blocks off the device end:
  > 
  > fsck_ffs  -d -d /dev/rvnd0a 
  > ** /dev/rvnd0a
  > wapbl_replay_start: vp=0x0 off=91808 count=8192 blksize=4096
  > wapbl_read: 8192 bytes at block 91808 on fd 0x3
  > wapbl_replay: head=5332992 tail=5095424 off=8192 len=33546240 used=237568
  > wapbl_read: 4096 bytes at block 93052 on fd 0x3
  > wapbl_read: 4096 bytes at block 93108 on fd 0x3
  > wapbl_read: 4096 bytes at block 93109 on fd 0x3
  > ** File system is journaled; replaying journal
  > wapbl_read: 4096 bytes at block 93092 on fd 0x3
  > wapbl_write: 4096 bytes at block 367616 on fd 0x4
  > 
  > CANNOT WRITE: BLK 367616
  > CONTINUE? [yn] n
  > 
  > Looks like it's trying to replay journal into blocks
  > that lie off the device ends. I suppose that it can
  > corrupt file system as well when wrong blocks to be written
  > lie within device boundaries.

 The cause of this is almost certainly wrong unit conversion
 somewhere.

 However, I notice that the dumpfs output says that there are 91808
 blocks in the fs. So reading from block 91808 (not to mention past
 that) and succeeding seems odd - does this volume have the journal
 past the end of the volume data?

 Anyway, 367616 is 91904 * 4, and 91904 is reasonably close to the
 other blocks it's accessing so it's a plausible correct block value.
 Which would imply that there's a stray factor of 4x somewhere. I'd
 have expected 8x (4096 / 512) but whatever...

 If you (or someone) can catch this in the debugger and then trace the
 provenance of that 367616 my guess is that it's been multiplied by the
 wrong thing.

 Although this is not trivial; for the moment I can't tell for sure if
 wapbl_read and wapbl_write are supposed to be getting block numbers in
 frags or device-level blocks or DEV_BSIZE blocks - it looks like in
 fsck it's supposed to be device-level blocks, but I'm not sure it's
 the same in the kernel... and I'm not sure that's what it's getting
 either (from reading the code) (which is what one would expect given
 the bug report I suppose).

 unfortunately the units are obscure and most of the conversion macros
 are named unhelpful things like 'sntod' or 'fsbtodb' where the letters
 aren't necessarily related much to the units they convert.

 -- 
 David A. Holland
 dholland@netbsd.org

From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: bin/50108: fsck_ffs fails replaying wapbl journal on filesystem with 4k sectors
Date: Sun, 2 Aug 2015 10:45:07 +0000 (UTC)

 dholland-bugs@netbsd.org (David Holland) writes:

 > If you (or someone) can catch this in the debugger and then trace the
 > provenance of that 367616 my guess is that it's been multiplied by the
 > wrong thing.

 > Although this is not trivial; for the moment I can't tell for sure if
 > wapbl_read and wapbl_write are supposed to be getting block numbers in
 > frags or device-level blocks or DEV_BSIZE blocks

 It depends on wether it is kernel (DEV_BSIZE) or userland (sector size).

 The fsck_ffs output shows that the journal blocks can be read
 successfully. The write however failed, and assuming this is
 a unit problem and the real block is 91904 or even 45952, it's
 a block inside the filesystem.

 Writes to blocks inside the filesystem using wapl_write are
 caused by replaying the journal... and the block numbers that
 appear inside the journal are written by the kernel using
 its DEV_BSIZE units. When replayed by fsck_ffs in userland
 however, these are interpreted as sector size units.

 You can dump the journal with:

 dumpfs -sjv vnd0a

 This should show a block with 'daddr 367616' in the journal.


 There are 4 places in the kernel code in vfs_wapbl.c (also used by
 fsck_ffs) that retrieve block numbers from the journal.

 wapbl_blkhash_ins(wr, wc->wc_blocks[i].wc_daddr + btodb(j * fsblklen), *offp);
 wapbl_blkhash_rem(wr, wc->wc_blocks[i].wc_daddr + btodb(j * fsblklen));

 and twice in the disabled verify function:
 wapbl_blkhash_get(wr, wc->wc_blocks[i].wc_daddr + btodb(j * fsblklen));


 Unfortunately there is no magic to transparently get the right
 unit conversion here.

 You could replace

 	wc->wc_blocks[i].wc_daddr + btodb(j * fsblklen)

 with

 	wapbl_block_daddr(wc, i, j, fsblklen)

 for something like:

 daddr_t
 wapbl_block_daddr(struct wapbl_wc_blocklist *wc, int i, int j, int blen)
 {
 #ifdef _KERNEL
 	return wc->wc_blocks[i].wc_daddr + btodb(j * blen);
 #else
 	return dbtob(wc->wc_blocks[i].wc_daddr) / blen + j;
 #endif
 }



 -- 
 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

Responsible-Changed-From-To: bin-bug-people->mlelstv
Responsible-Changed-By: mlelstv@NetBSD.org
Responsible-Changed-When: Sun, 09 Aug 2015 07:36:15 +0000
Responsible-Changed-Why:
I have a fix


State-Changed-From-To: open->analyzed
State-Changed-By: mlelstv@NetBSD.org
State-Changed-When: Sun, 09 Aug 2015 07:36:15 +0000
State-Changed-Why:
problem is understood


State-Changed-From-To: analyzed->open
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Mon, 17 Aug 2015 18:23:03 +0000
State-Changed-Why:
problem is not understood unless the explanation is posted to gnats :-)


From: Eugene <dudinea@gmail.com>
To: gnats-bugs@NetBSD.org
Cc: dholland@NetBSD.org, mlelstv@NetBSD.org, netbsd-bugs@netbsd.org,
 gnats-admin@netbsd.org
Subject: Re: bin/50108 (fsck_ffs fails replaying wapbl journal on filesystem
 with 4k sectors)
Date: Mon, 17 Aug 2015 23:02:40 +0300

 On Mon, 17 Aug 2015 18:23:03 +0000 (UTC)
 dholland@NetBSD.org wrote:

 > Synopsis: fsck_ffs fails replaying wapbl journal on filesystem with
 > 4k sectors
 [...]
 > problem is not understood unless the explanation is posted to
 > gnats :-)
 Why? Michael's explanation and fix look right: I've tested his proposed
 solution, and it works right for both 4k and regular 512 byte sectors. 

 BTW, on the other hand, the idea of the fix - passing same function
 argument with different units depending on user/kernel mode -  does not
 look elegant to me. 

 Wouldn't it be better to modify fsck_ffs code (that is the
 wapbl_read()/wapbl_write() functions in sbin/fsck_ffs/wapbl.c -
 kern/vfs_wapbl.c calls them back for IO when working in user mode) 
 to use the same DEV_BSIZE unit as kernel when replaying the journal? 

 -- 
 Regards,
 Eugene

 mailto:dudinea@gmail.com



From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: bin/50108 (fsck_ffs fails replaying wapbl journal on filesystem with 4k sectors)
Date: Mon, 17 Aug 2015 23:11:51 +0000 (UTC)

 dudinea@gmail.com (Eugene) writes:

 >Why? Michael's explanation and fix look right: I've tested his proposed
 >solution, and it works right for both 4k and regular 512 byte sectors. 

 It has already been committed to -current and netbsd-7.

 >BTW, on the other hand, the idea of the fix - passing same function
 >argument with different units depending on user/kernel mode -  does not
 >look elegant to me. 

 All disk I/O code originally addressed a block by physical block
 number and DEV_BSIZE was the (one and only) physical block size.

 At some point, about when the SCSI subsystem was integrated into
 the kernel, the model was changed, the kernel now uses the fixed
 DEV_BSIZE=512 coordinates and the driver translates that into
 physical blocks.
 Userland however still maintains the old view and there was lots
 of code in the kernel that treated DEV_BSIZE as the physical block
 size.

 This wasn't a problem as virtually all disks used 512byte blocks
 at that time. Both views were effectively the same.

 When disks with other block sizes became popular, it was necessary
 to decide how to handle this in kernel and userland.

 Userland using physical blocks for addressing had no problems
 except for determining the block size. This is done by querying
 the device driver for low-level tools and by checking filesystem
 parameters (== superblock) for filesystem tools.

 The kernel mainly needed to fix all places where filesystem
 to disk block translations still didn't use DEV_BSIZE.

 One problem is shared code between kernel and userland. But
 since all this code only dealt with the translation between filesytem
 coordinates and disk coordinates, this is handled in exactly one
 place, the fsbtodb/dbtofsb macros which are now defined differently
 for kernel and userland.

 WAPBL is an exception. It's the only filesystem-related code that
 does not use filesystem coordinates but disk coordinates (in kernel
 that's DEV_BSIZE units) but which are then passed in the journal to
 userland code.
 And since WAPBL is again shared code between kernel and userland the
 different views need to be reflected by that code.


 >Wouldn't it be better to modify fsck_ffs code (that is the
 >wapbl_read()/wapbl_write() functions in sbin/fsck_ffs/wapbl.c -
 >kern/vfs_wapbl.c calls them back for IO when working in user mode) 
 >to use the same DEV_BSIZE unit as kernel when replaying the journal? 

 The functions are used for both types of coordinates. When reading/writing
 the journal it's filesystem coordinates. When transfering data between
 filesystem and journal its disk coordinates (as written by the kernel).

 The only way to get out of this split is to have the same view on
 both sides again. Either revert the kernel to use physical block
 numbers again (i.e. change all drivers and most filesystems)
 or change userland to use DEV_BSIZE coordinates (become incompatibel
 with other BSDs and third party software).

 On the other hand, it doesn't really add complexity. Unit translations
 are done all over the place and you just have to keep in mind that
 kernel and userland have different views.


 -- 
 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

From: Eugene <dudinea@gmail.com>
To: gnats-bugs@NetBSD.org
Cc: mlelstv@serpens.de, mlelstv@NetBSD.org, gnats-admin@netbsd.org,
 netbsd-bugs@netbsd.org
Subject: Re: bin/50108 (fsck_ffs fails replaying wapbl journal on filesystem
 with 4k sectors)
Date: Tue, 18 Aug 2015 06:59:39 +0300

 On Mon, 17 Aug 2015 23:20:01 +0000 (UTC)
 mlelstv@serpens.de (Michael van Elst) wrote:

 >  It has already been committed to -current and netbsd-7.
 Thanks! But your commit states that it fixes bin/45933, not this one.
 Is this by error or bin/45933 is really a manifestation of the same
 bug? 

 >  
 >  >BTW, on the other hand, the idea of the fix - passing same function
 [ skipped ]
 >  On the other hand, it doesn't really add complexity. Unit
 > translations are done all over the place and you just have to keep in
 > mind that kernel and userland have different views.

 Thank you for your great explanation.

 -- 
 Regards,
 Eugene

 mailto:dudinea@gmail.com



From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: bin/50108 (fsck_ffs fails replaying wapbl journal on filesystem with 4k sectors)
Date: Tue, 18 Aug 2015 16:16:21 +0000 (UTC)

 dudinea@gmail.com (Eugene) writes:

 >On Mon, 17 Aug 2015 23:20:01 +0000 (UTC)
 >mlelstv@serpens.de (Michael van Elst) wrote:

 >>  It has already been committed to -current and netbsd-7.
 >Thanks! But your commit states that it fixes bin/45933, not this one.
 >Is this by error or bin/45933 is really a manifestation of the same
 >bug? 

 45933 is probably a different error, but even with correct superblock
 it would have failed (and then fsck might have trashed the filesystem).

 -- 
 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: bin/50108 (fsck_ffs fails replaying wapbl journal on filesystem
 with 4k sectors)
Date: Sat, 5 Sep 2015 23:43:52 +0000

 On Mon, Aug 17, 2015 at 08:05:01PM +0000, Eugene wrote:
  >  [...]
  >  > problem is not understood unless the explanation is posted to
  >  > gnats :-)
  >  Why? Michael's explanation and fix look right: I've tested his proposed
  >  solution, and it works right for both 4k and regular 512 byte sectors. 

 It's not included in the PR. It *is* included in 45933, but I'd
 forgotten about that one and it wasn't referenced in here here until
 later.

 If the issue is "understood" but doesn't actually get fixed, what
 happens is that someone coming by months (or years) later has no idea
 what was understood and has to redo the work. (Or if they're lucky
 they can ask the original person, but things get forgotten.)

 so it was just a procedural complaint.

  >  Wouldn't it be better to modify fsck_ffs code (that is the
  >  wapbl_read()/wapbl_write() functions in sbin/fsck_ffs/wapbl.c -
  >  kern/vfs_wapbl.c calls them back for IO when working in user mode) 
  >  to use the same DEV_BSIZE unit as kernel when replaying the journal? 

 This is a thorny and ugly issue and there isn't any one good solution
 to it that doesn't involve digging up and chewing over all of ffs. :(

 -- 
 David A. Holland
 dholland@netbsd.org

From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: bin/50108 (fsck_ffs fails replaying wapbl journal on filesystem
 with 4k sectors)
Date: Sat, 5 Sep 2015 23:46:05 +0000

 On Tue, Aug 18, 2015 at 04:20:00PM +0000, Michael van Elst wrote:
  >  >Thanks! But your commit states that it fixes bin/45933, not this one.
  >  >Is this by error or bin/45933 is really a manifestation of the same
  >  >bug? 
  >  
  >  45933 is probably a different error, but even with correct superblock
  >  it would have failed (and then fsck might have trashed the filesystem).

 ...so, what else do we understand about how this PR is different from
 45933?

 -- 
 David A. Holland
 dholland@netbsd.org

From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: bin/50108 (fsck_ffs fails replaying wapbl journal on filesystem with 4k sectors)
Date: Sun, 6 Sep 2015 08:19:40 +0000 (UTC)

 dholland-bugs@netbsd.org (David Holland) writes:

 > ...so, what else do we understand about how this PR is different from
 > 45933?

 45933 is caused by a bad superblock. The superblock encodes the
 physical sector size as seen by newfs. Userland tools like fsck
 use that information when reading a filesystem image. The kernel
 ignores that information, the disk driver has its own data to
 rely on.

 So when you make a filesystem image visible to multiple kernels
 (like in: Xen host vs. Xen guest) it doesn't matter. But the
 userland running under either kernel still cares.

 -- 
 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

>Unformatted:

 Code similar to the suggested has been committed in vfs_wapl.c 1.62
 and pulled up to netbsd-7.

 This probably needs to be applied to netbsd-6 as well.

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.