NetBSD Problem Report #53152
From email@example.com Tue Apr 3 21:31:13 2018
Received: from mail.netbsd.org (mail.netbsd.org [188.8.131.52])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 1EAF57A14F
for <gnats-bugs@gnats.NetBSD.org>; Tue, 3 Apr 2018 21:31:13 +0000 (UTC)
Date: Tue, 3 Apr 2018 17:31:07 -0400 (EDT)
Subject: fsync error chain lossage
>Synopsis: fsync error chain lossage
>Arrival-Date: Tue Apr 03 21:35:00 +0000 2018
>Last-Modified: Wed Apr 04 08:00:01 +0000 2018
>Release: NetBSD 8.99.14 (20180402)
System: NetBSD macaran 8.99.14 NetBSD 8.99.14 (MACARAN) #48: Mon Apr 2 18:37:50 EDT 2018 dholland@macaran:/usr/src/sys/arch/amd64/compile/MACARAN amd64
Pursuant to the recent discussion about broken fsync behavior in Linux
I've found four problems so far:
(1) I/O errors are not reported back to fsync at all.
(2) Write errors during genfs_putpages that fail for any reason other
than ENOMEM cause the data to be semi-silently discarded.
(3) It appears that UVM pages are marked clean when they're selected to be
written out, not after the write succeeds; so there are a bunch of
potential races when writes fail.
(4) It appears that write errors for buffercache buffers are semi-silently
discarded as well.
(1) fsync implementations (e.g. ffs_full_fsync) use vflushbuf() to do
the syncing. Unfortunately, vflushbuf() discards the error return from
VOP_PUTPAGES. So even if the error reporting chain from biodone
through putpages works (I think it does, but there are various
opportunities for things to go wrong) it gets lost at this point.
Furthermore, for non-uvm pages, the loop immediately below in
vflushbuf() checks for error only sometimes: it uses bawrite() for
blocks that belong to the vnode and waits for them to complete via
v_numoutput, which doesn't involve an error check.
(2) The chain of bufferio completion calls eventually leads to
uvm_aio_aiodone_pages(), which contains this comment:
* process errors. for reads, just mark the page to be freed.
* for writes, if the error was ENOMEM, we assume this was
* a transient failure so we mark the page dirty so that
* we'll try to write it again later. for all other write
* errors, we assume the error is permanent, thus the data
* in the page is lost. bummer.
and the code returns with the page marked clean so the fact it wasn't
written is forgotten. The error is discarded at this point; but note
that it's also extracted from the buffer by biowait() and returned.
So there is one chance to see the error; unfortunately as already
noted it's lost.
(3) There are only a small number of places where PG_CLEAN is applied;
the one that occurs on the write path is in uvn_findpage() in
uvm_vnode.c, when called with the UFP_DIRTYONLY flag.
genfs_do_putpages does this to figure out what pages to write.
Moreover, the code in uvm_aio_aiodone_pages() that retries on ENOMEM
matches this in that it explicitly removes PG_CLEAN to cause a retry.
Ideally, the page will be locked/busy/something during the I/O so the
PG_CLEAN bit cannot be observed, but it's far from clear a priori
whether that's actually true. It would be bad if e.g. another fsync of
the same file running concurrently could observe the clean flag and
succeed incorrectly believing the page had already been written.
(4) For buffercache buffers that are written with B_ASYNC (which
includes any that go through bawrite, such as many of the ones
explicitly written by vflushbuf as noted above) biodone2() goes
through the second of its three dispatch cases, which calls brelse()
without checking for error. brelse() at that point will invalidate and
The discards are only semi-silent because hard I/O errors print kernel
messages at the device level; but that's not really an adequate
substitute for application error reporting.
From: David Holland <firstname.lastname@example.org>
Subject: Re: kern/53152: fsync error chain lossage
Date: Wed, 4 Apr 2018 07:34:21 +0000
On Tue, Apr 03, 2018 at 09:35:00PM +0000, email@example.com wrote:
> (1) fsync implementations (e.g. ffs_full_fsync) use vflushbuf() to do
> the syncing. Unfortunately, vflushbuf() discards the error return from
> VOP_PUTPAGES. So even if the error reporting chain from biodone
> through putpages works (I think it does, but there are various
> opportunities for things to go wrong) it gets lost at this point.
This much is pretty easy to fix: I don't see any reason we can't
return any errors from VOP_PUTPAGES at this point. Except that it
would be better to try writing out all pages and buffers and only fail
The rest isn't though. Also note that if a write error is incurred by
the syncer or by a general sync() call, the page or buffer will be
discarded then and a subsequent call to fsync by the application will
improperly succeed regardless.
It isn't entirely clear what we should do in general, since (outside
of lfs) we have no ability after a hard write error to put the
contents somewhere else and keeping a doomed block around forever is a
useless waste of memory; but on the other hand it seems like the user
should be able to retry fsync if they want to... and if they do, it
should really retry and in particular not bogusly report success.
Maybe it would be sufficient to hold failed buffers and pages until
the last close, and then discard them?
(Also, mrg says point (3) should be immaterial as long as the page
stays marked busy the whole time, which hopefully it does.)
David A. Holland
From: firstname.lastname@example.org (Michael van Elst)
Subject: Re: kern/53152: fsync error chain lossage
Date: Wed, 4 Apr 2018 07:54:27 -0000 (UTC)
email@example.com (David Holland) writes:
> It isn't entirely clear what we should do in general, since (outside
> of lfs) we have no ability after a hard write error to put the
> contents somewhere else and keeping a doomed block around forever is a
> useless waste of memory; but on the other hand it seems like the user
> should be able to retry fsync if they want to... and if they do, it
> should really retry and in particular not bogusly report success.
A simple retry for the same block on that level is unlikely to succeed.
You'd need some kind of bad-block mapping or other recovery, which is
probably closer to the device level than putpages.
Please also note that a successful write doesn't mean that the data
can be read back. Putting lots of effort into retrying failed writes
isn't that useful and on the FFS level, as you said, there is no other
So the alternative to recovery is to record the error with the
mount point and trigger other actions, like making the mount
read-only, or even issuing a panic().
Michael van Elst
"A potential Snark may lurk in every tree."
David A. Holland
$NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.