NetBSD Problem Report #53152

From dholland@macaran.eecs.harvard.edu  Tue Apr  3 21:31:13 2018
Return-Path: <dholland@macaran.eecs.harvard.edu>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 1EAF57A14F
	for <gnats-bugs@gnats.NetBSD.org>; Tue,  3 Apr 2018 21:31:13 +0000 (UTC)
Message-Id: <20180403213107.8D9EA6E245@macaran.eecs.harvard.edu>
Date: Tue,  3 Apr 2018 17:31:07 -0400 (EDT)
From: dholland@eecs.harvard.edu
Reply-To: dholland@NetBSD.org
To: gnats-bugs@NetBSD.org
Subject: fsync error chain lossage
X-Send-Pr-Version: 3.95

>Number:         53152
>Category:       kern
>Synopsis:       fsync error chain lossage
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Apr 03 21:35:00 +0000 2018
>Last-Modified:  Wed Apr 04 08:00:01 +0000 2018
>Originator:     dholland@NetBSD.org
>Release:        NetBSD 8.99.14 (20180402)
>Organization:
>Environment:
System: NetBSD macaran 8.99.14 NetBSD 8.99.14 (MACARAN) #48: Mon Apr 2 18:37:50 EDT 2018 dholland@macaran:/usr/src/sys/arch/amd64/compile/MACARAN amd64
Architecture: x86_64
Machine: amd64
>Description:

Pursuant to the recent discussion about broken fsync behavior in Linux
(see 
https://www.postgresql.org/message-id/flat/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com#CAMsr+YHh+5Oq4xziwwoEfhoTZgr07vdGG+hu=1adXx59aTeaoQ@mail.gmail.com
)

I've found four problems so far:

(1) I/O errors are not reported back to fsync at all.
(2) Write errors during genfs_putpages that fail for any reason other
    than ENOMEM cause the data to be semi-silently discarded.
(3) It appears that UVM pages are marked clean when they're selected to be
    written out, not after the write succeeds; so there are a bunch of
    potential races when writes fail.
(4) It appears that write errors for buffercache buffers are semi-silently
    discarded as well.

Further details:


(1) fsync implementations (e.g. ffs_full_fsync) use vflushbuf() to do
the syncing. Unfortunately, vflushbuf() discards the error return from
VOP_PUTPAGES. So even if the error reporting chain from biodone
through putpages works (I think it does, but there are various
opportunities for things to go wrong) it gets lost at this point.

Furthermore, for non-uvm pages, the loop immediately below in
vflushbuf() checks for error only sometimes: it uses bawrite() for
blocks that belong to the vnode and waits for them to complete via
v_numoutput, which doesn't involve an error check.


(2) The chain of bufferio completion calls eventually leads to
uvm_aio_aiodone_pages(), which contains this comment:
  /*
   * process errors.  for reads, just mark the page to be freed.
   * for writes, if the error was ENOMEM, we assume this was
   * a transient failure so we mark the page dirty so that
   * we'll try to write it again later.  for all other write
   * errors, we assume the error is permanent, thus the data
   * in the page is lost.  bummer.
   */
and the code returns with the page marked clean so the fact it wasn't
written is forgotten. The error is discarded at this point; but note
that it's also extracted from the buffer by biowait() and returned.
So there is one chance to see the error; unfortunately as already
noted it's lost.


(3) There are only a small number of places where PG_CLEAN is applied;
the one that occurs on the write path is in uvn_findpage() in
uvm_vnode.c, when called with the UFP_DIRTYONLY flag.
genfs_do_putpages does this to figure out what pages to write.
Moreover, the code in uvm_aio_aiodone_pages() that retries on ENOMEM
matches this in that it explicitly removes PG_CLEAN to cause a retry.
Ideally, the page will be locked/busy/something during the I/O so the
PG_CLEAN bit cannot be observed, but it's far from clear a priori
whether that's actually true. It would be bad if e.g. another fsync of
the same file running concurrently could observe the clean flag and 
succeed incorrectly believing the page had already been written.


(4) For buffercache buffers that are written with B_ASYNC (which
includes any that go through bawrite, such as many of the ones
explicitly written by vflushbuf as noted above) biodone2() goes
through the second of its three dispatch cases, which calls brelse()
without checking for error. brelse() at that point will invalidate and
discard it.


The discards are only semi-silent because hard I/O errors print kernel
messages at the device level; but that's not really an adequate
substitute for application error reporting.

>How-To-Repeat:

Code analysis.

>Fix:

Looks unpleasant.


>Audit-Trail:
From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/53152: fsync error chain lossage
Date: Wed, 4 Apr 2018 07:34:21 +0000

 On Tue, Apr 03, 2018 at 09:35:00PM +0000, dholland@eecs.harvard.edu wrote:
  > (1) fsync implementations (e.g. ffs_full_fsync) use vflushbuf() to do
  > the syncing. Unfortunately, vflushbuf() discards the error return from
  > VOP_PUTPAGES. So even if the error reporting chain from biodone
  > through putpages works (I think it does, but there are various
  > opportunities for things to go wrong) it gets lost at this point.

 This much is pretty easy to fix: I don't see any reason we can't
 return any errors from VOP_PUTPAGES at this point. Except that it
 would be better to try writing out all pages and buffers and only fail
 afterwards.

 The rest isn't though. Also note that if a write error is incurred by
 the syncer or by a general sync() call, the page or buffer will be
 discarded then and a subsequent call to fsync by the application will
 improperly succeed regardless.

 It isn't entirely clear what we should do in general, since (outside
 of lfs) we have no ability after a hard write error to put the
 contents somewhere else and keeping a doomed block around forever is a
 useless waste of memory; but on the other hand it seems like the user
 should be able to retry fsync if they want to... and if they do, it
 should really retry and in particular not bogusly report success.

 Maybe it would be sufficient to hold failed buffers and pages until
 the last close, and then discard them?

 (Also, mrg says point (3) should be immaterial as long as the page
 stays marked busy the whole time, which hopefully it does.)


 -- 
 David A. Holland
 dholland@netbsd.org

From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/53152: fsync error chain lossage
Date: Wed, 4 Apr 2018 07:54:27 -0000 (UTC)

 dholland-bugs@netbsd.org (David Holland) writes:

 > It isn't entirely clear what we should do in general, since (outside
 > of lfs) we have no ability after a hard write error to put the
 > contents somewhere else and keeping a doomed block around forever is a
 > useless waste of memory; but on the other hand it seems like the user
 > should be able to retry fsync if they want to... and if they do, it
 > should really retry and in particular not bogusly report success.

 A simple retry for the same block on that level is unlikely to succeed.
 You'd need some kind of bad-block mapping or other recovery, which is
 probably closer to the device level than putpages.

 Please also note that a successful write doesn't mean that the data
 can be read back. Putting lots of effort into retrying failed writes
 isn't that useful and on the FFS level, as you said, there is no other
 recovery possible.

 So the alternative to recovery is to record the error with the
 mount point and trigger other actions, like making the mount
 read-only, or even issuing a panic().

 -- 
 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

>Unformatted:
 David A. Holland

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.