NetBSD Problem Report #58553
From www@netbsd.org Sun Aug 4 15:26:08 2024
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256
client-signature RSA-PSS (2048 bits) client-digest SHA256)
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 6E63C1A923C
for <gnats-bugs@gnats.NetBSD.org>; Sun, 4 Aug 2024 15:26:08 +0000 (UTC)
Message-Id: <20240804152607.06A561A923E@mollari.NetBSD.org>
Date: Sun, 4 Aug 2024 15:26:06 +0000 (UTC)
From: campbell+netbsd@mumble.net
Reply-To: campbell+netbsd@mumble.net
To: gnats-bugs@NetBSD.org
Subject: ffs: garbage data appended after crash
X-Send-Pr-Version: www-1.0
>Number: 58553
>Category: kern
>Synopsis: ffs: garbage data appended after crash
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sun Aug 04 15:30:01 +0000 2024
>Last-Modified: Sun Aug 04 18:55:01 +0000 2024
>Originator: Taylor R Campbell
>Release: current, 10, 9, 8, 7, 6, ...
>Organization:
The NetBS<fMfld[\"7"t@s):uQZc
>Environment:
>Description:
When ffs appends data to a file, it does three things:
1. allocate data blocks
2. write to data blocks
3. increase inode size
During normal operation, these steps are taken under the exclusive vnode lock, so other threads and processes can't see the intermediate states.
But if the system crashes in the middle of the steps, it may end up with data blocks that have been allocated, an inode that has been extended, and garbage in the data blocks because step (2) never finished.
This can happen because although steps (1) and (3) are metadata updates, which traditional ffs issues synchronously and wapbl issues in a transaction with a write-ahead log, step (2) is a data update which is largely unordered with respect to metadata updates.
The issue is exacerbated by wapbl, which accelerates metadata updates without changing the rate of data updates.
>How-To-Repeat:
1. start a write-heavy workload
2. crash the system in the middle
>Fix:
1. Create new type of logical log record truncate(n,k) for `truncate inode n to byte k'. (This requires versioning -- older versions of NetBSD won't be able to replay these logs, so it'll require a newfs or tunefs option to opt in.)
2. Change ffs_write (WRITE in sys/ufs/ufs/ufs_readwrite.c) extending a file from length k0 to length k1, create a record truncate(n,k0) in the next transaction.
3. Change ffs_fsync and ffs_full_fsync so that if they are syncing any prefix of the interval [k0, k1], say to byte k, they change the record to truncate(n,k) in the next transaction. If they are syncing the whole interval, they delete the record in the next transaction.
(We can also use truncate(n,k) records to make truncate itself atomic -- currently it is split over multiple transactions, in order to avoid overflowing the transaction when truncating a very large file requiring deallocating large numbers of data blocks, so if you crash in the middle of truncating a 100000-byte file to 100-bytes, you might find the file larger than 100 bytes but smaller than 100000 bytes.)
FreeBSD avoids the problem by enforcing a partial ordering with a more elaborate system of block dependencies called soft updates or softdep. We used softdep but after years of struggling with it concluded it was unmaintainable and undebuggable and removed it 15 years ago: https://mail-index.netbsd.org/source-changes/2009/02/22/msg217531.html, https://mail-index.netbsd.org/netbsd-announce/2008/12/14/msg000051.html
>Audit-Trail:
From: Robert Elz <kre@munnari.OZ.AU>
To: campbell+netbsd@mumble.net
Cc: gnats-bugs@netbsd.org
Subject: Re: kern/58553: ffs: garbage data appended after crash
Date: Mon, 05 Aug 2024 01:51:02 +0700
Date: Sun, 4 Aug 2024 15:30:01 +0000 (UTC)
From: campbell+netbsd@mumble.net
Message-ID: <20240804153001.C637C1A923F@mollari.NetBSD.org>
| 1. start a write-heavy workload
That's not necessarily needed ... I've seen cases where this kind of
thing happens on an almost idle system, where metadata updates were
all done, but data hadn't been written to files when the system crashed
(sudden complete power loss I think it was) about 12 hours after the
data had been written. The data writes depend upon something in the
system bothering to do them, and while if you have a write-heavy workload
that's likely to not take too long, if you don't, it can sometimes be
a very long time.
In my case I could easily tell as the data that was "lost" (not really,
I had copies) was incoming e-mail - the mail files all looked to be there,
had appropriate modify times, sizes, etc, but garbage contents.
[Since then I have my own replacement for update(8) running all the time!]
| >Fix:
I doubt that can be called a fix. A hack which might work around some
of the issues - perhaps the most common case - but not a fix.
Two major issues I can see .. first, nothing in your proposal covers
the case of data overwrites, where the metadata (other than the mtime)
isn't being altered at all, but several blocks of data are being written
somewhere in the middle of a file - some of those might be written, and
others not, leading to garbage in the file which is neither its before
nor intended after state. Your "at the end" case is just the common
case of that, but to be considered a fix, all of it would need fixing.
And:
| 3. Change ffs_fsync and ffs_full_fsync so that if they are syncing any
| prefix of the interval [k0, k1],
And if not syncing a prefix - but some data in the middle? Easy to just
not update things in that case, but sometime later, when the earlier part
of the interval has been written, the record would need to grow all of these
other blocks, as they won't happen again. The typical solution to that is
to split the record into two on any write to a segment in the interval, one
for what is still to come before, and one for what comes after, omitting
either, or both, of those if empty. In hard cases that can deteriorate
into a real mess.
However:
| (We can also use truncate(n,k) records to make truncate itself atomic
that one probably would be a benefit, though whether it is sufficiently
useful to add this extra mechanism, and forgo backward compat, I doubt.
After all, everything needed to finish a truncate is in the metadata, if
the size says the file should be 100 bytes, and there are blocks allocated
beyond that, those can easily be removed during file system cleanup, after
the crash. That is, we can deduce what was happening from the state
that remains.
This is all much much harder than it looks. If we really believe some
kind of better method is needed, we should probably bribe Kirk to come
and make softdeps work in NetBSD. Not that even that is a full solution,
data corruption after a crash is extremely hard to avoid without doing
fully synchronous (all the way to the flash or platter) I/O - which is
not something most people would tolerate most of the time.
kre
(Contact us)
$NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2024
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.