NetBSD Problem Report #50725

From dholland@macaran.eecs.harvard.edu  Fri Jan 29 18:09:02 2016
Return-Path: <dholland@macaran.eecs.harvard.edu>
Received: from mail.netbsd.org (mail.NetBSD.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id DF0D47ABFC
	for <gnats-bugs@gnats.NetBSD.org>; Fri, 29 Jan 2016 18:09:02 +0000 (UTC)
Message-Id: <20160129180831.73AFD6E25D@macaran.eecs.harvard.edu>
Date: Fri, 29 Jan 2016 13:08:31 -0500 (EST)
From: dholland@eecs.harvard.edu
Reply-To: dholland@NetBSD.org
To: gnats-bugs@NetBSD.org
Subject: ffs -o discard crash/unmount safety issues
X-Send-Pr-Version: 3.95

>Number:         50725
>Category:       kern
>Synopsis:       ffs -o discard crash/unmount safety issues
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Jan 29 18:10:00 +0000 2016
>Last-Modified:  Sun Aug 13 21:05:01 +0000 2017
>Originator:     David A. Holland
>Release:        NetBSD 7.99.25 (20151222)
>Organization:
>Environment:
System: NetBSD macaran 7.99.25 NetBSD 7.99.25 (MACARAN) #34: Tue Dec 22 23:55:33 EST 2015 dholland@macaran:/usr/src/sys/arch/amd64/compile/MACARAN amd64
Architecture: x86_64
Machine: amd64
>Description:

When mounted with -o discard, freed blocks are discarded in the
background. They aren't actually marked free in the block bitmap until
the discard is completed. (This is necessary because if such a block
is reallocated and written to before the discard is done, it'll
potentailly be lost.)

riz@ found today that if you discard a lot (he deleted ~3G) the
background discarding happens slowly enough that you can sit and wait
and watch the free space count increase.

This is partly because there's no logic for coalescing the discard
requests (more on that in a second PR) but what happens if the system
is shut down or crashes while the discards are still being processed?

Some quick inspection of the code reveals the following problems:

 * sync does not wait for pending discards to complete.
 * unmount does, but if the discards take more than five seconds it
   times out, prints a warning, and plows ahead. Given the observed
   behavior this is not long enough.

Therefore, if you unmount while there are discards pending and it
takes more than five seconds, the blocks not processed will not be
freed; and after unmount the volume will be marked clean, so those
blocks will disappear until you do a full fsck -f.

Crashing is not as big a problem as the fsck after crashing should
repair things. (However, does fsck do fdiscard on blocks it releases?
I bet not.) I don't remember offhand if traditional fsck will fail in
preen mode if it finds unreferenced blocks, but if it does that's a
further problem for unattended crashes.

>How-To-Repeat:

As above.

>Fix:

Increasing the unmount timeout is easy, it's on line 1643 of
ffs_alloc.c in ffs_discard_finish(). Probably it should wait five
seconds, print one warning, and then wait some substantially longer
time before giving up.

Having sync wait for discards to finish should also be fairly
straightforward; the logic at the top of ffs_discard_finish can be
shared (provided one is careful about races if more than one such wait
is tried at once, and that the cv_signal in ffs_discardcb is changed
to a broadcast) and it's only necessary to have the fs-level sync op
call that logic.

If in the case of a crash fsck fails afterwards that will be
substantially harder to deal with but I don't think it's the case.

This is without wapbl. With wapbl all the above is still true, except
that because crashing doesn't result in fsck and (AFAIK) nothing is
done to register the pending deletions on disk, the unprocessed blocks
will disappear until a full fsck -f is done.

>Audit-Trail:
From: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek@gmail.com>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/50725 ffs -o discard crash/unmount safety issues
Date: Sun, 25 Sep 2016 00:40:58 +0200

 There is another interesting thing regarding wapbl + discard, which
 needs to be taken care of.

 In ffs_wapbl_sync_metadata() called from wapbl_flush(), the code loops
 over deallocblks and calls ffs_blkfree(). Then it calls ffs_cgupdate()
 to save the cg descriptors to disk, which makes it in turn queued for
 journal write.

 However, if discard is active, those ffs_blkfree() just register the
 discard and do not modify the cg's right away. This means that cgs are
 not saved as part of the transaction in the flush, and get to saved
 only on the next sync.

 Jaromir

From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/50725 ffs -o discard crash/unmount safety issues
Date: Wed, 28 Sep 2016 10:38:45 +0000

 On Sat, Sep 24, 2016 at 10:45:01PM +0000, Jarom?r Dole?ek wrote:
  >  There is another interesting thing regarding wapbl + discard, which
  >  needs to be taken care of.
  >  
  >  In ffs_wapbl_sync_metadata() called from wapbl_flush(), the code loops
  >  over deallocblks and calls ffs_blkfree(). Then it calls ffs_cgupdate()
  >  to save the cg descriptors to disk, which makes it in turn queued for
  >  journal write.
  >  
  >  However, if discard is active, those ffs_blkfree() just register the
  >  discard and do not modify the cg's right away. This means that cgs are
  >  not saved as part of the transaction in the flush, and get to saved
  >  only on the next sync.

 It's been (unrelatedly) pointed out in several places recently that in
 addition to the problem that many drives ignore small discards,
 apparently the underlying SATA operation requires waiting for all
 other pending I/Os to complete. This is going to trash performance no
 matter what software-level steps are taken.

 Between these three problems I'm starting to think the current code
 for -o discard should be deprecated/removed, and instead we want code
 that runs once in a while and discards all free areas, maybe one
 cylinder group at a time.

 -- 
 David A. Holland
 dholland@netbsd.org

From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/50725 CVS commit: src/sys/ufs/ffs
Date: Thu, 10 Nov 2016 22:19:23 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Thu Nov 10 22:19:23 UTC 2016

 Modified Files:
 	src/sys/ufs/ffs: ffs_wapbl.c

 Log Message:
 disable discard when log is enabled to preserve log consistency promise

 PR kern/50725


 To generate a diff of this commit:
 cvs rdiff -u -r1.36 -r1.37 src/sys/ufs/ffs/ffs_wapbl.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Michael van Elst" <mlelstv@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/50725 CVS commit: src/sys/ufs/ffs
Date: Sun, 13 Aug 2017 21:00:58 +0000

 Module Name:	src
 Committed By:	mlelstv
 Date:		Sun Aug 13 21:00:58 UTC 2017

 Modified Files:
 	src/sys/ufs/ffs: ffs_alloc.c

 Log Message:
 Don't time out the discard work queue here. Either destroying a work queue
 with pending work items panics or accessing freed resources from the work
 item will crash. The timeout needs to be handled gracefully by the driver
 that implements the discard operation.

 Fixes parts of PR 50725.


 To generate a diff of this commit:
 cvs rdiff -u -r1.157 -r1.158 src/sys/ufs/ffs/ffs_alloc.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.