NetBSD Problem Report #39564

From jmcneill@black.invisible.ca  Tue Sep 16 16:10:58 2008
Return-Path: <jmcneill@black.invisible.ca>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by narn.NetBSD.org (Postfix) with ESMTP id 6BEEF63B842
	for <gnats-bugs@gnats.NetBSD.org>; Tue, 16 Sep 2008 16:10:58 +0000 (UTC)
Message-Id: <20080916161048.243F17D96D4@black.invisible.ca>
Date: Tue, 16 Sep 2008 12:10:47 -0400 (EDT)
From: jmcneill@invisible.ca
Reply-To: jmcneill@invisible.ca
To: gnats-bugs@gnats.NetBSD.org
Subject: wapbl performance issues with disk cache flushing
X-Send-Pr-Version: 3.95

>Number:         39564
>Category:       kern
>Synopsis:       wapbl performance issues with disk cache flushing
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    ad
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Sep 16 16:15:01 +0000 2008
>Closed-Date:    Sun Feb 22 20:12:17 +0000 2009
>Last-Modified:  Tue Feb 24 04:15:02 +0000 2009
>Originator:     Jared D. McNeill
>Release:        NetBSD 4.99.72
>Organization:

>Environment:


System: NetBSD black.invisible.ca 4.99.72 NetBSD 4.99.72 (GENERIC) #0: Mon Sep 15 11:17:37 EDT 2008 jmcneill@black.invisible.ca:/export/home/jmcneill/branches/HEAD/src/sys/arch/amd64/compile/GENERIC amd64
Architecture: x86_64
Machine: amd64

System: NetBSD blue.invisible.ca 4.99.72 NetBSD 4.99.72 (GENERIC) #1: Tue Sep 16 11:07:56 EDT 2008  jmcneill@blue.invisible.ca:/export/home/jmcneill/branches/HEAD/src/sys/arch/i386/compile/GENERIC i386
Architecture: i386
Machine: i386
>Description:
	Disk cache flushing can severely impact performance on a wapbl-enabled
	filesystem.
>How-To-Repeat:
	1. Grab qt3 and qt4 distfiles from pkgsrc:
-rw-r--r--  1 jmcneill  users  14360483 Mar 15  2007 qt-x11-free-3.3.8.tar.bz2
-rw-r--r--  1 jmcneill  users  98584925 May  2 08:03 qt-x11-opensource-src-4.4.0.tar.bz2

	2. In one terminal, extract 4.4.0 tarball in a loop
$ while true; do tar -zxvf qt-x11-opensource-src-4.4.0.tar.bz2; done

	3. In another terminal, time how long it takes to extract
	   the 3.3.8 distfile:
$ time tar -zxvf qt-x11-free-3.3.8.tar.bz2

	4. Results with and without disk cache flush (patch from simonb):

flush_disk_cache=0:   18.85s real    10.12s user     2.55s system
flush_disk_cache=1:  662.49s real    10.35s user     3.13s system

>Fix:


>Release-Note:

>Audit-Trail:
From: "Jared D. McNeill" <jmcneill@invisible.ca>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, 
 netbsd-bugs@netbsd.org
Subject: Re: kern/39564: wapbl performance issues with disk cache flushing
Date: Tue, 16 Sep 2008 12:38:09 -0400

 As requested offline, here are the disks I'm using:

 wd0 at atabus2 drive 0: <WDC WD800JB-00CRA1> (@ viaide)
 wd0 at atabus1 drive 0: <HITACHI HTS541612J9SA00> (@ ahcisata)

Responsible-Changed-From-To: kern-bug-people->simonb
Responsible-Changed-By: jmcneill@NetBSD.org
Responsible-Changed-When: Tue, 16 Sep 2008 16:44:37 +0000
Responsible-Changed-Why:
Simon volunteered.


From: Havard Eidnes <he@uninett.no>
To: jmcneill@invisible.ca
Cc: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
 gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/39564: wapbl performance issues with disk cache flushing
Date: Sun, 12 Oct 2008 19:47:52 +0200 (CEST)

 Hi,

 I have some additional data points and comments related to this
 issue.  I'm testing by doing parallel builds on a HP ProLiant
 460c G1 blade, which is equipped with 16GB memory, dual quad-
 core Xeon processors at 2.83GHz, and with an on-board ciss(4)
 controller with a single file system (root, /usr, /var etc.).

 I'm testing by doing "build.sh -j 16" on this system.  The
 different variants and the result of "time" from csh were as
 follows:

 First a few attempts without simon's patch to enable one to turn
 off cache flushing via sysctl:

 1) Plain "build.sh -j 16":

    6515.491u 2108.338s 2:19:19.68 103.1%   0+0k 3+688241io 32279pf+0w

    (yes, that's almost 2 hours and 20 minutes(!))

 2) As above, but with a "time sh -c 'while true; do sync; sleep 1; done=
 '
    loop running in parallel with the build:

    5777.841u 2284.086s 21:59.48 610.9%     0+0k 627+44423io 32323pf+0w

    while the sync loop got

    0.878u 62.666s 24:11.92 4.3%  0+0k 0+48589io 0pf+0w

 3) Just for comparison, a "plain" build.sh without -j specified:

    5312.852u 970.429s 2:52:58.37 60.5%     0+0k 103+190425io 32351pf+0w=


 Some observations / comments:

  o Even when taking the i/o's triggered by the sync loop in 2)
    into account, the amount of IO under 1) is massively larger
    than in 2), with a factor of about 7.4(!)

  o Running "top" and "systat vm" in two windows while the build
    in 1) is running shows that in periods of low cpu utilization,
    lots of processes appear "stuck" in "tstile/<n>" state.

  o The "tstile/<n>" status is kind of non-descript, it basically just
    says "I'm waiting on some unspecified kernel-internal resource".
    Pinpointing exactly which resource there is contention for cannot
    easily be done by using these normal system state monitor tools.
    Some other developers hinted that it would be better if the
    resource was described by the wait channel string.  It appears that
    to do this, the locking primitives which end up calling
    turnstile_block() would need to pass down some informative string
    which better describes the resource locked instead of using a
    constant "tstile" string as is done in the turnstile_block()
    function when calling sleepq_enqueue().

 So, in the next round I installed simon's patch and turned off wapbl
 disk cache flushing, and repeated the tests 1) and 2) above.  Contrary
 to what one might beleive, these actually revealed a difference(!):

 4) A standard "build.sh -j 16"

    5987.961u 2149.343s 28:30.74 475.6%     0+0k 11678+562510io 56375pf+=
 0w

 5) as in 4), but with the sync shell loop:

    5746.054u 2302.403s 21:08.90 634.2%     0+0k 1+46619io 32331pf+0w

    and for the sync loop itself:

    0.774u 52.742s 21:21.80 4.1%  0+0k 0+49910io 0pf+0w

 Observations / comments:

  o While the difference in wall-clock time is much less than between
    1) and 2) above, the difference in the amount of IO is still very
    significant, with a factor around 5.8(!)

  o Observing the "top" display in times of ebbs in the CPU
    utilization, I still see many processes in "tstile/<n>" state.

 I've repeated this experiment on a non-wapbl file system on the same
 system, but there the difference was in the "sync loop" disfavour:

 5799.240u 2232.222s 24:57.93 536.1%     0+0k 494+710112io 32612pf+0w
 vs.
 5881.837u 2435.640s 27:14.19 508.9%     0+0k 0+720501io 32646pf+0w

 and the number of IOs is comparable.

 So...  It seems to me that something or other is causing the wapbl
 file systems to experience massive lock contention with parallel
 activity on the file system.  Also, this seems to indicate that the
 patch to make it possible to turn off disk cache flushing with wapbl
 via sysctl may just paper over an underlying issue.  However, due to
 the non-descript nature of the "tstile" status, it's difficult to give
 any further hints as to which resource is actually contested.

 Best regards,

 - H=E5vard

From: Havard Eidnes <he@uninett.no>
To: jmcneill@invisible.ca
Cc: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
 gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/39564: wapbl performance issues with disk cache flushing
Date: Mon, 27 Oct 2008 00:52:59 +0100 (CET)

 Hi,

 here's some more data on this case.  I increased the amount of
 buffering in lockstat.c to be able to do a lockstat over an
 entire build (increased LOCKSTAT_MAXBUFS tenfold, and bumped
 LOCKSTAT_HASH_SIZE by 8), and with vfs.wapbl.flush_disk_cache=3D0,
 I repeated these two tests:

 4) A standard "build.sh -j 16":

    6001.172u 2184.995s 29:04.68 469.2%     0+0k 12721+559454io 56360pf+=
 0w

 5) as in 4), but with the sync shell loop, 1 second delay:

    5751.959u 2341.618s 21:22.68 630.9%     0+0k 8+45947io 32327pf+0w

 The corresponding lockstat files are, respectively

    ftp://quattro.urc.uninett.no/pub/NetBSD/lockstat.25
 and
    ftp://quattro.urc.uninett.no/pub/NetBSD/lockstat.26

 I'm no expert in interpreting the lockstat output, but perhaps
 someone with more experience in the kernel innards can dig out
 some more useful information from this.

 Best regards,

 - H=E5vard

Responsible-Changed-From-To: simonb->joerg
Responsible-Changed-By: joerg@NetBSD.org
Responsible-Changed-When: Tue, 18 Nov 2008 23:47:57 +0000
Responsible-Changed-Why:
Related to the discussion
http://mail-index.netbsd.org/current-users/2008/10/31/msg005580.html


Responsible-Changed-From-To: joerg->ad
Responsible-Changed-By: ad@NetBSD.org
Responsible-Changed-When: Sat, 21 Feb 2009 20:02:16 +0000
Responsible-Changed-Why:
i have a fix for this


State-Changed-From-To: open->closed
State-Changed-By: ad@NetBSD.org
State-Changed-When: Sun, 22 Feb 2009 20:12:17 +0000
State-Changed-Why:
fixed
we should batch transactions and allow new generations of transactions
to run in parallel with flushes, but that would be a new feature.


From: Andrew Doran <ad@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/39564 CVS commit: src/sys
Date: Sun, 22 Feb 2009 20:10:25 +0000 (UTC)

 Module Name:	src
 Committed By:	ad
 Date:		Sun Feb 22 20:10:25 UTC 2009

 Modified Files:
 	src/sys/kern: vfs_wapbl.c
 	src/sys/miscfs/syncfs: sync_subr.c sync_vnops.c
 	src/sys/ufs/ffs: ffs_alloc.c ffs_vfsops.c ffs_vnops.c

 Log Message:
 PR kern/39564 wapbl performance issues with disk cache flushing
 PR kern/40361 WAPBL locking panic in -current
 PR kern/40361 WAPBL locking panic in -current
 PR kern/40470 WAPBL corrupts ext2fs
 PR kern/40562 busy loop in ffs_sync when unmounting a file system
 PR kern/40525 panic: ffs_valloc: dup alloc

 - A fix for an issue that can lead to "ffs_valloc: dup" due to dirty cg
   buffers being invalidated. Problem discovered and patch by dholland@.

 - If the syncer fails to lazily sync a vnode due to lock contention,
   retry 1 second later instead of 30 seconds later.

 - Flush inode atime updates every ~10 seconds (this makes most sense with
   logging). Presently they didn't hit the disk for read-only files or
   devices until the file system was unmounted. It would be better to trickle
   the updates out but that would require more extensive changes.

 - Fix issues with file system corruption, busy looping and other nasty
   problems when logging and non-logging file systems are intermixed,
   with one being the root file system.

 - For logging, do not flush metadata on an inode-at-a-time basis if the sync
   has been requested by ioflush. Previously, we could try hundreds of log
   sync operations a second due to inode update activity, causing the syncer
   to fall behind and metadata updates to be serialized across the entire
   file system. Instead, burst out metadata and log flushes at a minimum
   interval of every 10 seconds on an active file system (happens more often
   if the log becomes full). Note this does not change the operation of
   fsync() etc.

 - With the flush issue fixed, re-enable concurrent metadata updates in
   vfs_wapbl.c.


 To generate a diff of this commit:
 cvs rdiff -r1.22 -r1.23 src/sys/kern/vfs_wapbl.c
 cvs rdiff -r1.35 -r1.36 src/sys/miscfs/syncfs/sync_subr.c
 cvs rdiff -r1.25 -r1.26 src/sys/miscfs/syncfs/sync_vnops.c
 cvs rdiff -r1.120 -r1.121 src/sys/ufs/ffs/ffs_alloc.c
 cvs rdiff -r1.241 -r1.242 src/sys/ufs/ffs/ffs_vfsops.c
 cvs rdiff -r1.109 -r1.110 src/sys/ufs/ffs/ffs_vnops.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Soren Jacobsen <snj@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/39564 CVS commit: [netbsd-5] src/sys
Date: Tue, 24 Feb 2009 04:13:35 +0000 (UTC)

 Module Name:	src
 Committed By:	snj
 Date:		Tue Feb 24 04:13:35 UTC 2009

 Modified Files:
 	src/sys/kern [netbsd-5]: vfs_wapbl.c
 	src/sys/miscfs/syncfs [netbsd-5]: sync_subr.c sync_vnops.c
 	src/sys/ufs/ffs [netbsd-5]: ffs_alloc.c ffs_vfsops.c ffs_vnops.c

 Log Message:
 Pull up following revision(s) (requested by ad in ticket #490):
 	sys/kern/vfs_wapbl.c: revision 1.23
 	sys/miscfs/syncfs/sync_subr.c: revision 1.36
 	sys/miscfs/syncfs/sync_vnops.c: revision 1.26
 	sys/ufs/ffs/ffs_alloc.c: revision 1.121
 	sys/ufs/ffs/ffs_vfsops.c: revision 1.242
 	sys/ufs/ffs/ffs_vnops.c: revision 1.110
 PR kern/39564 wapbl performance issues with disk cache flushing
 PR kern/40361 WAPBL locking panic in -current
 PR kern/40361 WAPBL locking panic in -current
 PR kern/40470 WAPBL corrupts ext2fs
 PR kern/40562 busy loop in ffs_sync when unmounting a file system
 PR kern/40525 panic: ffs_valloc: dup alloc
 - A fix for an issue that can lead to "ffs_valloc: dup" due to dirty cg
   buffers being invalidated. Problem discovered and patch by dholland@.
 - If the syncer fails to lazily sync a vnode due to lock contention,
   retry 1 second later instead of 30 seconds later.
 - Flush inode atime updates every ~10 seconds (this makes most sense with
   logging). Presently they didn't hit the disk for read-only files or
   devices until the file system was unmounted. It would be better to trickle
   the updates out but that would require more extensive changes.
 - Fix issues with file system corruption, busy looping and other nasty
   problems when logging and non-logging file systems are intermixed,
   with one being the root file system.
 - For logging, do not flush metadata on an inode-at-a-time basis if the sync
   has been requested by ioflush. Previously, we could try hundreds of log
   sync operations a second due to inode update activity, causing the syncer
   to fall behind and metadata updates to be serialized across the entire
   file system. Instead, burst out metadata and log flushes at a minimum
   interval of every 10 seconds on an active file system (happens more often
   if the log becomes full). Note this does not change the operation of
   fsync() etc.
 - With the flush issue fixed, re-enable concurrent metadata updates in
   vfs_wapbl.c.


 To generate a diff of this commit:
 cvs rdiff -r1.3 -r1.3.8.1 src/sys/kern/vfs_wapbl.c
 cvs rdiff -r1.34 -r1.34.20.1 src/sys/miscfs/syncfs/sync_subr.c
 cvs rdiff -r1.25 -r1.25.10.1 src/sys/miscfs/syncfs/sync_vnops.c
 cvs rdiff -r1.113 -r1.113.4.1 src/sys/ufs/ffs/ffs_alloc.c
 cvs rdiff -r1.239 -r1.239.2.1 src/sys/ufs/ffs/ffs_vfsops.c
 cvs rdiff -r1.104.4.5 -r1.104.4.6 src/sys/ufs/ffs/ffs_vnops.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.