NetBSD Problem Report #35187

From www@NetBSD.org  Tue Dec  5 06:24:38 2006
Return-Path: <www@NetBSD.org>
Received: by narn.NetBSD.org (Postfix, from userid 31301)
	id B868E63BA4D; Tue,  5 Dec 2006 06:24:38 +0000 (UTC)
Message-Id: <20061205062438.B868E63BA4D@narn.NetBSD.org>
Date: Tue,  5 Dec 2006 06:24:38 +0000 (UTC)
From: blair.sadewitz@gmail.com
Reply-To: blair.sadewitz@gmail.com
To: gnats-bugs@NetBSD.org
Subject: Certain file operations--such as chown--take inordinate amounts of time on LFS
X-Send-Pr-Version: www-1.0

>Number:         35187
>Category:       kern
>Synopsis:       Certain file operations--such as chown--take inordinate amounts of time on LFS
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Dec 05 06:25:00 +0000 2006
>Last-Modified:  Tue Jan 02 21:05:01 +0000 2007
>Originator:     Blair Sadewitz
>Release:        4.99.5 as of dec 3-4
>Organization:
Organization?  Maybe next year ...
>Environment:
NetBSD amd64 4.99.5
>Description:
root    1128 98.0  0.1   172 1088 ttyE1 D     1:00AM 13:19.82 chown -R blair pkg
src

This process has been running for virtually the same amount of time as the CPU time column indicates.  All I'm doing is chown -R blair:wsrc /usr/pkgsrc.  On ffs, this would be of no consequence.

I find LFS advantageous for a number of reasons, but I'm hesistant to use it if things such as this, the tail end of a pax/tar job, chmod, etc. takes this much CPU time.  I have a 3GHz Pentium D, and if it weren't for the other core, my machine would be virtually unusable, at least interactively.

Yay, it finally finished 20 minutes later! ;)

What is going on here?  I have tried setting vfs.lfs.pagetrip as instructed, as well as different buffer queue strategies, etc, all to no avail.
Obviously, this problem is beyond my comprehension.  I can't be the only one annoyed by this, especially considering that LFS shines in situations (such as large RAIDframe arrays) where one may want to do operations like this on 2TB of files.
>How-To-Repeat:
Perform filesystem metadata operations on many files in an LFS filesystem.
>Fix:
I wish I knew, and if I ever meet the person who fixes this, you can have a few beers on me. ;)

Please ignore the last incomplete PR I filed, that happened because my keyboard response was lagged due to the aforementioned chown process.

>Audit-Trail:
From: Andrew Doran <ad@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/35187 (Certain file operations--such as chown--take inordinate amounts of time on LFS)
Date: Mon, 1 Jan 2007 22:37:17 +0000

 I have been seeing this too. Basic setup:

 Filesystem  1K-blocks      Used     Avail Capacity  iused    ifree  %iused  Mounted on
 /dev/ld0e    14535448   3111227   9969928    23%   342102 97543026     0%   /local

 maxvnodes ~= 128k

 I did a bit of profiling, and it seems that the problem is lfs_writevnodes()
 spends an unreasonable amount of time traversing the vnode list for the
 mount, and in lfs_vref() -> vget().

 Output from gprof is below. The numbers for Xspllower() and kernel_lock()
 are a result of the activity in lfs_writevnodes().

   %   cumulative   self              self     total
  time   seconds   seconds    calls  ms/call  ms/call  name
  29.29    162.03   162.03    19092     8.49     8.49  Xspllower
  24.39    296.94   134.91  1122369     0.12     0.23  _kernel_lock
  23.68    427.91   130.97     3948    33.17    55.33  lfs_writevnodes
   7.29    468.25    40.34 176313571     0.00     0.00  vget
   4.43    492.74    24.49 175754664     0.00     0.00  lfs_vref
   4.06    515.21    22.47 175750428     0.00     0.00  lfs_vunref
   2.01    526.34    11.13                             mpidle

 index % time    self  children    called     name

                                  550             lfs_vflush <cycle 3> [740]
                                 3398             lfs_segwrite <cycle 3> [61]
 [5]     39.5  130.97   87.48    3948         lfs_writevnodes <cycle 3> [5]
                24.49   40.52 175745415/175754664     lfs_vref [10]
                22.47    0.00 175743764/175750428     lfs_vunref [23]
                               223127             lfs_writefile <cycle 3> [267]
                               223093             lfs_writeinode <cycle 3> [111]
                                   12             lfs_writeseg <cycle 3> [211]

 index % time    self  children    called     name

                 0.00    0.00    3131/175754664     lfs_mark_vnode [436]
                 0.00    0.00    6118/175754664     lfs_reserve [539]
                24.49   40.52 175745415/175754664     lfs_writevnodes <cycle 3> [5]
 [10]    11.8   24.49   40.52 175754664         lfs_vref [10]
                40.21    0.31 175761681/176313571     vget [12]

 index % time    self  children    called     name

                 0.00    0.00      16/176313571     checkalias [279]
                 0.00    0.00    4145/176313571     ffs_sync <cycle 3> [583]
                 0.00    0.00    4177/176313571     getcwd_common [114]
                 0.01    0.00   46692/176313571     cache_lookup [112]
                 0.11    0.00  496860/176313571     ufs_ihashget [125]
                40.21    0.31 175761681/176313571     lfs_vref [10]
 [12]     7.3   40.34    0.31 176313571         vget [12]
                 0.07    0.24  505279/1809247     vn_lock [69]

 To repeat this:

 1. Associate a large number of vnodes with the filesystem (untarring
    pkgsrc.tar.gz and checking out src/sys, for example).
 2. Start a few "MAKEDEV all" in the background.
 3. Start a few "chown -R ... chmod -R" in the background.
 4. Number of disk xfers as a result of LFS activity slows to around
    10/s, and the system becomes unusable:

 CPU0 states:  0.0% user,  0.0% nice,  100% system,  0.0% interrupt,  0.0% idle
 CPU1 states:  0.0% user,  0.0% nice,  100% system,  0.0% interrupt,  0.0% idle

 To fix, maybe alter lfs_writevnodes() to have to scan only a subset of all
 vnodes which are dirty or potentially dirty. I'm not sure.

From: "Blair Sadewitz" <blair.sadewitz@gmail.com>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org
Subject: Re: kern/35187 (Certain file operations--such as chown--take inordinate amounts of time on LFS)
Date: Tue, 2 Jan 2007 16:00:47 -0500

 There is a simple way to mitigate this to some extent (seems to help
 with chmod/chown, but not with the queue backup when untarring
 something) though I doubt it is safe only because it is purposefully
 excluded from the code:

 At the beginning lfs_writevnodes() in lfs_segment.c, change the #if 0
 to 1 so that it starts with the newest instead of oldest vnode.

 Unfortunately I havent thought of any better way to do this yet, as
 most likely doing so is beyond my ability. :(

 --Blair

 -- 
 Support WFMU-FM: free-form radio for the masses!

 <http://www.wfmu.org/>
 91.1 FM Jersey City, NJ
 90.1 FM Mt. Hope, NY

 "The Reggae Schoolroom":
 <http://www.wfmu.org/playlists/RS/>

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.