NetBSD Problem Report #39016
From apb@cequrux.com Sun Jun 22 09:29:56 2008
Return-Path: <apb@cequrux.com>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
by narn.NetBSD.org (Postfix) with ESMTP id E066463B89F
for <gnats-bugs@gnats.NetBSD.org>; Sun, 22 Jun 2008 09:29:55 +0000 (UTC)
Message-Id: <20080622092844.0F3DBE9307C@apb-laptoy.apb.alt.za>
Date: Sun, 22 Jun 2008 09:28:44 +0000 (UTC)
From: apb@cequrux.com
Reply-To: apb@cequrux.com
To: gnats-bugs@gnats.NetBSD.org
Subject: processes stuck in "tstile"
X-Send-Pr-Version: 3.95
>Number: 39016
>Category: kern
>Synopsis: processes stuck in "tstile"
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: joerg
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sun Jun 22 09:30:00 +0000 2008
>Closed-Date: Mon May 03 04:16:27 +0000 2010
>Last-Modified: Mon May 03 04:16:27 +0000 2010
>Originator: Alan Barrett
>Release: NetBSD 4.99.63
>Organization:
Not much
>Environment:
System: NetBSD 4.99.63
Architecture: i386
Machine: i386
>Description:
The system sometimes gets into a state such that all processes that
attempt to perform disk I/O get stuck waiting for "tstile".
>How-To-Repeat:
Just use the system for a while.
For example, most recently, I was running X, a mail reader,
systat vm, a few idle shells, "cp -RPp" copying from one
external USB disk to anotehr, a "make" job using rsync to copy
from an external machine to the local machine, and a recursive
"chown". I noticed that the window manager's clock had stopped
updating. Upon further investigation, I noticed:
* The window manager's clock display had frozen;
* The external USB disks were still busy;
* The systat display had frozen;
* Pressing enter in an idle shell still worked;
* Attempting to run any new command hung forever;
* The "make" job that was synchronising from an external machine
had printed a message saying that it was performing a "rm" command
to delete a lock file, but then appeared to be stuck;
* Pressing alt-control-F1 to get to the primary console worked;
* Pressing alt-control-escape to get into DDB worked;
* Inside ddb, the "ps" and "ps /w" commands indicated that many
processes were waiting for "tstile";
* One of the stuck processes was "rm", which I presume was the
command invoked from the make job menjioned above;
* The ddb "c" command worked;
* At the login prompt on the primary console, typing a username and
pressing enter caused the "login" proess to get stuck waiting
for "tstile" (or perhaps it was the "getty" process that was stuck,
I am not certain);
* the ddb "sync" command caused the machine to freeze, requiring
a power cycle.
Prior to this event, I had noticed from the "systat bufcache"
display that there was no free memory; most memory was in use by
cached file system data.
>Fix:
Unknown.
>Release-Note:
>Audit-Trail:
State-Changed-From-To: open->analyzed
State-Changed-By: ad@NetBSD.org
State-Changed-When: Thu, 25 Sep 2008 10:14:48 +0000
State-Changed-Why:
Unfortunatley more information is required to come up with a useful
diagnosis. We need backtraces from the stuck threads using 't/a' in DDB:
those stuck in tstile or anything else interesting, like biowait.
Alternatively a mini crash dump and a copy of the kernel image should be
enough to diagnose the issue ('call dumpsys' from ddb).
Also, what is the file system configuration on this system?
From: Alan Barrett <apb@cequrux.com>
To: gnats-bugs@NetBSD.org
Cc: netbsd-bugs@NetBSD.org
Subject: Re: kern/39016 (processes stuck in "tstile")
Date: Fri, 26 Sep 2008 17:36:43 +0200
It's been a long time since I saw the problem, but...
On Thu, 25 Sep 2008, ad@NetBSD.org wrote:
> Unfortunatley more information is required to come up with a useful
> diagnosis. We need backtraces from the stuck threads using 't/a' in DDB:
> those stuck in tstile or anything else interesting, like biowait.
> Alternatively a mini crash dump and a copy of the kernel image should be
> enough to diagnose the issue ('call dumpsys' from ddb).
"call dumpsys", OK, I have made a note of that.
> Also, what is the file system configuration on this system?
small root ffs on wd0a. large cgd on wd0e. ffs inside the cgd is
mounted on "/cgd1a". several nullfs mounts make several top level
directories inside the cgd appear like top level directories inside
the root (e.g. for dir in home usr var blah blah; do mount -t null
/cgd1a/${dir} /${dir}; done).
--apb (Alan Barrett)
From: Paul Ripke <stix@stix.id.au>
To: NetBSD gnats-bugs <gnats-bugs@NetBSD.org>
Cc: ad@NetBSD.org
Subject: Re: kern/39016 (processes stuck in "tstile")
Date: Sat, 27 Sep 2008 10:58:42 +1000
I've just tripped over this, too. I had a build system running NetBSD
4.0 x86 on a quad Intel, with softdep filesystems. I "upgraded" to
amd64 current (20080926T0002 UTC) with filesystems mounted with "log".
Previously, release build (-j 8) of 5 archs on this system would take
about 3.5 hrs. After the upgrade, 12+ hrs. I switched the build
filesystem to 'async' (I like living on the edge...) and it took the
normal time.
Watching processes, there are many in 'tstile' and the disk with the
build fs stays almost permanently ~5% busy.
I took a full crash dump, but I can't figure how to get arbitrary
backtraces from LWPs. http://www.netbsd.org/docs/kernel/ says to
use 'proc' in gdb, but that just runs the user-defined procs function
from src/sys/gdbscripts/procs. If I get a chance, I'll get some from
ddb.
I also noticed that crash dumping didn't appear to want to work
with AHCI enabled - maybe it was just bad timing.
FYI: current mounts are:
/dev/wd0a on / type ffs (log, local)
/dev/wd0f on /var type ffs (log, local)
/dev/wd0e on /usr type ffs (log, local)
tmpfs on /tmp type tmpfs (local)
/dev/wd1a on /u type ffs (log, local) <-- build fs
kernfs on /kern type kernfs (local)
zion:/export on /export type nfs
zion:/usr/local on /usr/local type nfs
zion:/usr/pkg on /usr/pkg type nfs
zion:/usr/pkg64 on /usr/pkg64 type nfs
--
Paul Ripke
From: Paul Ripke <stix@stix.id.au>
To: NetBSD gnats-bugs <gnats-bugs@NetBSD.org>
Cc: ad@NetBSD.org
Subject: Re: kern/39016 (processes stuck in "tstile")
Date: Sat, 27 Sep 2008 15:58:48 +1000
OK, just grabbed some backtraces from ddb. All the following processes
were in 'tstile' wchan. I didn't see any other processes in interesting
wchan's (all others were wait, select, etc). I've also grabbed a crash
dump from this dump.
Hand typed from ddb:
pid: 28901 (collect2)
sleepq_block()
turnstile_block()
rw_vector_enter()
wapbl_begin()
ffs_full_fsync()
ffs_fsync()
VOP_FSYNC
vinvalbuf()
vclean()
getcleanvnode()
getnewvnode()
tmpfs_alloc_vp()
tmpfs_alloc_file()
VOP_CREATE()
vn_open()
sys_open()
syscall()
pid: 21586 (sh)
sleepq_block()
turnstile_block()
rw_vector_enter()
wapbl_begin()
ufs_makeinode()
ufs_create()
VOP_CREATE()
vn_open()
sys_open()
syscall()
pid: 29705 (vax--netbsdelf-r...)
sleepq_block()
turnstile_block()
rw_vector_enter()
wapbl_begin()
ffs_write()
VOP_WRITE()
vn_write()
dofilewrite()
sys_write()
syscall()
pid: 266 (vax--netbsdelf-o...)
sleepq_block()
turnstile_block()
rw_vector_enter()
wapbl_begin()
ffs_write()
VOP_WRITE()
vn_write()
dofilewrite()
sys_write()
syscall()
pid: 26891 (as)
sleepq_block()
turnstile_block()
rw_vector_enter()
wapbl_begin()
ffs_write()
VOP_WRITE()
vn_write()
dofilewrite()
sys_write()
syscall()
pid: 8763 (ld)
sleepq_block()
turnstile_block()
rw_vector_enter()
wapbl_begin()
ffs_write()
VOP_WRITE()
vn_write()
dofilewrite()
sys_write()
syscall()
pid: 5184 (nbmkdep)
sleepq_block()
turnstile_block()
rw_vector_enter()
wapbl_begin()
ffs_write()
VOP_WRITE()
vn_write()
dofilewrite()
sys_write()
syscall()
pid: 16502 (nbmkdep)
sleepq_block()
turnstile_block()
rw_vector_enter()
wapbl_begin()
ffs_write()
VOP_WRITE()
vn_write()
dofilewrite()
sys_write()
syscall()
I looked around in DDB a few times - this was the only occasion that I
saw ffs_fsync, which had me worried the first time I saw it. Anyway,
this definitely looks WAPBL related. BTW: this is a stock GENERIC
kernel.
--
Paul Ripke
From: "S.P.Zeidler" <spz@serpens.de>
To: gnats-bugs@gnats.netbsd.org
Cc:
Subject: Re: kern/39016
Date: Fri, 24 Oct 2008 22:56:12 +0200
some debug info from ftp:
db{0}> ps /l
PID LID S FLAGS STRUCT LWP * NAME WAIT
4551 1 3 84 ffff80007ad97040 ftpd netio
3393 1 3 84 ffff80005912b000 ftpd netio
25674 1 3 84 ffff800048147020 tcsh ttyraw
18784 1 3 84 ffff80004d453040 ftpd netio
15618 1 3 84 ffff80004d453420 csh ttyraw
9666 1 3 84 ffff80004d453800 screen-4.0.3 select
27313 1 3 84 ffff80004d453be0 screen-4.0.3 pause
10939 1 3 4 ffff80005ce5abc0 ftpd tstile
12258 1 3 84 ffff800048147bc0 csh pause
13562 1 3 84 ffff80007c7a0400 sshd select
26015 1 3 84 ffff80006ff5a400 sshd netio
2159 1 3 4 ffff80007c7a0bc0 ftpd tstile
22748 1 3 84 ffff800070d58be0 ftpd netio
14391 1 3 4 ffff800070d58800 ftpd tstile
13340 1 3 4 ffff8000601bf000 ftpd tstile
24232 1 3 4 ffff8000601bf3e0 ftpd tstile
4033 1 3 4 ffff8000601bf7c0 ftpd tstile
23843 1 3 4 ffff8000698f27c0 ftpd tstile
13403 1 3 4 ffff8000601bfba0 ftpd tstile
13947 1 3 4 ffff8000698f2000 ftpd tstile
11806 1 3 4 ffff800070d58420 ftpd tstile
db{0}> t/a ffff80005ce5abc0
trace: pid 10939 lid 1 at 0xffff800046d6c740
sleepq_block() at netbsd:sleepq_block+0x10d
turnstile_block() at netbsd:turnstile_block+0x2e1
rw_vector_enter() at netbsd:rw_vector_enter+0x28c
vlockmgr() at netbsd:vlockmgr+0xf6
layer_lock() at netbsd:layer_lock+0x40
VOP_LOCK() at netbsd:VOP_LOCK+0x64
vn_lock() at netbsd:vn_lock+0xd9
layerfs_root() at netbsd:layerfs_root+0x4d
VFS_ROOT() at netbsd:VFS_ROOT+0x2a
lookup() at netbsd:lookup+0x3c9
namei() at netbsd:namei+0x1a4
do_sys_stat() at netbsd:do_sys_stat+0x44
sys___stat30() at netbsd:sys___stat30+0x2d
syscall() at netbsd:syscall+0x9a
Responsible-Changed-From-To: kern-bug-people->joerg
Responsible-Changed-By: joerg@NetBSD.org
Responsible-Changed-When: Tue, 18 Nov 2008 23:47:30 +0000
Responsible-Changed-Why:
WAPBL issue related to the thread
http://mail-index.netbsd.org/current-users/2008/10/31/msg005580.html
State-Changed-From-To: analyzed->feedback
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Thu, 16 Jul 2009 06:36:18 +0000
State-Changed-Why:
Did this get fixed?
From: Alan Barrett <apb@cequrux.com>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/39016: processes stuck in "tstile"
Date: Tue, 20 Oct 2009 08:07:16 +0200
It's been a long time since I last noticed this problem. Perhaps
the PR can be closed.
--apb (Alan Barrett)
From: Joerg Sonnenberger <joerg@britannica.bec.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/39016: processes stuck in "tstile"
Date: Tue, 20 Oct 2009 13:17:02 +0200
On Tue, Oct 20, 2009 at 06:10:05AM +0000, Alan Barrett wrote:
> The following reply was made to PR kern/39016; it has been noted by GNATS.
>
> From: Alan Barrett <apb@cequrux.com>
> To: gnats-bugs@NetBSD.org
> Cc:
> Subject: Re: kern/39016: processes stuck in "tstile"
> Date: Tue, 20 Oct 2009 08:07:16 +0200
>
> It's been a long time since I last noticed this problem. Perhaps
> the PR can be closed.
I've still seen it a number of times at least with slightly older
kernels. Tron also said he could trigger it, but couldn't provide
details yet.
Joerg
State-Changed-From-To: feedback->closed
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Mon, 03 May 2010 04:16:27 +0000
State-Changed-Why:
This particular problem has been fixed. Other problems that lead to
"tstile syndrome" still exist, because "tstile syndrome" is any generic
deadlock.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.