NetBSD Problem Report #39016

From apb@cequrux.com  Sun Jun 22 09:29:56 2008
Return-Path: <apb@cequrux.com>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by narn.NetBSD.org (Postfix) with ESMTP id E066463B89F
	for <gnats-bugs@gnats.NetBSD.org>; Sun, 22 Jun 2008 09:29:55 +0000 (UTC)
Message-Id: <20080622092844.0F3DBE9307C@apb-laptoy.apb.alt.za>
Date: Sun, 22 Jun 2008 09:28:44 +0000 (UTC)
From: apb@cequrux.com
Reply-To: apb@cequrux.com
To: gnats-bugs@gnats.NetBSD.org
Subject: processes stuck in "tstile"
X-Send-Pr-Version: 3.95

>Number:         39016
>Category:       kern
>Synopsis:       processes stuck in "tstile"
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    joerg
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Jun 22 09:30:00 +0000 2008
>Closed-Date:    Mon May 03 04:16:27 +0000 2010
>Last-Modified:  Mon May 03 04:16:27 +0000 2010
>Originator:     Alan Barrett
>Release:        NetBSD 4.99.63
>Organization:
Not much
>Environment:
System: NetBSD 4.99.63
Architecture: i386
Machine: i386
>Description:
	The system sometimes gets into a state such that all processes that
	attempt to perform disk I/O get stuck waiting for "tstile".
>How-To-Repeat:
	Just use the system for a while.

        For example, most recently, I was running X, a mail reader,
        systat vm, a few idle shells, "cp -RPp" copying from one
        external USB disk to anotehr, a "make" job using rsync to copy
        from an external machine to the local machine, and a recursive
        "chown".  I noticed that the window manager's clock had stopped
        updating.  Upon further investigation, I noticed:

	* The window manager's clock display had frozen;
	* The external USB disks were still busy;
	* The systat display had frozen;
	* Pressing enter in an idle shell still worked;
	* Attempting to run any new command hung forever;
	* The "make" job that was synchronising from an external machine
	  had printed a message saying that it was performing a "rm" command
	  to delete a lock file, but then appeared to be stuck;
	* Pressing alt-control-F1 to get to the primary console worked;
	* Pressing alt-control-escape to get into DDB worked;
	* Inside ddb, the "ps" and "ps /w" commands indicated that many
	  processes were waiting for "tstile";
	* One of the stuck processes was "rm", which I presume was the
	  command invoked from the make job menjioned above;
	* The ddb "c" command worked;
	* At the login prompt on the primary console, typing a username and
	  pressing enter caused the "login" proess to get stuck waiting
	  for "tstile" (or perhaps it was the "getty" process that was stuck,
	  I am not certain);
	* the ddb "sync" command caused the machine to freeze, requiring
	  a power cycle.

        Prior to this event, I had noticed from the "systat bufcache"
        display that there was no free memory; most memory was in use by
        cached file system data.
>Fix:
	Unknown.

>Release-Note:

>Audit-Trail:

State-Changed-From-To: open->analyzed
State-Changed-By: ad@NetBSD.org
State-Changed-When: Thu, 25 Sep 2008 10:14:48 +0000
State-Changed-Why:
Unfortunatley more information is required to come up with a useful
diagnosis. We need backtraces from the stuck threads using 't/a' in DDB:
those stuck in tstile or anything else interesting, like biowait.
Alternatively a mini crash dump and a copy of the kernel image should be
enough to diagnose the issue ('call dumpsys' from ddb).

Also, what is the file system configuration on this system?


From: Alan Barrett <apb@cequrux.com>
To: gnats-bugs@NetBSD.org
Cc: netbsd-bugs@NetBSD.org
Subject: Re: kern/39016 (processes stuck in "tstile")
Date: Fri, 26 Sep 2008 17:36:43 +0200

 It's been a long time since I saw the problem, but...

 On Thu, 25 Sep 2008, ad@NetBSD.org wrote:
 > Unfortunatley more information is required to come up with a useful
 > diagnosis. We need backtraces from the stuck threads using 't/a' in DDB:
 > those stuck in tstile or anything else interesting, like biowait.
 > Alternatively a mini crash dump and a copy of the kernel image should be
 > enough to diagnose the issue ('call dumpsys' from ddb).

 "call dumpsys", OK, I have made a note of that.

 > Also, what is the file system configuration on this system?

 small root ffs on wd0a.  large cgd on wd0e.  ffs inside the cgd is
 mounted on "/cgd1a".  several nullfs mounts make several top level
 directories inside the cgd appear like top level directories inside
 the root (e.g. for dir in home usr var blah blah; do mount -t null
 /cgd1a/${dir} /${dir}; done).

 --apb (Alan Barrett)

From: Paul Ripke <stix@stix.id.au>
To: NetBSD gnats-bugs <gnats-bugs@NetBSD.org>
Cc: ad@NetBSD.org
Subject: Re: kern/39016 (processes stuck in "tstile")
Date: Sat, 27 Sep 2008 10:58:42 +1000

 I've just tripped over this, too. I had a build system running NetBSD
 4.0 x86 on a quad Intel, with softdep filesystems. I "upgraded" to
 amd64 current (20080926T0002 UTC) with filesystems mounted with "log".

 Previously, release build (-j 8) of 5 archs on this system would take
 about 3.5 hrs. After the upgrade, 12+ hrs. I switched the build
 filesystem to 'async' (I like living on the edge...) and it took the
 normal time.

 Watching processes, there are many in 'tstile' and the disk with the
 build fs stays almost permanently ~5% busy.

 I took a full crash dump, but I can't figure how to get arbitrary
 backtraces from LWPs. http://www.netbsd.org/docs/kernel/ says to
 use 'proc' in gdb, but that just runs the user-defined procs function
 from src/sys/gdbscripts/procs. If I get a chance, I'll get some from
 ddb.

 I also noticed that crash dumping didn't appear to want to work
 with AHCI enabled - maybe it was just bad timing.

 FYI: current mounts are:
 /dev/wd0a on / type ffs (log, local)
 /dev/wd0f on /var type ffs (log, local)
 /dev/wd0e on /usr type ffs (log, local)
 tmpfs on /tmp type tmpfs (local)
 /dev/wd1a on /u type ffs (log, local)	<-- build fs
 kernfs on /kern type kernfs (local)
 zion:/export on /export type nfs
 zion:/usr/local on /usr/local type nfs
 zion:/usr/pkg on /usr/pkg type nfs
 zion:/usr/pkg64 on /usr/pkg64 type nfs

 -- 
 Paul Ripke

From: Paul Ripke <stix@stix.id.au>
To: NetBSD gnats-bugs <gnats-bugs@NetBSD.org>
Cc: ad@NetBSD.org
Subject: Re: kern/39016 (processes stuck in "tstile")
Date: Sat, 27 Sep 2008 15:58:48 +1000

 OK, just grabbed some backtraces from ddb. All the following processes
 were in 'tstile' wchan. I didn't see any other processes in interesting
 wchan's (all others were wait, select, etc). I've also grabbed a crash
 dump from this dump.

 Hand typed from ddb:

 pid: 28901 (collect2)
 sleepq_block()
 turnstile_block()
 rw_vector_enter()
 wapbl_begin()
 ffs_full_fsync()
 ffs_fsync()
 VOP_FSYNC
 vinvalbuf()
 vclean()
 getcleanvnode()
 getnewvnode()
 tmpfs_alloc_vp()
 tmpfs_alloc_file()
 VOP_CREATE()
 vn_open()
 sys_open()
 syscall()

 pid: 21586 (sh)
 sleepq_block()
 turnstile_block()
 rw_vector_enter()
 wapbl_begin()
 ufs_makeinode()
 ufs_create()
 VOP_CREATE()
 vn_open()
 sys_open()
 syscall()

 pid: 29705 (vax--netbsdelf-r...)
 sleepq_block()
 turnstile_block()
 rw_vector_enter()
 wapbl_begin()
 ffs_write()
 VOP_WRITE()
 vn_write()
 dofilewrite()
 sys_write()
 syscall()

 pid: 266 (vax--netbsdelf-o...)
 sleepq_block()
 turnstile_block()
 rw_vector_enter()
 wapbl_begin()
 ffs_write()
 VOP_WRITE()
 vn_write()
 dofilewrite()
 sys_write()
 syscall()

 pid: 26891 (as)
 sleepq_block()
 turnstile_block()
 rw_vector_enter()
 wapbl_begin()
 ffs_write()
 VOP_WRITE()
 vn_write()
 dofilewrite()
 sys_write()
 syscall()

 pid: 8763 (ld)
 sleepq_block()
 turnstile_block()
 rw_vector_enter()
 wapbl_begin()
 ffs_write()
 VOP_WRITE()
 vn_write()
 dofilewrite()
 sys_write()
 syscall()

 pid: 5184 (nbmkdep)
 sleepq_block()
 turnstile_block()
 rw_vector_enter()
 wapbl_begin()
 ffs_write()
 VOP_WRITE()
 vn_write()
 dofilewrite()
 sys_write()
 syscall()

 pid: 16502 (nbmkdep)
 sleepq_block()
 turnstile_block()
 rw_vector_enter()
 wapbl_begin()
 ffs_write()
 VOP_WRITE()
 vn_write()
 dofilewrite()
 sys_write()
 syscall()

 I looked around in DDB a few times - this was the only occasion that I
 saw ffs_fsync, which had me worried the first time I saw it. Anyway,
 this definitely looks WAPBL related. BTW: this is a stock GENERIC
 kernel.

 -- 
 Paul Ripke

From: "S.P.Zeidler" <spz@serpens.de>
To: gnats-bugs@gnats.netbsd.org
Cc: 
Subject: Re: kern/39016
Date: Fri, 24 Oct 2008 22:56:12 +0200

 some debug info from ftp:

 db{0}> ps /l
  PID         LID S     FLAGS       STRUCT LWP *               NAME WAIT
  4551          1 3        84   ffff80007ad97040               ftpd netio
  3393          1 3        84   ffff80005912b000               ftpd netio
  25674         1 3        84   ffff800048147020               tcsh ttyraw
  18784         1 3        84   ffff80004d453040               ftpd netio
  15618         1 3        84   ffff80004d453420                csh ttyraw
  9666          1 3        84   ffff80004d453800       screen-4.0.3 select
  27313         1 3        84   ffff80004d453be0       screen-4.0.3 pause
  10939         1 3         4   ffff80005ce5abc0               ftpd tstile
  12258         1 3        84   ffff800048147bc0                csh pause
  13562         1 3        84   ffff80007c7a0400               sshd select
  26015         1 3        84   ffff80006ff5a400               sshd netio
  2159          1 3         4   ffff80007c7a0bc0               ftpd tstile
  22748         1 3        84   ffff800070d58be0               ftpd netio
  14391         1 3         4   ffff800070d58800               ftpd tstile
  13340         1 3         4   ffff8000601bf000               ftpd tstile
  24232         1 3         4   ffff8000601bf3e0               ftpd tstile
  4033          1 3         4   ffff8000601bf7c0               ftpd tstile
  23843         1 3         4   ffff8000698f27c0               ftpd tstile
  13403         1 3         4   ffff8000601bfba0               ftpd tstile
  13947         1 3         4   ffff8000698f2000               ftpd tstile
  11806         1 3         4   ffff800070d58420               ftpd tstile
 db{0}> t/a ffff80005ce5abc0
 trace: pid 10939 lid 1 at 0xffff800046d6c740
 sleepq_block() at netbsd:sleepq_block+0x10d
 turnstile_block() at netbsd:turnstile_block+0x2e1
 rw_vector_enter() at netbsd:rw_vector_enter+0x28c
 vlockmgr() at netbsd:vlockmgr+0xf6
 layer_lock() at netbsd:layer_lock+0x40
 VOP_LOCK() at netbsd:VOP_LOCK+0x64
 vn_lock() at netbsd:vn_lock+0xd9
 layerfs_root() at netbsd:layerfs_root+0x4d
 VFS_ROOT() at netbsd:VFS_ROOT+0x2a
 lookup() at netbsd:lookup+0x3c9
 namei() at netbsd:namei+0x1a4
 do_sys_stat() at netbsd:do_sys_stat+0x44
 sys___stat30() at netbsd:sys___stat30+0x2d
 syscall() at netbsd:syscall+0x9a

Responsible-Changed-From-To: kern-bug-people->joerg
Responsible-Changed-By: joerg@NetBSD.org
Responsible-Changed-When: Tue, 18 Nov 2008 23:47:30 +0000
Responsible-Changed-Why:
WAPBL issue related to the thread
http://mail-index.netbsd.org/current-users/2008/10/31/msg005580.html


State-Changed-From-To: analyzed->feedback
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Thu, 16 Jul 2009 06:36:18 +0000
State-Changed-Why:
Did this get fixed?


From: Alan Barrett <apb@cequrux.com>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/39016: processes stuck in "tstile"
Date: Tue, 20 Oct 2009 08:07:16 +0200

 It's been a long time since I last noticed this problem.  Perhaps
 the PR can be closed.

 --apb (Alan Barrett)

From: Joerg Sonnenberger <joerg@britannica.bec.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/39016: processes stuck in "tstile"
Date: Tue, 20 Oct 2009 13:17:02 +0200

 On Tue, Oct 20, 2009 at 06:10:05AM +0000, Alan Barrett wrote:
 > The following reply was made to PR kern/39016; it has been noted by GNATS.
 > 
 > From: Alan Barrett <apb@cequrux.com>
 > To: gnats-bugs@NetBSD.org
 > Cc: 
 > Subject: Re: kern/39016: processes stuck in "tstile"
 > Date: Tue, 20 Oct 2009 08:07:16 +0200
 > 
 >  It's been a long time since I last noticed this problem.  Perhaps
 >  the PR can be closed.

 I've still seen it a number of times at least with slightly older
 kernels. Tron also said he could trigger it, but couldn't provide
 details yet.

 Joerg

State-Changed-From-To: feedback->closed
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Mon, 03 May 2010 04:16:27 +0000
State-Changed-Why:
This particular problem has been fixed. Other problems that lead to
"tstile syndrome" still exist, because "tstile syndrome" is any generic
deadlock.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.