NetBSD Problem Report #58317

From manu@netbsd.org  Thu Jun  6 12:56:30 2024
Return-Path: <manu@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256
	 client-signature RSA-PSS (2048 bits) client-digest SHA256)
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 6EFCC1A9238
	for <gnats-bugs@gnats.NetBSD.org>; Thu,  6 Jun 2024 12:56:30 +0000 (UTC)
Message-Id: <20240606125629.9837A84E78@mail.netbsd.org>
Date: Thu,  6 Jun 2024 12:56:29 +0000 (UTC)
From: manu@netbsd.org
Reply-To: manu@netbsd.org
To: gnats-bugs@NetBSD.org
Subject: hang in vcache_vget()
X-Send-Pr-Version: 3.95

>Number:         58317
>Category:       kern
>Synopsis:       hang in vcache_vget() on NetBSD-10.0/i386 XEN3PAE_DOMU
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Jun 06 13:00:00 +0000 2024
>Last-Modified:  Mon Jul 01 01:30:01 +0000 2024
>Originator:     Emmanuel Dreyfus
>Release:        NetBSD 10.0_STABLE
>Organization:
NetBSD
>Environment:
NetBSD w4.net.espci.fr 10.0 NetBSD 10.0 (XEN3PAE_DOMU) #2: Mon May 13 14:12:32 CEST 2024  root@duplo:/pkg_comp/NetBSD-10stable-i386/src/sys/arch/i386/compile/XEN3PAE_DOMU i386
Architecture: i386
Machine: i386
>Description:
	Machine hangs in moderate load. Many processes are waiting tstile. 
	One is waiting vnode with this backtrace:

sleepq_block
cv_wait   
vcache_vget
vcache_get
ufs_lookup
VOP_LOOKUP
lookup_once
namei_tryemulroot.constprop.0
namei 
vn_open   
do_open   
do_sys_openat 

>How-To-Repeat:
	No idea yet
>Fix:
	No idea yet

>Audit-Trail:
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/58317: hang in vcache_vget()
Date: Thu, 6 Jun 2024 15:47:06 +0200

 On Thu, Jun 06, 2024 at 01:00:01PM +0000, manu@netbsd.org wrote:
 > >How-To-Repeat:
 > 	No idea yet

 In the broken state, can you check "show uvemexp" from ddb, continue
 and then break into ddb and repeat again after a few seconds?

 Also a full ps output from ddb might be helpful.

 Martin

From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/58317: hang in vcache_vget()
Date: Fri, 07 Jun 2024 00:53:51 +0700

 I have seen that kind of thing as well - my pet (purely guessed)
 victim (imagined cause) was USB drive I/O, but it probably isn't related.

 One thing you should test is time - ie: just wait.   When I'm
 patient, things usually eventually recover.   That is, it isn't
 really hung, just waiting for something which takes a long time
 to complete.   And I do mean a long time, not "long by computer
 standards" (but not astronomic, or even archaeological long either).
 I mean tens of minutes, perhaps even an hour. 

 Things which are time sensitive (like keeping transfers running
 without the remote end deciding you've vanished as no more data
 is being taken - the TCP window sits at 0 for too long) tend to fail,
 but the system itself in my experience, tends to recover.

 That is, even though processes report being stuck in tstile
 waits, and that in the past that often represented a deadlock
 somewhere, I have not seen one of those now in a long time - these are,
 or seem to be, just waits that are waiting much longer than we'd want
 them to.

 And to answer (from my experience) Greg's question - yes, in my
 cases this tends to happen when there's memory pressure, but not
 the kind of pressure that should be bothering anything (in over 2
 years I've yet to see my system page anything out, ever - I have
 swap space, plenty, but it always shows 0 used) - there is
 something broken in some of the ubc/buffer cleanup code somewhere I think.
 Much of the used memory should simply have been flushed ages ago.

 My system has 64GB, which is quite a lot, but even if all of that
 (every single page) was data waiting to be written, which it
 obviously isn't, the slowest of my drives can handle 60MB/sec most
 of the time (and others much more) so all 64GB could be written in not
 much more than 1024 seconds, wherever it was destined, even if all to
 one of the slower drives, which is unlikely, but that is less than some
 of the hangs I have seen (30 mins or more).

 Further, even in that case, as some of all of that is written, it should
 be being discarded, leading to available memory, which should be
 allowing progress elsewhere, well before it is done.  But apparently not.

 My guess with that is that something is being locked, and stays locked,
 while all of this gets cleaned up - and that lock prevents almost all
 useful progress elsewhere from continuing.   It is worth noting that
 processes not doing actual I/O (like clocks, but also things like vmstat
 and iostat in xterms) keep on working.   So it looks like a tstile/deadlock
 issue, but isn't.

 I haven't yet managed to find which kernel thread is causing the problems.
 And that assumes there is just one of them, acting alone.

 Oh, and once it recovers, it is recovered (until next time) - it isn't as
 if lots of memory is being lost somewhere (or not enough that I can
 detect it anyway).


 Also, as it might be related, a while ago now, the system had a sudden stop
 due to power failure (either there was no UPS at the time, or the UPS
 gave up, I forget ... I'm not currently running any UPS monitoring system,
 which I know I should be, but that's not the issue here).   When the
 power returned, and the system rebooted, everything looked fine.   That is,
 except for data in some files was missing.   Now that's expected, data is
 buffered, and while we make sure (one way or another) that the meta-data
 (directories, inodes, etc) are all consistent, there's no guarantee that
 file data will have made it to disc.   What was surprising here is that
 when this happened, nothing much have been happening (it was an idle
 day for me as far as computer work was concerned).   Some of the files
 that ended up with no valid data were e-mail messages I had fetched & read
 more than 12 hours earlier (that is, before the power loss).   I use nmh,
 which uses one file for each message, and (as the meta data was all saved)
 the mod times of those files were there (and correct, or at least, close to
 when I'd expected they would be, I hadn't been making notes!)   But the
 message contents (the block contents) was just binary trash - whatever
 had been in the blocks when they were last used for something else, no
 signs of the e-mail message contents.   (I lost nothing, I just had to
 fetch and process that e-mail again, after deleting all the broken files).
 Most probably there was other damaged data as well, but that would have
 been harder to detect, and was probably unimportant.

 NetBSD "lost" update(8) some time ago - but since the above I have been running
 a sh script that does sync(2) (via sync(8)) every 30 secs (randomised a bit)
 ever since.   No more issues like that (though sync is one of the processes
 that will sleep for lengthy periods when hangs happen).   The syncs don't stop
 the hangs however (though they might sometimes seem to trigger a short one,
 if there has been lots of I/O happening recently - lots of file copying).

 kre

From: Ryo ONODERA <ryo@tetera.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/58317: hang in vcache_vget()
Date: Sun, 30 Jun 2024 22:54:02 +0900

 Hi,

 I can reproduce this problem during the build of pkgsrc/lang/perl5.
 Reverting rev 1.562 of src/sys/kern/vfs_syscalls.c to rev 1.561
 and I have no problem anymore.

From: Rin Okuyama <rokuyama.rk@gmail.com>
To: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org,
 gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, manu@netbsd.org
Cc: 
Subject: Re: kern/58317: hang in vcache_vget()
Date: Mon, 1 Jul 2024 09:38:48 +0900

 On 2024/06/30 22:55, Ryo ONODERA wrote:
 >   I can reproduce this problem during the build of pkgsrc/lang/perl5.
 >   Reverting rev 1.562 of src/sys/kern/vfs_syscalls.c to rev 1.561
 >   and I have no problem anymore.

 Similar symptoms but different bugs?

 This PR is for 10.0_STABLE as of 2024-06-06, while vfs_syscalls.c
 rev 1.562 is for -current.

 Thanks,
 rin

From: Ryo ONODERA <ryo@tetera.org>
To: Rin Okuyama <rokuyama.rk@gmail.com>, gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/58317: hang in vcache_vget()
Date: Mon, 01 Jul 2024 10:27:53 +0900

 On July 1, 2024 9:38:48 AM GMT+09:00, Rin Okuyama <rokuyama=2Erk@gmail=2Eco=
 m> wrote:
 >On 2024/06/30 22:55, Ryo ONODERA wrote:
 >>   I can reproduce this problem during the build of pkgsrc/lang/perl5=2E
 >>   Reverting rev 1=2E562 of src/sys/kern/vfs_syscalls=2Ec to rev 1=2E561
 >>   and I have no problem anymore=2E
 >
 >Similar symptoms but different bugs?
 >
 >This PR is for 10=2E0_STABLE as of 2024-06-06, while vfs_syscalls=2Ec
 >rev 1=2E562 is for -current=2E
 >
 >Thanks,
 >rin

 Hi,

 I am sorry=2E I had misunderstood the problem=2E

 Thank you=2E
 --=20
 Ryo ONODERA // ryo@tetera=2Eorg
 PGP fingerprint =3D 82A2 DC91 76E0 A10A 8ABB  FD1B F404 27FA C7D1 15F3

>Unformatted:
 	NetBSD-10.0/i386 XEN3PAE_DOMU
 	with this addition:
 	https://releng.netbsd.org/cgi-bin/req-10.cgi?show=701

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2024 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.