NetBSD Problem Report #58317
From manu@netbsd.org Thu Jun 6 12:56:30 2024
Return-Path: <manu@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256
client-signature RSA-PSS (2048 bits) client-digest SHA256)
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 6EFCC1A9238
for <gnats-bugs@gnats.NetBSD.org>; Thu, 6 Jun 2024 12:56:30 +0000 (UTC)
Message-Id: <20240606125629.9837A84E78@mail.netbsd.org>
Date: Thu, 6 Jun 2024 12:56:29 +0000 (UTC)
From: manu@netbsd.org
Reply-To: manu@netbsd.org
To: gnats-bugs@NetBSD.org
Subject: hang in vcache_vget()
X-Send-Pr-Version: 3.95
>Number: 58317
>Category: kern
>Synopsis: hang in vcache_vget() on NetBSD-10.0/i386 XEN3PAE_DOMU
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Thu Jun 06 13:00:00 +0000 2024
>Last-Modified: Mon Jul 01 01:30:01 +0000 2024
>Originator: Emmanuel Dreyfus
>Release: NetBSD 10.0_STABLE
>Organization:
NetBSD
>Environment:
NetBSD w4.net.espci.fr 10.0 NetBSD 10.0 (XEN3PAE_DOMU) #2: Mon May 13 14:12:32 CEST 2024 root@duplo:/pkg_comp/NetBSD-10stable-i386/src/sys/arch/i386/compile/XEN3PAE_DOMU i386
Architecture: i386
Machine: i386
>Description:
Machine hangs in moderate load. Many processes are waiting tstile.
One is waiting vnode with this backtrace:
sleepq_block
cv_wait
vcache_vget
vcache_get
ufs_lookup
VOP_LOOKUP
lookup_once
namei_tryemulroot.constprop.0
namei
vn_open
do_open
do_sys_openat
>How-To-Repeat:
No idea yet
>Fix:
No idea yet
>Audit-Trail:
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/58317: hang in vcache_vget()
Date: Thu, 6 Jun 2024 15:47:06 +0200
On Thu, Jun 06, 2024 at 01:00:01PM +0000, manu@netbsd.org wrote:
> >How-To-Repeat:
> No idea yet
In the broken state, can you check "show uvemexp" from ddb, continue
and then break into ddb and repeat again after a few seconds?
Also a full ps output from ddb might be helpful.
Martin
From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/58317: hang in vcache_vget()
Date: Fri, 07 Jun 2024 00:53:51 +0700
I have seen that kind of thing as well - my pet (purely guessed)
victim (imagined cause) was USB drive I/O, but it probably isn't related.
One thing you should test is time - ie: just wait. When I'm
patient, things usually eventually recover. That is, it isn't
really hung, just waiting for something which takes a long time
to complete. And I do mean a long time, not "long by computer
standards" (but not astronomic, or even archaeological long either).
I mean tens of minutes, perhaps even an hour.
Things which are time sensitive (like keeping transfers running
without the remote end deciding you've vanished as no more data
is being taken - the TCP window sits at 0 for too long) tend to fail,
but the system itself in my experience, tends to recover.
That is, even though processes report being stuck in tstile
waits, and that in the past that often represented a deadlock
somewhere, I have not seen one of those now in a long time - these are,
or seem to be, just waits that are waiting much longer than we'd want
them to.
And to answer (from my experience) Greg's question - yes, in my
cases this tends to happen when there's memory pressure, but not
the kind of pressure that should be bothering anything (in over 2
years I've yet to see my system page anything out, ever - I have
swap space, plenty, but it always shows 0 used) - there is
something broken in some of the ubc/buffer cleanup code somewhere I think.
Much of the used memory should simply have been flushed ages ago.
My system has 64GB, which is quite a lot, but even if all of that
(every single page) was data waiting to be written, which it
obviously isn't, the slowest of my drives can handle 60MB/sec most
of the time (and others much more) so all 64GB could be written in not
much more than 1024 seconds, wherever it was destined, even if all to
one of the slower drives, which is unlikely, but that is less than some
of the hangs I have seen (30 mins or more).
Further, even in that case, as some of all of that is written, it should
be being discarded, leading to available memory, which should be
allowing progress elsewhere, well before it is done. But apparently not.
My guess with that is that something is being locked, and stays locked,
while all of this gets cleaned up - and that lock prevents almost all
useful progress elsewhere from continuing. It is worth noting that
processes not doing actual I/O (like clocks, but also things like vmstat
and iostat in xterms) keep on working. So it looks like a tstile/deadlock
issue, but isn't.
I haven't yet managed to find which kernel thread is causing the problems.
And that assumes there is just one of them, acting alone.
Oh, and once it recovers, it is recovered (until next time) - it isn't as
if lots of memory is being lost somewhere (or not enough that I can
detect it anyway).
Also, as it might be related, a while ago now, the system had a sudden stop
due to power failure (either there was no UPS at the time, or the UPS
gave up, I forget ... I'm not currently running any UPS monitoring system,
which I know I should be, but that's not the issue here). When the
power returned, and the system rebooted, everything looked fine. That is,
except for data in some files was missing. Now that's expected, data is
buffered, and while we make sure (one way or another) that the meta-data
(directories, inodes, etc) are all consistent, there's no guarantee that
file data will have made it to disc. What was surprising here is that
when this happened, nothing much have been happening (it was an idle
day for me as far as computer work was concerned). Some of the files
that ended up with no valid data were e-mail messages I had fetched & read
more than 12 hours earlier (that is, before the power loss). I use nmh,
which uses one file for each message, and (as the meta data was all saved)
the mod times of those files were there (and correct, or at least, close to
when I'd expected they would be, I hadn't been making notes!) But the
message contents (the block contents) was just binary trash - whatever
had been in the blocks when they were last used for something else, no
signs of the e-mail message contents. (I lost nothing, I just had to
fetch and process that e-mail again, after deleting all the broken files).
Most probably there was other damaged data as well, but that would have
been harder to detect, and was probably unimportant.
NetBSD "lost" update(8) some time ago - but since the above I have been running
a sh script that does sync(2) (via sync(8)) every 30 secs (randomised a bit)
ever since. No more issues like that (though sync is one of the processes
that will sleep for lengthy periods when hangs happen). The syncs don't stop
the hangs however (though they might sometimes seem to trigger a short one,
if there has been lots of I/O happening recently - lots of file copying).
kre
From: Ryo ONODERA <ryo@tetera.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/58317: hang in vcache_vget()
Date: Sun, 30 Jun 2024 22:54:02 +0900
Hi,
I can reproduce this problem during the build of pkgsrc/lang/perl5.
Reverting rev 1.562 of src/sys/kern/vfs_syscalls.c to rev 1.561
and I have no problem anymore.
From: Rin Okuyama <rokuyama.rk@gmail.com>
To: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, manu@netbsd.org
Cc:
Subject: Re: kern/58317: hang in vcache_vget()
Date: Mon, 1 Jul 2024 09:38:48 +0900
On 2024/06/30 22:55, Ryo ONODERA wrote:
> I can reproduce this problem during the build of pkgsrc/lang/perl5.
> Reverting rev 1.562 of src/sys/kern/vfs_syscalls.c to rev 1.561
> and I have no problem anymore.
Similar symptoms but different bugs?
This PR is for 10.0_STABLE as of 2024-06-06, while vfs_syscalls.c
rev 1.562 is for -current.
Thanks,
rin
From: Ryo ONODERA <ryo@tetera.org>
To: Rin Okuyama <rokuyama.rk@gmail.com>, gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/58317: hang in vcache_vget()
Date: Mon, 01 Jul 2024 10:27:53 +0900
On July 1, 2024 9:38:48 AM GMT+09:00, Rin Okuyama <rokuyama=2Erk@gmail=2Eco=
m> wrote:
>On 2024/06/30 22:55, Ryo ONODERA wrote:
>> I can reproduce this problem during the build of pkgsrc/lang/perl5=2E
>> Reverting rev 1=2E562 of src/sys/kern/vfs_syscalls=2Ec to rev 1=2E561
>> and I have no problem anymore=2E
>
>Similar symptoms but different bugs?
>
>This PR is for 10=2E0_STABLE as of 2024-06-06, while vfs_syscalls=2Ec
>rev 1=2E562 is for -current=2E
>
>Thanks,
>rin
Hi,
I am sorry=2E I had misunderstood the problem=2E
Thank you=2E
--=20
Ryo ONODERA // ryo@tetera=2Eorg
PGP fingerprint =3D 82A2 DC91 76E0 A10A 8ABB FD1B F404 27FA C7D1 15F3
>Unformatted:
NetBSD-10.0/i386 XEN3PAE_DOMU
with this addition:
https://releng.netbsd.org/cgi-bin/req-10.cgi?show=701
(Contact us)
$NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2024
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.