NetBSD Problem Report #57507
From www@netbsd.org Thu Jul 6 19:56:25 2023
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 4B04A1A923D
for <gnats-bugs@gnats.NetBSD.org>; Thu, 6 Jul 2023 19:56:25 +0000 (UTC)
Message-Id: <20230706195623.420431A923E@mollari.NetBSD.org>
Date: Thu, 6 Jul 2023 19:56:23 +0000 (UTC)
From: jspath55@gmail.com
Reply-To: jspath55@gmail.com
To: gnats-bugs@NetBSD.org
Subject: Automated test case slimit_le_1_user fails intermittently on some architectures
X-Send-Pr-Version: www-1.0
>Number: 57507
>Category: misc
>Synopsis: Automated test case slimit_le_1_user fails intermittently on some architectures
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: misc-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Thu Jul 06 20:00:01 +0000 2023
>Last-Modified: Wed Jul 12 22:40:01 +0000 2023
>Originator: Jim Spath
>Release: 10.0_BETA
>Organization:
>Environment:
NetBSD etc.etc 10.0_BETA NetBSD 10.0_BETA (GENERIC) #0: Sun Feb 12 12:39:37 UTC 2023 mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/amd64/compile/GENERIC amd64 x86_64 AMD 686-class NetBSD
>Description:
This test case fails intermittently on some architectures I tested but not all.
Several tests were chosen from the full test suite after observing inconsistent pass/fail results over many runs.
Tests were run periodically from cron, timed to avoid overlapping runs while completing as many attempts as feasible based on the system speed.
The set being reported in July 2023 includes failure rates in the one-percent ranges, to be followed by other tests that fail more frequently.
The results for systems used:
System Passes Fails
amd64 3998 8
Pi3 267 5
Pi4 791 0
Pi0W 595 2
Pi02W 888 2
i386 (Intel Atom) 707 3
Example failure message:
tps-count: 1
tp-start: 1685596261.645414, fs/ffs/t_quotalimit, 1
tc-start: 1685596261.645555, slimit_le_1_user
tc-so:Executing command [ newfs -q user -B le -O 1 -s 4000 -F fsimage ]
tc-so:Executing command [ /usr/tests/fs/ffs/h_ffs_server -b fsimage unix://test ]
tc-so:Executing command [ env LD_PRELOAD=/usr/lib/librumphijack.so RUMPHIJACK=vfs=getvfsstat,blanket=/mnt edquota -u -s2k/4 -h3k/6 -t 1s/1d 1 ]
tc-so:Executing command [ rump.halt ]
tc-so:Executing command [ /usr/tests/fs/ffs/h_quota2_tests -b 1 fsimage unix://test ]
tc-so:Executing command [ env LD_PRELOAD=/usr/lib/librumphijack.so RUMPHIJACK=vfs=getvfsstat,blanket=/mnt quota -u -h 1 ]
tc-so:Executing command [ env LD_PRELOAD=/usr/lib/librumphijack.so RUMPHIJACK=vfs=getvfsstat,blanket=/mnt repquota -u /mnt ]
tc-so:Executing command [ env RUMP_SERVER=unix://test rump.halt ]
tc-so:Executing command [ fsck_ffs -nf -F fsimage ]
tc-end: 1685596622.716306, slimit_le_1_user, failed, Test case timed out after 360 seconds
tp-end: 1685596622.742382, fs/ffs/t_quotalimit
info: time.end, Thu Jun 1 05:17:02 UTC 2023
Example 2 failure:
tps-count: 18
tp-start: 1685840405.888079, fs/ffs/t_quotalimit, 1
tc-start: 1685840405.888403, slimit_le_1_user
tc-so:Executing command [ newfs -q user -B le -O 1 -s 4000 -F fsimage ]
tc-so:Executing command [ /usr/tests/fs/ffs/h_ffs_server -b fsimage unix://test ]
tc-so:Executing command [ env LD_PRELOAD=/usr/lib/librumphijack.so RUMPHIJACK=vfs=getvfsstat,blanket=/mnt edquota -u -s2k/4 -h3k/6 -t 1s/1d 1 ]
tc-so:Executing command [ rump.halt ]
tc-so:Executing command [ /usr/tests/fs/ffs/h_quota2_tests -b 1 fsimage unix://test ]
tc-se:Fail: regexp test 1: write beyond the soft limit after grace time returned 69: Disc quota exceeded not in stderr
tc-end: 1685840410.766193, slimit_le_1_user, failed, atf-check failed; see the output of the test for details
tp-end: 1685840410.783617, fs/ffs/t_quotalimit
Example pass message:
tps-count: 1
tp-start: 1685599861.881615, fs/ffs/t_quotalimit, 1
tc-start: 1685599861.881763, slimit_le_1_user
tc-so:Executing command [ newfs -q user -B le -O 1 -s 4000 -F fsimage ]
tc-so:Executing command [ /usr/tests/fs/ffs/h_ffs_server -b fsimage unix://test ]
tc-so:Executing command [ env LD_PRELOAD=/usr/lib/librumphijack.so RUMPHIJACK=vfs=getvfsstat,blanket=/mnt edquota -u -s2k/4 -h3k/6 -t 1s/1d 1 ]
tc-so:Executing command [ rump.halt ]
tc-so:Executing command [ /usr/tests/fs/ffs/h_quota2_tests -b 1 fsimage unix://test ]
tc-so:Executing command [ env LD_PRELOAD=/usr/lib/librumphijack.so RUMPHIJACK=vfs=getvfsstat,blanket=/mnt quota -u -h 1 ]
tc-so:Executing command [ env LD_PRELOAD=/usr/lib/librumphijack.so RUMPHIJACK=vfs=getvfsstat,blanket=/mnt repquota -u /mnt ]
tc-so:Executing command [ env RUMP_SERVER=unix://test rump.halt ]
tc-so:Executing command [ fsck_ffs -nf -F fsimage ]
tc-end: 1685599865.543896, slimit_le_1_user, passed
tp-end: 1685599865.561248, fs/ffs/t_quotalimit
info: time.end, Thu Jun 1 06:11:05 UTC 2023
>How-To-Repeat:
$ atf-run fs/ffs/t_quotalimit:slimit_le_1_user
>Fix:
Unknown.
>Audit-Trail:
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: misc/57507: Automated test case slimit_le_1_user fails
intermittently on some architectures
Date: Fri, 7 Jul 2023 10:49:54 +0200
On Thu, Jul 06, 2023 at 08:00:01PM +0000, jspath55@gmail.com wrote:
> Tests were run periodically from cron, timed to avoid overlapping
> runs while completing as many attempts as feasible based on the system
> speed.
Did the cron job run the single test you are reporting, or the full
atf test suite?
Were any processes left over (e.g. rump_server idle spinning) after the
test run, and was there something making sure those were all killed
after the run (or before the next one)?
Martin
From: Jim Spath <jspath55@gmail.com>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: misc/57507: Automated test case slimit_le_1_user fails
intermittently on some architectures
Date: Wed, 12 Jul 2023 18:37:11 -0400
On Fri, Jul 7, 2023 at 4:50=E2=80=AFAM Martin Husemann <martin@duskware.de>=
wrote:
> Did the cron job run the single test you are reporting, or the full
> atf test suite?
Thank you for the guidance. I had one cron job that ran a script with
successive single ATF cases, such as:
cd /usr/tests
atf-run \
fs/ffs/t_quotalimit:slimit_le_1_user
atf-run \
fs/vfs/t_full:nfs_fillfs
[...]
> Were any processes left over (e.g. rump_server idle spinning) after the
> test run, and was there something making sure those were all killed
> after the run (or before the next one)?
I looked back over one of the systems which had infrequent failures,
and found these processes:
root 4502 0.0 0.2 82596 6068 ? Ssl Thu12AM 18:13.56
/usr/tests/fs/ffs/h_quota2_tests -b 1 fsimage unix://test
root 5350 0.0 0.2 83332 6020 ? Isl Thu07PM 15:55.09
/usr/tests/fs/ffs/h_quota2_tests -b 1 fsimage unix://test
root 20843 0.0 0.2 82932 6040 ? Isl 25Jun23 30:40.67
/usr/tests/fs/ffs/h_quota2_tests -b 1 fsimage unix://test
These times equate to the 3 failures on that system of this test case.
info: atf.version, Automated Testing Framework 0.20 (atf-0.20)
info: tests.root, /usr/tests
info: time.start, Sun Jun 25 03:11:00 UTC 2023
[...]
tps-count: 1
tp-start: 1687662660.702304, fs/ffs/t_quotalimit, 1
tc-start: 1687662660.702379, slimit_le_1_user
tc-so:Executing command [ newfs -q user -B le -O 1 -s 4000 -F fsimage ]
tc-so:Executing command [ /usr/tests/fs/ffs/h_ffs_server -b fsimage
unix://test ]
tc-so:Executing command [ env LD_PRELOAD=3D/usr/lib/librumphijack.so
RUMPHIJACK=3Dvfs=3Dgetvfsstat,blanket=3D/mnt edquota -u -s2k/4 -h3k/6 -t
1s/1d 1 ]
tc-so:Executing command [ rump.halt ]
tc-so:Executing command [ /usr/tests/fs/ffs/h_quota2_tests -b 1
fsimage unix://test ]
tc-se:Fail: regexp test 1: write beyond the soft limit after grace
time returned 69: Disc quota exceeded not in stderr
tc-end: 1687662663.27629, slimit_le_1_user, failed, atf-check failed;
see the output of the test for details
tp-end: 1687662663.28139, fs/ffs/t_quotalimit
info: time.end, Sun Jun 25 03:11:03 UTC 2023
I did not have any steps in place to remove the orphan processes.
I did not see any other unexplained processes on that system; I will
review the others that are still "up". A couple of the smaller systems
ran out of one resource or another and I've rebooted them since the
test sweeps.
Jim
(Contact us)
$NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2023
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.