NetBSD Problem Report #57507

From www@netbsd.org  Thu Jul  6 19:56:25 2023
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 4B04A1A923D
	for <gnats-bugs@gnats.NetBSD.org>; Thu,  6 Jul 2023 19:56:25 +0000 (UTC)
Message-Id: <20230706195623.420431A923E@mollari.NetBSD.org>
Date: Thu,  6 Jul 2023 19:56:23 +0000 (UTC)
From: jspath55@gmail.com
Reply-To: jspath55@gmail.com
To: gnats-bugs@NetBSD.org
Subject: Automated test case slimit_le_1_user fails intermittently on some architectures
X-Send-Pr-Version: www-1.0

>Number:         57507
>Category:       misc
>Synopsis:       Automated test case slimit_le_1_user fails intermittently on some architectures
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    misc-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Jul 06 20:00:01 +0000 2023
>Last-Modified:  Wed Jul 12 22:40:01 +0000 2023
>Originator:     Jim Spath
>Release:        10.0_BETA
>Organization:
>Environment:
NetBSD etc.etc 10.0_BETA NetBSD 10.0_BETA (GENERIC) #0: Sun Feb 12 12:39:37 UTC 2023  mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/amd64/compile/GENERIC amd64 x86_64 AMD 686-class NetBSD

>Description:
This test case fails intermittently on some architectures I tested but not all.
Several tests were chosen from the full test suite after observing inconsistent pass/fail results over many runs.
Tests were run periodically from cron, timed to avoid overlapping runs while completing as many attempts as feasible based on the system speed.

The set being reported in July 2023 includes failure rates in the one-percent ranges, to be followed by other tests that fail more frequently.

The results for systems used:

System Passes Fails

amd64 3998 8
Pi3 267 5
Pi4 791 0
Pi0W 595 2
Pi02W 888 2
i386 (Intel Atom) 707 3


Example failure message:

tps-count: 1
tp-start: 1685596261.645414, fs/ffs/t_quotalimit, 1
tc-start: 1685596261.645555, slimit_le_1_user
tc-so:Executing command [ newfs -q user -B le -O 1 -s 4000 -F fsimage ]
tc-so:Executing command [ /usr/tests/fs/ffs/h_ffs_server -b fsimage unix://test ]
tc-so:Executing command [ env LD_PRELOAD=/usr/lib/librumphijack.so RUMPHIJACK=vfs=getvfsstat,blanket=/mnt edquota -u -s2k/4 -h3k/6 -t 1s/1d 1 ]
tc-so:Executing command [ rump.halt ]
tc-so:Executing command [ /usr/tests/fs/ffs/h_quota2_tests -b 1 fsimage unix://test ]
tc-so:Executing command [ env LD_PRELOAD=/usr/lib/librumphijack.so RUMPHIJACK=vfs=getvfsstat,blanket=/mnt quota -u -h 1 ]
tc-so:Executing command [ env LD_PRELOAD=/usr/lib/librumphijack.so RUMPHIJACK=vfs=getvfsstat,blanket=/mnt repquota -u /mnt ]
tc-so:Executing command [ env RUMP_SERVER=unix://test rump.halt ]
tc-so:Executing command [ fsck_ffs -nf -F fsimage ]
tc-end: 1685596622.716306, slimit_le_1_user, failed, Test case timed out after 360 seconds
tp-end: 1685596622.742382, fs/ffs/t_quotalimit
info: time.end, Thu Jun 1 05:17:02 UTC 2023


Example 2 failure:

tps-count: 18
tp-start: 1685840405.888079, fs/ffs/t_quotalimit, 1
tc-start: 1685840405.888403, slimit_le_1_user
tc-so:Executing command [ newfs -q user -B le -O 1 -s 4000 -F fsimage ]
tc-so:Executing command [ /usr/tests/fs/ffs/h_ffs_server -b fsimage unix://test ]
tc-so:Executing command [ env LD_PRELOAD=/usr/lib/librumphijack.so RUMPHIJACK=vfs=getvfsstat,blanket=/mnt edquota -u -s2k/4 -h3k/6 -t 1s/1d 1 ]
tc-so:Executing command [ rump.halt ]
tc-so:Executing command [ /usr/tests/fs/ffs/h_quota2_tests -b 1 fsimage unix://test ]
tc-se:Fail: regexp test 1: write beyond the soft limit after grace time returned 69: Disc quota exceeded not in stderr
tc-end: 1685840410.766193, slimit_le_1_user, failed, atf-check failed; see the output of the test for details
tp-end: 1685840410.783617, fs/ffs/t_quotalimit


Example pass message:

tps-count: 1
tp-start: 1685599861.881615, fs/ffs/t_quotalimit, 1
tc-start: 1685599861.881763, slimit_le_1_user
tc-so:Executing command [ newfs -q user -B le -O 1 -s 4000 -F fsimage ]
tc-so:Executing command [ /usr/tests/fs/ffs/h_ffs_server -b fsimage unix://test ]
tc-so:Executing command [ env LD_PRELOAD=/usr/lib/librumphijack.so RUMPHIJACK=vfs=getvfsstat,blanket=/mnt edquota -u -s2k/4 -h3k/6 -t 1s/1d 1 ]
tc-so:Executing command [ rump.halt ]
tc-so:Executing command [ /usr/tests/fs/ffs/h_quota2_tests -b 1 fsimage unix://test ]
tc-so:Executing command [ env LD_PRELOAD=/usr/lib/librumphijack.so RUMPHIJACK=vfs=getvfsstat,blanket=/mnt quota -u -h 1 ]
tc-so:Executing command [ env LD_PRELOAD=/usr/lib/librumphijack.so RUMPHIJACK=vfs=getvfsstat,blanket=/mnt repquota -u /mnt ]
tc-so:Executing command [ env RUMP_SERVER=unix://test rump.halt ]
tc-so:Executing command [ fsck_ffs -nf -F fsimage ]
tc-end: 1685599865.543896, slimit_le_1_user, passed
tp-end: 1685599865.561248, fs/ffs/t_quotalimit
info: time.end, Thu Jun 1 06:11:05 UTC 2023



>How-To-Repeat:
$ atf-run fs/ffs/t_quotalimit:slimit_le_1_user

>Fix:
Unknown.

>Audit-Trail:
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: misc/57507: Automated test case slimit_le_1_user fails
 intermittently on some architectures
Date: Fri, 7 Jul 2023 10:49:54 +0200

 On Thu, Jul 06, 2023 at 08:00:01PM +0000, jspath55@gmail.com wrote:
 > Tests were run periodically from cron, timed to avoid overlapping
 > runs while completing as many attempts as feasible based on the system
 > speed.

 Did the cron job run the single test you are reporting, or the full
 atf test suite?

 Were any processes left over (e.g. rump_server idle spinning) after the
 test run, and was there something making sure those were all killed
 after the run (or before the next one)?

 Martin

From: Jim Spath <jspath55@gmail.com>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: misc/57507: Automated test case slimit_le_1_user fails
 intermittently on some architectures
Date: Wed, 12 Jul 2023 18:37:11 -0400

 On Fri, Jul 7, 2023 at 4:50=E2=80=AFAM Martin Husemann <martin@duskware.de>=
  wrote:
 >  Did the cron job run the single test you are reporting, or the full
 >  atf test suite?

 Thank you for the guidance. I had one cron job that ran a script with
 successive single ATF cases, such as:

 cd /usr/tests
 atf-run \
  fs/ffs/t_quotalimit:slimit_le_1_user
 atf-run \
  fs/vfs/t_full:nfs_fillfs
 [...]

 >  Were any processes left over (e.g. rump_server idle spinning) after the
 >  test run, and was there something making sure those were all killed
 >  after the run (or before the next one)?

 I looked back over one of the systems which had infrequent failures,
 and found these processes:

 root       4502  0.0  0.2  82596  6068 ?      Ssl  Thu12AM  18:13.56
 /usr/tests/fs/ffs/h_quota2_tests -b 1 fsimage unix://test
 root       5350  0.0  0.2  83332  6020 ?      Isl  Thu07PM  15:55.09
 /usr/tests/fs/ffs/h_quota2_tests -b 1 fsimage unix://test
 root      20843  0.0  0.2  82932  6040 ?      Isl  25Jun23  30:40.67
 /usr/tests/fs/ffs/h_quota2_tests -b 1 fsimage unix://test

 These times equate to the 3 failures on that system of this test case.

 info: atf.version, Automated Testing Framework 0.20 (atf-0.20)
 info: tests.root, /usr/tests
 info: time.start, Sun Jun 25 03:11:00 UTC 2023
 [...]
 tps-count: 1
 tp-start: 1687662660.702304, fs/ffs/t_quotalimit, 1
 tc-start: 1687662660.702379, slimit_le_1_user
 tc-so:Executing command [ newfs -q user -B le -O 1 -s 4000 -F fsimage ]
 tc-so:Executing command [ /usr/tests/fs/ffs/h_ffs_server -b fsimage
 unix://test ]
 tc-so:Executing command [ env LD_PRELOAD=3D/usr/lib/librumphijack.so
 RUMPHIJACK=3Dvfs=3Dgetvfsstat,blanket=3D/mnt edquota -u -s2k/4 -h3k/6 -t
 1s/1d 1 ]
 tc-so:Executing command [ rump.halt ]
 tc-so:Executing command [ /usr/tests/fs/ffs/h_quota2_tests -b 1
 fsimage unix://test ]
 tc-se:Fail: regexp test 1: write beyond the soft limit after grace
 time returned 69: Disc quota exceeded not in stderr
 tc-end: 1687662663.27629, slimit_le_1_user, failed, atf-check failed;
 see the output of the test for details
 tp-end: 1687662663.28139, fs/ffs/t_quotalimit
 info: time.end, Sun Jun 25 03:11:03 UTC 2023

 I did not have any steps in place to remove the orphan processes.

 I did not see any other unexplained processes on that system; I will
 review the others that are still "up". A couple of the smaller systems
 ran out of one resource or another and I've rebooted them since the
 test sweeps.

 Jim

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2023 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.