NetBSD Problem Report #48282
From gson@gson.org Sun Oct 6 12:47:11 2013
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "Postmaster NetBSD.org" (verified OK))
by mollari.NetBSD.org (Postfix) with ESMTPS id B19917109D
for <gnats-bugs@gnats.NetBSD.org>; Sun, 6 Oct 2013 12:47:11 +0000 (UTC)
Message-Id: <20131006124707.8DCBD75E4E@guava.gson.org>
Date: Sun, 6 Oct 2013 15:47:07 +0300 (EEST)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@gnats.NetBSD.org
Subject: Running ATF tests on 6.1.2 in KVM VPS leaves processes running
X-Send-Pr-Version: 3.95
>Number: 48282
>Category: misc
>Synopsis: Running ATF tests on 6.1.2 in KVM VPS leaves processes running
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: misc-bug-people
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sun Oct 06 12:50:01 +0000 2013
>Closed-Date: Sat Mar 29 14:01:04 +0000 2014
>Last-Modified: Sat Sep 19 10:45:01 +0000 2020
>Originator: Andreas Gustafsson
>Release: NetBSD 6.1.2 + patches to work around PRs 45671 and 48277
>Organization:
>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:
After working around PRs 45671 and 48277, I now have NetBSD 6.1.2
running on an edis.at KVM VPS. However, when I run the ATF tests with
cd /usr/tests
atf-run | atf-report
a number of failures are reported:
Failed test cases:
fs/nfs/t_rquotad:get_nfs_be_1_group, fs/nfs/t_rquotad:get_nfs_be_1_user, fs/nfs/t_rquotad:get_nfs_le_1_both, fs/vfs/t_renamerace:lfs_renamerace_dirs,
lib/libc/stdlib/t_strtod:strtold_inf, lib/libc/stdlib/t_strtod:strtold_nan, lib/libm/t_infinity:infinity_long_double
Summary for 500 test programs:
2970 passed test cases.
7 failed test cases.
35 expected failed test cases.
67 skipped test cases.
The strtold_inf, strtold_nan, and infinity_long_double failures are
known problems and also fail elsewhere (e.g., today's git QEMU), but
the get_nfs_be_1_group, get_nfs_be_1_user, get_nfs_le_1_both, and
lfs_renamerace_dirs failures are unexpected.
Running ps after the tests have finished, it shows a number of
rump_server processes and some NFS related processes still running:
gunvor# ps -glaxww
UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
0 0 0 0 125 0 0 2380 uvm DKl ? 0:08.11 [system]
0 1 0 10503 83 0 12864 1296 wait Is ? 0:00.04 init
0 96 1 0 85 0 23268 1544 kqueue Is ? 0:00.03 /usr/libexec/rpc.rquotad
0 158 1 0 85 0 23832 1860 kqueue Ss ? 0:00.07 /usr/sbin/syslogd -s
0 294 1 0 85 0 12880 1004 kqueue Is ? 0:00.00 /usr/sbin/powerd
0 318 1 13685 43 0 178832 7588 parked Isl ? 0:07.46 rump_server -lrumpvfs -lrumpdev -lrumpnet -lrumpnet_net -lrumpnet_netinet -lrumpnet_local -lrumpnet_shmif -lrumpdev_disk -lrumpfs_ffs -lrumpfs_nfs -lrumpfs_nfsserver -d key unix://test
0 323 1 0 85 0 57304 2856 select Is ? 0:00.01 /usr/sbin/sshd
0 383 1 0 85 0 10828 1284 nanoslp Ss ? 0:00.03 /usr/sbin/cron
0 419 1 16894 43 0 121020 6520 parked Isl ? 0:07.25 rump_server -lrumpvfs -lrumpnet -lrumpnet_net -lrumpnet_netinet -lrumpnet_shmif -lrumpfs_nfs unix://clientsock
0 505 1 0 85 0 49088 3852 kqueue Is ? 0:00.06 /usr/libexec/postfix/master
0 518 1 10373 43 0 35472 1576 parked Isl ? 0:00.01 nfsd -tu
0 519 1 0 85 0 14980 1080 kqueue Is ? 0:00.00 /usr/sbin/inetd -l
0 534 323 0 85 0 76380 5748 select Ss ? 0:01.10 sshd: root@pts/0 (sshd)
12 548 505 0 85 0 49208 3772 kqueue I ? 0:00.02 qmgr -l -t unix -u
12 581 505 0 85 0 49148 3772 kqueue I ? 0:00.02 pickup -l -t fifo -u
0 588 1 156264 38 0 178832 7584 parked Isl ? 0:08.52 rump_server -lrumpvfs -lrumpdev -lrumpnet -lrumpnet_net -lrumpnet_netinet -lrumpnet_local -lrumpnet_shmif -lrumpdev_disk -lrumpfs_ffs -lrumpfs_nfs -lrumpfs_nfsserver -d key unix://test
0 721 1 12865 43 0 178832 7584 parked Isl ? 0:07.89 rump_server -lrumpvfs -lrumpdev -lrumpnet -lrumpnet_net -lrumpnet_netinet -lrumpnet_local -lrumpnet_shmif -lrumpdev_disk -lrumpfs_ffs -lrumpfs_nfs -lrumpfs_nfsserver -d key unix://test
0 782 1 23453 42 0 116924 6504 parked Isl ? 0:08.25 rump_server -lrumpvfs -lrumpnet -lrumpnet_net -lrumpnet_netinet -lrumpnet_shmif -lrumpfs_nfs unix://clientsock
0 813 1 0 85 0 23340 1552 kqueue Is ? 0:00.04 rpcbind
0 1360 1 0 85 0 23356 1664 kqueue Is ? 0:00.03 mountd /rump/etc/exports
0 26263 1 0 85 0 23356 1664 kqueue Is ? 0:00.03 mountd /rump/etc/exports
0 26504 1 0 85 0 23268 1544 kqueue Is ? 0:00.03 /usr/libexec/rpc.rquotad
0 26754 1 56990 38 0 35472 1576 parked Isl ? 0:00.00 nfsd -tu
0 26792 1 0 85 0 23268 1544 kqueue Is ? 0:00.03 /usr/libexec/rpc.rquotad
0 27042 1 0 85 0 23340 1560 kqueue Is ? 0:00.04 rpcbind
0 27243 1 0 85 0 23356 1664 kqueue Is ? 0:00.03 mountd /rump/etc/exports
0 28061 1 0 85 0 23340 1556 kqueue Is ? 0:00.03 rpcbind
0 28308 1 102582 38 0 116924 6504 parked Isl ? 0:07.50 rump_server -lrumpvfs -lrumpnet -lrumpnet_net -lrumpnet_netinet -lrumpnet_shmif -lrumpfs_nfs unix://clientsock
0 29519 1 12065 42 0 35472 1576 parked Isl ? 0:00.01 nfsd -tu
0 562 534 0 85 0 13116 1796 wait Ss ttyp0 0:00.06 -sh
0 3677 562 0 43 0 12912 1140 - O+ ttyp0 0:00.00 ps -glaxww
0 538 1 0 85 0 12888 1288 ttyraw Is+ ttyE0 0:00.00 /usr/libexec/getty Pc console
0 444 1 0 85 0 12888 1288 ttyraw Is+ ttyE1 0:00.00 /usr/libexec/getty Pc ttyE1
0 531 1 0 85 0 12888 1288 ttyraw Is+ ttyE2 0:00.00 /usr/libexec/getty Pc ttyE2
0 558 1 0 85 0 12888 1288 ttyraw Is+ ttyE3 0:00.00 /usr/libexec/getty Pc ttyE3
This is a newly installed machine that is not configured as an NFS
server or client, so any NFS related processes above are leftovers
from the test, not deliberately started.
I don't know if this is a bug in NetBSD or one in KVM.
>How-To-Repeat:
>Fix:
>Release-Note:
>Audit-Trail:
From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org
Cc: gson@gson.org (Andreas Gustafsson)
Subject: Re: misc/48282: Running ATF tests on 6.1.2 in KVM VPS leaves processes running
Date: Mon, 7 Oct 2013 14:55:25 +0300
I did some more debugging of this. First of all, I tracked
down the problem to the fs/nfs/t_rquotad test script.
Instead of running the full test suite, the problem can be triggered
manually by running:
cd /usr/tests/fs/nfs/
atf-run t_rquotad
This will yield a bunch of output including
tc-so:Executing command [ rump_server -lrumpvfs -lrumpnet -lrumpnet_net -lrumpnet_netinet -lrumpnet_shmif -lrumpfs_nfs unix://clientsock ]
tc-so:Executing command [ rump.ifconfig shmif0 create ]
tc-so:Executing command [ rump.ifconfig shmif0 linkstr shmbus ]
tc-so:Executing command [ rump.ifconfig shmif0 inet 10.1.1.100 ]
tc-so:Executing command [ mkdir /rump/mnt ]
(at this point, there is a one-minute pause)
tc-end: 1381146179.891394, get_nfs_be_1_user, failed, Test case timed out after 60 seconds
Although this output may look like the "mkdir" command is timing out,
the timeout actually happens in the following command,
mount_nfs 10.1.1.1:/export /rump/mnt
which ATF fails to log (that's PR 48285).
If I add a "sleep 1" between the "mkdir /rump/mnt" line and the
"mount_nfs" line, the test passes.
I still don't know what causes this, but perhaps the above contains
enough clues for someone to figure it out.
--
Andreas Gustafsson, gson@gson.org
From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org
Cc: gson@gson.org (Andreas Gustafsson)
Subject: Re: misc/48282: Running ATF tests on 6.1.2 in KVM VPS leaves processes running
Date: Wed, 9 Oct 2013 16:48:51 +0300
I have now determined the cause of the problem. It's a race
condition: nfsd calls daemon() early in its main(), and by the time it
gets around to registering its RPCs, the test has already proceeded to
execute mount_nfs, which fails because the RPCs are not yet
registered.
Other daemons in NetBSD also suffer from the same general problem of
going into the background before they are ready to provide service.
Not that the problem is in any way limited to NetBSD; see for example
this article, which I mostly agree with:
http://welz.org.za/notes/on-starting-daemons.html
It would be good if we could come up with a standard solution to this
problem rather than solving it separately and differently in each
daemon. Once have a concrete proposal, I will post it on
tech-userlevel.
--
Andreas Gustafsson, gson@gson.org
From: "Andreas Gustafsson" <gson@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/48282 CVS commit: src/tests/fs/nfs
Date: Thu, 13 Mar 2014 12:45:15 +0000
Module Name: src
Committed By: gson
Date: Thu Mar 13 12:45:14 UTC 2014
Modified Files:
src/tests/fs/nfs: t_rquotad.sh
Log Message:
Add -lrumpnet_netinet6 to the rump_server invocation, to stop
rpc.rquotad from printing the error message "rpc.rquotad:
_svc_tli_create: could not open connection for udp6: Address family
not supported by protocol family". The error message currently ends
up in /dev/null because it is printed to stderr after calling daemon()
with noclose=0, but if the daemonization is moved to a later point to
fix the race condition of PR misc/48282, it will actually appear on
stderr and break the test, which is expecting stderr to be empty.
To generate a diff of this commit:
cvs rdiff -u -r1.3 -r1.4 src/tests/fs/nfs/t_rquotad.sh
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Andreas Gustafsson" <gson@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/48282 CVS commit: src/libexec/rpc.rquotad
Date: Tue, 18 Mar 2014 11:00:20 +0000
Module Name: src
Committed By: gson
Date: Tue Mar 18 11:00:20 UTC 2014
Modified Files:
src/libexec/rpc.rquotad: rquotad.c
Log Message:
Delay daemonization until RPC services have been registered.
Fixes one of the race conditions of PR misc/48282.
To generate a diff of this commit:
cvs rdiff -u -r1.32 -r1.33 src/libexec/rpc.rquotad/rquotad.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Andreas Gustafsson" <gson@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/48282 CVS commit: src/usr.sbin/nfsd
Date: Sat, 29 Mar 2014 13:50:53 +0000
Module Name: src
Committed By: gson
Date: Sat Mar 29 13:50:53 UTC 2014
Modified Files:
src/usr.sbin/nfsd: nfsd.c
Log Message:
When daemonizing, delay the parent exit until the daemon is
ready to provide service. Fixes one of the race conditions
of PR misc/48282.
To generate a diff of this commit:
cvs rdiff -u -r1.61 -r1.62 src/usr.sbin/nfsd/nfsd.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: open->closed
State-Changed-By: gson@NetBSD.org
State-Changed-When: Sat, 29 Mar 2014 14:01:04 +0000
State-Changed-Why:
Races in both rpc.rquotad and nfsd are now fixed.
From: "Andreas Gustafsson" <gson@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/48282 CVS commit: src/usr.sbin/mountd
Date: Fri, 4 Apr 2014 12:45:59 +0000
Module Name: src
Committed By: gson
Date: Fri Apr 4 12:45:59 UTC 2014
Modified Files:
src/usr.sbin/mountd: mountd.c
Log Message:
When daemonizing, delay the parent exit until the daemon is ready to
provide service. Fixes a race that is a third possible cause of
PR misc/48282, in addition to the ones in rpc.rquotad and nfsd that
were fixed earlier.
To generate a diff of this commit:
cvs rdiff -u -r1.125 -r1.126 src/usr.sbin/mountd/mountd.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org, christos@NetBSD.org
Cc:
Subject: Re: misc/48282: Running ATF tests on 6.1.2 in KVM VPS leaves processes running
Date: Wed, 26 Aug 2020 17:11:08 +0300
The old race condition in nfsd reported in PR 48282 is back; when nfsd
daemonizes, the parent process can once again exit before the child
process is ready to provide service, because of the line
(void)write(detach_msg_pipe[1], "", 1);
added in 2016 by Christos' commit of nfsd.c 1.67, with the message
PR/51437: Hauke Fath: write a byte to the socket so that nfsd -u does not die.
The added line makes no sense to me, and I don't see how it could
possibly have fixed PR/51437 as claimed (but the commit of 1.68 the
next day probably did).
Since removing the line might be considered "reverting another
developer's commit" under section 11 of the NetBSD Commit Guidelines,
I'm hereby asking Christos to remove it.
--
Andreas Gustafsson, gson@gson.org
From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: misc/48282: Running ATF tests on 6.1.2 in KVM VPS leaves processes running
Date: Sat, 19 Sep 2020 13:41:47 +0300
The race condition in nfsd is fixed once again with christos' commit
of src/usr.sbin/nfsd/nfsd.c 1.73, so the rquotad tests should no
longer randomly fail.
And even if they do fail, they should no longer leave rump processes
running since my commit of src/tests/fs/nfs/t_rquotad.sh 1.8.
--
Andreas Gustafsson, gson@gson.org
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.