NetBSD Problem Report #48282

From gson@gson.org  Sun Oct  6 12:47:11 2013
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id B19917109D
	for <gnats-bugs@gnats.NetBSD.org>; Sun,  6 Oct 2013 12:47:11 +0000 (UTC)
Message-Id: <20131006124707.8DCBD75E4E@guava.gson.org>
Date: Sun,  6 Oct 2013 15:47:07 +0300 (EEST)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@gnats.NetBSD.org
Subject: Running ATF tests on 6.1.2 in KVM VPS leaves processes running
X-Send-Pr-Version: 3.95

>Number:         48282
>Category:       misc
>Synopsis:       Running ATF tests on 6.1.2 in KVM VPS leaves processes running
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    misc-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Oct 06 12:50:01 +0000 2013
>Closed-Date:    Sat Mar 29 14:01:04 +0000 2014
>Last-Modified:  Sat Sep 19 10:45:01 +0000 2020
>Originator:     Andreas Gustafsson
>Release:        NetBSD 6.1.2 + patches to work around PRs 45671 and 48277
>Organization:
>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:

After working around PRs 45671 and 48277, I now have NetBSD 6.1.2
running on an edis.at KVM VPS.  However, when I run the ATF tests with

   cd /usr/tests
   atf-run | atf-report

a number of failures are reported:

    Failed test cases:
	fs/nfs/t_rquotad:get_nfs_be_1_group, fs/nfs/t_rquotad:get_nfs_be_1_user, fs/nfs/t_rquotad:get_nfs_le_1_both, fs/vfs/t_renamerace:lfs_renamerace_dirs,
	lib/libc/stdlib/t_strtod:strtold_inf, lib/libc/stdlib/t_strtod:strtold_nan, lib/libm/t_infinity:infinity_long_double

    Summary for 500 test programs:
	2970 passed test cases.
	7 failed test cases.
	35 expected failed test cases.
	67 skipped test cases.

The strtold_inf, strtold_nan, and infinity_long_double failures are
known problems and also fail elsewhere (e.g., today's git QEMU), but
the get_nfs_be_1_group, get_nfs_be_1_user, get_nfs_le_1_both, and
lfs_renamerace_dirs failures are unexpected.

Running ps after the tests have finished, it shows a number of
rump_server processes and some NFS related processes still running:

gunvor# ps -glaxww
UID   PID PPID    CPU PRI NI    VSZ  RSS WCHAN   STAT TTY      TIME COMMAND
  0     0    0      0 125  0      0 2380 uvm     DKl  ?     0:08.11 [system]
  0     1    0  10503  83  0  12864 1296 wait    Is   ?     0:00.04 init 
  0    96    1      0  85  0  23268 1544 kqueue  Is   ?     0:00.03 /usr/libexec/rpc.rquotad 
  0   158    1      0  85  0  23832 1860 kqueue  Ss   ?     0:00.07 /usr/sbin/syslogd -s 
  0   294    1      0  85  0  12880 1004 kqueue  Is   ?     0:00.00 /usr/sbin/powerd 
  0   318    1  13685  43  0 178832 7588 parked  Isl  ?     0:07.46 rump_server -lrumpvfs -lrumpdev -lrumpnet -lrumpnet_net -lrumpnet_netinet -lrumpnet_local -lrumpnet_shmif -lrumpdev_disk -lrumpfs_ffs -lrumpfs_nfs -lrumpfs_nfsserver -d key unix://test 
  0   323    1      0  85  0  57304 2856 select  Is   ?     0:00.01 /usr/sbin/sshd 
  0   383    1      0  85  0  10828 1284 nanoslp Ss   ?     0:00.03 /usr/sbin/cron 
  0   419    1  16894  43  0 121020 6520 parked  Isl  ?     0:07.25 rump_server -lrumpvfs -lrumpnet -lrumpnet_net -lrumpnet_netinet -lrumpnet_shmif -lrumpfs_nfs unix://clientsock 
  0   505    1      0  85  0  49088 3852 kqueue  Is   ?     0:00.06 /usr/libexec/postfix/master 
  0   518    1  10373  43  0  35472 1576 parked  Isl  ?     0:00.01 nfsd -tu 
  0   519    1      0  85  0  14980 1080 kqueue  Is   ?     0:00.00 /usr/sbin/inetd -l 
  0   534  323      0  85  0  76380 5748 select  Ss   ?     0:01.10 sshd: root@pts/0 (sshd)
 12   548  505      0  85  0  49208 3772 kqueue  I    ?     0:00.02 qmgr -l -t unix -u 
 12   581  505      0  85  0  49148 3772 kqueue  I    ?     0:00.02 pickup -l -t fifo -u 
  0   588    1 156264  38  0 178832 7584 parked  Isl  ?     0:08.52 rump_server -lrumpvfs -lrumpdev -lrumpnet -lrumpnet_net -lrumpnet_netinet -lrumpnet_local -lrumpnet_shmif -lrumpdev_disk -lrumpfs_ffs -lrumpfs_nfs -lrumpfs_nfsserver -d key unix://test 
  0   721    1  12865  43  0 178832 7584 parked  Isl  ?     0:07.89 rump_server -lrumpvfs -lrumpdev -lrumpnet -lrumpnet_net -lrumpnet_netinet -lrumpnet_local -lrumpnet_shmif -lrumpdev_disk -lrumpfs_ffs -lrumpfs_nfs -lrumpfs_nfsserver -d key unix://test 
  0   782    1  23453  42  0 116924 6504 parked  Isl  ?     0:08.25 rump_server -lrumpvfs -lrumpnet -lrumpnet_net -lrumpnet_netinet -lrumpnet_shmif -lrumpfs_nfs unix://clientsock 
  0   813    1      0  85  0  23340 1552 kqueue  Is   ?     0:00.04 rpcbind 
  0  1360    1      0  85  0  23356 1664 kqueue  Is   ?     0:00.03 mountd /rump/etc/exports 
  0 26263    1      0  85  0  23356 1664 kqueue  Is   ?     0:00.03 mountd /rump/etc/exports 
  0 26504    1      0  85  0  23268 1544 kqueue  Is   ?     0:00.03 /usr/libexec/rpc.rquotad 
  0 26754    1  56990  38  0  35472 1576 parked  Isl  ?     0:00.00 nfsd -tu 
  0 26792    1      0  85  0  23268 1544 kqueue  Is   ?     0:00.03 /usr/libexec/rpc.rquotad 
  0 27042    1      0  85  0  23340 1560 kqueue  Is   ?     0:00.04 rpcbind 
  0 27243    1      0  85  0  23356 1664 kqueue  Is   ?     0:00.03 mountd /rump/etc/exports 
  0 28061    1      0  85  0  23340 1556 kqueue  Is   ?     0:00.03 rpcbind 
  0 28308    1 102582  38  0 116924 6504 parked  Isl  ?     0:07.50 rump_server -lrumpvfs -lrumpnet -lrumpnet_net -lrumpnet_netinet -lrumpnet_shmif -lrumpfs_nfs unix://clientsock 
  0 29519    1  12065  42  0  35472 1576 parked  Isl  ?     0:00.01 nfsd -tu 
  0   562  534      0  85  0  13116 1796 wait    Ss   ttyp0 0:00.06 -sh 
  0  3677  562      0  43  0  12912 1140 -       O+   ttyp0 0:00.00 ps -glaxww 
  0   538    1      0  85  0  12888 1288 ttyraw  Is+  ttyE0 0:00.00 /usr/libexec/getty Pc console 
  0   444    1      0  85  0  12888 1288 ttyraw  Is+  ttyE1 0:00.00 /usr/libexec/getty Pc ttyE1 
  0   531    1      0  85  0  12888 1288 ttyraw  Is+  ttyE2 0:00.00 /usr/libexec/getty Pc ttyE2 
  0   558    1      0  85  0  12888 1288 ttyraw  Is+  ttyE3 0:00.00 /usr/libexec/getty Pc ttyE3 

This is a newly installed machine that is not configured as an NFS
server or client, so any NFS related processes above are leftovers
from the test, not deliberately started.

I don't know if this is a bug in NetBSD or one in KVM.

>How-To-Repeat:

>Fix:

>Release-Note:

>Audit-Trail:
From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org
Cc: gson@gson.org (Andreas Gustafsson)
Subject: Re: misc/48282: Running ATF tests on 6.1.2 in KVM VPS leaves processes running
Date: Mon, 7 Oct 2013 14:55:25 +0300

 I did some more debugging of this.  First of all, I tracked
 down the problem to the fs/nfs/t_rquotad test script.

 Instead of running the full test suite, the problem can be triggered
 manually by running:

   cd /usr/tests/fs/nfs/
   atf-run t_rquotad

 This will yield a bunch of output including

   tc-so:Executing command [ rump_server -lrumpvfs -lrumpnet -lrumpnet_net -lrumpnet_netinet -lrumpnet_shmif -lrumpfs_nfs unix://clientsock ]
   tc-so:Executing command [ rump.ifconfig shmif0 create ]
   tc-so:Executing command [ rump.ifconfig shmif0 linkstr shmbus ]
   tc-so:Executing command [ rump.ifconfig shmif0 inet 10.1.1.100 ]
   tc-so:Executing command [ mkdir /rump/mnt ]

 (at this point, there is a one-minute pause)

   tc-end: 1381146179.891394, get_nfs_be_1_user, failed, Test case timed out after 60 seconds

 Although this output may look like the "mkdir" command is timing out,
 the timeout actually happens in the following command,

   mount_nfs 10.1.1.1:/export /rump/mnt

 which ATF fails to log (that's PR 48285).

 If I add a "sleep 1" between the "mkdir /rump/mnt" line and the
 "mount_nfs" line, the test passes.

 I still don't know what causes this, but perhaps the above contains
 enough clues for someone to figure it out.
 -- 
 Andreas Gustafsson, gson@gson.org

From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org
Cc: gson@gson.org (Andreas Gustafsson)
Subject: Re: misc/48282: Running ATF tests on 6.1.2 in KVM VPS leaves processes running
Date: Wed, 9 Oct 2013 16:48:51 +0300

 I have now determined the cause of the problem.  It's a race
 condition: nfsd calls daemon() early in its main(), and by the time it
 gets around to registering its RPCs, the test has already proceeded to
 execute mount_nfs, which fails because the RPCs are not yet
 registered.

 Other daemons in NetBSD also suffer from the same general problem of
 going into the background before they are ready to provide service.
 Not that the problem is in any way limited to NetBSD; see for example
 this article, which I mostly agree with:

    http://welz.org.za/notes/on-starting-daemons.html

 It would be good if we could come up with a standard solution to this
 problem rather than solving it separately and differently in each
 daemon.  Once have a concrete proposal, I will post it on
 tech-userlevel.
 -- 
 Andreas Gustafsson, gson@gson.org

From: "Andreas Gustafsson" <gson@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/48282 CVS commit: src/tests/fs/nfs
Date: Thu, 13 Mar 2014 12:45:15 +0000

 Module Name:	src
 Committed By:	gson
 Date:		Thu Mar 13 12:45:14 UTC 2014

 Modified Files:
 	src/tests/fs/nfs: t_rquotad.sh

 Log Message:
 Add -lrumpnet_netinet6 to the rump_server invocation, to stop
 rpc.rquotad from printing the error message "rpc.rquotad:
 _svc_tli_create: could not open connection for udp6: Address family
 not supported by protocol family".  The error message currently ends
 up in /dev/null because it is printed to stderr after calling daemon()
 with noclose=0, but if the daemonization is moved to a later point to
 fix the race condition of PR misc/48282, it will actually appear on
 stderr and break the test, which is expecting stderr to be empty.


 To generate a diff of this commit:
 cvs rdiff -u -r1.3 -r1.4 src/tests/fs/nfs/t_rquotad.sh

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Andreas Gustafsson" <gson@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/48282 CVS commit: src/libexec/rpc.rquotad
Date: Tue, 18 Mar 2014 11:00:20 +0000

 Module Name:	src
 Committed By:	gson
 Date:		Tue Mar 18 11:00:20 UTC 2014

 Modified Files:
 	src/libexec/rpc.rquotad: rquotad.c

 Log Message:
 Delay daemonization until RPC services have been registered.
 Fixes one of the race conditions of PR misc/48282.


 To generate a diff of this commit:
 cvs rdiff -u -r1.32 -r1.33 src/libexec/rpc.rquotad/rquotad.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Andreas Gustafsson" <gson@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/48282 CVS commit: src/usr.sbin/nfsd
Date: Sat, 29 Mar 2014 13:50:53 +0000

 Module Name:	src
 Committed By:	gson
 Date:		Sat Mar 29 13:50:53 UTC 2014

 Modified Files:
 	src/usr.sbin/nfsd: nfsd.c

 Log Message:
 When daemonizing, delay the parent exit until the daemon is
 ready to provide service.  Fixes one of the race conditions
 of PR misc/48282.


 To generate a diff of this commit:
 cvs rdiff -u -r1.61 -r1.62 src/usr.sbin/nfsd/nfsd.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->closed
State-Changed-By: gson@NetBSD.org
State-Changed-When: Sat, 29 Mar 2014 14:01:04 +0000
State-Changed-Why:
Races in both rpc.rquotad and nfsd are now fixed.


From: "Andreas Gustafsson" <gson@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/48282 CVS commit: src/usr.sbin/mountd
Date: Fri, 4 Apr 2014 12:45:59 +0000

 Module Name:	src
 Committed By:	gson
 Date:		Fri Apr  4 12:45:59 UTC 2014

 Modified Files:
 	src/usr.sbin/mountd: mountd.c

 Log Message:
 When daemonizing, delay the parent exit until the daemon is ready to
 provide service.  Fixes a race that is a third possible cause of
 PR misc/48282, in addition to the ones in rpc.rquotad and nfsd that
 were fixed earlier.


 To generate a diff of this commit:
 cvs rdiff -u -r1.125 -r1.126 src/usr.sbin/mountd/mountd.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org, christos@NetBSD.org
Cc: 
Subject: Re: misc/48282: Running ATF tests on 6.1.2 in KVM VPS leaves processes running
Date: Wed, 26 Aug 2020 17:11:08 +0300

 The old race condition in nfsd reported in PR 48282 is back; when nfsd
 daemonizes, the parent process can once again exit before the child
 process is ready to provide service, because of the line

   (void)write(detach_msg_pipe[1], "", 1);

 added in 2016 by Christos' commit of nfsd.c 1.67, with the message

   PR/51437: Hauke Fath: write a byte to the socket so that nfsd -u does not die.

 The added line makes no sense to me, and I don't see how it could
 possibly have fixed PR/51437 as claimed (but the commit of 1.68 the
 next day probably did).

 Since removing the line might be considered "reverting another
 developer's commit" under section 11 of the NetBSD Commit Guidelines,
 I'm hereby asking Christos to remove it.
 -- 
 Andreas Gustafsson, gson@gson.org

From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: misc/48282: Running ATF tests on 6.1.2 in KVM VPS leaves processes running
Date: Sat, 19 Sep 2020 13:41:47 +0300

 The race condition in nfsd is fixed once again with christos' commit
 of src/usr.sbin/nfsd/nfsd.c 1.73, so the rquotad tests should no
 longer randomly fail.

 And even if they do fail, they should no longer leave rump processes
 running since my commit of src/tests/fs/nfs/t_rquotad.sh 1.8.
 -- 
 Andreas Gustafsson, gson@gson.org

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.