NetBSD Problem Report #53928
From gson@gson.org Wed Jan 30 08:14:21 2019
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id BC79F7A18D
for <gnats-bugs@gnats.NetBSD.org>; Wed, 30 Jan 2019 08:14:21 +0000 (UTC)
Message-Id: <20190130081416.5967D98ACAD@guava.gson.org>
Date: Wed, 30 Jan 2019 10:14:16 +0200 (EET)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: modules/t_builtin:disable test case randomly fails
X-Send-Pr-Version: 3.95
>Number: 53928
>Category: kern
>Synopsis: modules/t_builtin:disable test case randomly fails
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Wed Jan 30 08:15:00 +0000 2019
>Closed-Date: Sat Feb 23 13:07:37 +0000 2019
>Last-Modified: Sat Feb 23 13:07:37 +0000 2019
>Originator: Andreas Gustafsson
>Release: NetBSD-current
>Organization:
>Environment:
System: NetBSD
Architecture: amd64
Machine: amd64
>Description:
The "disable" test case of the modules/t_builtin test program randomly
fails with the error message "rump_sys_modctl(MODCTL_UNLOAD, kernfs):
Device busy".
For example, in the i386 tests on the TNF testbed, 78 runs out of 227
have failed so far this year. Log output from the most recent failure
is at:
http://releng.netbsd.org/b5reports/i386/2019/2019.01.29.12.01.20/test.html#modules_t_builtin_disable
The test did not fail in 2015 or 2016, and failed for the first
time with source date 2017.05.08.06.39.23.
It is also failing on real amd64 hardware, for example:
http://www.gson.org/netbsd/bugs/build/amd64-baremetal/2019/2019.01.30.01.20.47/test.html#modules_t_builtin_disable
>How-To-Repeat:
Run the ATF tests repeatedly.
>Fix:
>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: kern-bug-people->pgoyette
Responsible-Changed-By: pgoyette@NetBSD.org
Responsible-Changed-When: Wed, 30 Jan 2019 09:55:27 +0000
Responsible-Changed-Why:
Take (I'm looking into it)
From: Paul Goyette <paul@whooppee.com>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/53928: modules/t_builtin:disable test case randomly fails
Date: Wed, 30 Jan 2019 20:18:07 +0800 (PST)
I rebuilt a local librump with kern.module.verbose enabled and ran the
test case with RUMP_VERBOSE=1 defined. This resulted in the following:
...
[ 1.0000090] root file system type: rumpfs
[ 1.0000090] kern.module.path=/stand/amd64/8.99.32/modules
[ 1.0200090] DEBUG: module: unload requested for 'kernfs' (TRUE)
[ 1.0200090] DEBUG: module: cannot unload module `kernfs' error=16
The particular "cannot unload module" error is generated when there
is an error returned from the module's MODULE_CMD_FINI routine.
So, the problem here would appear to be vfs_detach() returning EBUSY
when trying to unload the module.
Since this is not strictly speaking a module problem, I'm going to
reassign this PR back to kern-bug-people. Someone who knows about
vfs_detach() can determine why it sometimes fails.
+------------------+--------------------------+----------------------------+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses: |
| (Retired) | FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org |
+------------------+--------------------------+----------------------------+
Responsible-Changed-From-To: pgoyette->kern-bug-people
Responsible-Changed-By: pgoyette@NetBSD.org
Responsible-Changed-When: Wed, 30 Jan 2019 12:21:57 +0000
Responsible-Changed-Why:
Not module related, need someone familiar with cfs to look into it
From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
To: "gnats-bugs@netbsd.org" <gnats-bugs@NetBSD.org>
Cc:
Subject: Re: kern/53928: modules/t_builtin:disable test case randomly fails
Date: Wed, 30 Jan 2019 14:43:48 +0100
--Apple-Mail=_35FBCF58-F33F-407C-A784-C756827411A6
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
charset=us-ascii
> On 30. Jan 2019, at 13:20, Paul Goyette <paul@whooppee.com> wrote:
<snip>
> So, the problem here would appear to be vfs_detach() returning EBUSY
> when trying to unload the module.
This is most likely fstrans. Destroying a "struct mount" and
thus releasing its mnt_op vector is done by the last thread
running fstrans_enter() and seeing a mount with IMNT_GONE.
It may help to "sysctl -w kern.maxvnodes=2", which fails
but may trigger threads holding a reference to the mount.
--
J. Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig
--Apple-Mail=_35FBCF58-F33F-407C-A784-C756827411A6
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename=signature.asc
Content-Type: application/pgp-signature;
name=signature.asc
Content-Description: Message signed with OpenPGP
-----BEGIN PGP SIGNATURE-----
iQEzBAEBCAAdFiEE2BL3ha7Xao4WUZVYKoaVJdNr+uEFAlxRqhQACgkQKoaVJdNr
+uGgcwf/dyFMXo/BvpMUxdJc0fi4cPwnXdE7NPq+eRrcm/gWkDj5XhKIIr6GcuXu
uXzgQgQEmG7onT83amadsdR37NiMvKkPGnA540cwdQNLoU5Mik3oxgJKIblU6oE9
zyGdFquGLNBxwJDhOujmkJLIO2hv44pYgbXOMRAX9ImfjUOdaWVBfMVZv30YjRul
pXD7hU77+U/2TGFHWXmZR3oX1F3ealJBNNuUC8GzYhVPaZcSjhLoRDrG0dsnyKfk
KGag4afUFT0HlRk0ms77pAOsAezpxGpLyVS7zdoAn+l4XhaRMp7UF2JLaQYLGZma
SsDR6MykSimE7V2Vvo8Tq/8TTTcnvA==
=Ykih
-----END PGP SIGNATURE-----
--Apple-Mail=_35FBCF58-F33F-407C-A784-C756827411A6--
From: Andreas Gustafsson <gson@gson.org>
To: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/53928: modules/t_builtin:disable test case randomly fails
Date: Wed, 30 Jan 2019 18:28:07 +0200
J. Hannken-Illjes wrote:
> This is most likely fstrans.
In that case, perhaps the problem started with these commits, which
were made the day before the first recorded test failure on the
i386 testbed:
2017.05.07.08.21.08 hannken src/sys/kern/vfs_mount.c 1.59
2017.05.07.08.21.57 hannken src/sys/miscfs/genfs/layer_vnops.c 1.64
2017.05.07.08.22.40 hannken src/sys/fs/union/union_vnops.c 1.68
2017.05.07.08.23.28 hannken src/sys/kern/vfs_trans.c 1.44
2017.05.07.08.24.20 hannken src/sys/kern/vfs_mount.c 1.60
2017.05.07.08.24.20 hannken src/sys/kern/vfs_trans.c 1.45
2017.05.07.08.25.54 hannken src/sys/kern/vfs_syscalls.c 1.514
2017.05.07.08.25.54 hannken src/sys/miscfs/genfs/genfs_vfsops.c 1.6
2017.05.07.08.26.58 hannken src/sys/kern/vfs_mount.c 1.61
2017.05.07.08.26.58 hannken src/sys/kern/vfs_subr.c 1.464
2017.05.07.08.26.58 hannken src/sys/kern/vfs_syscalls.c 1.515
> Destroying a "struct mount" and
> thus releasing its mnt_op vector is done by the last thread
> running fstrans_enter() and seeing a mount with IMNT_GONE.
>
> It may help to "sysctl -w kern.maxvnodes=2", which fails
> but may trigger threads holding a reference to the mount.
I'm still confused as to which part of the system you think the
bug lies in, and what would be the correct fix, because surely
running "sysctl -w kern.maxvnodes=2" isn't it.
--
Andreas Gustafsson, gson@gson.org
From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/53928: modules/t_builtin:disable test case randomly fails
Date: Thu, 31 Jan 2019 13:02:05 +0700
Date: Wed, 30 Jan 2019 16:30:01 +0000 (UTC)
From: Andreas Gustafsson <gson@gson.org>
Message-ID: <20190130163001.6B03B7A1F7@mollari.NetBSD.org>
| I'm still confused as to which part of the system you think the
| bug lies in, and what would be the correct fix, because surely
| running "sysctl -w kern.maxvnodes=2" isn't it.
I have been waiting for someone more knowledgeable to answer
this, before throwing in my random guess ... but from what I
have read here, I suspect that the module unload routine (or
something that it calls) when doing the unload by a specific
request (rather than just the "can I unload this now" autounload
way) needs to stop and wait until all references to the filesystem
(including background kernel threads that only run periodically)
have finished with it.
Whether there is enough mechanism, on all sides, for this to be
done properly or not, I have no idea.
I assume the code has already a method to check that the filesystem
isn't in active use, so if it gets past that, and can lock the filesys
so it cannot be returned to use again (by being mounted while we're
waiting) all it should need to do is delay for a short while if it
gets the EBUSY in question, and try again.
I suspect this belongs inside the vfs interface somewhere, as this
isn't the kind of thing a generic module interface ought to know
about ... but it is possible that the same strategy might be needed
for some other part of the system (device drivers, security modules, ...)
so it might be worth having a specific error that could be returned
to the generic module code to indicate that it should try again soon
(EAGAIN ?) have the VFS code do that when the only reason it is still
busy is because it is waiting on the last cleanup actions to occur,
and then have the generic module code detect that error, wait, and
retry (a small number of times, with gradually longer waits, up to
say a max of a second total) before giving up and returning an error
to the user level.
kre
From: "Juergen Hannken-Illjes" <hannken@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/53928 CVS commit: src
Date: Wed, 20 Feb 2019 10:08:38 +0000
Module Name: src
Committed By: hannken
Date: Wed Feb 20 10:08:38 UTC 2019
Modified Files:
src/external/cddl/osnet/dist/uts/common/fs/zfs: zfs_ctldir.c
src/sys/kern: vfs_mount.c vfs_trans.c
Log Message:
Move fstrans_unmount() to vfs_rele(), just before it would free the mount.
Don't take a mount reference for fstrans as it gets notified about the release.
Defer the final free of the mount to fstrans_mount_dtor() when fstrans
has released all references to this mount. Prevents the mount's memory
to be reused as a new mount before fstrans released all references.
Address PR kern/53928 modules/t_builtin:disable test case randomly fails.
To generate a diff of this commit:
cvs rdiff -u -r1.5 -r1.6 \
src/external/cddl/osnet/dist/uts/common/fs/zfs/zfs_ctldir.c
cvs rdiff -u -r1.69 -r1.70 src/sys/kern/vfs_mount.c
cvs rdiff -u -r1.52 -r1.53 src/sys/kern/vfs_trans.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: open->closed
State-Changed-By: gson@NetBSD.org
State-Changed-When: Sat, 23 Feb 2019 13:07:37 +0000
State-Changed-Why:
The test has now passed in 17 consecutive runs. I'm calling it fixed. Thanks hannken!
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.