NetBSD Problem Report #53928

From gson@gson.org  Wed Jan 30 08:14:21 2019
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id BC79F7A18D
	for <gnats-bugs@gnats.NetBSD.org>; Wed, 30 Jan 2019 08:14:21 +0000 (UTC)
Message-Id: <20190130081416.5967D98ACAD@guava.gson.org>
Date: Wed, 30 Jan 2019 10:14:16 +0200 (EET)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: modules/t_builtin:disable test case randomly fails
X-Send-Pr-Version: 3.95

>Number:         53928
>Category:       kern
>Synopsis:       modules/t_builtin:disable test case randomly fails
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Jan 30 08:15:00 +0000 2019
>Closed-Date:    Sat Feb 23 13:07:37 +0000 2019
>Last-Modified:  Sat Feb 23 13:07:37 +0000 2019
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current
>Organization:

>Environment:
System: NetBSD
Architecture: amd64
Machine: amd64
>Description:

The "disable" test case of the modules/t_builtin test program randomly
fails with the error message "rump_sys_modctl(MODCTL_UNLOAD, kernfs):
Device busy".

For example, in the i386 tests on the TNF testbed, 78 runs out of 227
have failed so far this year.  Log output from the most recent failure
is at:

  http://releng.netbsd.org/b5reports/i386/2019/2019.01.29.12.01.20/test.html#modules_t_builtin_disable

The test did not fail in 2015 or 2016, and failed for the first
time with source date 2017.05.08.06.39.23.

It is also failing on real amd64 hardware, for example:

  http://www.gson.org/netbsd/bugs/build/amd64-baremetal/2019/2019.01.30.01.20.47/test.html#modules_t_builtin_disable

>How-To-Repeat:

Run the ATF tests repeatedly.

>Fix:

>Release-Note:

>Audit-Trail:

Responsible-Changed-From-To: kern-bug-people->pgoyette
Responsible-Changed-By: pgoyette@NetBSD.org
Responsible-Changed-When: Wed, 30 Jan 2019 09:55:27 +0000
Responsible-Changed-Why:
Take (I'm looking into it)


From: Paul Goyette <paul@whooppee.com>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/53928: modules/t_builtin:disable test case randomly fails
Date: Wed, 30 Jan 2019 20:18:07 +0800 (PST)

 I rebuilt a local librump with kern.module.verbose enabled and ran the
 test case with RUMP_VERBOSE=1 defined.  This resulted in the following:

 ...
 [   1.0000090] root file system type: rumpfs
 [   1.0000090] kern.module.path=/stand/amd64/8.99.32/modules
 [   1.0200090] DEBUG: module: unload requested for 'kernfs' (TRUE)
 [   1.0200090] DEBUG: module: cannot unload module `kernfs' error=16

 The particular "cannot unload module" error is generated when there
 is an error returned from the module's MODULE_CMD_FINI routine.

 So, the problem here would appear to be vfs_detach() returning EBUSY
 when trying to unload the module.

 Since this is not strictly speaking a module problem, I'm going to
 reassign this PR back to kern-bug-people.  Someone who knows about
 vfs_detach() can determine why it sometimes fails.


 +------------------+--------------------------+----------------------------+
 | Paul Goyette     | PGP Key fingerprint:     | E-mail addresses:          |
 | (Retired)        | FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com   |
 | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org |
 +------------------+--------------------------+----------------------------+

Responsible-Changed-From-To: pgoyette->kern-bug-people
Responsible-Changed-By: pgoyette@NetBSD.org
Responsible-Changed-When: Wed, 30 Jan 2019 12:21:57 +0000
Responsible-Changed-Why:
Not module related, need someone familiar with cfs to look into it


From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
To: "gnats-bugs@netbsd.org" <gnats-bugs@NetBSD.org>
Cc: 
Subject: Re: kern/53928: modules/t_builtin:disable test case randomly fails
Date: Wed, 30 Jan 2019 14:43:48 +0100

 --Apple-Mail=_35FBCF58-F33F-407C-A784-C756827411A6
 Content-Transfer-Encoding: 7bit
 Content-Type: text/plain;
 	charset=us-ascii

 > On 30. Jan 2019, at 13:20, Paul Goyette <paul@whooppee.com> wrote:
 <snip>
 > So, the problem here would appear to be vfs_detach() returning EBUSY
 > when trying to unload the module.

 This is most likely fstrans.  Destroying a "struct mount" and
 thus releasing its mnt_op vector is done by the last thread
 running fstrans_enter() and seeing a mount with IMNT_GONE.

 It may help to "sysctl -w kern.maxvnodes=2", which fails
 but may trigger threads holding a reference to the mount.

 --
 J. Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig

 --Apple-Mail=_35FBCF58-F33F-407C-A784-C756827411A6
 Content-Transfer-Encoding: 7bit
 Content-Disposition: attachment;
 	filename=signature.asc
 Content-Type: application/pgp-signature;
 	name=signature.asc
 Content-Description: Message signed with OpenPGP

 -----BEGIN PGP SIGNATURE-----

 iQEzBAEBCAAdFiEE2BL3ha7Xao4WUZVYKoaVJdNr+uEFAlxRqhQACgkQKoaVJdNr
 +uGgcwf/dyFMXo/BvpMUxdJc0fi4cPwnXdE7NPq+eRrcm/gWkDj5XhKIIr6GcuXu
 uXzgQgQEmG7onT83amadsdR37NiMvKkPGnA540cwdQNLoU5Mik3oxgJKIblU6oE9
 zyGdFquGLNBxwJDhOujmkJLIO2hv44pYgbXOMRAX9ImfjUOdaWVBfMVZv30YjRul
 pXD7hU77+U/2TGFHWXmZR3oX1F3ealJBNNuUC8GzYhVPaZcSjhLoRDrG0dsnyKfk
 KGag4afUFT0HlRk0ms77pAOsAezpxGpLyVS7zdoAn+l4XhaRMp7UF2JLaQYLGZma
 SsDR6MykSimE7V2Vvo8Tq/8TTTcnvA==
 =Ykih
 -----END PGP SIGNATURE-----

 --Apple-Mail=_35FBCF58-F33F-407C-A784-C756827411A6--

From: Andreas Gustafsson <gson@gson.org>
To: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/53928: modules/t_builtin:disable test case randomly fails
Date: Wed, 30 Jan 2019 18:28:07 +0200

 J. Hannken-Illjes wrote:
 >  This is most likely fstrans.

 In that case, perhaps the problem started with these commits, which
 were made the day before the first recorded test failure on the
 i386 testbed:

   2017.05.07.08.21.08 hannken src/sys/kern/vfs_mount.c 1.59
   2017.05.07.08.21.57 hannken src/sys/miscfs/genfs/layer_vnops.c 1.64
   2017.05.07.08.22.40 hannken src/sys/fs/union/union_vnops.c 1.68
   2017.05.07.08.23.28 hannken src/sys/kern/vfs_trans.c 1.44
   2017.05.07.08.24.20 hannken src/sys/kern/vfs_mount.c 1.60
   2017.05.07.08.24.20 hannken src/sys/kern/vfs_trans.c 1.45
   2017.05.07.08.25.54 hannken src/sys/kern/vfs_syscalls.c 1.514
   2017.05.07.08.25.54 hannken src/sys/miscfs/genfs/genfs_vfsops.c 1.6
   2017.05.07.08.26.58 hannken src/sys/kern/vfs_mount.c 1.61
   2017.05.07.08.26.58 hannken src/sys/kern/vfs_subr.c 1.464
   2017.05.07.08.26.58 hannken src/sys/kern/vfs_syscalls.c 1.515

 >  Destroying a "struct mount" and
 >  thus releasing its mnt_op vector is done by the last thread
 >  running fstrans_enter() and seeing a mount with IMNT_GONE.
 >
 >  It may help to "sysctl -w kern.maxvnodes=2", which fails
 >  but may trigger threads holding a reference to the mount.

 I'm still confused as to which part of the system you think the
 bug lies in, and what would be the correct fix, because surely
 running "sysctl -w kern.maxvnodes=2" isn't it.
 -- 
 Andreas Gustafsson, gson@gson.org

From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/53928: modules/t_builtin:disable test case randomly fails
Date: Thu, 31 Jan 2019 13:02:05 +0700

     Date:        Wed, 30 Jan 2019 16:30:01 +0000 (UTC)
     From:        Andreas Gustafsson <gson@gson.org>
     Message-ID:  <20190130163001.6B03B7A1F7@mollari.NetBSD.org>

   |  I'm still confused as to which part of the system you think the
   |  bug lies in, and what would be the correct fix, because surely
   |  running "sysctl -w kern.maxvnodes=2" isn't it.

 I have been waiting for someone more knowledgeable to answer
 this, before throwing in my random guess ... but from what I
 have read here, I suspect that the module unload routine (or
 something that it calls) when doing the unload by a specific
 request (rather than just the "can I unload this now" autounload
 way) needs to stop and wait until all references to the filesystem
 (including background kernel threads that only run periodically)
 have finished with it.

 Whether there is enough mechanism, on all sides, for this to be
 done properly or not, I have no idea.

 I assume the code has already a method to check that the filesystem
 isn't in active use, so if it gets past that, and can lock the filesys
 so it cannot be returned to use again (by being mounted while we're
 waiting) all it should need to do is delay for a short while if it
 gets the EBUSY in question, and try again.

 I suspect this belongs inside the vfs interface somewhere, as this
 isn't the kind of thing a generic module interface ought to know
 about ... but it is possible that the same strategy might be needed
 for some other part of the system (device drivers, security modules, ...)
 so it might be worth having a specific error that could be returned
 to the generic module code to indicate that it should try again soon
 (EAGAIN ?) have the VFS code do that when the only reason it is still
 busy is because it is waiting on the last cleanup actions to occur,
 and then have the generic module code detect that error, wait, and
 retry (a small number of times, with gradually longer waits, up to
 say a max of a second total) before giving up and returning an error
 to the user level.

 kre

From: "Juergen Hannken-Illjes" <hannken@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/53928 CVS commit: src
Date: Wed, 20 Feb 2019 10:08:38 +0000

 Module Name:	src
 Committed By:	hannken
 Date:		Wed Feb 20 10:08:38 UTC 2019

 Modified Files:
 	src/external/cddl/osnet/dist/uts/common/fs/zfs: zfs_ctldir.c
 	src/sys/kern: vfs_mount.c vfs_trans.c

 Log Message:
 Move fstrans_unmount() to vfs_rele(), just before it would free the mount.
 Don't take a mount reference for fstrans as it gets notified about the release.

 Defer the final free of the mount to fstrans_mount_dtor() when fstrans
 has released all references to this mount.  Prevents the mount's memory
 to be reused as a new mount before fstrans released all references.

 Address PR kern/53928 modules/t_builtin:disable test case randomly fails.


 To generate a diff of this commit:
 cvs rdiff -u -r1.5 -r1.6 \
     src/external/cddl/osnet/dist/uts/common/fs/zfs/zfs_ctldir.c
 cvs rdiff -u -r1.69 -r1.70 src/sys/kern/vfs_mount.c
 cvs rdiff -u -r1.52 -r1.53 src/sys/kern/vfs_trans.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->closed
State-Changed-By: gson@NetBSD.org
State-Changed-When: Sat, 23 Feb 2019 13:07:37 +0000
State-Changed-Why:
The test has now passed in 17 consecutive runs.  I'm calling it fixed.  Thanks hannken!


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.