NetBSD Problem Report #52364
From old4@o2.pl Mon Jul 3 22:50:07 2017
Return-Path: <old4@o2.pl>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id C70C87A212
for <gnats-bugs@gnats.NetBSD.org>; Mon, 3 Jul 2017 22:50:07 +0000 (UTC)
Message-Id: <20170703210316.GA16562@Bell.tvk.wroc.pl>
Date: Mon, 3 Jul 2017 23:03:16 +0200
From: old4@o2.pl
Reply-To: old4@o2.pl
To: gnats-bugs@NetBSD.org
Subject: System hangs not much before showing the login prompt; changes in
VFS suspected
X-Send-Pr-Version: 3.95
>Number: 52364
>Category: kern
>Synopsis: System hangs not much before showing the login prompt; changes in VFS suspected
>Confidential: no
>Severity: serious
>Priority: low
>Responsible: kern-bug-people
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Mon Jul 03 22:55:00 +0000 2017
>Closed-Date: Sun Jun 27 18:30:41 +0000 2021
>Last-Modified: Sun Jun 27 18:30:41 +0000 2021
>Originator: Mateusz Poszwa
>Release: NetBSD 8.99.1 2017/06/27 12:21
>Organization:
Mateusz Poszwa
>Environment:
System: NetBSD Bell 7.99.71 NetBSD 7.99.71 (GENERIC) #2: Sun Jul 2 17:59:05 CEST 2017 beb@Bell:/home/beb/sysbuild/amd64/obj/home/beb/sysbuild/src/sys/arch/amd64/compile/GENERIC amd64
Architecture: x86_64
Machine: amd64
>Description:
Note: Times specified in CEST (UTC +2).
Initially, the system booting proceeds as
usual. Messages from device detection and init
scripts are visible (dhcp and xdm among others).
When the procedure comes close to the point in which it
would normally show a login prompt (either xdm, or, with
xdm turned off, the text-based one) the system hangs:
messages stop appearing and the prompt is not visible.
After some time, the disk spins down, probably due to
power-saving capabilities. If xdm is turned on, the
system manages to switch to an empty VT before hanging.
In some versions closer to the last working one, it
was possible to switch VTs (all missing the login
prompt). In newer versions it is no longer possible.
Pressing Ctr-Alt-Esc brings up ddb on the first VT, but
as my keyboard is connected through a USB port (and a
PS/2–USB adapter) I cannot issue commands (I’m sorry).
>How-To-Repeat:
Try to boot NetBSD-current from 2017/05/17 18:00 source
(7.99.71 at the time) or later. I tried several times
with 7.99.73, .74, .75, and some versions of 8.99.1 up to
2017/06/27. The earlier version and date comes from bisecting
the problem with precision up to two kernel-specific commits.
A version before
http://mail-index.netbsd.org/source-changes/2017/05/17/msg084531.html
worked fine. A version after
http://mail-index.netbsd.org/source-changes/2017/05/17/msg084532.html
did not.
I deleted my build directory before realising these two
changes were not a single commit. I can check the last
remaining version (between the two changes) on request.
Since I do not know of anyone facing a similar problem,
it is highly probable that it is specific to my
hardware or configuration. Fortunately, it seems
to be reproducible every time I try. Any help with
supplying the relevant details will be appreciated.
>Fix:
Boot into single user mode.
# fsck -fy
# mount -rw /
Replace the kernel with a version from 2017/05/17 12:00 source.
# shutdown -r now
Note that no changes to the userspace programs are necessary.
8.99.1 userspace programs seem to work fine with 7.99.71 kernel.
>Release-Note:
>Audit-Trail:
From: coypu@sdf.org
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/52364: System hangs not much before showing the login
prompt; changes in VFS suspected
Date: Tue, 4 Jul 2017 06:30:34 +0000
Hi,
The error case of vfs_suspend means we keep on holding vfs_suspend_lock.
vrevoke_suspend_next will convey the failure by returning NULL, but I
think vrevoke might not be checking for that.
unionfs seems to be the only one not using genfs_suspendctl, so possibly
the odd one and the reason no one else is seeing this.
./unionfs/unionfs_vfsops.c: .vfs_suspendctl = (void *)eopnotsupp,
From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/52364: System hangs not much before showing the login
prompt; changes in VFS suspected
Date: Tue, 4 Jul 2017 10:28:17 +0200
--Apple-Mail=_0CDE0FE9-5A71-4849-B082-21827C920AA6
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
charset=us-ascii
First, this has nothing to do with fs/unionfs, it is dead code and
we use fs/union.
As you are not able to enter commands at the ddb prompt please apply
the attached diff to sys/kern/vfs_vnode.c Rev 1.88. It will now
print diagnostics for every call to vrevoke() like this:
revoke: vnode 0xfffffe8438244e70 flags 0x30<MPSAFE,LOCKSWORK>
<...>
Fstrans locks by lwp:
Fstrans state by mount:
revoke 0xfffffe8438244e70 done
The printout starting at the last "revoke: vnode..." is interesting.
Maybe putting "ddb.commandonenter=trace" into /etc/sysctl.conf would
help too ...
--
J. Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)
--Apple-Mail=_0CDE0FE9-5A71-4849-B082-21827C920AA6
Content-Disposition: attachment;
filename=vfs_vnode.c.diff
Content-Type: application/octet-stream;
x-unix-mode=0644;
name="vfs_vnode.c.diff"
Content-Transfer-Encoding: 7bit
Index: vfs_vnode.c
===================================================================
RCS file: /cvsroot/src/sys/kern/vfs_vnode.c,v
retrieving revision 1.88
diff -p -u -2 -r1.88 vfs_vnode.c
--- vfs_vnode.c 17 May 2017 12:46:14 -0000 1.88
+++ vfs_vnode.c 4 Jul 2017 08:18:36 -0000
@@ -956,7 +956,11 @@ vrevoke(vnode_t *vp)
enum vtype type;
dev_t dev;
+ extern void fstrans_dump(int);
KASSERT(vp->v_usecount > 0);
+ vprint("revoke", vp);
+ fstrans_dump(0);
+
mp = vp->v_mount;
error = vfs_suspend(mp, 0);
@@ -982,5 +986,7 @@ vrevoke(vnode_t *vp)
if (mp)
vfs_resume(mp);
- mp = vp->v_mount;
+ vprint("revoke2", vq);
+ fstrans_dump(0);
+ mp = vq->v_mount;
error = vfs_suspend(mp, 0);
KASSERT(error == 0 || error == EOPNOTSUPP);
@@ -993,4 +999,5 @@ vrevoke(vnode_t *vp)
if (mp)
vfs_resume(mp);
+ printf("revoke %p done\n", vp);
}
--Apple-Mail=_0CDE0FE9-5A71-4849-B082-21827C920AA6--
From: Mateusz Poszwa <old4@o2.pl>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/52364: System hangs not much before showing the login
prompt; changes in VFS suspected
Date: Tue, 4 Jul 2017 21:35:38 +0200
--pf9I7BMVVzbSWLtt
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
I issued
# sysctl -w ddb.commandonenter=trace
in single-user mode and resumed the boot. I’m
afraid the output (attached) is not very helpful.
When running the patched kernel, there were two diagnostic printouts.
One happened just after outputting the modules path and ended
with a ‘done’ line. I didn’t capture it before the VT switch.
The second one seems to be more interesting.
You’ll find the transcription attached.
Was the change from vp->v_mount to vq->v_mount
intentional? I’m asking just in case.
Thank you for your prompt response.
--pf9I7BMVVzbSWLtt
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename=dump
Starting smartd.
Starting saslauthd.
Starting nasd.
Network Audio system Release 1.9.4
Starting estd.
/etc/rc: WARNING: $dirmngr is not set properly - see rc.conf(5).
Starting dbus.
Starting cron.
The following components reported failures:
/etc/rc.d/wpa_suplicant
See /var/run/rc.log for more information.
Tue Jul 4 18:00:02 CEST 2017
usl_detachtimeout
usl_detachtimeout: not detaching
fatal breakpoint trap in supervisor mode
trap type 1 code 0 rip 0xffffffff802249f5 cs 0x8 rflags 0x202 cr2 0x6f833f7bd000 ilevel 0x6 rsp 0xffffe40099ba3dc0
curlwp 0xffffe4015fb37860 pid 0.5 lowest kstack 0xffffe40099ba02c0
Stopped in pid 0.5 (system) at netbsd:breakpoint+0x5: leave
breakpoint() at netbsd:breakpoint+0x5
wskbd_translate() at netbsd:wskbd_translate+0xbb4
wskbd_input() at netbsd:wskbd_input+0x5b
ukbd_decode() at netbsd:ukbd_decode+0x253
callout_softclock() at netbsd:callout_softclock+0x41c
softint_dispatch() at netbsd:softint_dispatch+0xd3
DDB lost frame for netbsd:Xsoftintr+0x4f
--- interrupt ---
0:
db{0}>
--pf9I7BMVVzbSWLtt
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename=dump-revoke
Starting nasd.
Network Audio System Release 1.9.4
Starting estd.
/etc/rc: WARNING: $dirmngr is not set properly - see rc.conf(5)
Starting dbus.
Starting cron.
The following components reported failures:
/etc/rc.d/wpa_suplicant
See /var/run/rc.log for more information.
Tue Jul 4 19:31:02 CEST 2017
revoke: vnode 0xfffffe8107f0d7e0 flags 0x34<ISTTY,MPSAFE,LOCKSWORK>
tag VT_UFS(1) type VCHR(4) mount 0xfffffe810807e008 typedata 0xfffffe815fde0f58
usecount 3 rwitecount 2 holdcount 0
size 0 writesize 0 numoutput 0
data 0xfffffe81080abd00 lock 0xfffffe8107f0d910
state Active key(0xfffffe810807e008 8) 07 4f 1d 00 00 00 00 00
lrulisthd 0xffffffff815ad400
tag VT_UFS, ino 1920775, on dev 0, 0 flags 0x2010, nlink 1
mode 020622, owner 0, group 4, size 0
Fstrans locks by lwp:
772.1 (/) shared 2 cow 0
748.1 (/) shared 2 cow 0
352.1 (/) shared 2 cow 0
432.1 (/) shared 2 cow 0
434.1 (/) shared 2 cow 0
941.1 (/) shared 2 cow 0
361.1 (/) shared 2 cow 0
819.1 (/) shared 2 cow 0
838.1 (/) shared 2 cow 0
481.1 (/) shared 2 cow 0
418.1 (/) shared 2 cow 0
757.1 (/) shared 2 cow 0
826.1 (/) shared 2 cow 0
426.1 (/) shared 2 cow 0
867.1 (/) shared 2 cow 0
725.1 (/) shared 2 cow 0
805.1 (/) shared 2 cow 0
791.1 (/) shared 2 cow 0
764.1 (/) shared 1 cow 0
363.1 (/) shared 1 cow 0
659.1 (/) shared 1 cow 0
783.1 (/) shared 1 cow 0
Fstrans state by mount:
usl_detachtimeout
usl_detachtimeout: not detaching
fatal breakpoint trap in supervisor mode
trap type 1 code 0 rip 0xffffffff802157d5 cs 0x8 rflags 0x202 cr2 0x7b8d92746000 ilevel 0x6 rsp 0xfffffe8099ba8db0
curlwp 0xfffffe815fb37860 pid 0.5 lowest kstack 0xfffffe8099ba02c0
Stopped in pid 0.5 (system) at netbsd:breakpoint+0x5: leave
db{0}>
--pf9I7BMVVzbSWLtt--
From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/52364: System hangs not much before showing the login
prompt; changes in VFS suspected
Date: Wed, 5 Jul 2017 10:32:28 +0200
--Apple-Mail=_EDB23F79-44D8-42E6-A6FC-C7718857FE90
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
charset=utf-8
> Was the change from vp->v_mount to vq->v_mount
> intentional? I=E2=80=99m asking just in case.
Yes -- it was a typo fixed in -current.
> Fstrans locks by lwp:
> 772.1 (/) shared 2 cow 0
> 748.1 (/) shared 2 cow 0
> 352.1 (/) shared 2 cow 0
...
Ok, vfs_suspend() fails to suspend activity on "/".
Please apply the attached diff too. You will get another dump of the
fstrans state, then a panic and a backtrace from ddb if vfs_suspend()
takes more than five seconds.
--
J. Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)
--Apple-Mail=_EDB23F79-44D8-42E6-A6FC-C7718857FE90
Content-Disposition: attachment;
filename=vfs_trans.c.diff
Content-Type: application/octet-stream;
x-unix-mode=0644;
name="vfs_trans.c.diff"
Content-Transfer-Encoding: 7bit
Index: vfs_trans.c
===================================================================
RCS file: /cvsroot/src/sys/kern/vfs_trans.c,v
retrieving revision 1.45
diff -p -u -2 -r1.45 vfs_trans.c
--- vfs_trans.c 7 May 2017 08:24:20 -0000 1.45
+++ vfs_trans.c 5 Jul 2017 08:27:03 -0000
@@ -45,4 +45,5 @@ __KERNEL_RCSID(0, "$NetBSD: vfs_trans.c,
#include <sys/atomic.h>
#include <sys/buf.h>
+#include <sys/kernel.h>
#include <sys/kmem.h>
#include <sys/mount.h>
@@ -492,5 +493,14 @@ fstrans_setstate(struct mount *mp, enum
error = 0;
while (! state_change_done(mp)) {
- error = cv_wait_sig(&fstrans_count_cv, &fstrans_lock);
+ error = cv_timedwait_sig(&fstrans_count_cv, &fstrans_lock,
+ mstohz(5000));
+ if (error == EWOULDBLOCK) {
+ extern void fstrans_dump(int);
+ extern char db_cmd_on_enter[];
+
+ fstrans_dump(0);
+ strcpy(db_cmd_on_enter, "trace");
+ panic("%s timed out after 5 seconds", __func__);
+ }
if (error) {
new_state = fmi->fmi_state = FSTRANS_NORMAL;
--Apple-Mail=_EDB23F79-44D8-42E6-A6FC-C7718857FE90--
From: Mateusz Poszwa <old4@o2.pl>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/52364: System hangs not much before showing the login
prompt; changes in VFS suspected
Date: Wed, 5 Jul 2017 20:50:46 +0200
--0OAP2g/MAC+5xKAE
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
At first I collected two core dumps, but I figured out
that I had to set ddb.onpanic=1 in /ets/sysctl.conf
Some of the output went off the screen.
I attach what I managed to capture.
I can also host the mentioned core dumps if that helps.
Thank you.
--0OAP2g/MAC+5xKAE
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename=dump
usecount 3 writecount 2 holdcount 0
size 0 writesize 0 numoutput 0
data 0xfffffe815edbbd00 lock 0xfffffe8107f0d910
state ACTIVE key(0xfffffe815edc0008 8) 07 4f 1d 00 00 00 00 00
lrulisthd 0xffffffff815ad400
tag VT_UFS, ino 1920775 on dev 0, 0 flags 0x2010, nlink 1
mode 020622, owner 0, group 4, size 0
Fstrans locks by lwp:
481.1 (/) shared 1 cow 0
959.1 (/) shared 1 cow 0
893.1 (/) shared 1 cow 0
498.1 (/) shared 1 cow 0
920.1 (/) shared 1 cow 0
813.1 (/) shared 1 cow 0
848.1 (/) shared 1 cow 0
846.1 (/) shared 1 cow 0
766.1 (/) shared 2 cow 0
380.1 (/) shared 1 cow 0
436.1 (/) shared 1 cow 0
809.1 (/) shared 1 cow 0
662.1 (/) shared 1 cow 0
900.1 (/) shared 2 cow 0
835.1 (/) shared 1 cow 0
437.1 (/) shared 2 cow 0
743.1 (/) shared 2 cow 0
906.1 (/) shared 2 cow 0
1007.1 (/) shared 2 cow 0
881.1 (/) shared 1 cow 0
923.1 (/) shared 1 cow 0
435.1 (/) shared 1 cow 0
442.1 (/) shared 1 cow 0
921.1 (/) shared 1 cow 0
887.1 (/) shared 2 cow 0
891.1 (/) shared 2 cow 0
108.1 (/) shared 1 cow 0
587.1 (/) shared 1 cow 0
682.1 (/) shared 1 cow 0
552.1 (/) shared 1 cow 0
Fstrans state by mount:
Fstrans locks by lwp:
108.1 (/) shared 1 cow 0
587.1 (/) shared 1 cow 0
682.1 (/) shared 1 cow 0
552.1 (/) shared 1 cow 0
Fstrans state by mount:
/ state suspended
panic: fstrans_setstate timed out after 5 seconds
fatal breakpoint trap in supervisor mode
trap type 1 code 0 rip 0xffffffff802157d5 cs 0x8 rflags 0x246 cr2 0x7d88eb568270 ilevel 0 rsp 0xfffffe8040017bc0
curlwp 0xfffffe815edcbd20 pid 12.1 lowest ksstack 0xfffffe80400142c0
Stopped in pid 12.1 (sh) at netbsd:breakpoint+0x5: leave
breakpiont() at netbsd:breakpoint+0x5
vpanic() at netbsd:vpanic+0x1cd
panic() at netbsd:panic+0x43
fstrans_setstate() at netbsd:fstrans_setstate+0x1c9
genfs_suspendctl() at netbsd:genfs_suspendctl+0x37
VFS_SUSPENDCTL() at netbsd:VFS_SUSPENDCTL+0x2d
vfs_suspend() at netbsd:vfs_suspend+0x55
vrevoke() at netbsd:vrevoke0x46
genfs_revoke() at netbsd:genfs_revoke+0x13
VOP_REVOKE() at netbsd:VOP_REVOKE+0x42
exit1() at netbsd:exit1+0x2d0
sys_exit() at netbsd:sys_exit+0x47
syscall() at netbsd:syscall+0xd8
--- syscall (number 1) ---
7d6381b1bc1a:
db{2}>
--0OAP2g/MAC+5xKAE--
From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/52364: System hangs not much before showing the login
prompt; changes in VFS suspected
Date: Tue, 11 Jul 2017 10:41:44 +0200
The suspension gets blocked for a long time or forever by saslauthd
setting advisory locks (VOP_ADVLOCK -> lf_advlock -> lf_setlock -> BLOCK).
As we hold fstrans for the VOP there is no chance to suspend the fs.
This diff (against May, 17. sources) should help, please try it out:
Index: vnode_if.c
===================================================================
RCS file: /cvsroot/src/sys/kern/vnode_if.c,v
retrieving revision 1.103
diff -p -u -2 -r1.103 vnode_if.c
--- vnode_if.c 26 Apr 2017 03:04:24 -0000 1.103
+++ vnode_if.c 11 Jul 2017 08:35:59 -0000
@@ -1405,5 +1405,4 @@ VOP_ADVLOCK(struct vnode *vp,
bool mpsafe;
struct vop_advlock_args a;
- struct mount *mp = vp->v_mount;
a.a_desc = VDESC(vop_advlock);
a.a_vp = vp;
@@ -1414,7 +1413,5 @@ VOP_ADVLOCK(struct vnode *vp,
mpsafe = (vp->v_vflag & VV_MPSAFE);
if (!mpsafe) { KERNEL_LOCK(1, curlwp); }
- fstrans_start(mp, FSTRANS_SHARED);
error = (VCALL(vp, VOFFSET(vop_advlock), &a));
- fstrans_done(mp);
if (!mpsafe) { KERNEL_UNLOCK_ONE(curlwp); }
return error;
From: Mateusz Poszwa <old4@o2.pl>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/52364: System hangs not much before showing the login
prompt; changes in VFS suspected
Date: Tue, 11 Jul 2017 20:16:30 +0200
The patch helped indeed. I haven’t put much load on the
system yet, but so far I haven’t noticed anything worrying.
Thank you for dealing with my problem so quickly.
From: "Juergen Hannken-Illjes" <hannken@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/52364 CVS commit: src/sys/kern
Date: Wed, 12 Jul 2017 09:31:07 +0000
Module Name: src
Committed By: hannken
Date: Wed Jul 12 09:31:07 UTC 2017
Modified Files:
src/sys/kern: vnode_if.src
Log Message:
As VOP_ADVLOCK() may block indefinitely we cannot take fstrans here.
Fixes PR kern/52364: System hangs not much before showing the login prompt
To generate a diff of this commit:
cvs rdiff -u -r1.76 -r1.77 src/sys/kern/vnode_if.src
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Soren Jacobsen" <snj@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/52364 CVS commit: [netbsd-8] src/sys/kern
Date: Tue, 25 Jul 2017 01:29:23 +0000
Module Name: src
Committed By: snj
Date: Tue Jul 25 01:29:23 UTC 2017
Modified Files:
src/sys/kern [netbsd-8]: vnode_if.src
Log Message:
Pull up following revision(s) (requested by hannken in ticket #130):
sys/kern/vnode_if.src: revision 1.77
As VOP_ADVLOCK() may block indefinitely we cannot take fstrans here.
Fixes PR kern/52364: System hangs not much before showing the login prompt
To generate a diff of this commit:
cvs rdiff -u -r1.75.2.1 -r1.75.2.2 src/sys/kern/vnode_if.src
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: open->closed
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sun, 27 Jun 2021 18:30:41 +0000
State-Changed-Why:
patch was committed in 2017
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.