NetBSD Problem Report #52783
From o.vd.linden@quicknet.nl Sun Dec 3 20:27:45 2017
Return-Path: <o.vd.linden@quicknet.nl>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id CF7B37A1CB
for <gnats-bugs@gnats.NetBSD.org>; Sun, 3 Dec 2017 20:27:45 +0000 (UTC)
Message-Id: <20171203190748.GA556@sheep>
Date: Sun, 3 Dec 2017 20:07:48 +0100
From: Onno van der Linden <o.vd.linden@quicknet.nl>
To: gnats-bugs@netbsd.org
Subject: parallel fsck hangs during boot of 8.99.[5678]
>Number: 52783
>Category: kern
>Synopsis: parallel fsck hangs during boot of 8.99.[5678]
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: jdolecek
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sun Dec 03 20:30:00 +0000 2017
>Closed-Date: Sat Aug 11 11:19:10 +0000 2018
>Last-Modified: Sat Aug 11 11:19:10 +0000 2018
>Originator: Onno van der Linden
>Release: NetBSD 8.99.8
>Organization:
>Environment:
System: NetBSD sheep 8.99.8 NetBSD 8.99.8 (SHEEPKMS) #0: Sun Dec 3 15:49:57 CET 2017 onno@sheep:/usr/src/sys/arch/i386/compile/SHEEPKMS i386
Architecture: i386
Machine: i386
>Description:
fsck called from /etc/rc.d/fsck with default (parallel) options frequently hangs
My /etc/fstab:
/dev/wd0a / ffs rw 1 1
/dev/wd0b none swap sw 0 0
/dev/wd0e /home ffs rw,log 1 2
/dev/wd0f /usr ffs rw,log 1 2
/dev/wd0g /usr/pkg ffs rw,log 1 2
/dev/wd0h /usr/src ffs rw,log 1 2
/dev/wd0i /usr/xsrc ffs rw,log 1 2
/dev/wd0j /cd ffs rw,log 1 2
/dev/wd0k /cd3 ffs rw,log 1 2
/dev/wd1a /var ffs rw,log 1 2
/dev/wd1b /tmp ffs rw,async 1 2
/dev/wd1e /usr/obj ffs rw,async 1 2
/dev/wd1f /usr/pkgsrc ffs rw,log 1 2
/dev/wd1g /cd2 ffs rw,log 1 2
/dev/wd1h /cd4 ffs rw,log 1 2
NAME=dk0 /cd6 ffs rw,log 1 2
NAME=dk1 /cd7 ffs rw,log 1 2
NAME=dk2 /cd8 ffs rw,log 1 2
NAME=dk3 /cd9 ffs rw,log 1 2
/dev/cd0a /cdrom cd9660 ro,noauto,nodev,nosuid 0 0
kernfs /kern kernfs rw 0 0
procfs /proc procfs rw,linux
tmpfs /usr/pkg/emul/linux/dev/shm tmpfs rw,-m1777
tmpfs /var/shm tmpfs rw,-m1777,-sram%25
On a 8.99.8 kernel in DDB I get (copying from notes):
58 fsck biolock getblk->bbusy->cv_timed_wait->sleepq_block
57 fsck_ffs tstile do_sys_sync->mutex_vector_enter->turnstile_block->sleepq_block
54 fsck_ffs biolock same stack as 58
50 fsck wait
46 sh wait
17 sh tstile genfs_lock->rw_vector_enter->turnstile_block->sleepq_block
0.60 ioflush biolock
On a 8.99.5 kernel (sources up to Nov 2 2017 0:00) compiled with lock_debug
1 fsck_ffs process hangs in biowait with stack (copying from notes again)
wdopen->dk_open->dk_getdisklabel->readdisklabel->scan_mbr->biowait->cv_wait->sleepq_block
Note that this stack trace is somewhat identical to that of thread 59 in PR 46462.
And when doing a call cpu_reboot in DDB with this kernel I get
panic LOCKDEBUG: mutex_vector_enter,510 acquiring sleep lock from interrupt context
I can easily reproduce it with slightly older kernels from November. It took
a couple of attempts with 8.99.8 and with different stack traces compared to
earlier kernels.
I'm also wondering if any of this is related to PR's 46462 and 52769.
>How-To-Repeat:
(Re)boot my system
>Fix:
Workaround is to specify fsck_flags="-l 0" in /etc/rc.conf
>Release-Note:
>Audit-Trail:
From: Onno van der Linden <o.vd.linden@quicknet.nl>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/52783: parallel fsck hangs during boot of 8.99.[5678]
Date: Mon, 4 Dec 2017 17:15:42 +0100
jdolecek@ asked me to recompile with
dev/ata/wd.c rev. 1.433 and dev/ata/wdvar.h rev. 1.44
(versions from before the dksubrification of wd(4))
and boot from that kernel.
Tried 5 cold boots and 5 warm boots. No fsck hangs whatsoever.
Maybe I got lucky but I doubt it.
Maybe kern/52720 and kern/52769 are related to this.
kern/46462 is related wrt dksubr and "same" stacktrace
that I've seen
Onno
From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/52783: parallel fsck hangs during boot of 8.99.[5678]
Date: Wed, 6 Dec 2017 07:41:27 +0000
On Mon, Dec 04, 2017 at 05:30:01PM +0000, Onno van der Linden wrote:
> Tried 5 cold boots and 5 warm boots. No fsck hangs whatsoever.
> Maybe I got lucky but I doubt it.
> Maybe kern/52720 and kern/52769 are related to this.
> kern/46462 is related wrt dksubr and "same" stacktrace
> that I've seen
It looks a lot like 52769...
--
David A. Holland
dholland@netbsd.org
From: Onno van der Linden <o.vd.linden@quicknet.nl>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/52783: parallel fsck hangs during boot of 8.99.[5678]
Date: Sat, 14 Jul 2018 20:24:36 +0200
Some additional debug info.
My system has 3 drives
wd0 (partitions wd0[abefghijk])
wd1 (partitions wd1[abefgh])
dk (partitions dk[0123])
In my fstab wd0a has passno 1, the rest has 2.
Recent kernel (8.99.21 sources from July 13th 2018)
Parallel fsck starts, does
rdk0
rwd1a
rwd0e
rdk1
and hangs.
Into ddb and ps says
PID FLAGS LWP NAME WAIT
58 0 c3213d20 fsck biolock
57 0 c31a9020 fsck_ffs biolock
54 0 c31a92c0 fsck_ffs tstile
50 0 c31a9560 fsck wait
46 80 c31a9aa0 sh wait
17 0 c31a9d40 sh tstile
16 0 sh wait
11 0 sh pipe_rd
10 0 sh wait
9 0 sh wait
8 0 sh wait
2 0 sh wait
1 init
tr /a c3213d20 tr /a c31a9020
pid 58 pid 57
sleepq_block sleepq_block
cv_timedwait cv_timedwait
bbusy bbusy
getblk getblk
bio_doread bio_doread
bread bread
ffs_initvnode ffs_update.part.3
ffs_loadvnode ffs_full_fsync
vcache_get VOP_FSYNC
ufs_lookup ffs_sync
VOP_LOOOKUP VFS_SYNC
lookup_once do_sys_sync
namei_tryemulroot sys_sync
namei syscall
namei_at_simple_user
do_sys_pstatvfs
sys_statvfs1
syscall
tr /a c31a92c0 tr /a c31a9560
pid 54 pid 50
sleepq_block sleepq_blcok
turnstile_block cv_wait
mutex_vector_enter fork1
do_sys_sync sys___vfork14
sys_sync syscall
syscall
8.99.21 kernel (same sources as earlier) compiled with
LOCKDEBUG. And after 5 reboot attempts .......
Parallel fsck starts, does
dk0
wd0e
wd1a
wd0f
dk1
wd1b
wd0g
dk2
wd1e
dk3
and hangs.
Into ddb and ps says
PID FLAGS LWP NAME WAIT
69 0 c31fa800 sh tstile
59 0 c31fa020 fsck_ffs biolock
61 4 c31fa2c0 fsck_ffs physio
50 0 c31aa560 fsck wait
46 80 sh wait
17 0 sh tstile
16 0 sh wait
11 0 sh pipe_rd
10 0 sh wait
9 0 sh wait
8 0 sh wait
2 0 sh wait
1 init
tr /a c31fa2c0 tr /a c31fa020
pid 61 pid 59
sleepq_block sleepq_block
cv_wait cv_timedwait
physio bbusy
wdread getblk
cdev_read bio_doread
spec_read bread
VOP_READ ffs_update.part.3
vn_read ffs_full_fsync
dofileread VOP_FSYNC
sys_pread ffs_sync
syscall VFS_SYNC
do_sys_sync
sys_sync
syscall
Onno
Responsible-Changed-From-To: kern-bug-people->jdolecek
Responsible-Changed-By: jdolecek@NetBSD.org
Responsible-Changed-When: Sat, 14 Jul 2018 18:54:55 +0000
Responsible-Changed-Why:
I'm investigating this.
From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/52783 CVS commit: src/sys/dev/ata
Date: Mon, 6 Aug 2018 20:07:05 +0000
Module Name: src
Committed By: jdolecek
Date: Mon Aug 6 20:07:05 UTC 2018
Modified Files:
src/sys/dev/ata: ata_subr.c atavar.h wd.c
Log Message:
add wddebug() which dumps some status for attached disks; indended for
debugging of PR kern/52783
To generate a diff of this commit:
cvs rdiff -u -r1.4 -r1.5 src/sys/dev/ata/ata_subr.c
cvs rdiff -u -r1.97 -r1.98 src/sys/dev/ata/atavar.h
cvs rdiff -u -r1.439 -r1.440 src/sys/dev/ata/wd.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: open->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Fri, 10 Aug 2018 22:44:18 +0000
State-Changed-Why:
Fix was committed. Can you confirm it works for you?
From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/52783 CVS commit: src/sys/dev/ata
Date: Fri, 10 Aug 2018 22:43:22 +0000
Module Name: src
Committed By: jdolecek
Date: Fri Aug 10 22:43:22 UTC 2018
Modified Files:
src/sys/dev/ata: ata_subr.c atavar.h wd.c
Log Message:
fix race in wd_lastclose() on systems with two ide disks on same
channel, which happened when one disk had pending I/O while the other
disk executed the final disk flush - need to restart bufq processing
once xfer is freed in this case
it could happen e.g. on boot when system executes fsck on different
partitions on the two drives in parallell and hence open and closes
the disk devices repeatedly
add KASSERT() for empty bufq on wd_lastclose(), and fix similar issue
also on suspend/standby path
this was introduced by the NCQ merge and not dksubr - before the merge
each drive had their own xfer, so they could not block each other
fixes PR kern/52783 by Onno van der Linden; many thanks for extensive
help with tracking this down
To generate a diff of this commit:
cvs rdiff -u -r1.5 -r1.6 src/sys/dev/ata/ata_subr.c
cvs rdiff -u -r1.98 -r1.99 src/sys/dev/ata/atavar.h
cvs rdiff -u -r1.440 -r1.441 src/sys/dev/ata/wd.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Onno van der Linden <o.vd.linden@quicknet.nl>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/52783 (parallel fsck hangs during boot of 8.99.[5678])
Date: Sat, 11 Aug 2018 11:49:56 +0200
On Fri, Aug 10, 2018 at 10:44:18PM +0000, jdolecek@NetBSD.org wrote:
> Synopsis: parallel fsck hangs during boot of 8.99.[5678]
>
> State-Changed-From-To: open->feedback
> State-Changed-By: jdolecek@NetBSD.org
> State-Changed-When: Fri, 10 Aug 2018 22:44:18 +0000
> State-Changed-Why:
> Fix was committed. Can you confirm it works for you?
Rebooted a kernel containing this fix 10 times and works fine for me.
PR can be close as far as I'm concerned.
Thanks,
Onno
State-Changed-From-To: feedback->closed
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Sat, 11 Aug 2018 11:19:10 +0000
State-Changed-Why:
Confirmed fixed. No pullup since this is post netbsd-8 change. Thanks again
for help tracking this down.
>Unformatted:
2017.12.03.00.00
(Contact us)
$NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.