NetBSD Problem Report #52783

From o.vd.linden@quicknet.nl  Sun Dec  3 20:27:45 2017
Return-Path: <o.vd.linden@quicknet.nl>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id CF7B37A1CB
	for <gnats-bugs@gnats.NetBSD.org>; Sun,  3 Dec 2017 20:27:45 +0000 (UTC)
Message-Id: <20171203190748.GA556@sheep>
Date: Sun, 3 Dec 2017 20:07:48 +0100
From: Onno van der Linden <o.vd.linden@quicknet.nl>
To: gnats-bugs@netbsd.org
Subject: parallel fsck hangs during boot of 8.99.[5678]

>Number:         52783
>Category:       kern
>Synopsis:       parallel fsck hangs during boot of 8.99.[5678]
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    jdolecek
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Dec 03 20:30:00 +0000 2017
>Closed-Date:    Sat Aug 11 11:19:10 +0000 2018
>Last-Modified:  Sat Aug 11 11:19:10 +0000 2018
>Originator:     Onno van der Linden
>Release:        NetBSD 8.99.8
>Organization:

>Environment:
System: NetBSD sheep 8.99.8 NetBSD 8.99.8 (SHEEPKMS) #0: Sun Dec 3 15:49:57 CET 2017 onno@sheep:/usr/src/sys/arch/i386/compile/SHEEPKMS i386
Architecture: i386
Machine: i386
>Description:
fsck called from /etc/rc.d/fsck with default (parallel) options frequently hangs
My /etc/fstab:
/dev/wd0a	/	ffs	rw	1 1
/dev/wd0b	none	swap	sw	0 0
/dev/wd0e	/home	ffs	rw,log	1 2
/dev/wd0f	/usr	ffs	rw,log	1 2
/dev/wd0g	/usr/pkg	ffs	rw,log	1 2
/dev/wd0h	/usr/src	ffs	rw,log	1 2
/dev/wd0i	/usr/xsrc	ffs	rw,log	1 2
/dev/wd0j	/cd	ffs	rw,log	1 2
/dev/wd0k	/cd3	ffs	rw,log	1 2
/dev/wd1a	/var	ffs	rw,log	1 2
/dev/wd1b	/tmp	ffs	rw,async	1 2
/dev/wd1e	/usr/obj	ffs	rw,async	1 2
/dev/wd1f	/usr/pkgsrc	ffs	rw,log	1 2
/dev/wd1g	/cd2	ffs	rw,log	1 2
/dev/wd1h	/cd4	ffs	rw,log	1 2
NAME=dk0	/cd6	ffs	rw,log	1 2
NAME=dk1	/cd7	ffs	rw,log	1 2
NAME=dk2	/cd8	ffs	rw,log	1 2
NAME=dk3	/cd9	ffs	rw,log	1 2
/dev/cd0a       /cdrom  cd9660  ro,noauto,nodev,nosuid  0 0
kernfs		/kern	kernfs	rw	0 0
procfs		/proc	procfs	rw,linux
tmpfs		/usr/pkg/emul/linux/dev/shm tmpfs rw,-m1777
tmpfs	/var/shm	tmpfs	rw,-m1777,-sram%25

On a 8.99.8 kernel in DDB I get (copying from notes):
58 fsck biolock getblk->bbusy->cv_timed_wait->sleepq_block
57 fsck_ffs tstile do_sys_sync->mutex_vector_enter->turnstile_block->sleepq_block
54 fsck_ffs biolock same stack as 58
50 fsck wait
46 sh wait
17 sh tstile genfs_lock->rw_vector_enter->turnstile_block->sleepq_block
0.60 ioflush biolock 

On a 8.99.5 kernel (sources up to Nov 2 2017 0:00) compiled with lock_debug
1 fsck_ffs process hangs in biowait with stack (copying from notes again)
wdopen->dk_open->dk_getdisklabel->readdisklabel->scan_mbr->biowait->cv_wait->sleepq_block

Note that this stack trace is somewhat identical to that of thread 59 in PR 46462.
And when doing a call cpu_reboot in DDB with this kernel I get
panic LOCKDEBUG: mutex_vector_enter,510 acquiring sleep lock from interrupt context

I can easily reproduce it with slightly older kernels from November. It took
a couple of attempts with 8.99.8 and with different stack traces compared to
earlier kernels.

I'm also wondering if any of this is related to PR's 46462 and 52769.


>How-To-Repeat:
(Re)boot my system
>Fix:
Workaround is to specify fsck_flags="-l 0" in /etc/rc.conf

>Release-Note:

>Audit-Trail:
From: Onno van der Linden <o.vd.linden@quicknet.nl>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/52783: parallel fsck hangs during boot of 8.99.[5678]
Date: Mon, 4 Dec 2017 17:15:42 +0100

 jdolecek@ asked me to recompile with
 dev/ata/wd.c rev. 1.433 and dev/ata/wdvar.h rev. 1.44
 (versions from before the dksubrification of wd(4))
 and boot from that kernel.

 Tried 5 cold boots and 5 warm boots. No fsck hangs whatsoever.
 Maybe I got lucky but I doubt it.
 Maybe kern/52720 and kern/52769 are related to this.
 kern/46462 is related wrt dksubr and "same" stacktrace
 that I've seen

 Onno

From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/52783: parallel fsck hangs during boot of 8.99.[5678]
Date: Wed, 6 Dec 2017 07:41:27 +0000

 On Mon, Dec 04, 2017 at 05:30:01PM +0000, Onno van der Linden wrote:
  >  Tried 5 cold boots and 5 warm boots. No fsck hangs whatsoever.
  >  Maybe I got lucky but I doubt it.
  >  Maybe kern/52720 and kern/52769 are related to this.
  >  kern/46462 is related wrt dksubr and "same" stacktrace
  >  that I've seen

 It looks a lot like 52769...

 -- 
 David A. Holland
 dholland@netbsd.org

From: Onno van der Linden <o.vd.linden@quicknet.nl>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/52783: parallel fsck hangs during boot of 8.99.[5678]
Date: Sat, 14 Jul 2018 20:24:36 +0200

 Some additional debug info.

 My system has 3 drives
 wd0 (partitions wd0[abefghijk])
 wd1 (partitions wd1[abefgh])
 dk (partitions dk[0123])

 In my fstab wd0a has passno 1, the rest has 2.

 Recent kernel (8.99.21 sources from July 13th 2018)

 Parallel fsck starts, does
 rdk0
 rwd1a
 rwd0e
 rdk1

 and hangs.
 Into ddb and ps says

 PID FLAGS LWP       NAME     WAIT
  58   0   c3213d20  fsck     biolock
  57   0   c31a9020  fsck_ffs biolock
  54   0   c31a92c0  fsck_ffs tstile
  50   0   c31a9560  fsck     wait
  46  80   c31a9aa0  sh       wait
  17   0   c31a9d40  sh       tstile
  16   0             sh       wait
  11   0             sh       pipe_rd
  10   0             sh       wait
   9   0             sh       wait
   8   0             sh       wait
   2   0             sh       wait
   1                 init

 tr /a c3213d20         tr /a c31a9020
 pid 58                 pid 57
 sleepq_block           sleepq_block
 cv_timedwait           cv_timedwait
 bbusy                  bbusy
 getblk                 getblk
 bio_doread             bio_doread
 bread                  bread
 ffs_initvnode          ffs_update.part.3
 ffs_loadvnode          ffs_full_fsync
 vcache_get             VOP_FSYNC
 ufs_lookup             ffs_sync
 VOP_LOOOKUP            VFS_SYNC
 lookup_once            do_sys_sync
 namei_tryemulroot      sys_sync
 namei                  syscall
 namei_at_simple_user
 do_sys_pstatvfs
 sys_statvfs1
 syscall

 tr /a c31a92c0         tr /a c31a9560
 pid 54                 pid 50
 sleepq_block           sleepq_blcok
 turnstile_block        cv_wait
 mutex_vector_enter     fork1
 do_sys_sync            sys___vfork14
 sys_sync               syscall
 syscall


 8.99.21 kernel (same sources as earlier) compiled with
 LOCKDEBUG. And after 5 reboot attempts .......

 Parallel fsck starts, does
 dk0
 wd0e
 wd1a
 wd0f
 dk1
 wd1b
 wd0g
 dk2
 wd1e
 dk3

 and hangs.
 Into ddb and ps says

 PID FLAGS LWP       NAME     WAIT
  69   0   c31fa800  sh       tstile
  59   0   c31fa020  fsck_ffs biolock
  61   4   c31fa2c0  fsck_ffs physio
  50   0   c31aa560  fsck     wait
  46  80             sh       wait
  17   0             sh       tstile
  16   0             sh       wait
  11   0             sh       pipe_rd
  10   0             sh       wait
   9   0             sh       wait
   8   0             sh       wait
   2   0             sh       wait
   1                 init

 tr /a c31fa2c0         tr /a c31fa020
 pid 61                 pid 59
 sleepq_block           sleepq_block
 cv_wait                cv_timedwait
 physio                 bbusy
 wdread                 getblk
 cdev_read              bio_doread
 spec_read              bread
 VOP_READ               ffs_update.part.3
 vn_read                ffs_full_fsync
 dofileread             VOP_FSYNC
 sys_pread              ffs_sync
 syscall                VFS_SYNC
                        do_sys_sync
                        sys_sync
                        syscall


 Onno

Responsible-Changed-From-To: kern-bug-people->jdolecek
Responsible-Changed-By: jdolecek@NetBSD.org
Responsible-Changed-When: Sat, 14 Jul 2018 18:54:55 +0000
Responsible-Changed-Why:
I'm investigating this.


From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/52783 CVS commit: src/sys/dev/ata
Date: Mon, 6 Aug 2018 20:07:05 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Mon Aug  6 20:07:05 UTC 2018

 Modified Files:
 	src/sys/dev/ata: ata_subr.c atavar.h wd.c

 Log Message:
 add wddebug() which dumps some status for attached disks; indended for
 debugging of PR kern/52783


 To generate a diff of this commit:
 cvs rdiff -u -r1.4 -r1.5 src/sys/dev/ata/ata_subr.c
 cvs rdiff -u -r1.97 -r1.98 src/sys/dev/ata/atavar.h
 cvs rdiff -u -r1.439 -r1.440 src/sys/dev/ata/wd.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Fri, 10 Aug 2018 22:44:18 +0000
State-Changed-Why:
Fix was committed. Can you confirm it works for you?


From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/52783 CVS commit: src/sys/dev/ata
Date: Fri, 10 Aug 2018 22:43:22 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Fri Aug 10 22:43:22 UTC 2018

 Modified Files:
 	src/sys/dev/ata: ata_subr.c atavar.h wd.c

 Log Message:
 fix race in wd_lastclose() on systems with two ide disks on same
 channel, which happened when one disk had pending I/O while the other
 disk executed the final disk flush - need to restart bufq processing
 once xfer is freed in this case

 it could happen e.g. on boot when system executes fsck on different
 partitions on the two drives in parallell and hence open and closes
 the disk devices repeatedly

 add KASSERT() for empty bufq on wd_lastclose(), and fix similar issue
 also on suspend/standby path

 this was introduced by the NCQ merge and not dksubr - before the merge
 each drive had their own xfer, so they could not block each other

 fixes PR kern/52783 by Onno van der Linden; many thanks for extensive
 help with tracking this down


 To generate a diff of this commit:
 cvs rdiff -u -r1.5 -r1.6 src/sys/dev/ata/ata_subr.c
 cvs rdiff -u -r1.98 -r1.99 src/sys/dev/ata/atavar.h
 cvs rdiff -u -r1.440 -r1.441 src/sys/dev/ata/wd.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Onno van der Linden <o.vd.linden@quicknet.nl>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/52783 (parallel fsck hangs during boot of 8.99.[5678])
Date: Sat, 11 Aug 2018 11:49:56 +0200

 On Fri, Aug 10, 2018 at 10:44:18PM +0000, jdolecek@NetBSD.org wrote:
 > Synopsis: parallel fsck hangs during boot of 8.99.[5678]
 > 
 > State-Changed-From-To: open->feedback
 > State-Changed-By: jdolecek@NetBSD.org
 > State-Changed-When: Fri, 10 Aug 2018 22:44:18 +0000
 > State-Changed-Why:
 > Fix was committed. Can you confirm it works for you?

 Rebooted a kernel containing this fix 10 times and works fine for me.
 PR can be close as far as I'm concerned.

 Thanks,

 Onno

State-Changed-From-To: feedback->closed
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Sat, 11 Aug 2018 11:19:10 +0000
State-Changed-Why:
Confirmed fixed. No pullup since this is post netbsd-8 change. Thanks again
for help tracking this down.


>Unformatted:
 2017.12.03.00.00
Home
PR Database Search
(Contact us) $NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.