NetBSD Problem Report #48411

From htodd@mara.i8u.org  Wed Nov 27 06:05:17 2013
Return-Path: <htodd@mara.i8u.org>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id BCCFFA61B2
	for <gnats-bugs@gnats.NetBSD.org>; Wed, 27 Nov 2013 06:05:17 +0000 (UTC)
Message-Id: <20131127060516.42B38C29E99@mara.i8u.org>
Date: Tue, 26 Nov 2013 22:05:16 -0800 (PST)
From: htodd@twofifty.com
Reply-To: htodd@twofifty.com
To: gnats-bugs@NetBSD.org
Subject: repeatable SMP crashes in amd64-current
X-Send-Pr-Version: 3.95

>Number:         48411
>Category:       kern
>Synopsis:       repeatable SMP crashes in amd64-current
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    hannken
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Nov 27 06:10:00 +0000 2013
>Closed-Date:    Sat Dec 21 18:38:18 +0000 2013
>Last-Modified:  Sat Dec 21 18:38:18 +0000 2013
>Originator:     H. Todd Fujinaka
>Release:        NetBSD 6.99.28
>Organization:
None
>Environment:


System: NetBSD mara.i8u.org 6.99.28 NetBSD 6.99.28 (MARA) #747: Tue Nov 26 16:06:33 PST 2013 htodd@mara.i8u.org:/usr/obj/amd64/sys/arch/amd64/compile/MARA amd64
Architecture: x86_64
Machine: amd64
>Description:
parallel builds on amd64-current lock up the system with lots of processes stuck in tstile. Backtraces at http://www.i8u.org/~Htodd/crash/

>How-To-Repeat:
install amd64-current, reboot, try to build amd64-current again

>Fix:
boot without SMP


>Release-Note:

>Audit-Trail:
From: Hisashi T Fujinaka <htodd@twofifty.com>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
        hannken@netbsd.org
Subject: Re: kern/48411: repeatable SMP crashes in amd64-current
Date: Wed, 27 Nov 2013 17:01:54 -0800 (PST)

 rmind suggested I check
 http://mail-index.netbsd.org/source-changes/2013/11/23/msg049417.html
 where hannken changed src/sys/kern/vfs_vnode.c src/sys/sys/vnode.h

  		 cvs rdiff -u -r1.25 -r1.26 src/sys/kern/vfs_vnode.c
  		 cvs rdiff -u -r1.240 -r1.241 src/sys/sys/vnode.h

 Reverting this change fixes my hangs so far. Before I couldn't complete
 a build of netbsd, now I can build netbsd and pkgsrc simultaneously.

From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/48411: repeatable SMP crashes in amd64-current
Date: Thu, 28 Nov 2013 05:10:04 +0000

 On Thu, Nov 28, 2013 at 01:05:00AM +0000, Hisashi T Fujinaka wrote:
  >  rmind suggested I check
  >  http://mail-index.netbsd.org/source-changes/2013/11/23/msg049417.html
  >  where hannken changed src/sys/kern/vfs_vnode.c src/sys/sys/vnode.h
  >  
  >   		 cvs rdiff -u -r1.25 -r1.26 src/sys/kern/vfs_vnode.c
  >   		 cvs rdiff -u -r1.240 -r1.241 src/sys/sys/vnode.h
  >  
  >  Reverting this change fixes my hangs so far. Before I couldn't complete
  >  a build of netbsd, now I can build netbsd and pkgsrc simultaneously.

 alas... that patch really did look like a step forward.

 Did you ever get a clear trace of the deadlock?

 -- 
 David A. Holland
 dholland@netbsd.org

Responsible-Changed-From-To: kern-bug-people->hannken
Responsible-Changed-By: dholland@NetBSD.org
Responsible-Changed-When: Thu, 28 Nov 2013 05:36:06 +0000
Responsible-Changed-Why:
so you don't miss anything
(if you don't want to be "responsible", give it to me, but I watch all of
gnats anyway)


From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/48411: repeatable SMP crashes in amd64-current
Date: Thu, 28 Nov 2013 05:42:24 +0000

 On Thu, Nov 28, 2013 at 05:15:00AM +0000, David Holland wrote:
  >  On Thu, Nov 28, 2013 at 01:05:00AM +0000, Hisashi T Fujinaka wrote:
  >   >  rmind suggested I check
  >   >  http://mail-index.netbsd.org/source-changes/2013/11/23/msg049417.html
  >   >  where hannken changed src/sys/kern/vfs_vnode.c src/sys/sys/vnode.h
  >   >  
  >   >   		 cvs rdiff -u -r1.25 -r1.26 src/sys/kern/vfs_vnode.c
  >   >   		 cvs rdiff -u -r1.240 -r1.241 src/sys/sys/vnode.h
  >   >  
  >   >  Reverting this change fixes my hangs so far. Before I couldn't complete
  >   >  a build of netbsd, now I can build netbsd and pkgsrc simultaneously.
  >  
  >  alas... that patch really did look like a step forward.
  >  
  >  Did you ever get a clear trace of the deadlock?

 after some discussion in chat and looking at the url cited in the PR,
 it looks like (1) layerfs is definitely involved, and (2) deep
 suspicion attaches to layer_node_find().

 see http://www.i8u.org/~htodd/crash/6b00.jpg and note that in the
 initial ps output, the lwp that isn't tstiling is waiting on "layerfs".

 -- 
 David A. Holland
 dholland@netbsd.org

From: Hisashi T Fujinaka <htodd@twofifty.com>
To: gnats-bugs@NetBSD.org
Cc: hannken@NetBSD.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/48411: repeatable SMP crashes in amd64-current
Date: Wed, 27 Nov 2013 22:45:17 -0800 (PST)

 And there's a typo in the first link:

 http://www.i8u.org/~htodd/crash/

From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/48411: repeatable SMP crashes in amd64-current
Date: Thu, 28 Nov 2013 17:22:15 +0100

 --Apple-Mail=_05C7A77A-2DE2-44B8-A1D6-83D7F30CD1E5
 Content-Transfer-Encoding: 7bit
 Content-Type: text/plain;
 	charset=us-ascii

 This deadlock comes from one thread being in layer_node_find(), holding
 the vnode locked and trying to vget(..., LK_NOWAIT) it.  Another thread
 (usually the vrele thread) is in vrelel() trying to get the vnode lock.

 Please try the attached diff where vrelel() marks the vnode as changing
 after it has aquired the vnode lock but before it runs VOP_INACTIVE().

 Entering layer_node_find() with a locked vnode has to go -- unfortunately
 this means VOP_LOOKUP() has to return an unlocked vnode.

 --
 J. Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)


 --Apple-Mail=_05C7A77A-2DE2-44B8-A1D6-83D7F30CD1E5
 Content-Disposition: attachment;
 	filename=vfs_vnode.c.diff
 Content-Type: application/octet-stream;
 	name="vfs_vnode.c.diff"
 Content-Transfer-Encoding: 7bit

 Index: vfs_vnode.c
 ===================================================================
 RCS file: /cvsroot/src/sys/kern/vfs_vnode.c,v
 retrieving revision 1.26
 diff -p -u -2 -r1.26 vfs_vnode.c
 --- vfs_vnode.c	23 Nov 2013 13:46:22 -0000	1.26
 +++ vfs_vnode.c	28 Nov 2013 16:03:24 -0000
 @@ -607,8 +607,4 @@ vrelel(vnode_t *vp, int flags)

  	KASSERT((vp->v_iflag & VI_XLOCK) == 0);
 -	if ((flags & VRELEL_CHANGING_SET) == 0) {
 -		KASSERT((vp->v_iflag & VI_CHANGING) == 0);
 -		vp->v_iflag |= VI_CHANGING;
 -	}

  #ifdef DIAGNOSTIC
 @@ -655,11 +651,12 @@ vrelel(vnode_t *vp, int flags)
  			if (__predict_false(vtryrele(vp))) {
  				VOP_UNLOCK(vp);
 -				KASSERT((vp->v_iflag & VI_CHANGING) != 0);
 -				vp->v_iflag &= ~VI_CHANGING;
 -				cv_broadcast(&vp->v_cv);
 +				if ((flags & VRELEL_CHANGING_SET) != 0) {
 +					KASSERT((vp->v_iflag & VI_CHANGING) != 0);
 +					vp->v_iflag &= ~VI_CHANGING;
 +					cv_broadcast(&vp->v_cv);
 +				}
  				mutex_exit(vp->v_interlock);
  				return;
  			}
 -			mutex_exit(vp->v_interlock);
  			defer = false;
  		} else if ((vp->v_iflag & VI_LAYER) != 0) {
 @@ -676,8 +673,8 @@ vrelel(vnode_t *vp, int flags)
  			if (error != 0) {
  				defer = true;
 -				mutex_enter(vp->v_interlock);
  			} else {
  				defer = false;
  			}
 +			mutex_enter(vp->v_interlock);
  		}

 @@ -688,6 +685,9 @@ vrelel(vnode_t *vp, int flags)
  			 */
  			KASSERT(mutex_owned(vp->v_interlock));
 -			KASSERT((vp->v_iflag & VI_CHANGING) != 0);
 -			vp->v_iflag &= ~VI_CHANGING;
 +			if ((flags & VRELEL_CHANGING_SET) != 0) {
 +				KASSERT((vp->v_iflag & VI_CHANGING) != 0);
 +				vp->v_iflag &= ~VI_CHANGING;
 +				cv_broadcast(&vp->v_cv);
 +			}
  			mutex_enter(&vrele_lock);
  			TAILQ_INSERT_TAIL(&vrele_list, vp, v_freelist);
 @@ -695,9 +695,14 @@ vrelel(vnode_t *vp, int flags)
  				cv_signal(&vrele_cv); 
  			mutex_exit(&vrele_lock);
 -			cv_broadcast(&vp->v_cv);
  			mutex_exit(vp->v_interlock);
  			return;
  		}

 +		if ((flags & VRELEL_CHANGING_SET) == 0) {
 +			KASSERT((vp->v_iflag & VI_CHANGING) == 0);
 +			vp->v_iflag |= VI_CHANGING;
 +		}
 +		mutex_exit(vp->v_interlock);
 +
  		/*
  		 * The vnode can gain another reference while being
 @@ -740,4 +745,9 @@ vrelel(vnode_t *vp, int flags)
  		}
  		KASSERT(vp->v_usecount > 0);
 +	} else { /* vnode was already clean */
 +		if ((flags & VRELEL_CHANGING_SET) == 0) {
 +			KASSERT((vp->v_iflag & VI_CHANGING) == 0);
 +			vp->v_iflag |= VI_CHANGING;
 +		}
  	}


 --Apple-Mail=_05C7A77A-2DE2-44B8-A1D6-83D7F30CD1E5--

From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/48411: repeatable SMP crashes in amd64-current
Date: Thu, 28 Nov 2013 16:42:58 +0000

 On Thu, Nov 28, 2013 at 04:25:01PM +0000, J. Hannken-Illjes wrote:
  >  Entering layer_node_find() with a locked vnode has to go -- unfortunately
  >  this means VOP_LOOKUP() has to return an unlocked vnode.

 That isn't unfortunate; it just takes some work to get organized.

 -- 
 David A. Holland
 dholland@netbsd.org

From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/48411: repeatable SMP crashes in amd64-current
Date: Thu, 28 Nov 2013 18:02:05 +0100

 > On Thu, Nov 28, 2013 at 04:25:01PM +0000, J. Hannken-Illjes wrote:
 >> Entering layer_node_find() with a locked vnode has to go -- unfortunately
 >> this means VOP_LOOKUP() has to return an unlocked vnode.
 > 
 > That isn't unfortunate; it just takes some work to get organized.

 So to make the list complete: lookup, create, mknod, mkdir, symlink and
 bmap return locked vnodes and would need changing.

 --
 J. Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)

From: Hisashi T Fujinaka <htodd@twofifty.com>
To: gnats-bugs@NetBSD.org
Cc: hannken@NetBSD.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/48411: repeatable SMP crashes in amd64-current
Date: Thu, 28 Nov 2013 09:07:45 -0800 (PST)

 Testing now, so far, so good.

From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/48411: repeatable SMP crashes in amd64-current
Date: Thu, 28 Nov 2013 17:51:03 +0000

 On Thu, Nov 28, 2013 at 05:05:01PM +0000, J. Hannken-Illjes wrote:
  >>> Entering layer_node_find() with a locked vnode has to go -- unfortunately
  >>> this means VOP_LOOKUP() has to return an unlocked vnode.
  >> 
  >> That isn't unfortunate; it just takes some work to get organized.
  >  
  >  So to make the list complete: lookup, create, mknod, mkdir, symlink and
  >  bmap return locked vnodes and would need changing.

 Except for bmap the rest of those should not cause any (extra)
 problems. bmap I'm not sure about; it's a mess pertaining to device
 vnodes.

 -- 
 David A. Holland
 dholland@netbsd.org

From: Hisashi T Fujinaka <htodd@twofifty.com>
To: gnats-bugs@NetBSD.org
Cc: hannken@NetBSD.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/48411: repeatable SMP crashes in amd64-current
Date: Thu, 28 Nov 2013 11:34:11 -0800 (PST)

 Things have been building for several hours now, so I think that patch
 would fix my hang.

From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/48411: repeatable SMP crashes in amd64-current
Date: Thu, 28 Nov 2013 21:40:12 +0100

 On Nov 28, 2013, at 6:55 PM, David Holland <dholland-bugs@netbsd.org> =
 wrote:
 >=20
 > On Thu, Nov 28, 2013 at 05:05:01PM +0000, J. Hannken-Illjes wrote:
 >>>> Entering layer_node_find() with a locked vnode has to go -- =
 unfortunately
 >>>> this means VOP_LOOKUP() has to return an unlocked vnode.
 >>>=20
 >>> That isn't unfortunate; it just takes some work to get organized.
 >>=20
 >> So to make the list complete: lookup, create, mknod, mkdir, symlink =
 and
 >> bmap return locked vnodes and would need changing.
 >=20
 > Except for bmap the rest of those should not cause any (extra)
 > problems. bmap I'm not sure about; it's a mess pertaining to device
 > vnodes.

 As bmap returns the device we are mounted on and this device usually
 gets used only by strategy I would not expect big problems here.

 --
 J. Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)

From: "Juergen Hannken-Illjes" <hannken@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/48411 CVS commit: src/sys/kern
Date: Fri, 29 Nov 2013 14:58:55 +0000

 Module Name:	src
 Committed By:	hannken
 Date:		Fri Nov 29 14:58:55 UTC 2013

 Modified Files:
 	src/sys/kern: vfs_vnode.c

 Log Message:
 Change vrelel() to mark the vnode as changing after it has aquired
 the vnode lock but before it calls VOP_INACTIVE().

 Should fix the race between layer_node_find() trying to vget(, LK_NOWAIT)
 a locked vnode when vrelel() marked it as changing and wants its lock.

 PR kern/48411 (repeatable SMP crashes in amd64-current)


 To generate a diff of this commit:
 cvs rdiff -u -r1.26 -r1.27 src/sys/kern/vfs_vnode.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->feedback
State-Changed-By: hannken@NetBSD.org
State-Changed-When: Fri, 29 Nov 2013 15:02:00 +0000
State-Changed-Why:
Committed a fix -- please confirm.


From: "Christos Zoulas" <christos@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/48411 CVS commit: src/sys
Date: Sat, 30 Nov 2013 19:59:34 -0500

 Module Name:	src
 Committed By:	christos
 Date:		Sun Dec  1 00:59:34 UTC 2013

 Modified Files:
 	src/sys/kern: vfs_vnode.c
 	src/sys/sys: vnode.h

 Log Message:
 Revert recent vnode changes per PR/48411, I still have deadlocks with
 build -j 20 on an 8 cpu machine.


 To generate a diff of this commit:
 cvs rdiff -u -r1.27 -r1.28 src/sys/kern/vfs_vnode.c
 cvs rdiff -u -r1.241 -r1.242 src/sys/sys/vnode.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
To: gnats-bugs@NetBSD.org
Cc: Christos Zoulas <christos@netbsd.org>
Subject: Re: PR/48411 CVS commit: src/sys
Date: Sun, 1 Dec 2013 10:05:14 +0100

 On Dec 1, 2013, at 2:00 AM, "Christos Zoulas" <christos@netbsd.org> wrote:

 > Revert recent vnode changes per PR/48411, I still have deadlocks with
 > build -j 20 on an 8 cpu machine.

 Christos,

 do you have a core dump or back traces?

 --
 J. Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)

From: christos@zoulas.com (Christos Zoulas)
To: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>, gnats-bugs@NetBSD.org
Cc: 
Subject: Re: PR/48411 CVS commit: src/sys
Date: Sun, 1 Dec 2013 08:45:55 -0500

 On Dec 1, 10:05am, hannken@eis.cs.tu-bs.de ("J. Hannken-Illjes") wrote:
 -- Subject: Re: PR/48411 CVS commit: src/sys

 | On Dec 1, 2013, at 2:00 AM, "Christos Zoulas" <christos@netbsd.org> wrote:
 | 
 | > Revert recent vnode changes per PR/48411, I still have deadlocks with
 | > build -j 20 on an 8 cpu machine.
 | 
 | Christos,
 | 
 | do you have a core dump or back traces?

 No, but I will put them back and try a couple of full builds...
 It could be that they were affected by the pcu lossage.

 christos

From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: PR/48411 CVS commit: src/sys
Date: Sat, 21 Dec 2013 18:28:38 +0000

 On Sun, Dec 01, 2013 at 01:50:01PM +0000, Christos Zoulas wrote:
  >  | > Revert recent vnode changes per PR/48411, I still have deadlocks with
  >  | > build -j 20 on an 8 cpu machine.
  >  | 
  >  | Christos,
  >  | 
  >  | do you have a core dump or back traces?
  >  
  >  No, but I will put them back and try a couple of full builds...
  >  It could be that they were affected by the pcu lossage.

 For the record, these changes were put back as all the trouble was
 caused by the pcu stuff.

 I think this PR can be closed at this point -- htodd?

 -- 
 David A. Holland
 dholland@netbsd.org

From: Hisashi T Fujinaka <htodd@twofifty.com>
To: gnats-bugs@NetBSD.org
Cc: hannken@NetBSD.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: PR/48411 CVS commit: src/sys
Date: Sat, 21 Dec 2013 10:32:09 -0800 (PST)

 On Sat, 21 Dec 2013, David Holland wrote:

 > I think this PR can be closed at this point -- htodd?

 Yes, please. I was leaving it open in case someone else was using it for
 tracking. I think that's not the case.

State-Changed-From-To: feedback->closed
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sat, 21 Dec 2013 18:38:18 +0000
State-Changed-Why:
Problem is fixed.
(I had been using it to remind me to write the previous email, but that was
all.)


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.