NetBSD Problem Report #46879

From Wolfgang.Stukenbrock@nagler-company.com  Thu Aug 30 16:08:03 2012
Return-Path: <Wolfgang.Stukenbrock@nagler-company.com>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	by www.NetBSD.org (Postfix) with ESMTP id A104863B86D
	for <gnats-bugs@gnats.NetBSD.org>; Thu, 30 Aug 2012 16:08:03 +0000 (UTC)
Message-Id: <20120830160754.7B3A61E80A9@test-s0.nagler-company.com>
Date: Thu, 30 Aug 2012 18:07:54 +0200 (CEST)
From: Wolfgang.Stukenbrock@nagler-company.com
Reply-To: Wolfgang.Stukenbrock@nagler-company.com
To: gnats-bugs@gnats.NetBSD.org
Subject: panic in reboot 5_1_STABLE in dkwedge - dead-lock detected
X-Send-Pr-Version: 3.95

>Number:         46879
>Category:       kern
>Synopsis:       panic in reboot 5_1_STABLE in dkwedge - dead-lock detected
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Aug 30 16:10:00 +0000 2012
>Closed-Date:    Fri Nov 25 21:06:33 +0000 2016
>Last-Modified:  Fri Nov 25 21:06:33 +0000 2016
>Originator:     Wolfgang Stukenbrock
>Release:        NetBSD 5.1.STABLE
>Organization:
Dr. Nagler & Company GmbH
>Environment:


System: NetBSD test-s0 4.0 NetBSD 4.0 (NSW-WS) #0: Tue Aug 17 17:28:09 CEST 2010 wgstuken@test-s0:/usr/src/sys/arch/amd64/compile/NSW-WS amd64
Architecture: x86_64
Machine: amd64
>Description:
	affected system: amd64 (Xeon 5xxx CPU -> 12 Cpu's found by system)

	The problem does not occure, if the filesystem is unmounted by hand.
	It only happens (till now) if the filesystem is still mounted when the system reboots.
	The panic is reproducable.

	Filesystem setup - one partition with entire disk all the time:
	(remark: dk0, dk1, dk2 and dk3 are other wedges on other disks)
	sd0 - BSD-label - sd0e (type RAID)
	sd1 - BSD-label - sd1e (type RAID)
	sd2 - BSD-label - sd2e (type RAID)
	sd3 - BSD-label - sd3e (type RAID)
	raid4 stripe (0) of sd0e and sd1e - gpt-label - dk4 (type raidframe)
	raid5 stripe (0) of sd2e and sd3e - gpt-label - dk5 (type raidframe)
	raid3 stripe (0) of dk4 and dk5 - gpt-label - dk6 (type ffs)

	When /dev/dk6 is mounted when the system reboots the panic below happens.

	remark: the panic will not occure with layered raidframe devices build with BSD-labels only.
		This runs stable since 4.0 on our server - currently with 5.1-release (and the patch
		from PR39784 to support autoconfiguration).
	remark: This is a test-setup to perform some tests with gpt-labels, raidframe and large disks.
		So don't blame about the less usefull raid0 on raid0 setup here.
		The main goal is to setup a stripe of mirrors for data security with raidframe.
		I'm just working on the reintegration of my autoconfigure-Raidframe-Code
		(see PR39784 - still not integrated) (see PR45179 for this general topic too) in 5_1_STABLE.
		(It looks good now - setup seems to work and I will update PR39784 in the next days.)
		At this point I've stumbled over the problem reported here.


	The following output is found after executing the "reboot" command as root:

System shutdown time has arrived

About to run shutdown hooks...
Stopping cron.
Stopping inetd.
Waiting for PIDS: 451.
Removing block-type swap devices
swapctl: removing /dev/raid0b as swap device
Thu Aug 30 17:28:35 CEST 2012

Done running shutdown hooks.
Aug 30 17:28:40 wst-test syslogd: Exiting on signal 15
syncing disks... done
unmounting file systems...Mutex error: mutex_vector_enter: locking against myself

lock address : 0xffff80002fd8a570
current cpu  :                  6
current lwp  : 0xffff8000881c0000
owner field  : 0xffff8000881c0000 wait/spin:                0/0

panic: lock error
fatal breakpoint trap in supervisor mode
trap type 1 code 0 rip ffffffff804b1045 cs 8 rflags 246 cr2  7f7ffd985290 cpl 0 rsp ffff8000882ec280
Stopped in pid 563.1 (reboot) at        netbsd:breakpoint+0x5:  leave
db{6}>

	The following traceback is found:
db{6}> trace
breakpoint() at netbsd:breakpoint+0x5
panic() at netbsd:panic+0x24d
lockdebug_abort() at netbsd:lockdebug_abort+0x42
mutex_vector_enter() at netbsd:mutex_vector_enter+0x208
dkwedge_del() at netbsd:dkwedge_del+0x181
dkwedge_delall() at netbsd:dkwedge_delall+0x65
raidclose() at netbsd:raidclose+0x133
bdev_close() at netbsd:bdev_close+0x89
spec_close() at netbsd:spec_close+0x231
VOP_CLOSE() at netbsd:VOP_CLOSE+0x62
vn_close() at netbsd:vn_close+0x51
dkclose() at netbsd:dkclose+0xcb
bdev_close() at netbsd:bdev_close+0x89
spec_close() at netbsd:spec_close+0x231
VOP_CLOSE() at netbsd:VOP_CLOSE+0x62
rf_close_component() at netbsd:rf_close_component+0x59
rf_UnconfigureVnodes() at netbsd:rf_UnconfigureVnodes+0x42
rf_Shutdown() at netbsd:rf_Shutdown+0x174
raidclose() at netbsd:raidclose+0xf7
bdev_close() at netbsd:bdev_close+0x89
spec_close() at netbsd:spec_close+0x231
VOP_CLOSE() at netbsd:VOP_CLOSE+0x62
vn_close() at netbsd:vn_close+0x51
dkclose() at netbsd:dkclose+0xcb
bdev_close() at netbsd:bdev_close+0x89
spec_close() at netbsd:spec_close+0x231
VOP_CLOSE() at netbsd:VOP_CLOSE+0x62
ffs_unmount() at netbsd:ffs_unmount+0x11e
VFS_UNMOUNT() at netbsd:VFS_UNMOUNT+0x2e
dounmount() at netbsd:dounmount+0xd5
vfs_unmountall() at netbsd:vfs_unmountall+0x55
cpu_reboot() at netbsd:cpu_reboot+0x100
sys_reboot() at netbsd:sys_reboot+0x5f
syscall() at netbsd:syscall+0xa0
db{6}> 

>How-To-Repeat:
	Setup an layered raidframe device with wedges as described, mount it and reboot the system.
>Fix:
	Not know at the moment - sorry.
	It looks like a problem related to the recursive call for dkclose(), raidclose() ...
	If anybody can give me some assistence in DDB-debugging I can deliver more information.
	remark: the root is on a raid (1) device - and the system fails to dump to it ...

	I think it should be easy to reproduce this setup on any system with a free disk available
	for testing.  I beleave it will also happen if all "sd"-partitions will be on the same disk.

>Release-Note:

>Audit-Trail:
From: Wolfgang Stukenbrock <wolfgang.stukenbrock@nagler-company.com>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/46879: panic in reboot 5_1_STABLE in dkwedge - dead-lock detected
Date: Fri, 31 Aug 2012 12:36:21 +0200

 Hi again,


 I've had an additional look at it ...

 The problem seems to be in src/dev/dkwedges/dk.c

 I do not understand the whole semantic of the locks (dk_openlock and 
 kd_rawlock), but I've recognised the following:

 in dk_close() the dk_openlock is entered on the wedge
 and if a close of for the parent is needed the dk_rawlock is allocated 
 for the parent - no dk_openlock on parrent aquired.
 The same is done in dkwedge_del.
 In both functions vn_close() for the dk_rawvp is called with tzhe 
 dk_openlock held on the wedge.

 Here the dkclose() on the mounted wedge will call raidclose() of the 
 underlying raid-device.
 If doing_shutdown is set, the raid-device gets destroyed and 
 dkwedge_delall() is called for the raid-device.
 dkwedge_delall now copies the information of the first wedge - if any - 
 into a local buffer and calls dkwedge_del in order to destroy it.
 This will enter the the dk_openlock on that wedge again, but we still 
 hold it from dkclose() before.

 This would mean, that the panic is not related to a layerd raid-device, 
 as I expected before - it will happen for every raiddevice with a wedge 
 on it.

 I've tested this - BSD-label on sd0 and sd1,
 a raiddevice (stripe in this case) of sd0 and sd1,
 a GPT-label with one wedge on this raiddevice.
 If the filesystem on the wedge is mounted when a reboot occures I get 
 the following panic (and trace) in DDB:

 syncing disks... done
 unmounting file systems...Mutex error: mutex_vector_enter: locking 
 against myself

 lock address : 0xffff80002f9ffb70
 current cpu  :                 10
 current lwp  : 0xffff800087b7b000
 owner field  : 0xffff800087b7b000 wait/spin:                0/0

 panic: lock error
 fatal breakpoint trap in supervisor mode
 trap type 1 code 0 rip ffffffff804b1045 cs 8 rflags 246 cr2 
 7f7ffd620030 cpl 0 rsp ffff800088062530
 Stopped in pid 594.1 (reboot) at        netbsd:breakpoint+0x5:  leave
 db{10}> trace
 breakpoint() at netbsd:breakpoint+0x5
 panic() at netbsd:panic+0x24d
 lockdebug_abort() at netbsd:lockdebug_abort+0x42
 mutex_vector_enter() at netbsd:mutex_vector_enter+0x208
 dkwedge_del() at netbsd:dkwedge_del+0x181
 dkwedge_delall() at netbsd:dkwedge_delall+0x65
 raidclose() at netbsd:raidclose+0x133
 bdev_close() at netbsd:bdev_close+0x89
 spec_close() at netbsd:spec_close+0x231
 VOP_CLOSE() at netbsd:VOP_CLOSE+0x62
 vn_close() at netbsd:vn_close+0x51
 dkclose() at netbsd:dkclose+0xcb
 bdev_close() at netbsd:bdev_close+0x89
 spec_close() at netbsd:spec_close+0x231
 VOP_CLOSE() at netbsd:VOP_CLOSE+0x62
 ffs_unmount() at netbsd:ffs_unmount+0x11e
 VFS_UNMOUNT() at netbsd:VFS_UNMOUNT+0x2e
 dounmount() at netbsd:dounmount+0xd5
 vfs_unmountall() at netbsd:vfs_unmountall+0x55
 cpu_reboot() at netbsd:cpu_reboot+0x100
 sys_reboot() at netbsd:sys_reboot+0x5f
 syscall() at netbsd:syscall+0xa0
 db{10}>

 Yes - my analyses above seems to be correct.
 I would tend to increase the level from serious to critical, because a 
 wedge on a raidframe-device is not usable!

 I'm not sure if it would be the correct sollution to release the 
 dk_openlock mutex prior calling vn_close() in both functions mentioned 
 above. I'm not shure if it would be possible to set dk_rawvp to NULL 
 prior calling vpn_close() on it - may be stored in a temp-variable.

 If the answer to thees questions is yes, than that would be the sollution.
 Can someone with more knowledge on the mutex order and the semantic of 
 dk_rawvp have a look on this topic.
 Thanks in advance

 gnats-admin@NetBSD.org wrote:

 > Thank you very much for your problem report.
 > It has the internal identification `kern/46879'.
 > The individual assigned to look at your
 > report is: kern-bug-people. 
 > 
 > 
 >>Category:       kern
 >>Responsible:    kern-bug-people
 >>Synopsis:       panic in reboot 5_1_STABLE in dkwedge - dead-lock detected
 >>Arrival-Date:   Thu Aug 30 16:10:00 +0000 2012
 >>
 > 
 > 



From: Wolfgang Stukenbrock <wolfgang.stukenbrock@nagler-company.com>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/46879: panic in reboot 5_1_STABLE in dkwedge - dead-lock detected
Date: Fri, 31 Aug 2012 15:04:16 +0200

 This is a multi-part message in MIME format.
 --------------060607080304090807080703
 Content-Type: text/plain; charset=ISO-8859-1; format=flowed
 Content-Transfer-Encoding: 8bit

 Hi again,

 the attached patch for /usr/src/sys/dev/dkwedge/dk.c seems to fix the 
 problem.
 No panic occure anymore in reboot.

 But someone with detailed knowledge of the mutex order and the reference 
 management of the wedges-stuff should have a look at it prior 
 integration into the source tree.

 best regards

 W. Stukenbrock

 Wolfgang Stukenbrock wrote:

 > The following reply was made to PR kern/46879; it has been noted by GNATS.
 > 
 > From: Wolfgang Stukenbrock <wolfgang.stukenbrock@nagler-company.com>
 > To: gnats-bugs@NetBSD.org
 > Cc: 
 > Subject: Re: kern/46879: panic in reboot 5_1_STABLE in dkwedge - dead-lock detected
 > Date: Fri, 31 Aug 2012 12:36:21 +0200
 > 
 >  Hi again,
 >  
 >  
 >  I've had an additional look at it ...
 >  
 >  The problem seems to be in src/dev/dkwedges/dk.c
 >  
 >  I do not understand the whole semantic of the locks (dk_openlock and 
 >  kd_rawlock), but I've recognised the following:
 >  
 >  in dk_close() the dk_openlock is entered on the wedge
 >  and if a close of for the parent is needed the dk_rawlock is allocated 
 >  for the parent - no dk_openlock on parrent aquired.
 >  The same is done in dkwedge_del.
 >  In both functions vn_close() for the dk_rawvp is called with tzhe 
 >  dk_openlock held on the wedge.
 >  
 >  Here the dkclose() on the mounted wedge will call raidclose() of the 
 >  underlying raid-device.
 >  If doing_shutdown is set, the raid-device gets destroyed and 
 >  dkwedge_delall() is called for the raid-device.
 >  dkwedge_delall now copies the information of the first wedge - if any - 
 >  into a local buffer and calls dkwedge_del in order to destroy it.
 >  This will enter the the dk_openlock on that wedge again, but we still 
 >  hold it from dkclose() before.
 >  
 >  This would mean, that the panic is not related to a layerd raid-device, 
 >  as I expected before - it will happen for every raiddevice with a wedge 
 >  on it.
 >  
 >  I've tested this - BSD-label on sd0 and sd1,
 >  a raiddevice (stripe in this case) of sd0 and sd1,
 >  a GPT-label with one wedge on this raiddevice.
 >  If the filesystem on the wedge is mounted when a reboot occures I get 
 >  the following panic (and trace) in DDB:
 >  
 >  syncing disks... done
 >  unmounting file systems...Mutex error: mutex_vector_enter: locking 
 >  against myself
 >  
 >  lock address : 0xffff80002f9ffb70
 >  current cpu  :                 10
 >  current lwp  : 0xffff800087b7b000
 >  owner field  : 0xffff800087b7b000 wait/spin:                0/0
 >  
 >  panic: lock error
 >  fatal breakpoint trap in supervisor mode
 >  trap type 1 code 0 rip ffffffff804b1045 cs 8 rflags 246 cr2 
 >  7f7ffd620030 cpl 0 rsp ffff800088062530
 >  Stopped in pid 594.1 (reboot) at        netbsd:breakpoint+0x5:  leave
 >  db{10}> trace
 >  breakpoint() at netbsd:breakpoint+0x5
 >  panic() at netbsd:panic+0x24d
 >  lockdebug_abort() at netbsd:lockdebug_abort+0x42
 >  mutex_vector_enter() at netbsd:mutex_vector_enter+0x208
 >  dkwedge_del() at netbsd:dkwedge_del+0x181
 >  dkwedge_delall() at netbsd:dkwedge_delall+0x65
 >  raidclose() at netbsd:raidclose+0x133
 >  bdev_close() at netbsd:bdev_close+0x89
 >  spec_close() at netbsd:spec_close+0x231
 >  VOP_CLOSE() at netbsd:VOP_CLOSE+0x62
 >  vn_close() at netbsd:vn_close+0x51
 >  dkclose() at netbsd:dkclose+0xcb
 >  bdev_close() at netbsd:bdev_close+0x89
 >  spec_close() at netbsd:spec_close+0x231
 >  VOP_CLOSE() at netbsd:VOP_CLOSE+0x62
 >  ffs_unmount() at netbsd:ffs_unmount+0x11e
 >  VFS_UNMOUNT() at netbsd:VFS_UNMOUNT+0x2e
 >  dounmount() at netbsd:dounmount+0xd5
 >  vfs_unmountall() at netbsd:vfs_unmountall+0x55
 >  cpu_reboot() at netbsd:cpu_reboot+0x100
 >  sys_reboot() at netbsd:sys_reboot+0x5f
 >  syscall() at netbsd:syscall+0xa0
 >  db{10}>
 >  
 >  Yes - my analyses above seems to be correct.
 >  I would tend to increase the level from serious to critical, because a 
 >  wedge on a raidframe-device is not usable!
 >  
 >  I'm not sure if it would be the correct sollution to release the 
 >  dk_openlock mutex prior calling vn_close() in both functions mentioned 
 >  above. I'm not shure if it would be possible to set dk_rawvp to NULL 
 >  prior calling vpn_close() on it - may be stored in a temp-variable.
 >  
 >  If the answer to thees questions is yes, than that would be the sollution.
 >  Can someone with more knowledge on the mutex order and the semantic of 
 >  dk_rawvp have a look on this topic.
 >  Thanks in advance
 >  
 >  gnats-admin@NetBSD.org wrote:
 >  
 >  > Thank you very much for your problem report.
 >  > It has the internal identification `kern/46879'.
 >  > The individual assigned to look at your
 >  > report is: kern-bug-people. 
 >  > 
 >  > 
 >  >>Category:       kern
 >  >>Responsible:    kern-bug-people
 >  >>Synopsis:       panic in reboot 5_1_STABLE in dkwedge - dead-lock detected
 >  >>Arrival-Date:   Thu Aug 30 16:10:00 +0000 2012
 >  >>
 >  > 
 >  > 
 >  
 >  
 >  
 > 
 > 


 -- 


 Dr. Nagler & Company GmbH
 Hauptstraße 9
 92253 Schnaittenbach

 Tel. +49 9622/71 97-42
 Fax +49 9622/71 97-50

 Wolfgang.Stukenbrock@nagler-company.com
 http://www.nagler-company.com


 Hauptsitz: Schnaittenbach
 Handelregister: Amberg HRB
 Gerichtsstand: Amberg
 Steuernummer: 201/118/51825
 USt.-ID-Nummer: DE 273143997
 Geschäftsführer: Dr. Martin Nagler, Dr. Dr. Karl-Kuno Kunze


 --------------060607080304090807080703
 Content-Type: text/plain;
  name="dk.c-patch"
 Content-Transfer-Encoding: 7bit
 Content-Disposition: inline;
  filename="dk.c-patch"

 --- dk.c	2012/08/31 10:57:31	1.1
 +++ dk.c	2012/08/31 11:02:46
 @@ -432,6 +432,7 @@
  dkwedge_del(struct dkwedge_info *dkw)
  {
  	struct dkwedge_softc *sc = NULL;
 +	struct vnode *tmp_vp = NULL;
  	u_int unit;
  	int bmaj, cmaj, s;

 @@ -480,15 +481,15 @@
  		mutex_enter(&sc->sc_parent->dk_rawlock);
  		if (sc->sc_parent->dk_rawopens-- == 1) {
  			KASSERT(sc->sc_parent->dk_rawvp != NULL);
 -			mutex_exit(&sc->sc_parent->dk_rawlock);
 -			(void) vn_close(sc->sc_parent->dk_rawvp, FREAD | FWRITE,
 -			    NOCRED);
 +			tmp_vp = sc->sc_parent->dk_rawvp;
  			sc->sc_parent->dk_rawvp = NULL;
 -		} else
 -			mutex_exit(&sc->sc_parent->dk_rawlock);
 +		}
 +		mutex_exit(&sc->sc_parent->dk_rawlock);
  		sc->sc_dk.dk_openmask = 0;
  	}
  	mutex_exit(&sc->sc_dk.dk_openlock);
 +	if (tmp_vp != NULL)
 +		(void) vn_close(tmp_vp, FREAD | FWRITE, NOCRED);

  	/* Announce our departure. */
  	aprint_normal("%s at %s (%s) deleted\n", device_xname(sc->sc_dev),
 @@ -964,7 +965,7 @@
  dkclose(dev_t dev, int flags, int fmt, struct lwp *l)
  {
  	struct dkwedge_softc *sc = dkwedge_lookup(dev);
 -	int error = 0;
 +	struct vnode *tmp_vp = NULL;

  	KASSERT(sc->sc_dk.dk_openmask != 0);

 @@ -981,17 +982,17 @@
  		mutex_enter(&sc->sc_parent->dk_rawlock);
  		if (sc->sc_parent->dk_rawopens-- == 1) {
  			KASSERT(sc->sc_parent->dk_rawvp != NULL);
 -			mutex_exit(&sc->sc_parent->dk_rawlock);
 -			error = vn_close(sc->sc_parent->dk_rawvp,
 -			    FREAD | FWRITE, NOCRED);
 +			tmp_vp = sc->sc_parent->dk_rawvp = NULL;
  			sc->sc_parent->dk_rawvp = NULL;
 -		} else
 -			mutex_exit(&sc->sc_parent->dk_rawlock);
 +		}
 +		mutex_exit(&sc->sc_parent->dk_rawlock);
  	}

  	mutex_exit(&sc->sc_dk.dk_openlock);

 -	return (error);
 +	if (tmp_vp != NULL)
 +		return vn_close(tmp_vp, FREAD | FWRITE, NOCRED);
 +	return 0;
  }

  /*

 --------------060607080304090807080703--

From: Wolfgang Stukenbrock <wolfgang.stukenbrock@nagler-company.com>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/46879: panic in reboot 5_1_STABLE in dkwedge - dead-lock detected
Date: Wed, 21 Aug 2013 17:30:28 +0200

 Hi,

 this PR is still open.

 I'm upgrading to 6.1 and (at least in that sw-tree) this problem has 
 been addressed and fixed in a completely other way.

 I've tested the version from 6.1-Release some minutes ago and no panic 
 occurs.

 So, if the fix added in 6.1 will be in 6.0 and 5.x too, I think this PR 
 can be closed.
 (And I think I remember that I've seen this fix in the 5.x tree too, but 
 I'm not shure anymore.)

 best regards

 W. Stukenbrock




State-Changed-From-To: open->closed
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Fri, 25 Nov 2016 21:06:33 +0000
State-Changed-Why:
Submitted confirmed problem is fixed (differently) in 6.1.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.