NetBSD Problem Report #46734
From martin@duskware.de Mon Jul 23 13:29:14 2012
Return-Path: <martin@duskware.de>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
by www.NetBSD.org (Postfix) with ESMTP id 69B5863B85F
for <gnats-bugs@gnats.NetBSD.org>; Mon, 23 Jul 2012 13:29:14 +0000 (UTC)
Message-Id: <20120723132914.69B5863B85F@www.NetBSD.org>
Date: Mon, 23 Jul 2012 13:29:14 +0000 (UTC)
From: martin@NetBSD.org
Reply-To: martin@NetBSD.org
To: gnats-bugs@gnats.NetBSD.org
Subject: pool crash on module unload
X-Send-Pr-Version: 3.95
>Number: 46734
>Category: kern
>Synopsis: pool crash on module unload
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Mon Jul 23 13:30:00 +0000 2012
>Last-Modified: Fri Aug 17 15:30:03 +0000 2012
>Originator: Martin Husemann
>Release: NetBSD 6.99.10
>Organization:
The NetBSD Foundation, Inc.
>Environment:
System: NetBSD thirdstage.duskware.de 6.99.10 NetBSD 6.99.10 (MODULAR) #72: Mon Jul 23 14:33:10 CEST 2012 martin@night-porter.duskware.de:/usr/src/sys/arch/sparc64/compile/MODULAR sparc64
Architecture: sparc64
Machine: sparc64
>Description:
Running a full ATF test run on a -current sparc64 SMP machine crashes with:
panic: pool_destroy: pool busy: still out: 1
Stopped in pid 0.8 (system) at netbsd:cpu_Debugger+0x4: nop
db{0}> bt
panic(17bfcb0, 1, 675e000, 8000000000000000, 8936cb0, 8953a30) at netbsd:panic+0x24
pool_destroy(73c3200, 0, 2, 6, 1009bb0, 1c14000) at netbsd:pool_destroy+0x320
pool_cache_destroy(18ade40, 73c3200, 0, 0, 0, 0) at netbsd:pool_cache_destroy+0x10
puffs_vfsop_done(1c4b800, 0, 2a837, 67445a0, 0, 175c198) at puffs:puffs_vfsop_done+0x4
vfs_detach(1c4b660, 0, 675e000, 186e9d0, 1794318, 1c14000) at netbsd:vfs_detach+0x80
module_do_unload(18a4000, 0, 675e000, 6, 8, 1c14000) at netbsd:module_do_unload+0x74
module_thread(675e000, 675e000, 0, 0, 1796ca8, 1c4a990) at netbsd:module_thread+0xd0
lwp_trampoline(f005eab8, 113800, 113c00, 1118c0, 111d40, 111838) at netbsd:lwp_trampoline+0x8
>How-To-Repeat:
cd /usr/tests && atf-run|atf-report
>Fix:
n/a
>Audit-Trail:
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/46734: pool not drained
Date: Tue, 24 Jul 2012 10:01:58 +0200
Runing the tests in /usr/tests/sbin/resize_ffs is enough to trigger this
problem (at least on sparc64 SMP, might be 8k page size and recent poll
changes uncovered an old bug).
Given this backtrace:
pool_destroy(73c3200, 0, 1, 6, 175c310, 17985e8) at netbsd:pool_destroy+0x320
pool_cache_destroy(18ade40, 73c3200, 0, 175c2d8, 175c310, 1797610) at netbsd:poo
l_cache_destroy+0x10
puffs_vfsop_done(1c4b800, 0, 4f72, 6744e10, 0, 8840dc08) at puffs:puffs_vfsop_do
ne+0x4
vfs_detach(1c4b660, 0, 675e000, 186e9d0, 1794458, 1c14000) at netbsd:vfs_detach+
0x80
module_do_unload(18a4000, 0, 675e000, 6, 0, 1c14000) at netbsd:module_do_unload+
0x74
I did:
db{1}> sh pool 73c3200
POOL puffprkl: size 112, align 16, ioff 0, roflags 0x00000040
alloc 0x1875d10
minitems 0, minpages 0, maxpages 4294967295, npages 16
itemsperpage 72, nitems 1151, nout 1, hardlimit 4294967295
nget 1135, nfail 0, nput 1134
npagealloc 16, npagefree 0, hiwat 16, nidle 15
Which code is supposed to drain this pool before module unload?
Martin
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org, manu@NetBSD.org
Cc:
Subject: Re: kern/46734: pool not drained
Date: Tue, 24 Jul 2012 11:51:01 +0200
Reverting to the revisions prior to this commit fixes the problem (and also
makes all resize_fs tests work as expected again):
Test cases for known bugs:
t_shrink:shrink_24M_16M_v2_4096: PR bin/44205: atf-check failed; see
the output of the test for details
t_shrink_swapped:shrink_24M_16M_v2_16384: PR bin/44205: atf-check
failed; see the output of the
test for details
Summary for 4 test programs:
14 passed test cases.
0 failed test cases.
2 expected failed test cases.
0 skipped test cases.
Martin
Module Name: src
Committed By: manu
Date: Sat Jul 21 05:17:11 UTC 2012
Modified Files:
src/lib/libpuffs: dispatcher.c puffs.h
src/sys/fs/puffs: puffs_msgif.c puffs_msgif.h puffs_sys.h
puffs_vfsops.c puffs_vnops.c
Log Message:
- Improve PUFFS_KFLAG_CACHE_FS_TTL by reclaiming older inactive nodes.
..
..
We fix this by introducing a lookup count in kernel and userland. On
reclaim, the kernel sends the count, which enable userland to detect
situation where it initiated a lookup that is not completed in kernel.
In such a situation, the reclaim must be ignored, as the node is about
to be looked up again.
To generate a diff of this commit:
cvs rdiff -u -r1.41 -r1.42 src/lib/libpuffs/dispatcher.c
cvs rdiff -u -r1.122 -r1.123 src/lib/libpuffs/puffs.h
cvs rdiff -u -r1.89 -r1.90 src/sys/fs/puffs/puffs_msgif.c
cvs rdiff -u -r1.78 -r1.79 src/sys/fs/puffs/puffs_msgif.h
cvs rdiff -u -r1.79 -r1.80 src/sys/fs/puffs/puffs_sys.h
cvs rdiff -u -r1.101 -r1.102 src/sys/fs/puffs/puffs_vfsops.c
cvs rdiff -u -r1.166 -r1.167 src/sys/fs/puffs/puffs_vnops.c
From: Emmanuel Dreyfus <manu@netbsd.org>
To: Martin Husemann <martin@duskware.de>
Cc: gnats-bugs@NetBSD.org, manu@NetBSD.org
Subject: Re: kern/46734: pool not drained
Date: Tue, 24 Jul 2012 13:40:44 +0000
> puffs_vfsop_done(1c4b800, 0, 4f72, 6744e10, 0, 8840dc08)
> at puffs:puffs_vfsop_done+0x4
> vfs_detach(1c4b660, 0, 675e000, 186e9d0, 1794458, 1c14000)
> at netbsd:vfs_detach+0x80
I see where is the bug but I do not know how to fix it. I will ask
on tech-kern
--
Emmanuel Dreyfus
manu@netbsd.org
From: Emmanuel Dreyfus <manu@netbsd.org>
To: Emmanuel Dreyfus <manu@netbsd.org>
Cc: Martin Husemann <martin@duskware.de>, gnats-bugs@NetBSD.org
Subject: Re: kern/46734: pool not drained
Date: Tue, 24 Jul 2012 16:31:47 +0000
--zhXaljGHf11kAtnf
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
On Tue, Jul 24, 2012 at 01:40:44PM +0000, Emmanuel Dreyfus wrote:
> I see where is the bug but I do not know how to fix it. I will ask
> on tech-kern
It was not what I thought at once, I think I have a fix.
Please test the attached patch.
--
Emmanuel Dreyfus
manu@netbsd.org
--zhXaljGHf11kAtnf
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="umountdrain.patch"
Index: sys/fs/puffs/puffs_msgif.c
===================================================================
RCS file: /cvsroot/src/sys/fs/puffs/puffs_msgif.c,v
retrieving revision 1.91
diff -U4 -r1.91 puffs_msgif.c
--- sys/fs/puffs/puffs_msgif.c 22 Jul 2012 17:40:46 -0000 1.91
+++ sys/fs/puffs/puffs_msgif.c 24 Jul 2012 16:27:40 -0000
@@ -845,10 +845,10 @@
/*
* Node expiry. We come here after an inactive on an unexpired node.
* The expiry has been queued and is done in sop thread.
*/
-static bool
-puffsop_expire(struct puffs_mount *pmp, puffs_cookie_t cookie)
+static void
+puffsop_expire(struct puffs_mount *pmp, puffs_cookie_t cookie, int force)
{
struct vnode *vp;
KASSERT(PUFFS_USE_FS_TTL(pmp));
@@ -856,15 +856,21 @@
/*
* If it still exists and has no reference,
* vrele should cause it to be reclaimed.
* Otherwise, we have nothing to do.
+ *
+ * On unmount, we mark the node with PNODE_NOREFS
+ * to make sure it will go away even if it was
+ * reused recently and have a grace time again.
*/
if (puffs_cookie2vnode(pmp, cookie, 0, 0, &vp) == 0) {
+ if (force)
+ VPTOPP(vp)->pn_stat |= PNODE_NOREFS;
VPTOPP(vp)->pn_stat &= ~PNODE_SOPEXP;
vrele(vp);
}
- return false;
+ return;
}
static void
puffsop_flush(struct puffs_mount *pmp, struct puffs_flush *pf)
@@ -1057,13 +1063,11 @@
mutex_enter(&pmp->pmp_sopmtx);
for (keeprunning = true; keeprunning; ) {
/*
- * We have a higher priority queue for flush and umount
- * and a lower priority queue for reclaims. Request on
- * slower queue are not honoured before clock reaches
- * psopr_at. This code assumes that requests are ordered
- * by psopr_at in queues.
+ * We have a fast queue for flush and umount, and a node
+ * queue for delayes node reclaims. Requests on node queue * are not honoured before clock reaches psopr_at. This
+ * code assumes that requests are ordered by psopr_at.
*/
do {
psopr = TAILQ_FIRST(&pmp->pmp_sopfastreqs);
if (psopr != NULL) {
@@ -1071,11 +1075,11 @@
psopr, psopr_entries);
break;
}
- psopr = TAILQ_FIRST(&pmp->pmp_sopslowreqs);
+ psopr = TAILQ_FIRST(&pmp->pmp_sopnodereqs);
if ((psopr != NULL) && TIMED_OUT(psopr->psopr_at)) {
- TAILQ_REMOVE(&pmp->pmp_sopslowreqs,
+ TAILQ_REMOVE(&pmp->pmp_sopnodereqs,
psopr, psopr_entries);
break;
}
@@ -1091,9 +1095,9 @@
case PUFFS_SOPREQ_FLUSH:
puffsop_flush(pmp, &psopr->psopr_pf);
break;
case PUFFS_SOPREQ_EXPIRE:
- puffsop_expire(pmp, psopr->psopr_ck);
+ puffsop_expire(pmp, psopr->psopr_ck, 0);
break;
case PUFFS_SOPREQ_UNMOUNT:
puffs_msg_sendresp(pmp, &psopr->psopr_preq, 0);
@@ -1122,12 +1126,15 @@
kmem_free(psopr, sizeof(*psopr));
mutex_enter(&pmp->pmp_sopmtx);
}
- while ((psopr = TAILQ_FIRST(&pmp->pmp_sopslowreqs)) != NULL) {
- TAILQ_REMOVE(&pmp->pmp_sopslowreqs, psopr, psopr_entries);
+ while ((psopr = TAILQ_FIRST(&pmp->pmp_sopnodereqs)) != NULL) {
+ TAILQ_REMOVE(&pmp->pmp_sopnodereqs, psopr, psopr_entries);
mutex_exit(&pmp->pmp_sopmtx);
- puffs_msg_sendresp(pmp, &psopr->psopr_preq, ENXIO);
+
+ KASSERT(psopr->psopr_sopreq == PUFFS_SOPREQ_EXPIRE);
+ puffsop_expire(pmp, psopr->psopr_ck, 1);
+
kmem_free(psopr, sizeof(*psopr));
mutex_enter(&pmp->pmp_sopmtx);
}
Index: sys/fs/puffs/puffs_sys.h
===================================================================
RCS file: /cvsroot/src/sys/fs/puffs/puffs_sys.h,v
retrieving revision 1.80
diff -U4 -r1.80 puffs_sys.h
--- sys/fs/puffs/puffs_sys.h 21 Jul 2012 05:17:10 -0000 1.80
+++ sys/fs/puffs/puffs_sys.h 24 Jul 2012 16:27:40 -0000
@@ -175,9 +175,9 @@
kmutex_t pmp_sopmtx;
kcondvar_t pmp_sopcv;
int pmp_sopthrcount;
TAILQ_HEAD(, puffs_sopreq) pmp_sopfastreqs;
- TAILQ_HEAD(, puffs_sopreq) pmp_sopslowreqs;
+ TAILQ_HEAD(, puffs_sopreq) pmp_sopnodereqs;
bool pmp_docompat;
};
#define PUFFSTAT_BEFOREINIT 0
Index: sys/fs/puffs/puffs_vfsops.c
===================================================================
RCS file: /cvsroot/src/sys/fs/puffs/puffs_vfsops.c,v
retrieving revision 1.103
diff -U4 -r1.103 puffs_vfsops.c
--- sys/fs/puffs/puffs_vfsops.c 22 Jul 2012 17:40:46 -0000 1.103
+++ sys/fs/puffs/puffs_vfsops.c 24 Jul 2012 16:27:40 -0000
@@ -307,9 +307,9 @@
cv_init(&pmp->pmp_sopcv, "puffsop");
TAILQ_INIT(&pmp->pmp_msg_touser);
TAILQ_INIT(&pmp->pmp_msg_replywait);
TAILQ_INIT(&pmp->pmp_sopfastreqs);
- TAILQ_INIT(&pmp->pmp_sopslowreqs);
+ TAILQ_INIT(&pmp->pmp_sopnodereqs);
if ((error = kthread_create(PRI_NONE, KTHREAD_MPSAFE, NULL,
puffs_sop_thread, pmp, NULL, "puffsop")) != 0)
goto out;
Index: sys/fs/puffs/puffs_vnops.c
===================================================================
RCS file: /cvsroot/src/sys/fs/puffs/puffs_vnops.c,v
retrieving revision 1.170
diff -U4 -r1.170 puffs_vnops.c
--- sys/fs/puffs/puffs_vnops.c 23 Jul 2012 19:06:10 -0000 1.170
+++ sys/fs/puffs/puffs_vnops.c 24 Jul 2012 16:27:40 -0000
@@ -1278,9 +1278,9 @@
*/
if (pmp->pmp_sopthrcount == 0) {
kmem_free(psopr, sizeof(*psopr));
} else {
- TAILQ_INSERT_TAIL(&pmp->pmp_sopslowreqs,
+ TAILQ_INSERT_TAIL(&pmp->pmp_sopnodereqs,
psopr, psopr_entries);
pnode->pn_stat |= PNODE_SOPEXP;
}
--zhXaljGHf11kAtnf--
From: Martin Husemann <martin@duskware.de>
To: Emmanuel Dreyfus <manu@netbsd.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/46734: pool not drained
Date: Tue, 24 Jul 2012 23:34:16 +0200
Sorry, still causes the same panic.
Martin
From: Emmanuel Dreyfus <manu@netbsd.org>
To: Martin Husemann <martin@duskware.de>
Cc: Emmanuel Dreyfus <manu@netbsd.org>, gnats-bugs@NetBSD.org
Subject: Re: kern/46734: pool not drained
Date: Wed, 25 Jul 2012 07:51:19 +0000
On Tue, Jul 24, 2012 at 11:34:16PM +0200, Martin Husemann wrote:
> Sorry, still causes the same panic.
I cannot reproduce the crash, either with or without my patch. The
resize_ffs test passes fine.
It would be interesting to have three printfs in
src/sys/fs/puffs/puffs_vfs.c:puffs_vfsop_done()
so that we know what pool_destroy causes the problem.
--
Emmanuel Dreyfus
manu@netbsd.org
From: Martin Husemann <martin@duskware.de>
To: Emmanuel Dreyfus <manu@netbsd.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/46734: pool not drained
Date: Wed, 25 Jul 2012 09:59:49 +0200
The crash does not happen every time (but most of them, and apparently
always w/o your patch).
And of course you need a kernel that does not have the puffs code build
in (so on i386/amd64 a custom kernel is needed).
Martin
From: Martin Husemann <martin@duskware.de>
To: Emmanuel Dreyfus <manu@netbsd.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/46734: pool not drained
Date: Wed, 25 Jul 2012 10:04:41 +0200
Forgot to add: from the "show pool" output the pool in question is:
void
puffs_msgif_destroy(void)
{
pool_cache_destroy(parkpc);
}
Martin
From: Emmanuel Dreyfus <manu@netbsd.org>
To: Martin Husemann <martin@duskware.de>
Cc: Emmanuel Dreyfus <manu@netbsd.org>, gnats-bugs@NetBSD.org
Subject: Re: kern/46734: pool not drained
Date: Wed, 25 Jul 2012 08:48:39 +0000
On Wed, Jul 25, 2012 at 10:04:41AM +0200, Martin Husemann wrote:
> Forgot to add: from the "show pool" output the pool in question is:
> pool_cache_destroy(parkpc);
Yes, I came to the same conclusion. It means we still have a PUFFS
reply being awaited. Mouting with PUFFS_FLAG_OPDUMP may tell us
something, but it also may hide the problem, since it lookup timing-dependent.
--
Emmanuel Dreyfus
manu@netbsd.org
From: Emmanuel Dreyfus <manu@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/46734: pool not drained
Date: Fri, 17 Aug 2012 15:28:45 +0000
Lots of talks and tests occured behind the scene with Martin. here is a
summary.
The kernel crashed because of an assertion when freeing a pool while some
objects were still allocated. This happenned because we had pending parks
the were not freed at the time we freed the pool. And we proceeded with
unmounting while having pending parks because the unmount system call
was interrupted by a SIGCHLD thatcaused another unmount by closing
/dev/puffs.
This means we have a race condition in the kernel, but it seems it has
always been there. reproducing it was hard. However, we spoted the change
that made the race condition likely to occur:
libpuffs maintain a lookup count in struct puffs_node in order to detect
trace conditions between lookup and reclaim. Doing that, it assumes the
cookies always map to struct puffs_node. This was not the case for
libp2k, which mapped cookies to struct p2k_node. libpuffs therefore
corrupted struct libp2k_node.
That bug has been ifxed in libp2k, and the fix has been pulled up to
netbsd-6. The kernel race condition on unmount remains but we were
not able to reproduce it anymore.
--
Emmanuel Dreyfus
manu@netbsd.org
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.