NetBSD Problem Report #42661
From he@smistad.uninett.no Fri Jan 22 16:22:51 2010
Return-Path: <he@smistad.uninett.no>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
by www.NetBSD.org (Postfix) with ESMTP id 54B1A63C54F
for <gnats-bugs@gnats.NetBSD.org>; Fri, 22 Jan 2010 16:22:51 +0000 (UTC)
Message-Id: <20100122162249.3799F3D0A8@smistad.uninett.no>
Date: Fri, 22 Jan 2010 17:22:49 +0100 (CET)
From: he@nordu.net
Reply-To: he@nordu.net
To: gnats-bugs@gnats.NetBSD.org
Subject: Linux-emulated Veritas NetBackup fails to work in 5.0
X-Send-Pr-Version: 3.95
>Number: 42661
>Category: kern
>Synopsis: Linux-emulated Veritas NetBackup fails to work in 5.0
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: kern-bug-people
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri Jan 22 16:25:00 +0000 2010
>Closed-Date: Wed Mar 31 12:11:33 +0000 2010
>Last-Modified: Sat Jun 12 18:40:01 +0000 2010
>Originator: Havard Eidnes
>Release: NetBSD 5.0.1_PATCH
>Organization:
NORDUnet AS
>Environment:
System:
Architecture: i386
Machine: i386
>Description:
Well, the basic problem is that Veritas NetBackup (which is
only available in binary form, and we use the Linux version)
fails to work in NetBSD 5.0. It works fine in 4.0.
Because we run a Linux binary, we need to take special steps
to ensure that the entire /usr gets backed up, such that the
backup of /usr/lib ends up with the NetBSD libraries and not
the Linux-emulation libraries in /emul/linux/usr/lib instead.
So... Since we want to have all the file systems we should
back up under a common root, we need to re-mount the relevant
file systems somewhere, using some method.
We have tried two methods:
1) null mounts
2) NFS mounts
With null mounts in 4.0, we encountered a problem that after a
few days of run-time, all kernel memory was consumed, and if
my recollection is correct, it would basically seize up, so
that manual intervention via DDB was required to bring it back
to life. We therefore looked at alternatives, and ended up
with NFS mounts.
We have re-tried the null mounts, but the un-identified memory
leak problems appear to still be there in 5.0, so that's not a
usable method.
The NFS mount method has worked well in 4.0, but is giving us
problems in 5.0. After some debugging, we have found that one
of the two "bpbkar" processes end up in uvn_fp2 wait, most
probably while holding a lock, and fails to make any progress
beyond that point. New bpbkar processes (the backup server
initiates new ones on a schedule) leaves the new ones in
"tstile" state. The same does "df" processes, be they either
native or Linux-emulated.
Our most recent attempt at rebooting also got stuck in tstile
while unmounting one of the file systems, and here is some
selected output from the console log:
Jan 22 16:29:10 mail-server shutdown: reboot by he: new kernel
Jan 22 16:29:24 mail-server syslogd: Exiting on signal 15
syncing disks... 1 done
unmounting file systems...
unmounting /usr/pkg/emul/linux/netbackup/home (localhost:/home)...[halt sent]
fatal breakpoint trap in supervisor mode
trap type 1 code 0 eip c05b2ecc cs 8 eflags 202 cr2 bb906538 ilevel 8
Stopped in pid 0.2 (system) at netbsd:breakpoint+0x4: popl %ebp
db{0}: ps
PID LID S CPU FLAGS STRUCT LWP * NAME WAIT
20756 1 3 1 4 e80c0d40 reboot tstile
6695 1 3 2 9020004 e4807580 bpbkar tstile
3952 1 3 2 9020004 e46bd280 bpbkar tstile
3081 1 3 2 9020004 e9408d20 df tstile
2519 1 3 1 9020004 d89250c0 df tstile
3006 1 3 1 9020004 d8898ca0 df tstile
17026 1 3 2 9020004 e4807800 bpbkar nfsrcv
5421 1 3 1 9020004 d89a07a0 bpbkar uvn_fp2
1 1 3 2 8020084 ce3bc840 init wait
0 73 3 0 204 e94080a0 ktrace ktrwait
72 3 0 204 e9d5eae0 ktrace ktrwait
68 3 1 204 d4744300 nfsio netio
67 3 2 204 d4744580 nfsio nfsrcv
66 3 1 204 d4744800 nfsio nfsrcv
65 3 0 204 d4744a80 nfsio nfsrcv
(why did it suddenly start indenting the ps listing at that
point?!?)
db{0}: trace/t 0t5421
trace: pid 5421 lid 1 at 0xd89c43cc
sleepq_block(0,0,c0aaba51,c0b27c80,0,c150a9ac,9,c2580910,da4a13a0,0) at netbsd:sleepq_block+0xeb
mtsleep(c2580910,204,c0aaba51,0,da4a13a0,da4a13a0,10,6,0,0) at netbsd:mtsleep+0x12d
uvn_findpage(d89c45ac,0,d89c44ac,c05343fa,0,0,2,0,994000,d89c45cc) at netbsd:uvn_findpage+0x92
uvn_findpages(da4a13a0,24e60000,2,d89c45ec,d89c45ac,0,994000,20,2,0) at netbsd:uvn_findpages+0x73
genfs_getpages(d89c46b0,0,0,0,0,24ed0000,0,0,2,d89c465c) at netbsd:genfs_getpages+0x743
nfs_getpages(d89c46b0,4,24e62000,2,0,10000,24ee0000,c089d600,da4a13a0,24e60000) at netbsd:nfs_getpages+0xbb
VOP_GETPAGES(da4a13a0,24e60000,2,d89c4750,d89c47c8,0,1,0,1802,0) at netbsd:VOP_GETPAGES+0x65
uvn_get(da4a13a0,24e60000,2,d89c4750,d89c47c8,0,1,0,1802,d89a07a0) at netbsd:uvn_get+0x117
ubc_fault(d89c48e0,d3981000,d89c48a0,1,0,1,42,246,8,c0bc8d04) at netbsd:ubc_fault+0x170
uvm_fault_internal(c0bc21c0,d3981000,1,0,c262cfca,c0000,0,c05a6cfa,6,6) at netbsd:uvm_fault_internal+0x3a9
trap() at netbsd:trap+0x797
--- trap (number 6) ---
copyout(d87906c0,d3981000,8249438,2000,d87906c0,0,d3981000,24e60000,2,d3981000) at netbsd:copyout+0x33
uiomove(d3981000,2000,d89c4c8c,d89c4adc,0,101,deaddead,0,1829b58,0) at netbsd:uiomove+0x62
ubc_uiomove(da4a13a0,d89c4c8c,10000,0,101,7c356d21,d89c4b2c,c085d206,da4945c0,da4a1440) at netbsd:ubc_uiomove+0xeb
nfs_bioread(da4a13a0,d89c4c8c,0,ce3a6f00,0,da4a13a0,d89c4c2c,c053d6f4,d89c4c14,da4a13a0) at netbsd:nfs_bioread+0x312
nfs_read(d89c4c14,da4a13a0,c089d3c0,da4a13a0,1,20001,d89c4c2c,c0534d58,c089ce80,da4a13a0) at netbsd:nfs_read+0x43
VOP_READ(da4a13a0,d89c4c8c,0,ce3a6f00,d40a1040,0,9c4c6c,16,10000,8249438) at netbsd:VOP_READ+0x44
vn_read(d8c4d940,d8c4d940,d89c4c8c,ce3a6f00,1,0,0,0,d89a07a0,d89c4d48) at netbsd:vn_read+0x93
dofileread(9,d8c4d940,8249438,10000,d8c4d940,1,d89c4d28,d89c4d48,d89c4d48,d89a07a0) at netbsd:dofileread+0x75
sys_read(d89a07a0,d89c4d10,d89c4d28,9c4d20,96,10,c0b4a744,9,8249438,10000) at netbsd:sys_read+0x6f
linux_syscall(d89c4d48,2b,2b,2b,2b,610,8259338,bfbeec08,9,10000) at netbsd:linux_syscall+0x9b
db{0}:
Now, inspection shows that the 5th argument to mtsleep is the
mutex it sleeps on, and that it's usable with "show lock" in
DDB:
db{0}: show lock 0xda4a13a0
lock address : 0x00000000da4a13a0 type : sleep/adaptive
initialized : 0x00000000c052b9c6
shared holds : 0 exclusive: 0
shares wanted: 0 exclusive: 0
current cpu : 0 last held: 1
current lwp : 0x00000000ce3a7c80 last held: 000000000000000000
last locked : 0x00000000c03d3f4c unlocked : 0x00000000c03d403b
owner field : 000000000000000000 wait/spin: 0/0
Turnstile chain at 0xc150ba80.
=> No active turnstile for this lock.
db{0}:
The "last locked" and "unlocked" values are:
db{0}: x/i 0x00000000c03d3f4c
netbsd:nfs_sync+0x7c: cmpl $0x3,0xc(%ebp)
db{0}: x/i 0x00000000c03d403b
netbsd:nfs_sync+0x16b: jmp netbsd:nfs_sync+0x44
db{0}:
Now, the way I read the "show lock" output, this lock is
currently not held, while the "bpbkar" process is still
waiting on it. That may be the reason that process is not
making any progress.
Now, as to the root cause of this problem, I have no idea, and
would like further input to narrow down on the root cause.
>How-To-Repeat:
Try to use Linux-emulated Veritas NetBackup together with NFS
mounted file systems to be backed up, and watch it get stuck.
>Fix:
Sorry, no idea -- request help for digging further.
>Release-Note:
>Audit-Trail:
From: Havard Eidnes <he@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/42661 CVS commit: src/sys/compat
Date: Wed, 3 Mar 2010 08:20:39 +0000
Module Name: src
Committed By: he
Date: Wed Mar 3 08:20:39 UTC 2010
Modified Files:
src/sys/compat/common: vfs_syscalls_30.c
src/sys/compat/ibcs2: ibcs2_misc.c
src/sys/compat/irix: irix_dirent.c
src/sys/compat/linux/common: linux_file64.c linux_misc.c
src/sys/compat/linux32/common: linux32_dirent.c
src/sys/compat/osf1: osf1_file.c
src/sys/compat/sunos: sunos_misc.c
src/sys/compat/sunos32: sunos32_misc.c
src/sys/compat/svr4: svr4_misc.c
src/sys/compat/svr4_32: svr4_32_misc.c
Log Message:
When implementing "read directory", when there are too many empty entries
in a row, and we need to try to read the next block, and have passed a
non-NULL cookie pointer to VOP_READDIR, ensure that we free the cookie
buffer before re-doing VOP_READDIR, so that we don't leak memory.
This fix is similar to nfs_serv.c revisions 1.115 + 1.124.
This should fix the long-standing problem observed by e.g. using Linux-
emulated programs to take backup of servers, which is one of the problems
which were reported in PR#42661.
Thanks to pooka@ for the hints for traversing the VOP* layer.
To generate a diff of this commit:
cvs rdiff -u -r1.30 -r1.31 src/sys/compat/common/vfs_syscalls_30.c
cvs rdiff -u -r1.109 -r1.110 src/sys/compat/ibcs2/ibcs2_misc.c
cvs rdiff -u -r1.23 -r1.24 src/sys/compat/irix/irix_dirent.c
cvs rdiff -u -r1.49 -r1.50 src/sys/compat/linux/common/linux_file64.c
cvs rdiff -u -r1.213 -r1.214 src/sys/compat/linux/common/linux_misc.c
cvs rdiff -u -r1.9 -r1.10 src/sys/compat/linux32/common/linux32_dirent.c
cvs rdiff -u -r1.37 -r1.38 src/sys/compat/osf1/osf1_file.c
cvs rdiff -u -r1.165 -r1.166 src/sys/compat/sunos/sunos_misc.c
cvs rdiff -u -r1.68 -r1.69 src/sys/compat/sunos32/sunos32_misc.c
cvs rdiff -u -r1.148 -r1.149 src/sys/compat/svr4/svr4_misc.c
cvs rdiff -u -r1.67 -r1.68 src/sys/compat/svr4_32/svr4_32_misc.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Soren Jacobsen <snj@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/42661 CVS commit: [netbsd-5] src/sys/compat
Date: Wed, 17 Mar 2010 02:59:53 +0000
Module Name: src
Committed By: snj
Date: Wed Mar 17 02:59:53 UTC 2010
Modified Files:
src/sys/compat/common [netbsd-5]: vfs_syscalls_30.c
src/sys/compat/ibcs2 [netbsd-5]: ibcs2_misc.c
src/sys/compat/irix [netbsd-5]: irix_dirent.c
src/sys/compat/linux/common [netbsd-5]: linux_file64.c linux_misc.c
src/sys/compat/linux32/common [netbsd-5]: linux32_dirent.c
src/sys/compat/sunos [netbsd-5]: sunos_misc.c
src/sys/compat/sunos32 [netbsd-5]: sunos32_misc.c
src/sys/compat/svr4 [netbsd-5]: svr4_misc.c
src/sys/compat/svr4_32 [netbsd-5]: svr4_32_misc.c
Log Message:
Pull up following revision(s) (requested by he in ticket #1323):
sys/compat/common/vfs_syscalls_30.c: revision 1.31
sys/compat/ibcs2/ibcs2_misc.c: revision 1.110
sys/compat/irix/irix_dirent.c: revision 1.24
sys/compat/linux/common/linux_file64.c: revision 1.50
sys/compat/linux/common/linux_misc.c: revision 1.214
sys/compat/linux32/common/linux32_dirent.c: revision 1.10
sys/compat/sunos/sunos_misc.c: revision 1.166
sys/compat/sunos32/sunos32_misc.c: revision 1.69
sys/compat/svr4/svr4_misc.c: revision 1.149
sys/compat/svr4_32/svr4_32_misc.c: revision 1.68
When implementing "read directory", when there are too many empty entries
in a row, and we need to try to read the next block, and have passed a
non-NULL cookie pointer to VOP_READDIR, ensure that we free the cookie
buffer before re-doing VOP_READDIR, so that we don't leak memory.
This fix is similar to nfs_serv.c revisions 1.115 + 1.124.
This should fix the long-standing problem observed by e.g. using Linux-
emulated programs to take backup of servers, which is one of the problems
which were reported in PR#42661.
Thanks to pooka@ for the hints for traversing the VOP* layer.
To generate a diff of this commit:
cvs rdiff -u -r1.28 -r1.28.6.1 src/sys/compat/common/vfs_syscalls_30.c
cvs rdiff -u -r1.104 -r1.104.6.1 src/sys/compat/ibcs2/ibcs2_misc.c
cvs rdiff -u -r1.23 -r1.23.10.1 src/sys/compat/irix/irix_dirent.c
cvs rdiff -u -r1.48 -r1.48.6.1 src/sys/compat/linux/common/linux_file64.c
cvs rdiff -u -r1.201 -r1.201.6.1 src/sys/compat/linux/common/linux_misc.c
cvs rdiff -u -r1.6 -r1.6.4.1 src/sys/compat/linux32/common/linux32_dirent.c
cvs rdiff -u -r1.161 -r1.161.4.1 src/sys/compat/sunos/sunos_misc.c
cvs rdiff -u -r1.62 -r1.62.4.1 src/sys/compat/sunos32/sunos32_misc.c
cvs rdiff -u -r1.144 -r1.144.6.1 src/sys/compat/svr4/svr4_misc.c
cvs rdiff -u -r1.63 -r1.63.6.1 src/sys/compat/svr4_32/svr4_32_misc.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: open->closed
State-Changed-By: he@NetBSD.org
State-Changed-When: Wed, 31 Mar 2010 12:11:33 +0000
State-Changed-Why:
This bug was fixed, and the fix was pulled up to the netbsd-5
branch, as recorded here.
From: Havard Eidnes <he@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/42661 CVS commit: src
Date: Mon, 5 Apr 2010 07:16:13 +0000
Module Name: src
Committed By: he
Date: Mon Apr 5 07:16:13 UTC 2010
Modified Files:
src/sys/kern: kern_malloc.c
src/sys/sys: mallocvar.h param.h
src/usr.bin/vmstat: vmstat.c
Log Message:
Extend struct malloc_type to count the number of active allocations
per size, and make vmstat report this information under the "Memory
statistics by type" display, which is only printed when the kernel
has been compiled with KMEMSTATS defined, like this:
Memory statistics by type Type Kern
Type InUse MemUse HighUse Limit Requests Limit Limit Size(s)
wapbl 15 4192K 4192K 78644K 376426 0 0 32:0,256:3,512:6,131072:1,262144:2,524288:3
Since struct malloc_type is user-visible and is changed, bump kernel
revision to 5.99.26.
While it is true that malloc(9) is in general on the path of slowly
being replaced by kmem(9) (kmem_alloc/kmem_free), there remains a
lot of points of usage of malloc/free, and this could aid in finding
any leaks. (It helped finding the leak fixed in PR#42661.)
This was discussed with and somewhat hestitantly OKed by rmind@
To generate a diff of this commit:
cvs rdiff -u -r1.128 -r1.129 src/sys/kern/kern_malloc.c
cvs rdiff -u -r1.7 -r1.8 src/sys/sys/mallocvar.h
cvs rdiff -u -r1.360 -r1.361 src/sys/sys/param.h
cvs rdiff -u -r1.166 -r1.167 src/usr.bin/vmstat/vmstat.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Jeff Rizzo <riz@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/42661 CVS commit: [netbsd-4] src/sys/compat
Date: Sat, 12 Jun 2010 18:38:02 +0000
Module Name: src
Committed By: riz
Date: Sat Jun 12 18:38:02 UTC 2010
Modified Files:
src/sys/compat/common [netbsd-4]: vfs_syscalls_30.c
src/sys/compat/ibcs2 [netbsd-4]: ibcs2_misc.c
src/sys/compat/irix [netbsd-4]: irix_dirent.c
src/sys/compat/linux/common [netbsd-4]: linux_file64.c linux_misc.c
src/sys/compat/sunos [netbsd-4]: sunos_misc.c
src/sys/compat/sunos32 [netbsd-4]: sunos32_misc.c
src/sys/compat/svr4 [netbsd-4]: svr4_misc.c
src/sys/compat/svr4_32 [netbsd-4]: svr4_32_misc.c
Log Message:
Pull up following revision(s) (requested by he in ticket #1387):
sys/compat/svr4/svr4_misc.c: revision 1.149
sys/compat/linux/common/linux_misc.c: revision 1.214
sys/compat/common/vfs_syscalls_30.c: revision 1.31
sys/compat/sunos/sunos_misc.c: revision 1.166
sys/compat/linux/common/linux_file64.c: revision 1.50
sys/compat/svr4_32/svr4_32_misc.c: revision 1.68
sys/compat/ibcs2/ibcs2_misc.c: revision 1.110
sys/compat/linux32/common/linux32_dirent.c: revision 1.10
sys/compat/sunos32/sunos32_misc.c: revision 1.69
sys/compat/irix/irix_dirent.c: revision 1.24
sys/compat/osf1/osf1_file.c: revision 1.38
When implementing "read directory", when there are too many empty entries
in a row, and we need to try to read the next block, and have passed a
non-NULL cookie pointer to VOP_READDIR, ensure that we free the cookie
buffer before re-doing VOP_READDIR, so that we don't leak memory.
This fix is similar to nfs_serv.c revisions 1.115 + 1.124.
This should fix the long-standing problem observed by e.g. using Linux-
emulated programs to take backup of servers, which is one of the problems
which were reported in PR#42661.
Thanks to pooka@ for the hints for traversing the VOP* layer.
To generate a diff of this commit:
cvs rdiff -u -r1.18 -r1.18.2.1 src/sys/compat/common/vfs_syscalls_30.c
cvs rdiff -u -r1.81 -r1.81.2.1 src/sys/compat/ibcs2/ibcs2_misc.c
cvs rdiff -u -r1.16 -r1.16.18.1 src/sys/compat/irix/irix_dirent.c
cvs rdiff -u -r1.34 -r1.34.8.1 src/sys/compat/linux/common/linux_file64.c
cvs rdiff -u -r1.165.2.2 -r1.165.2.3 src/sys/compat/linux/common/linux_misc.c
cvs rdiff -u -r1.143 -r1.143.2.1 src/sys/compat/sunos/sunos_misc.c
cvs rdiff -u -r1.42 -r1.42.2.1 src/sys/compat/sunos32/sunos32_misc.c
cvs rdiff -u -r1.121 -r1.121.2.1 src/sys/compat/svr4/svr4_misc.c
cvs rdiff -u -r1.39 -r1.39.2.1 src/sys/compat/svr4_32/svr4_32_misc.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.