NetBSD Problem Report #42661

From he@smistad.uninett.no  Fri Jan 22 16:22:51 2010
Return-Path: <he@smistad.uninett.no>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by www.NetBSD.org (Postfix) with ESMTP id 54B1A63C54F
	for <gnats-bugs@gnats.NetBSD.org>; Fri, 22 Jan 2010 16:22:51 +0000 (UTC)
Message-Id: <20100122162249.3799F3D0A8@smistad.uninett.no>
Date: Fri, 22 Jan 2010 17:22:49 +0100 (CET)
From: he@nordu.net
Reply-To: he@nordu.net
To: gnats-bugs@gnats.NetBSD.org
Subject: Linux-emulated Veritas NetBackup fails to work in 5.0
X-Send-Pr-Version: 3.95

>Number:         42661
>Category:       kern
>Synopsis:       Linux-emulated Veritas NetBackup fails to work in 5.0
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Jan 22 16:25:00 +0000 2010
>Closed-Date:    Wed Mar 31 12:11:33 +0000 2010
>Last-Modified:  Sat Jun 12 18:40:01 +0000 2010
>Originator:     Havard Eidnes
>Release:        NetBSD 5.0.1_PATCH
>Organization:
	NORDUnet AS
>Environment:


System: 
Architecture: i386
Machine: i386
>Description:
	Well, the basic problem is that Veritas NetBackup (which is
	only available in binary form, and we use the Linux version)
	fails to work in NetBSD 5.0.  It works fine in 4.0.

	Because we run a Linux binary, we need to take special steps
	to ensure that the entire /usr gets backed up, such that the
	backup of /usr/lib ends up with the NetBSD libraries and not
	the Linux-emulation libraries in /emul/linux/usr/lib instead.

	So...  Since we want to have all the file systems we should
	back up under a common root, we need to re-mount the relevant
	file systems somewhere, using some method.

	We have tried two methods:

	1) null mounts
	2) NFS mounts

	With null mounts in 4.0, we encountered a problem that after a
	few days of run-time, all kernel memory was consumed, and if
	my recollection is correct, it would basically seize up, so
	that manual intervention via DDB was required to bring it back
	to life.  We therefore looked at alternatives, and ended up
	with NFS mounts.

	We have re-tried the null mounts, but the un-identified memory
	leak problems appear to still be there in 5.0, so that's not a
	usable method.

	The NFS mount method has worked well in 4.0, but is giving us
	problems in 5.0.  After some debugging, we have found that one
	of the two "bpbkar" processes end up in uvn_fp2 wait, most
	probably while holding a lock, and fails to make any progress
	beyond that point.  New bpbkar processes (the backup server
	initiates new ones on a schedule) leaves the new ones in
	"tstile" state.  The same does "df" processes, be they either
	native or Linux-emulated.

	Our most recent attempt at rebooting also got stuck in tstile
	while unmounting one of the file systems, and here is some
	selected output from the console log:

Jan 22 16:29:10 mail-server shutdown: reboot by he: new kernel 
Jan 22 16:29:24 mail-server syslogd: Exiting on signal 15
syncing disks... 1 done
unmounting file systems...
unmounting /usr/pkg/emul/linux/netbackup/home (localhost:/home)...[halt sent]
fatal breakpoint trap in supervisor mode
trap type 1 code 0 eip c05b2ecc cs 8 eflags 202 cr2 bb906538 ilevel 8
Stopped in pid 0.2 (system) at  netbsd:breakpoint+0x4:  popl    %ebp
db{0}: ps      
PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
20756    1 3   1         4           e80c0d40             reboot tstile
6695     1 3   2   9020004           e4807580             bpbkar tstile
3952     1 3   2   9020004           e46bd280             bpbkar tstile
3081     1 3   2   9020004           e9408d20                 df tstile
2519     1 3   1   9020004           d89250c0                 df tstile
3006     1 3   1   9020004           d8898ca0                 df tstile
17026    1 3   2   9020004           e4807800             bpbkar nfsrcv
5421     1 3   1   9020004           d89a07a0             bpbkar uvn_fp2
1        1 3   2   8020084           ce3bc840               init wait
0       73 3   0       204           e94080a0             ktrace ktrwait
              72 3   0       204           e9d5eae0             ktrace ktrwait
              68 3   1       204           d4744300              nfsio netio
              67 3   2       204           d4744580              nfsio nfsrcv
              66 3   1       204           d4744800              nfsio nfsrcv
              65 3   0       204           d4744a80              nfsio nfsrcv

	(why did it suddenly start indenting the ps listing at that
	point?!?)

db{0}: trace/t 0t5421
trace: pid 5421 lid 1 at 0xd89c43cc
sleepq_block(0,0,c0aaba51,c0b27c80,0,c150a9ac,9,c2580910,da4a13a0,0) at netbsd:sleepq_block+0xeb
mtsleep(c2580910,204,c0aaba51,0,da4a13a0,da4a13a0,10,6,0,0) at netbsd:mtsleep+0x12d
uvn_findpage(d89c45ac,0,d89c44ac,c05343fa,0,0,2,0,994000,d89c45cc) at netbsd:uvn_findpage+0x92
uvn_findpages(da4a13a0,24e60000,2,d89c45ec,d89c45ac,0,994000,20,2,0) at netbsd:uvn_findpages+0x73
genfs_getpages(d89c46b0,0,0,0,0,24ed0000,0,0,2,d89c465c) at netbsd:genfs_getpages+0x743
nfs_getpages(d89c46b0,4,24e62000,2,0,10000,24ee0000,c089d600,da4a13a0,24e60000) at netbsd:nfs_getpages+0xbb
VOP_GETPAGES(da4a13a0,24e60000,2,d89c4750,d89c47c8,0,1,0,1802,0) at netbsd:VOP_GETPAGES+0x65
uvn_get(da4a13a0,24e60000,2,d89c4750,d89c47c8,0,1,0,1802,d89a07a0) at netbsd:uvn_get+0x117
ubc_fault(d89c48e0,d3981000,d89c48a0,1,0,1,42,246,8,c0bc8d04) at netbsd:ubc_fault+0x170
uvm_fault_internal(c0bc21c0,d3981000,1,0,c262cfca,c0000,0,c05a6cfa,6,6) at netbsd:uvm_fault_internal+0x3a9
trap() at netbsd:trap+0x797
--- trap (number 6) ---
copyout(d87906c0,d3981000,8249438,2000,d87906c0,0,d3981000,24e60000,2,d3981000) at netbsd:copyout+0x33
uiomove(d3981000,2000,d89c4c8c,d89c4adc,0,101,deaddead,0,1829b58,0) at netbsd:uiomove+0x62
ubc_uiomove(da4a13a0,d89c4c8c,10000,0,101,7c356d21,d89c4b2c,c085d206,da4945c0,da4a1440) at netbsd:ubc_uiomove+0xeb
nfs_bioread(da4a13a0,d89c4c8c,0,ce3a6f00,0,da4a13a0,d89c4c2c,c053d6f4,d89c4c14,da4a13a0) at netbsd:nfs_bioread+0x312
nfs_read(d89c4c14,da4a13a0,c089d3c0,da4a13a0,1,20001,d89c4c2c,c0534d58,c089ce80,da4a13a0) at netbsd:nfs_read+0x43
VOP_READ(da4a13a0,d89c4c8c,0,ce3a6f00,d40a1040,0,9c4c6c,16,10000,8249438) at netbsd:VOP_READ+0x44
vn_read(d8c4d940,d8c4d940,d89c4c8c,ce3a6f00,1,0,0,0,d89a07a0,d89c4d48) at netbsd:vn_read+0x93
dofileread(9,d8c4d940,8249438,10000,d8c4d940,1,d89c4d28,d89c4d48,d89c4d48,d89a07a0) at netbsd:dofileread+0x75
sys_read(d89a07a0,d89c4d10,d89c4d28,9c4d20,96,10,c0b4a744,9,8249438,10000) at netbsd:sys_read+0x6f
linux_syscall(d89c4d48,2b,2b,2b,2b,610,8259338,bfbeec08,9,10000) at netbsd:linux_syscall+0x9b
db{0}: 

	Now, inspection shows that the 5th argument to mtsleep is the
	mutex it sleeps on, and that it's usable with "show lock" in
	DDB:

db{0}: show lock 0xda4a13a0
lock address : 0x00000000da4a13a0 type     :     sleep/adaptive
initialized  : 0x00000000c052b9c6
shared holds :                  0 exclusive:                  0
shares wanted:                  0 exclusive:                  0
current cpu  :                  0 last held:                  1
current lwp  : 0x00000000ce3a7c80 last held: 000000000000000000
last locked  : 0x00000000c03d3f4c unlocked : 0x00000000c03d403b
owner field  : 000000000000000000 wait/spin:                0/0

Turnstile chain at 0xc150ba80.
=> No active turnstile for this lock.
db{0}: 

	The "last locked" and "unlocked" values are:

db{0}: x/i 0x00000000c03d3f4c
netbsd:nfs_sync+0x7c:   cmpl    $0x3,0xc(%ebp)
db{0}: x/i 0x00000000c03d403b
netbsd:nfs_sync+0x16b:  jmp     netbsd:nfs_sync+0x44
db{0}:

	Now, the way I read the "show lock" output, this lock is
	currently not held, while the "bpbkar" process is still
	waiting on it.  That may be the reason that process is not
	making any progress.

	Now, as to the root cause of this problem, I have no idea, and
	would like further input to narrow down on the root cause.



>How-To-Repeat:
	Try to use Linux-emulated Veritas NetBackup together with NFS
	mounted file systems to be backed up, and watch it get stuck.

>Fix:
	Sorry, no idea -- request help for digging further.

>Release-Note:

>Audit-Trail:
From: Havard Eidnes <he@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/42661 CVS commit: src/sys/compat
Date: Wed, 3 Mar 2010 08:20:39 +0000

 Module Name:	src
 Committed By:	he
 Date:		Wed Mar  3 08:20:39 UTC 2010

 Modified Files:
 	src/sys/compat/common: vfs_syscalls_30.c
 	src/sys/compat/ibcs2: ibcs2_misc.c
 	src/sys/compat/irix: irix_dirent.c
 	src/sys/compat/linux/common: linux_file64.c linux_misc.c
 	src/sys/compat/linux32/common: linux32_dirent.c
 	src/sys/compat/osf1: osf1_file.c
 	src/sys/compat/sunos: sunos_misc.c
 	src/sys/compat/sunos32: sunos32_misc.c
 	src/sys/compat/svr4: svr4_misc.c
 	src/sys/compat/svr4_32: svr4_32_misc.c

 Log Message:
 When implementing "read directory", when there are too many empty entries
 in a row, and we need to try to read the next block, and have passed a
 non-NULL cookie pointer to VOP_READDIR, ensure that we free the cookie
 buffer before re-doing VOP_READDIR, so that we don't leak memory.
 This fix is similar to nfs_serv.c revisions 1.115 + 1.124.

 This should fix the long-standing problem observed by e.g. using Linux-
 emulated programs to take backup of servers, which is one of the problems
 which were reported in PR#42661.

 Thanks to pooka@ for the hints for traversing the VOP* layer.


 To generate a diff of this commit:
 cvs rdiff -u -r1.30 -r1.31 src/sys/compat/common/vfs_syscalls_30.c
 cvs rdiff -u -r1.109 -r1.110 src/sys/compat/ibcs2/ibcs2_misc.c
 cvs rdiff -u -r1.23 -r1.24 src/sys/compat/irix/irix_dirent.c
 cvs rdiff -u -r1.49 -r1.50 src/sys/compat/linux/common/linux_file64.c
 cvs rdiff -u -r1.213 -r1.214 src/sys/compat/linux/common/linux_misc.c
 cvs rdiff -u -r1.9 -r1.10 src/sys/compat/linux32/common/linux32_dirent.c
 cvs rdiff -u -r1.37 -r1.38 src/sys/compat/osf1/osf1_file.c
 cvs rdiff -u -r1.165 -r1.166 src/sys/compat/sunos/sunos_misc.c
 cvs rdiff -u -r1.68 -r1.69 src/sys/compat/sunos32/sunos32_misc.c
 cvs rdiff -u -r1.148 -r1.149 src/sys/compat/svr4/svr4_misc.c
 cvs rdiff -u -r1.67 -r1.68 src/sys/compat/svr4_32/svr4_32_misc.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Soren Jacobsen <snj@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/42661 CVS commit: [netbsd-5] src/sys/compat
Date: Wed, 17 Mar 2010 02:59:53 +0000

 Module Name:	src
 Committed By:	snj
 Date:		Wed Mar 17 02:59:53 UTC 2010

 Modified Files:
 	src/sys/compat/common [netbsd-5]: vfs_syscalls_30.c
 	src/sys/compat/ibcs2 [netbsd-5]: ibcs2_misc.c
 	src/sys/compat/irix [netbsd-5]: irix_dirent.c
 	src/sys/compat/linux/common [netbsd-5]: linux_file64.c linux_misc.c
 	src/sys/compat/linux32/common [netbsd-5]: linux32_dirent.c
 	src/sys/compat/sunos [netbsd-5]: sunos_misc.c
 	src/sys/compat/sunos32 [netbsd-5]: sunos32_misc.c
 	src/sys/compat/svr4 [netbsd-5]: svr4_misc.c
 	src/sys/compat/svr4_32 [netbsd-5]: svr4_32_misc.c

 Log Message:
 Pull up following revision(s) (requested by he in ticket #1323):
 	sys/compat/common/vfs_syscalls_30.c: revision 1.31
 	sys/compat/ibcs2/ibcs2_misc.c: revision 1.110
 	sys/compat/irix/irix_dirent.c: revision 1.24
 	sys/compat/linux/common/linux_file64.c: revision 1.50
 	sys/compat/linux/common/linux_misc.c: revision 1.214
 	sys/compat/linux32/common/linux32_dirent.c: revision 1.10
 	sys/compat/sunos/sunos_misc.c: revision 1.166
 	sys/compat/sunos32/sunos32_misc.c: revision 1.69
 	sys/compat/svr4/svr4_misc.c: revision 1.149
 	sys/compat/svr4_32/svr4_32_misc.c: revision 1.68
 When implementing "read directory", when there are too many empty entries
 in a row, and we need to try to read the next block, and have passed a
 non-NULL cookie pointer to VOP_READDIR, ensure that we free the cookie
 buffer before re-doing VOP_READDIR, so that we don't leak memory.
 This fix is similar to nfs_serv.c revisions 1.115 + 1.124.
 This should fix the long-standing problem observed by e.g. using Linux-
 emulated programs to take backup of servers, which is one of the problems
 which were reported in PR#42661.
 Thanks to pooka@ for the hints for traversing the VOP* layer.


 To generate a diff of this commit:
 cvs rdiff -u -r1.28 -r1.28.6.1 src/sys/compat/common/vfs_syscalls_30.c
 cvs rdiff -u -r1.104 -r1.104.6.1 src/sys/compat/ibcs2/ibcs2_misc.c
 cvs rdiff -u -r1.23 -r1.23.10.1 src/sys/compat/irix/irix_dirent.c
 cvs rdiff -u -r1.48 -r1.48.6.1 src/sys/compat/linux/common/linux_file64.c
 cvs rdiff -u -r1.201 -r1.201.6.1 src/sys/compat/linux/common/linux_misc.c
 cvs rdiff -u -r1.6 -r1.6.4.1 src/sys/compat/linux32/common/linux32_dirent.c
 cvs rdiff -u -r1.161 -r1.161.4.1 src/sys/compat/sunos/sunos_misc.c
 cvs rdiff -u -r1.62 -r1.62.4.1 src/sys/compat/sunos32/sunos32_misc.c
 cvs rdiff -u -r1.144 -r1.144.6.1 src/sys/compat/svr4/svr4_misc.c
 cvs rdiff -u -r1.63 -r1.63.6.1 src/sys/compat/svr4_32/svr4_32_misc.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->closed
State-Changed-By: he@NetBSD.org
State-Changed-When: Wed, 31 Mar 2010 12:11:33 +0000
State-Changed-Why:
This bug was fixed, and the fix was pulled up to the netbsd-5
branch, as recorded here.


From: Havard Eidnes <he@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/42661 CVS commit: src
Date: Mon, 5 Apr 2010 07:16:13 +0000

 Module Name:	src
 Committed By:	he
 Date:		Mon Apr  5 07:16:13 UTC 2010

 Modified Files:
 	src/sys/kern: kern_malloc.c
 	src/sys/sys: mallocvar.h param.h
 	src/usr.bin/vmstat: vmstat.c

 Log Message:
 Extend struct malloc_type to count the number of active allocations
 per size, and make vmstat report this information under the "Memory
 statistics by type" display, which is only printed when the kernel
 has been compiled with KMEMSTATS defined, like this:

 Memory statistics by type                                Type  Kern
            Type InUse  MemUse HighUse   Limit   Requests Limit Limit Size(s)
           wapbl    15   4192K   4192K  78644K     376426     0     0 32:0,256:3,512:6,131072:1,262144:2,524288:3

 Since struct malloc_type is user-visible and is changed, bump kernel
 revision to 5.99.26.

 While it is true that malloc(9) is in general on the path of slowly
 being replaced by kmem(9) (kmem_alloc/kmem_free), there remains a
 lot of points of usage of malloc/free, and this could aid in finding
 any leaks.  (It helped finding the leak fixed in PR#42661.)

 This was discussed with and somewhat hestitantly OKed by rmind@


 To generate a diff of this commit:
 cvs rdiff -u -r1.128 -r1.129 src/sys/kern/kern_malloc.c
 cvs rdiff -u -r1.7 -r1.8 src/sys/sys/mallocvar.h
 cvs rdiff -u -r1.360 -r1.361 src/sys/sys/param.h
 cvs rdiff -u -r1.166 -r1.167 src/usr.bin/vmstat/vmstat.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Jeff Rizzo <riz@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/42661 CVS commit: [netbsd-4] src/sys/compat
Date: Sat, 12 Jun 2010 18:38:02 +0000

 Module Name:	src
 Committed By:	riz
 Date:		Sat Jun 12 18:38:02 UTC 2010

 Modified Files:
 	src/sys/compat/common [netbsd-4]: vfs_syscalls_30.c
 	src/sys/compat/ibcs2 [netbsd-4]: ibcs2_misc.c
 	src/sys/compat/irix [netbsd-4]: irix_dirent.c
 	src/sys/compat/linux/common [netbsd-4]: linux_file64.c linux_misc.c
 	src/sys/compat/sunos [netbsd-4]: sunos_misc.c
 	src/sys/compat/sunos32 [netbsd-4]: sunos32_misc.c
 	src/sys/compat/svr4 [netbsd-4]: svr4_misc.c
 	src/sys/compat/svr4_32 [netbsd-4]: svr4_32_misc.c

 Log Message:
 Pull up following revision(s) (requested by he in ticket #1387):
 	sys/compat/svr4/svr4_misc.c: revision 1.149
 	sys/compat/linux/common/linux_misc.c: revision 1.214
 	sys/compat/common/vfs_syscalls_30.c: revision 1.31
 	sys/compat/sunos/sunos_misc.c: revision 1.166
 	sys/compat/linux/common/linux_file64.c: revision 1.50
 	sys/compat/svr4_32/svr4_32_misc.c: revision 1.68
 	sys/compat/ibcs2/ibcs2_misc.c: revision 1.110
 	sys/compat/linux32/common/linux32_dirent.c: revision 1.10
 	sys/compat/sunos32/sunos32_misc.c: revision 1.69
 	sys/compat/irix/irix_dirent.c: revision 1.24
 	sys/compat/osf1/osf1_file.c: revision 1.38
 When implementing "read directory", when there are too many empty entries
 in a row, and we need to try to read the next block, and have passed a
 non-NULL cookie pointer to VOP_READDIR, ensure that we free the cookie
 buffer before re-doing VOP_READDIR, so that we don't leak memory.
 This fix is similar to nfs_serv.c revisions 1.115 + 1.124.
 This should fix the long-standing problem observed by e.g. using Linux-
 emulated programs to take backup of servers, which is one of the problems
 which were reported in PR#42661.
 Thanks to pooka@ for the hints for traversing the VOP* layer.


 To generate a diff of this commit:
 cvs rdiff -u -r1.18 -r1.18.2.1 src/sys/compat/common/vfs_syscalls_30.c
 cvs rdiff -u -r1.81 -r1.81.2.1 src/sys/compat/ibcs2/ibcs2_misc.c
 cvs rdiff -u -r1.16 -r1.16.18.1 src/sys/compat/irix/irix_dirent.c
 cvs rdiff -u -r1.34 -r1.34.8.1 src/sys/compat/linux/common/linux_file64.c
 cvs rdiff -u -r1.165.2.2 -r1.165.2.3 src/sys/compat/linux/common/linux_misc.c
 cvs rdiff -u -r1.143 -r1.143.2.1 src/sys/compat/sunos/sunos_misc.c
 cvs rdiff -u -r1.42 -r1.42.2.1 src/sys/compat/sunos32/sunos32_misc.c
 cvs rdiff -u -r1.121 -r1.121.2.1 src/sys/compat/svr4/svr4_misc.c
 cvs rdiff -u -r1.39 -r1.39.2.1 src/sys/compat/svr4_32/svr4_32_misc.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.