NetBSD Problem Report #37236
From tron@zhadum.org.uk Sat Oct 27 16:59:39 2007
Return-Path: <tron@zhadum.org.uk>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
by narn.NetBSD.org (Postfix) with ESMTP id 9228A63BA35
for <gnats-bugs@gnats.NetBSD.org>; Sat, 27 Oct 2007 16:59:39 +0000 (UTC)
Message-Id: <200710271659.l9RGxZqq001130@colwyn.zhadum.org.uk>
Date: Sat, 27 Oct 2007 17:59:36 +0100 (BST)
From: tron@zhadum.org.uk
Reply-To: tron@zhadum.org.uk
To: gnats-bugs@NetBSD.org
Subject: Mac OS X NFS client frequently crashes rpc.lockd(8) on NetBSD
X-Send-Pr-Version: 3.95
>Number: 37236
>Category: bin
>Synopsis: Mac OS X NFS client frequently crashes rpc.lockd(8) on NetBSD
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: tron
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sat Oct 27 17:00:00 +0000 2007
>Closed-Date: Sat Nov 10 16:38:05 +0000 2007
>Last-Modified: Mon Nov 19 21:15:07 +0000 2007
>Originator: Matthias Scheler
>Release: NetBSD 4.0_RC3
>Organization:
Matthias Scheler http://zhadum.org.uk/
>Environment:
System: NetBSD colwyn.zhadum.org.uk 4.0_RC3 NetBSD 4.0_RC3 (COLWYN) #0: Mon Oct 22 08:33:43 BST 2007 tron@colwyn.zhadum.org.uk:/src/sys/compile/COLWYN i386
Architecture: i386
Machine: i386
>Description:
My desktop is Mac OS X (PowerPC, 10.4.10) machine. My NetBSD server provides
accounts via LDAP and home directories via NFS to the Mac. After I moved the
"Library" directory of my personal account from the Mac's local harddisk
back to the NFS server (which now has more diskspace) the Mac started
complaind about NFS locking problems frequently. The symptoms are the
same John D. Baker descrived over a year ago:
http://mail-index.netbsd.org/port-sparc/2006/05/21/0001.html
Running "/etc/rc.d/nfslocking restart" on the NetBSD system fixes
the problem for a few hours. The problem is caused by "rpc.lockd"
crashing. I've managed to get a crash dump. Here is the stack trace:
#0 0xbbb825b8 in strcmp () from /usr/lib/libc.so.12
#1 0x0804c8ef in unlock ()
#2 0x0804b4d8 in nlm4_unlock_msg_4_svc ()
#3 0x0804995d in nlm_prog_4 ()
#4 0xbbb3ef48 in svc_getreq_common () from /usr/lib/libc.so.12
#5 0xbbb3f04f in svc_getreqset () from /usr/lib/libc.so.12
#6 0xbbae368b in svc_run () from /usr/lib/libc.so.12
#7 0x0804a434 in main ()
Here is the register dump:
eax 0x8051320 134550304
ecx 0x0 0
edx 0x6602024c 1711407692
ebx 0x66020248 1711407688
esp 0xbfbfe794 0xbfbfe794
ebp 0xbfbfe7e8 0xbfbfe7e8
esi 0x805131c 134550300
edi 0xbfbfe86c -1077942164
eip 0xbbb825b8 0xbbb825b8 <strcmp+48>
eflags 0x10212 [ AF IF RF ]
cs 0x17 23
ss 0x1f 31
ds 0x1f 31
es 0x1f 31
fs 0x1f 31
gs 0x1f 31
If I understand the assembler code of strcmp() correctly %ebx should
point to valid memory address but apparently doesn't. Looking at the
code of unlock() ...
if (strcmp(fl->client_name, lck->caller_name) ||
fhcmp(&filehandle, &fl->filehandle) != 0 ||
... I would guess that "fl->client_name" hasn't been initialized properly.
The lalloc() function could be causing this:
static struct file_lock *
lalloc(void)
{
struct file_lock *fl;
fl = malloc(sizeof(*fl));
if (fl != NULL) {
fl->addr = NULL;
fl->client.oh.n_bytes = NULL;
fl->client_cookie.n_bytes = NULL;
fl->filehandle.fhdata = NULL;
}
return fl;
}
Why was this function written at all? "calloc(1, sizeof(file_lock))" would
IMHO do the job much better.
>How-To-Repeat:
Run Firefox 2.0.0.x under Mac OS X with an NFS mounted home directory.
>Fix:
Not known.
>Release-Note:
>Audit-Trail:
From: Christos Zoulas <christos@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: PR/37236 CVS commit: src/usr.sbin/rpc.lockd
Date: Sat, 27 Oct 2007 18:41:55 +0000 (UTC)
Module Name: src
Committed By: christos
Date: Sat Oct 27 18:41:55 UTC 2007
Modified Files:
src/usr.sbin/rpc.lockd: lockd_lock.c
Log Message:
PR/37236: Matthias Scheler: Mac OS X NFS client frequently crashes rpc.lockd(8)
on NetBSD. Use calloc to allocate the lock as suggested in the PR.
To generate a diff of this commit:
cvs rdiff -r1.26 -r1.27 src/usr.sbin/rpc.lockd/lockd_lock.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Matthias Scheler <tron@zhadum.org.uk>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: PR/37236 CVS commit: src/usr.sbin/rpc.lockd
Date: Sat, 27 Oct 2007 21:45:25 +0100
On Sat, Oct 27, 2007 at 06:45:02PM +0000, Christos Zoulas wrote:
> From: Christos Zoulas <christos@netbsd.org>
> To: gnats-bugs@NetBSD.org
> Cc:
> Subject: PR/37236 CVS commit: src/usr.sbin/rpc.lockd
> Date: Sat, 27 Oct 2007 18:41:55 +0000 (UTC)
>
> Module Name: src
> Committed By: christos
> Date: Sat Oct 27 18:41:55 UTC 2007
>
> Modified Files:
> src/usr.sbin/rpc.lockd: lockd_lock.c
>
> Log Message:
> PR/37236: Matthias Scheler: Mac OS X NFS client frequently crashes rpc.lockd(8)
> on NetBSD. Use calloc to allocate the lock as suggested in the PR.
>
>
> To generate a diff of this commit:
> cvs rdiff -r1.26 -r1.27 src/usr.sbin/rpc.lockd/lockd_lock.c
Thanks for the commit. But Martin Husemann guessed correctly that this
doesn't fix the problem. I tried a binary with that change and it
crashed again. But I have a better core dump this time (the line
numbers refer to a "netbsd-4" branch source with lalloc() removed):
#0 0xbbb825b8 in strcmp () from /usr/lib/libc.so.12
#1 0x0804c91f in unlock (lck=0xbfbfe0d4, flags=2) at lockd_lock.c:386
#2 0x0804b508 in nlm4_unlock_msg_4_svc (arg=0xbfbfe0cc, rqstp=0xbfbfe188)
at lock_proc.c:1044
#3 0x0804998d in nlm_prog_4 (rqstp=0xbfbfe188, transp=0x8063080)
at nlm_prot_svc.c:469
#4 0xbbb3ef48 in svc_getreq_common () from /usr/lib/libc.so.12
#5 0xbbb3f04f in svc_getreqset () from /usr/lib/libc.so.12
#6 0xbbae368b in svc_run () from /usr/lib/libc.so.12
#7 0x0804a464 in main (argc=Cannot access memory at address 0x0
Locking at unlock() reveals why it crashed:
(gdb) up
#1 0x0804c91f in unlock (lck=0xbfbfe0d4, flags=2) at lockd_lock.c:386
386 if (strcmp(fl->client_name, lck->caller_name) ||
(gdb) print *fl
Cannot access memory at address 0x202
(gdb) print fl
$1 = (struct file_lock *) 0x202
(gdb) print lck
$2 = (nlm4_lock *) 0xbfbfe0d4
(gdb) print *lck
$3 = {caller_name = 0x8051320 "excalibur.zhadum.org.uk", fh = {n_len = 28,
n_bytes = 0x8051340 ""}, oh = {n_len = 8, n_bytes = 0x8062210 ""},
svid = 251, l_offset = 0, l_len = 0}
It seems that the list of locks got corrupted:
(gdb) print lcklst_head
$1 = {lh_first = 0x60030210}
(gdb) print *(struct file_lock *)0x60030210
Cannot access memory at address 0x60030210
Kind regards
--
Matthias Scheler http://zhadum.org.uk/
From: Matthias Scheler <tron@zhadum.org.uk>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: PR/37236 CVS commit: src/usr.sbin/rpc.lockd
Date: Wed, 31 Oct 2007 21:55:26 +0000
On Sat, Oct 27, 2007 at 09:45:25PM +0100, Matthias Scheler wrote:
> It seems that the list of locks got corrupted:
>
> (gdb) print lcklst_head
> $1 = {lh_first = 0x60030210}
> (gdb) print *(struct file_lock *)0x60030210
> Cannot access memory at address 0x60030210
I had another crash with an unmodified version of "rpc.lockd":
(gdb) where
#0 0xbbb825b8 in strcmp () from /usr/lib/libc.so.12
#1 0x0804c8ef in unlock (lck=0xbfbfe834, flags=2) at lockd_lock.c:387
#2 0x0804b4d8 in nlm4_unlock_msg_4_svc (arg=0xbfbfe82c, rqstp=0xbfbfe8e8)
at lock_proc.c:1044
#3 0x0804995d in nlm_prog_4 (rqstp=0xbfbfe8e8, transp=0x8063080)
at nlm_prot_svc.c:469
#4 0xbbb3ef48 in svc_getreq_common () from /usr/lib/libc.so.12
#5 0xbbb3f04f in svc_getreqset () from /usr/lib/libc.so.12
#6 0xbbae368b in svc_run () from /usr/lib/libc.so.12
#7 0x0804a434 in main (argc=Cannot access memory at address 0x0
) at lockd.c:211
(gdb) up
#1 0x0804c8ef in unlock (lck=0xbfbfe834, flags=2) at lockd_lock.c:387
387 if (strcmp(fl->client_name, lck->caller_name) ||
(gdb) print fl
$1 = (struct file_lock *) 0x0
This time the loop in unlock() was executed although fl is NULL. The
crash looks like a race between sigchild_handler() and one of the
dispatch procedures. That should however not happen because of the
calls to siglock() and sigunlock().
Kind regards
--
Matthias Scheler http://zhadum.org.uk/
From: Matthias Scheler <tron@zhadum.org.uk>
To: NetBSD GNATS <gnats-bugs@NetBSD.org>
Cc:
Subject: Re: PR/37236 CVS commit: src/usr.sbin/rpc.lockd
Date: Sun, 4 Nov 2007 19:52:50 +0000
On Thu, Nov 01, 2007 at 03:00:14PM +0000, Matthias Scheler wrote:
> This time the loop in unlock() was executed although fl is NULL. The
> crash looks like a race between sigchild_handler() and one of the
> dispatch procedures. That should however not happen because of the
> calls to siglock() and sigunlock().
I've changed lalloc() to add redzones before and after each "struct file_lock"
and added various checks e.g. in lfree() and each LIST_FOREACH loop that
check the redzones. "rpc.lockd" crashed again with this stack trace:
#0 0x0804bc39 in get_alloc (fl=0x8b030210) at lockd_lock.c:470
470 assert(memcmp(fla->redzone_head, redzone_head_pattern,
(gdb) where
#0 0x0804bc39 in get_alloc (fl=0x8b030210) at lockd_lock.c:470
#1 0x0804cb82 in unlock (lck=0xbfbfe0e4, flags=2) at lockd_lock.c:413
#2 0x0804b518 in nlm4_unlock_msg_4_svc (arg=0xbfbfe0dc, rqstp=0xbfbfe198)
at lock_proc.c:1044
#3 0x0804999d in nlm_prog_4 (rqstp=0xbfbfe198, transp=0x8063080)
at nlm_prot_svc.c:469
#4 0xbbb3ef48 in svc_getreq_common () from /usr/lib/libc.so.12
#5 0xbbb3f04f in svc_getreqset () from /usr/lib/libc.so.12
#6 0xbbae368b in svc_run () from /usr/lib/libc.so.12
#7 0x0804a474 in main (argc=Cannot access memory at address 0x20
) at lockd.c:211
The reason is heap corruption:
(gdb) print lcklst_head
$2 = {lh_first = 0x8b030210}
(gdb) print *(struct file_lock *)0x8b030210
Cannot access memory at address 0x8b030210
(gdb) print hostlst_head
$3 = {lh_first = 0x76b5bb51}
(gdb) print *(struct host *)0x76b5bb51
Cannot access memory at address 0x76b5bb51
This rules out the theory about a race condition. These pointers are
completely invalid.
Kind regards
--
Matthias Scheler http://zhadum.org.uk/
From: Matthias Scheler <tron@zhadum.org.uk>
To: NetBSD GNATS <gnats-bugs@NetBSD.org>
Cc:
Subject: Re: PR/37236 CVS commit: src/usr.sbin/rpc.lockd
Date: Sun, 4 Nov 2007 19:54:37 +0000
On Sun, Nov 04, 2007 at 07:52:50PM +0000, Matthias Scheler wrote:
> The reason is heap corruption:
>
> (gdb) print lcklst_head
> $2 = {lh_first = 0x8b030210}
> (gdb) print *(struct file_lock *)0x8b030210
> Cannot access memory at address 0x8b030210
> (gdb) print hostlst_head
> $3 = {lh_first = 0x76b5bb51}
> (gdb) print *(struct host *)0x76b5bb51
> Cannot access memory at address 0x76b5bb51
>
> This rules out the theory about a race condition. These pointers are
> completely invalid.
I hope that I found the problem. From "lock_proc.c":
static struct sockaddr_storage clnt_cache_addr[CLIENT_CACHE_SIZE];
[...]
/* Success - update the cache entry */
clnt_cache_ptr[clnt_cache_next_to_use] = client;
memcpy(&clnt_cache_addr[clnt_cache_next_to_use], host_addr,
host_addr->sa_len);
clnt_cache_time[clnt_cache_next_to_use] = time_now.tv_sec;
if (++clnt_cache_next_to_use > CLIENT_CACHE_SIZE)
clnt_cache_next_to_use = 0;
[...]
rpc.lockd(8) will happily overflow the clnt_cache_addr array by one.
Kind regards
--
Matthias Scheler http://zhadum.org.uk/
From: Matthias Scheler <tron@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: PR/37236 CVS commit: src/usr.sbin/rpc.lockd
Date: Sun, 4 Nov 2007 19:59:55 +0000 (UTC)
Module Name: src
Committed By: tron
Date: Sun Nov 4 19:59:54 UTC 2007
Modified Files:
src/usr.sbin/rpc.lockd: lock_proc.c
Log Message:
Fix off-by-one error accessing "clnt_cache_addr" array which causes heap
corruption. This will hopefully fix PR bin/37236.
To generate a diff of this commit:
cvs rdiff -r1.7 -r1.8 src/usr.sbin/rpc.lockd/lock_proc.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
Responsible-Changed-From-To: bin-bug-people->tron
Responsible-Changed-By: tron@netbsd.org
Responsible-Changed-When: Sun, 04 Nov 2007 20:04:08 +0000
Responsible-Changed-Why:
I'll handle this PR.
State-Changed-From-To: open->analyzed
State-Changed-By: tron@netbsd.org
State-Changed-When: Sun, 04 Nov 2007 20:04:08 +0000
State-Changed-Why:
The bug is hopefully well understood now:
1.) My latest crashes indicates heap corruption of "lcklst_head".
2.) "clnt_cache_addr" is exactly in front of "lcklst_head".
3.) There was a boundary violation of "clnt_cache_addr" in "lockd_lock.c"
at line 240.
From: Juan Romero Pardines <xtraeme@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: PR/37236 CVS commit: [netbsd-4] src/usr.sbin/rpc.lockd
Date: Sun, 4 Nov 2007 21:27:13 +0000 (UTC)
Module Name: src
Committed By: xtraeme
Date: Sun Nov 4 21:27:13 UTC 2007
Modified Files:
src/usr.sbin/rpc.lockd [netbsd-4]: lock_proc.c
Log Message:
Pull up following revision(s) (requested by tron in ticket #976):
usr.sbin/rpc.lockd/lock_proc.c: revision 1.8
Fix off-by-one error accessing "clnt_cache_addr" array which causes heap
corruption. This will hopefully fix PR bin/37236.
To generate a diff of this commit:
cvs rdiff -r1.7 -r1.7.18.1 src/usr.sbin/rpc.lockd/lock_proc.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: analyzed->closed
State-Changed-By: tron@netbsd.org
State-Changed-When: Sat, 10 Nov 2007 16:38:05 +0000
State-Changed-Why:
rpc.lockd(8) didn't crash in almost a week. It looks like the problem is fixed.
From: Manuel Bouyer <bouyer@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: PR/37236 CVS commit: [netbsd-2-0] src/usr.sbin/rpc.lockd
Date: Mon, 19 Nov 2007 20:33:42 +0000 (UTC)
Module Name: src
Committed By: bouyer
Date: Mon Nov 19 20:33:42 UTC 2007
Modified Files:
src/usr.sbin/rpc.lockd [netbsd-2-0]: lock_proc.c
Log Message:
Pull up following revision(s) (requested by tron in ticket #11390):
usr.sbin/rpc.lockd/lock_proc.c: revision 1.8
Fix off-by-one error accessing "clnt_cache_addr" array which causes heap
corruption. This will hopefully fix PR bin/37236.
To generate a diff of this commit:
cvs rdiff -r1.7 -r1.7.4.1 src/usr.sbin/rpc.lockd/lock_proc.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Manuel Bouyer <bouyer@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: PR/37236 CVS commit: [netbsd-2-1] src/usr.sbin/rpc.lockd
Date: Mon, 19 Nov 2007 20:34:43 +0000 (UTC)
Module Name: src
Committed By: bouyer
Date: Mon Nov 19 20:34:42 UTC 2007
Modified Files:
src/usr.sbin/rpc.lockd [netbsd-2-1]: lock_proc.c
Log Message:
Pull up following revision(s) (requested by tron in ticket #11390):
usr.sbin/rpc.lockd/lock_proc.c: revision 1.8
Fix off-by-one error accessing "clnt_cache_addr" array which causes heap
corruption. This will hopefully fix PR bin/37236.
To generate a diff of this commit:
cvs rdiff -r1.7 -r1.7.10.1 src/usr.sbin/rpc.lockd/lock_proc.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Manuel Bouyer <bouyer@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: PR/37236 CVS commit: [netbsd-2] src/usr.sbin/rpc.lockd
Date: Mon, 19 Nov 2007 20:35:42 +0000 (UTC)
Module Name: src
Committed By: bouyer
Date: Mon Nov 19 20:35:42 UTC 2007
Modified Files:
src/usr.sbin/rpc.lockd [netbsd-2]: lock_proc.c
Log Message:
Pull up following revision(s) (requested by tron in ticket #11390):
usr.sbin/rpc.lockd/lock_proc.c: revision 1.8
Fix off-by-one error accessing "clnt_cache_addr" array which causes heap
corruption. This will hopefully fix PR bin/37236.
To generate a diff of this commit:
cvs rdiff -r1.7 -r1.7.6.1 src/usr.sbin/rpc.lockd/lock_proc.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Manuel Bouyer <bouyer@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: PR/37236 CVS commit: [netbsd-3-0] src/usr.sbin/rpc.lockd
Date: Mon, 19 Nov 2007 21:09:38 +0000 (UTC)
Module Name: src
Committed By: bouyer
Date: Mon Nov 19 21:09:38 UTC 2007
Modified Files:
src/usr.sbin/rpc.lockd [netbsd-3-0]: lock_proc.c
Log Message:
Pull up following revision(s) (requested by tron in ticket #1873):
usr.sbin/rpc.lockd/lock_proc.c: revision 1.8
Fix off-by-one error accessing "clnt_cache_addr" array which causes heap
corruption. This will hopefully fix PR bin/37236.
To generate a diff of this commit:
cvs rdiff -r1.7 -r1.7.12.1 src/usr.sbin/rpc.lockd/lock_proc.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Manuel Bouyer <bouyer@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: PR/37236 CVS commit: [netbsd-3-1] src/usr.sbin/rpc.lockd
Date: Mon, 19 Nov 2007 21:10:45 +0000 (UTC)
Module Name: src
Committed By: bouyer
Date: Mon Nov 19 21:10:45 UTC 2007
Modified Files:
src/usr.sbin/rpc.lockd [netbsd-3-1]: lock_proc.c
Log Message:
Pull up following revision(s) (requested by tron in ticket #1873):
usr.sbin/rpc.lockd/lock_proc.c: revision 1.8
Fix off-by-one error accessing "clnt_cache_addr" array which causes heap
corruption. This will hopefully fix PR bin/37236.
To generate a diff of this commit:
cvs rdiff -r1.7 -r1.7.16.1 src/usr.sbin/rpc.lockd/lock_proc.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Manuel Bouyer <bouyer@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: PR/37236 CVS commit: [netbsd-3] src/usr.sbin/rpc.lockd
Date: Mon, 19 Nov 2007 21:11:52 +0000 (UTC)
Module Name: src
Committed By: bouyer
Date: Mon Nov 19 21:11:52 UTC 2007
Modified Files:
src/usr.sbin/rpc.lockd [netbsd-3]: lock_proc.c
Log Message:
Pull up following revision(s) (requested by tron in ticket #1873):
usr.sbin/rpc.lockd/lock_proc.c: revision 1.8
Fix off-by-one error accessing "clnt_cache_addr" array which causes heap
corruption. This will hopefully fix PR bin/37236.
To generate a diff of this commit:
cvs rdiff -r1.7 -r1.7.8.1 src/usr.sbin/rpc.lockd/lock_proc.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.