NetBSD Problem Report #32318

From bouyer@chassiron.antioche.eu.org  Fri Dec 16 18:56:26 2005
Return-Path: <bouyer@chassiron.antioche.eu.org>
Received: from chassiron.antioche.eu.org (bouyer.net1.nerim.net [62.212.96.44])
	by narn.netbsd.org (Postfix) with ESMTP id 87F2163B88D
	for <gnats-bugs@gnats.NetBSD.org>; Fri, 16 Dec 2005 18:56:25 +0000 (UTC)
Message-Id: <200512161856.jBGIuLnr006510@chassiron.antioche.eu.org>
Date: Fri, 16 Dec 2005 19:56:22 +0100 (MET)
From: bouyer@antioche.eu.org (Manuel Bouyer)
Reply-To: bouyer@antioche.eu.org
To: gnats-bugs@netbsd.org
Subject: NFS client or server hang
X-Send-Pr-Version: 3.95

>Number:         32318
>Category:       kern
>Synopsis:       NFS client or server hang
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    yamt
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Dec 16 19:00:01 +0000 2005
>Closed-Date:    Sat Feb 11 13:04:09 +0000 2012
>Last-Modified:  Sun Feb 12 13:50:21 +0000 2012
>Originator:     Manuel Bouyer
>Release:        NetBSD 3.0_RC3
>Organization:
>Environment:
System: NetBSD chassiron.antioche.eu.org 3.0_RC3 NetBSD 3.0_RC3 (CHASSIRON) #0: Sat Nov 26 15:11:16 CET 2005 bouyer@pop.lip6.fr:/local/pop1/bouyer/tmp/sparc/obj/local/pop1/bouyer/netbsd-3/src/sys/arch/sparc/compile/CHASSIRON sparc
Architecture: sparc
Machine: sparc
>Description:
	Setup: I get mail from various pop3 server via fetchmail and
	deliver to local folders (mbox format) via procmail, the folders are
	stored on a NFS server.
	fetchmail/procmail run on a x86 box (celeron 500) running a months-old
	current:
NetBSD rochebonne.antioche.eu.org 3.99.7 NetBSD 3.99.7 (ROCHEBONNE) #1: Tue Aug  9 23:54:57 CEST 2005  bouyer@pop.lip6.fr:/local/pop1/bouyer/tmp/i386/obj/local/pop1/bouyer/current/src/sys/arch/i386/compile/ROCHEBONNE i386
	The NFS server is a sparc IPX (40Mhz sparcv7).

	Problem: from time to time, the process accessing the files on
	the NFS server hang. This usually happens when the client does
	2 concurent accesses to the mailboxes (e.g. reading a mailbox
	with mutt while procmail tries to deliver a mail to this mailbox).
	I've seen this also before the 3.0 branch was cut, with the NFS server
	running 2.0 or 2.1. I've never noticed this when the server was running
	1.6.2 (it started happening when the server got upgraded).
	Doing a /etc/rc.d/nfsd restart on the server unwedge the processes
	on the client box.

	Today I managed to reproduce this with a tcpdump running.
	The full trace is at:
	ftp://chassiron.antioche.eu.org/pub/private/nfs.hang.gz
	(the hang begins at 19:19:35, I ran the /etc/rc.d/nfsd restart at
	19:23:03).
	When the processes are stuck, the only traffic between
	the client and server are:
19:19:35.106216 IP rochebonne.antioche.eu.org.82 > chassiron.localhost.nfs: 40 n
ull
19:19:35.108362 IP chassiron.localhost.nfs > rochebonne.antioche.eu.org.82: repl
y ok 24 null

	Before that the server sent a stream of
19:19:24.927421 IP chassiron.localhost.nfs > rochebonne.antioche.eu.org.1098072401: reply ERR 1460
	I'm not sure if it's normal or not (is this an error, or a normal
	reply to a read ?)
	It also looks like the client opened a second TCP connection at
	19:19:26.792149, maybe for the concurrent accesses ?

	To me it looks like this request:
19:19:26.845210 IP rochebonne.antioche.eu.org.809670347 > chassiron.localhost.nfs: 148 lookup fh 25,15/13347 "_bX.uUwoDB.rochebonne.antioch"
	got no reply and this is what caused the hang. After the nfsd restart,
	the same request was sent 2 times, the second one got the reply
	"no such file or directory"

	Now I don't know if this is a client or server side issue. The
	server seems to loose requests, but is the client supposed to
	retry with NFS over TCP ?

>How-To-Repeat:
	Try concurent accesses to the same file or directory against
	a slow NFS server ?
>Fix:
	yes, please

>Release-Note:

>Audit-Trail:
From: christos@zoulas.com (Christos Zoulas)
To: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org,
	gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Cc: 
Subject: Re: kern/32318: NFS client or server hang
Date: Fri, 16 Dec 2005 14:28:49 -0500

 On Dec 16,  7:00pm, bouyer@antioche.eu.org (Manuel Bouyer) wrote:
 -- Subject: kern/32318: NFS client or server hang

 | 	Before that the server sent a stream of
 | 19:19:24.927421 IP chassiron.localhost.nfs > rochebonne.antioche.eu.org.1098072401: reply ERR 1460
 | 	I'm not sure if it's normal or not (is this an error, or a normal
 | 	reply to a read ?)
 | 	It also looks like the client opened a second TCP connection at
 | 	19:19:26.792149, maybe for the concurrent accesses ?

 Well, lets find out what the error means... Here's a patch to parse the
 rpc message rejection code. I have not tested it, but it should be close.
 If it works, please commit it. We can then find out what kind of error
 you are getting.

 christos

 Index: print-nfs.c
 ===================================================================
 RCS file: /cvsroot/src/dist/tcpdump/print-nfs.c,v
 retrieving revision 1.13
 diff -u -u -r1.13 print-nfs.c
 --- print-nfs.c	27 Sep 2004 23:04:24 -0000	1.13
 +++ print-nfs.c	16 Dec 2005 19:27:26 -0000
 @@ -1007,6 +1007,67 @@
  	len = EXTRACT_32BITS(&dp[1]);
  	if (len >= length)
  		return (NULL);
 +
 +	if (EXTRACT_32BITS(&rp->rm_reply.rp_stat) != MSG_ACCEPTED) {
 +		enum reject_stat rstat;
 +		rpcvers_t rlow;
 +		rpcvers_t rhigh;
 +		enum auth_stat rwhy;
 +
 +		rstat = EXTRACT_32BITS(dp);
 +		switch (rstat) {
 +		case RPC_MISMATCH:
 +			dp += sizeof(u_int32_t);
 +			rlow = EXTRACT_32BITS(dp);
 +			dp += sizeof(u_int32_t);
 +			rhigh = EXTRACT_32BITS(dp);
 +			printf("RPC Version mismatch (%d-%d)\n",
 +			    (int)rlow, (int)rhigh);
 +			break;
 +		case AUTH_ERROR:
 +			dp += sizeof(u_int32_t);
 +			rwhy = EXTRACT_32BITS(dp);
 +			printf("Auth ");
 +			switch (rwhy) {
 +			case AUTH_OK:
 +				printf("OK\n");
 +				break;
 +			case AUTH_BADCRED:
 +				printf("Bogus Credentials (seal broken)\n");
 +				break;
 +			case AUTH_REJECTEDCRED:
 +				printf("Rejected Credentials (client should "
 +				    "begin new session)\n");
 +				break;
 +			case AUTH_BADVERF:
 +				printf("Bogus Verifier (seal broken)\n");
 +				break;
 +			case AUTH_REJECTEDVERF:
 +				printf("Verifier expired or was replayed\n");
 +				break;
 +			case AUTH_TOOWEAK:
 +				printf("Credentials are too weak\n");
 +				break;
 +			case AUTH_INVALIDRESP:
 +				printf("Bogus response verifier\n");
 +				break;
 +			case AUTH_FAILED:
 +				printf("Unknown failure\n");
 +				break;
 +			default:
 +				printf("Invalid failure code %d\n",
 +				    (int)rwhy);
 +				break;
 +			}
 +			break;
 +		default:
 +			printf("Unknown reason for rejecting rpc message %d\n",
 +			    (int)rstat);
 +			break;
 +		}
 +		return NULL;
 +	}
 +
  	/*
  	 * skip past the ar_verf credentials.
  	 */

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org
Subject: Re: kern/32318: NFS client or server hang
Date: Fri, 16 Dec 2005 21:00:38 +0100

 On Fri, Dec 16, 2005 at 07:30:02PM +0000, Christos Zoulas wrote:
 > The following reply was made to PR kern/32318; it has been noted by GNATS.
 > 
 > From: christos@zoulas.com (Christos Zoulas)
 > To: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org,
 > 	gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
 > Cc: 
 > Subject: Re: kern/32318: NFS client or server hang
 > Date: Fri, 16 Dec 2005 14:28:49 -0500
 > 
 >  On Dec 16,  7:00pm, bouyer@antioche.eu.org (Manuel Bouyer) wrote:
 >  -- Subject: kern/32318: NFS client or server hang
 >  
 >  | 	Before that the server sent a stream of
 >  | 19:19:24.927421 IP chassiron.localhost.nfs > rochebonne.antioche.eu.org.1098072401: reply ERR 1460
 >  | 	I'm not sure if it's normal or not (is this an error, or a normal
 >  | 	reply to a read ?)
 >  | 	It also looks like the client opened a second TCP connection at
 >  | 	19:19:26.792149, maybe for the concurrent accesses ?
 >  
 >  Well, lets find out what the error means... Here's a patch to parse the
 >  rpc message rejection code. I have not tested it, but it should be close.
 >  If it works, please commit it. We can then find out what kind of error
 >  you are getting.

 Sorry, it doesn't seem to work. I still get the same messages:
 20:58:31.772492 IP chassiron.localhost.nfs > rochebonne.antioche.eu.org.1768910394: reply ERR 1460

 Note that I do tcpdump -w, then tcpdump -r. Would this make a difference
 for your patch ?

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: christos@zoulas.com (Christos Zoulas)
To: Manuel Bouyer <bouyer@antioche.eu.org>, gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org
Subject: Re: kern/32318: NFS client or server hang
Date: Fri, 16 Dec 2005 15:02:56 -0500

 On Dec 16,  9:00pm, bouyer@antioche.eu.org (Manuel Bouyer) wrote:
 -- Subject: Re: kern/32318: NFS client or server hang

 | Note that I do tcpdump -w, then tcpdump -r. Would this make a difference
 | for your patch ?
 | 

 Can you put you tcpdump file somewhere where I can ftp it from?
 Then I can debug my crappy code :-)

 Thanks,

 christos

From: YAMAMOTO Takashi <yamt@mwd.biglobe.ne.jp>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org, bouyer@antioche.eu.org
Subject: Re: kern/32318: NFS client or server hang
Date: Tue, 20 Jun 2006 17:09:36 +0900

 >  | Note that I do tcpdump -w, then tcpdump -r. Would this make a difference
 >  | for your patch ?
 >  | 
 >  
 >  Can you put you tcpdump file somewhere where I can ftp it from?
 >  Then I can debug my crappy code :-)
 >  
 >  Thanks,
 >  
 >  christos

 i'd like to see the raw dump, too.  bouyer, do you still have it?

 YAMAMOTO Takashi

State-Changed-From-To: open->feedback
State-Changed-By: yamt@netbsd.org
State-Changed-When: Wed, 02 May 2007 15:09:54 +0000
State-Changed-Why:
waiting a feedback.


From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, netbsd-bugs@NetBSD.org,
	gnats-admin@NetBSD.org, yamt@NetBSD.org
Subject: Re: kern/32318 (NFS client or server hang)
Date: Wed, 2 May 2007 17:18:16 +0200

 On Wed, May 02, 2007 at 03:09:55PM +0000, yamt@NetBSD.org wrote:
 > Synopsis: NFS client or server hang
 > 
 > State-Changed-From-To: open->feedback
 > State-Changed-By: yamt@netbsd.org
 > State-Changed-When: Wed, 02 May 2007 15:09:54 +0000
 > State-Changed-Why:
 > waiting a feedback.

 It still happens with a 4.99.16 kernel on the server.

 -- 
 Manuel Bouyer, LIP6, Universite Paris VI.           Manuel.Bouyer@lip6.fr
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi)
To: bouyer@antioche.eu.org
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org,
	netbsd-bugs@NetBSD.org, gnats-admin@NetBSD.org, yamt@NetBSD.org
Subject: Re: kern/32318 (NFS client or server hang)
Date: Thu,  3 May 2007 00:25:50 +0900 (JST)

 > It still happens with a 4.99.16 kernel on the server.

 how about raw dump?

 YAMAMOTO Takashi

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: YAMAMOTO Takashi <yamt@mwd.biglobe.ne.jp>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org,
	netbsd-bugs@NetBSD.org, gnats-admin@NetBSD.org, yamt@NetBSD.org
Subject: Re: kern/32318 (NFS client or server hang)
Date: Wed, 2 May 2007 17:28:33 +0200

 On Thu, May 03, 2007 at 12:25:50AM +0900, YAMAMOTO Takashi wrote:
 > > It still happens with a 4.99.16 kernel on the server.
 > 
 > how about raw dump?

 You mean, a tcpdump of the traffic when it happens ?

 -- 
 Manuel Bouyer, LIP6, Universite Paris VI.           Manuel.Bouyer@lip6.fr
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi)
To: bouyer@antioche.eu.org
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org,
	netbsd-bugs@NetBSD.org, gnats-admin@NetBSD.org, yamt@NetBSD.org
Subject: Re: kern/32318 (NFS client or server hang)
Date: Thu,  3 May 2007 00:36:45 +0900 (JST)

 > On Thu, May 03, 2007 at 12:25:50AM +0900, YAMAMOTO Takashi wrote:
 > > > It still happens with a 4.99.16 kernel on the server.
 > > 
 > > how about raw dump?
 > 
 > You mean, a tcpdump of the traffic when it happens ?

 yes.

 YAMAMOTO Takashi

Responsible-Changed-From-To: kern-bug-people->yamt
Responsible-Changed-By: yamt@netbsd.org
Responsible-Changed-When: Tue, 08 May 2007 08:31:15 +0000
Responsible-Changed-Why:
mine.


State-Changed-From-To: feedback->open
State-Changed-By: yamt@netbsd.org
State-Changed-When: Tue, 08 May 2007 08:31:15 +0000
State-Changed-Why:
feedback provided (privately)


State-Changed-From-To: open->feedback
State-Changed-By: tron@NetBSD.org
State-Changed-When: Sat, 11 Feb 2012 12:40:36 +0000
State-Changed-Why:
Can you still reproduce this problem? This PR could be a duplicate
of PR kern/45093 which was fixed a while ago.


From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: yamt@NetBSD.org, netbsd-bugs@NetBSD.org, gnats-admin@NetBSD.org,
        tron@NetBSD.org
Subject: Re: kern/32318 (NFS client or server hang)
Date: Sat, 11 Feb 2012 13:50:30 +0100

 On Sat, Feb 11, 2012 at 12:40:37PM +0000, tron@NetBSD.org wrote:
 > Synopsis: NFS client or server hang
 > 
 > State-Changed-From-To: open->feedback
 > State-Changed-By: tron@NetBSD.org
 > State-Changed-When: Sat, 11 Feb 2012 12:40:36 +0000
 > State-Changed-Why:
 > Can you still reproduce this problem? This PR could be a duplicate
 > of PR kern/45093 which was fixed a while ago.

 it's not a duplicate of kern/45093: kern/45093 is a real kernel
 deadlock requiring a hard reboot, while this one only required
 restarting nfsd.

 Anyway, I've not seen this problem for a long time, I think we
 can consider it as fixed.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

State-Changed-From-To: feedback->closed
State-Changed-By: tron@NetBSD.org
State-Changed-When: Sat, 11 Feb 2012 13:04:09 +0000
State-Changed-Why:
The originator thinks that this bug has been fixed in the meantime.


From: D'Arcy Cain <darcy@NetBSD.org>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: gnats-bugs@NetBSD.org, yamt@NetBSD.org, netbsd-bugs@NetBSD.org, 
 gnats-admin@NetBSD.org, tron@NetBSD.org
Subject: Re: kern/32318 (NFS client or server hang)
Date: Sun, 12 Feb 2012 08:40:38 -0500

 On 12-02-11 07:50 AM, Manuel Bouyer wrote:
 > it's not a duplicate of kern/45093: kern/45093 is a real kernel

 Is it possible that kern/45609 is related to kern/45093?

 -- 
 D'Arcy J.M. Cain <darcy@NetBSD.org>
 http://www.NetBSD.org/ IM:darcy@Vex.Net

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: "D'Arcy Cain" <darcy@NetBSD.org>
Cc: gnats-bugs@NetBSD.org, yamt@NetBSD.org, netbsd-bugs@NetBSD.org,
        gnats-admin@NetBSD.org, tron@NetBSD.org
Subject: Re: kern/32318 (NFS client or server hang)
Date: Sun, 12 Feb 2012 14:48:56 +0100

 On Sun, Feb 12, 2012 at 08:40:38AM -0500, D'Arcy Cain wrote:
 > On 12-02-11 07:50 AM, Manuel Bouyer wrote:
 > >it's not a duplicate of kern/45093: kern/45093 is a real kernel
 > 
 > Is it possible that kern/45609 is related to kern/45093?

 unlikely. 45093 is a deadlock, no memory corruption of crash.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.