NetBSD Problem Report #46221

From tron@zhadum.org.uk  Sun Mar 18 15:54:22 2012
Return-Path: <tron@zhadum.org.uk>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	by www.NetBSD.org (Postfix) with ESMTP id D01E563B946
	for <gnats-bugs@gnats.NetBSD.org>; Sun, 18 Mar 2012 15:54:22 +0000 (UTC)
Message-Id: <20120318155429.358C5A3C551@mail.zhadum.org.uk>
Date: Sun, 18 Mar 2012 15:54:29 +0000 (GMT)
From: tron@zhadum.org.uk
Reply-To: tron@zhadum.org.uk
To: gnats-bugs@gnats.NetBSD.org
Subject: Kernel panic in NFS server code
X-Send-Pr-Version: 3.95

>Number:         46221
>Category:       kern
>Synopsis:       Kernel panic in NFS server code
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    hannken
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Mar 18 15:55:00 +0000 2012
>Closed-Date:    Thu Apr 12 17:25:06 +0000 2012
>Last-Modified:  Thu Apr 12 17:25:06 +0000 2012
>Originator:     Matthias Scheler
>Release:        NetBSD 6.0_BETA 2012-03-16 sources with fix for PR kern/45677
>Organization:
Matthias Scheler                                  http://zhadum.org.uk/
>Environment:
System: NetBSD colwyn.zhadum.org.uk 6.0_BETA NetBSD 6.0_BETA (COLWYN.64) #0: Fri Mar 16 13:40:20 GMT 2012 tron@colwyn.zhadum.org.uk:/src/sys/compile/COLWYN.64 amd64
Architecture: x86_64
Machine: amd64
>Description:
My NetBSD/amd646 6.0_BETA server has crashed three times during access by
a Mac OS X NFS client. It rebooted twice. But in one case it got stuck
and I could copy this traceback from the console:

printf_nolog() at netbsd:printf_nolog
wddump() at netbsd:wddump+0x23d
raiiddump() at netbsd:raiddump+0x227
dumpsys_seg() at netbsd:dumpsys_seg+0xb9
dump_seg_iter() at netbsd:dump_seg_iter+0xfb
dodumpsys() at netbsd:dodumpsys+0x273
dumpsys() at netbsd:dumpsys+0x1d
vpanic() at netbsd:vpanic+0x1dd
printf_nolog() at netbsd:printf_nolog
startlwp() at netbsd:startlwp
alltraps() at netbsd:alltraps+0xa2
ffs_fhtovp() at netbsd:ffs_fhtovp+0x55
VFS_FHTOVP() at netbsd:UFS_FHTOUP+0x1c
nfsrv_fhtovp() at netbsd:nfsrv_fhtovp+0x9a
nfsrv_read() at netbsd:nfsrv_read+0x245
nfssvc_nfsd() at netbsd:nfssvc_nfsd+0x1ce
sys_nfssvc() at netbsd:sys_nfssvc+0x22d 
syscall() at netbsd:syscall+0xc4
cpu1: End traceback...
sd0(umass0:0:0:0): generic HEA error
sd0: cache synchronization failed
fatal page fault in supervisor mode
trap type 6 code 10 rip 0 cs 8 rflags 10286 cr2 0 cpl 4 rsp fffffe810e95dc48

There might be typos in the above. The original is here:

http://zhadum.org.uk/~tron/panic.png

>How-To-Repeat:
1.) Use a NetBSD 6.0_BETA machine as an NFS server for a Mac OS X Snow Leopard
    client.
2.) Login as a user with an NFS mounted home directory.
3.) Start Apple's "Mail" application.

>Fix:
Not known. It didn't happen under NetBSD 5.1_STABLE.

>Release-Note:

>Audit-Trail:
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/46221: Kernel panic in NFS server code
Date: Sun, 18 Mar 2012 17:06:09 +0100

 Could you build a kernel with makeoptions    DEBUG="-g", reboot it
 and try to reproduce the problem ? It would be interesting to see where
 ffs_fhtovp+0x55 points to in sources.
 Also, if you don't get a crash dump, please set ddb.onpanic=1 so that
 we get the kernel panic message, the instruction that faulted, and you can
 dump registers content (it looks like it's a trap, so 'sh registers' will
 print something interesting).

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/46221: Kernel panic in NFS server code
Date: Sun, 18 Mar 2012 16:15:36 +0000

 On Sun, Mar 18, 2012 at 03:55:00PM +0000, tron@zhadum.org.uk wrote:
  > vpanic() at netbsd:vpanic+0x1dd
  > printf_nolog() at netbsd:printf_nolog
  > startlwp() at netbsd:startlwp
  > alltraps() at netbsd:alltraps+0xa2
  > ffs_fhtovp() at netbsd:ffs_fhtovp+0x55

 That chain is a little odd. I suppose some of it is probably stale
 values on the stack that ddb is interpreting as part of the stack
 trace. However, it would be nice to know what the panic message was...

 -- 
 David A. Holland
 dholland@netbsd.org

From: Matthias Scheler <tron@zhadum.org.uk>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/46221: Kernel panic in NFS server code
Date: Sun, 18 Mar 2012 17:35:48 +0000

 On Sun, Mar 18, 2012 at 04:20:06PM +0000, David Holland wrote:
 >  On Sun, Mar 18, 2012 at 03:55:00PM +0000, tron@zhadum.org.uk wrote:
 >   > vpanic() at netbsd:vpanic+0x1dd
 >   > printf_nolog() at netbsd:printf_nolog
 >   > startlwp() at netbsd:startlwp
 >   > alltraps() at netbsd:alltraps+0xa2
 >   > ffs_fhtovp() at netbsd:ffs_fhtovp+0x55
 >  
 >  That chain is a little odd. I suppose some of it is probably stale
 >  values on the stack that ddb is interpreting as part of the stack
 >  trace. However, it would be nice to know what the panic message was...

 I'm afraid it got lost. I have however reconfigure the machine to
 enter DDB on panic.

 	Kind regards

 -- 
 Matthias Scheler                                  http://zhadum.org.uk/

From: Matthias Scheler <tron@zhadum.org.uk>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: NetBSD GNATS <gnats-bugs@NetBSD.org>
Subject: Re: kern/46221: Kernel panic in NFS server code
Date: Sun, 18 Mar 2012 17:44:41 +0000

 >  Could you build a kernel with makeoptions    DEBUG="-g", reboot it
 >  and try to reproduce the problem ?

 I always built kernel with that option.

 > It would be interesting to see where ffs_fhtovp+0x55 points to in sources.

 If you pleae tell me how to find that out in gdb(?) I will happily provide
 the information. Alternatively you can find the kernel image here:

 	http://files.zhadum.org.uk/netbsd.gdb.xz

 > Also, if you don't get a crash dump, please set ddb.onpanic=1 so that
 > we get the kernel panic message, the instruction that faulted, and you can
 > dump registers content (it looks like it's a trap, so 'sh registers' will
 > print something interesting).

 I've already done that. I only forgot to do it again after Friday's
 kernel update and before the last crash.

 	Kind regards

 -- 
 Matthias Scheler                                  http://zhadum.org.uk/

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Matthias Scheler <tron@zhadum.org.uk>
Cc: NetBSD GNATS <gnats-bugs@NetBSD.org>
Subject: Re: kern/46221: Kernel panic in NFS server code
Date: Sun, 18 Mar 2012 19:16:28 +0100

 On Sun, Mar 18, 2012 at 05:44:41PM +0000, Matthias Scheler wrote:
 > 
 > > It would be interesting to see where ffs_fhtovp+0x55 points to in sources.
 > 
 > If you pleae tell me how to find that out in gdb(?) I will happily provide
 > the information. Alternatively you can find the kernel image here:
 > 
 > 	http://files.zhadum.org.uk/netbsd.gdb.xz

 (gdb) l *(ffs_fhtovp+0x55)
 0xffffffff801af684 is in ffs_fhtovp (/usr/src/sys/ufs/ffs/ffs_vfsops.c:1907).
 1902    /usr/src/sys/ufs/ffs/ffs_vfsops.c: No such file or directory.
         in /usr/src/sys/ufs/ffs/ffs_vfsops.c

 But this is at the very end of ffs_fhtovp(). I don't know if ddb is somewhat
 confused when computing the stack trace, or something else.

 (gdb) x/i ffs_fhtovp+0x55
    0xffffffff801af684 <ffs_fhtovp+85>:  leaveq 
 (gdb) disas ffs_fhtovp
 [...]
    0xffffffff801af67f <+80>:    callq  0xffffffff8042b0f3 <ufs_fhtovp>
    0xffffffff801af684 <+85>:    leaveq 
    0xffffffff801af685 <+86>:    retq   

 Either ddb skipped the call to ufs_fhtovp (and so the place where it
 crashed in ufs_fhtovp) or the stack is corrupted.
 The exact trap message from ddb and the register dump would be usefull.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/46221: Kernel panic in NFS server code
Date: Sun, 18 Mar 2012 18:38:07 +0000

 On Sun, Mar 18, 2012 at 05:45:02PM +0000, Matthias Scheler wrote:
  >> It would be interesting to see where ffs_fhtovp+0x55 points to in sources.
  > 
  >  If you pleae tell me how to find that out in gdb(?) I will happily provide
  >  the information.

 If you have debug info you can do

    (gdb) list *(ffs_fhtovp + 0x55)
    ...

 If not, you can do "disassemble ffs_fhtovp" and work it out by
 comparing the asm to the source, although that's a pain. ffs_fhtovp is
 not large though.

 In my kernel 0x55 is right before the end and is the return point of
 the call to ufs_fhtovp; it is quite likely the same for yours.

 -- 
 David A. Holland
 dholland@netbsd.org

From: Matthias Scheler <tron@zhadum.org.uk>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: NetBSD GNATS <gnats-bugs@NetBSD.org>
Subject: Re: kern/46221: Kernel panic in NFS server code
Date: Sun, 18 Mar 2012 22:32:25 +0000

 On Sun, Mar 18, 2012 at 07:16:28PM +0100, Manuel Bouyer wrote:
 > (gdb) l *(ffs_fhtovp+0x55)
 > 0xffffffff801af684 is in ffs_fhtovp (/usr/src/sys/ufs/ffs/ffs_vfsops.c:1907).
 > 1902    /usr/src/sys/ufs/ffs/ffs_vfsops.c: No such file or directory.
 >         in /usr/src/sys/ufs/ffs/ffs_vfsops.c
 > 
 > But this is at the very end of ffs_fhtovp(). I don't know if ddb is somewhat
 > confused when computing the stack trace, or something else.
 > 
 > (gdb) x/i ffs_fhtovp+0x55
 >    0xffffffff801af684 <ffs_fhtovp+85>:  leaveq 
 > (gdb) disas ffs_fhtovp
 > [...]
 >    0xffffffff801af67f <+80>:    callq  0xffffffff8042b0f3 <ufs_fhtovp>
 >    0xffffffff801af684 <+85>:    leaveq 
 >    0xffffffff801af685 <+86>:    retq   
 > 
 > Either ddb skipped the call to ufs_fhtovp (and so the place where it
 > crashed in ufs_fhtovp) or the stack is corrupted.
 > The exact trap message from ddb and the register dump would be usefull.

 I'll try to get that the next time.

 	Kind regards

 -- 
 Matthias Scheler                                  http://zhadum.org.uk/

From: Matthias Scheler <tron@zhadum.org.uk>
To: NetBSD GNATS <gnats-bugs@NetBSD.org>
Cc: Manuel Bouyer <bouyer@antioche.eu.org>
Subject: Re: kern/46221: Kernel panic in NFS server code
Date: Tue, 20 Mar 2012 23:40:32 +0000

 On Sun, Mar 18, 2012 at 10:35:02PM +0000, Matthias Scheler wrote:
 > The following reply was made to PR kern/46221; it has been noted by GNATS.
 > 
 > From: Matthias Scheler <tron@zhadum.org.uk>
 > To: Manuel Bouyer <bouyer@antioche.eu.org>
 > Cc: NetBSD GNATS <gnats-bugs@NetBSD.org>
 > Subject: Re: kern/46221: Kernel panic in NFS server code
 > Date: Sun, 18 Mar 2012 22:32:25 +0000
 > 
 >  On Sun, Mar 18, 2012 at 07:16:28PM +0100, Manuel Bouyer wrote:
 >  > (gdb) l *(ffs_fhtovp+0x55)
 >  > 0xffffffff801af684 is in ffs_fhtovp (/usr/src/sys/ufs/ffs/ffs_vfsops.c:1907).
 >  > 1902    /usr/src/sys/ufs/ffs/ffs_vfsops.c: No such file or directory.
 >  >         in /usr/src/sys/ufs/ffs/ffs_vfsops.c
 >  > 
 >  > But this is at the very end of ffs_fhtovp(). I don't know if ddb is somewhat
 >  > confused when computing the stack trace, or something else.
 >  > 
 >  > (gdb) x/i ffs_fhtovp+0x55
 >  >    0xffffffff801af684 <ffs_fhtovp+85>:  leaveq 
 >  > (gdb) disas ffs_fhtovp
 >  > [...]
 >  >    0xffffffff801af67f <+80>:    callq  0xffffffff8042b0f3 <ufs_fhtovp>
 >  >    0xffffffff801af684 <+85>:    leaveq 
 >  >    0xffffffff801af685 <+86>:    retq   
 >  > 
 >  > Either ddb skipped the call to ufs_fhtovp (and so the place where it
 >  > crashed in ufs_fhtovp) or the stack is corrupted.
 >  > The exact trap message from ddb and the register dump would be usefull.
 >  
 >  I'll try to get that the next time.

 I got a new back-trace ...

 b{1}> bt
 ufs_fhtovp() at netbsd:ufs_fhtovp+0x2e
 ffs_fhtovp() at netbsd:ffs_fhtovp+0x55
 VFS_FHTOVP() at netbsd:VFS_FHTOVP+0x1c
 dofhopen() at netbsd:dofhopen+0xda
 syscall() at netbsd:syscall+0xc4

 ... and a register dump:

 db{1}> sh registers
 ds          9a90
 es          7480
 fs          9a30
 gs          a000
 rdi         fffffe81c281bac0
 rsi         0
 rbp         fffffe810eda9a80
 rbx         fffffe810eda9a90
 rdx         0
 rcx         7
 rax         0
 r8          fffffe810e9b3000
 r9          0
 r10         1
 r11         0
 r12         fffffe810eda9bc0
 r13         3
 r14         1c
 r15         7f7ff7b4f3c0
 rip         ffffffff8042b121    ufs_fhtovp+0x2e
 cs          8
 rflags      10246
 rsp         fffffe810eda9a60
 ss          10
 netbsd:ufs_fhtovp+0x2e: cmpw    $0,c8(%rdx)

 I didn't know how to get the panic message.

 	Kind regards

 -- 
 Matthias Scheler                                  http://zhadum.org.uk/

From: Matthias Scheler <tron@zhadum.org.uk>
To: NetBSD GNATS <gnats-bugs@NetBSD.org>
Cc: 
Subject: Re: kern/46221: Kernel panic in NFS server code
Date: Wed, 21 Mar 2012 13:34:17 +0000

 On Tue, Mar 20, 2012 at 11:45:02PM +0000, Matthias Scheler wrote:
 > The following reply was made to PR kern/46221; it has been noted by GNATS.
 > 
 > From: Matthias Scheler <tron@zhadum.org.uk>
 > To: NetBSD GNATS <gnats-bugs@NetBSD.org>
 > Cc: Manuel Bouyer <bouyer@antioche.eu.org>
 > Subject: Re: kern/46221: Kernel panic in NFS server code
 > Date: Tue, 20 Mar 2012 23:40:32 +0000
 > 
 >  On Sun, Mar 18, 2012 at 10:35:02PM +0000, Matthias Scheler wrote:
 >  > The following reply was made to PR kern/46221; it has been noted by GNATS.
 >  > 
 >  > From: Matthias Scheler <tron@zhadum.org.uk>
 >  > To: Manuel Bouyer <bouyer@antioche.eu.org>
 >  > Cc: NetBSD GNATS <gnats-bugs@NetBSD.org>
 >  > Subject: Re: kern/46221: Kernel panic in NFS server code
 >  > Date: Sun, 18 Mar 2012 22:32:25 +0000
 >  > 
 >  >  On Sun, Mar 18, 2012 at 07:16:28PM +0100, Manuel Bouyer wrote:
 >  >  > (gdb) l *(ffs_fhtovp+0x55)
 >  >  > 0xffffffff801af684 is in ffs_fhtovp (/usr/src/sys/ufs/ffs/ffs_vfsops.c:1907).
 >  >  > 1902    /usr/src/sys/ufs/ffs/ffs_vfsops.c: No such file or directory.
 >  >  >         in /usr/src/sys/ufs/ffs/ffs_vfsops.c
 >  >  > 
 >  >  > But this is at the very end of ffs_fhtovp(). I don't know if ddb is somewhat
 >  >  > confused when computing the stack trace, or something else.
 >  >  > 
 >  >  > (gdb) x/i ffs_fhtovp+0x55
 >  >  >    0xffffffff801af684 <ffs_fhtovp+85>:  leaveq 
 >  >  > (gdb) disas ffs_fhtovp
 >  >  > [...]
 >  >  >    0xffffffff801af67f <+80>:    callq  0xffffffff8042b0f3 <ufs_fhtovp>
 >  >  >    0xffffffff801af684 <+85>:    leaveq 
 >  >  >    0xffffffff801af685 <+86>:    retq   
 >  >  > 
 >  >  > Either ddb skipped the call to ufs_fhtovp (and so the place where it
 >  >  > crashed in ufs_fhtovp) or the stack is corrupted.
 >  >  > The exact trap message from ddb and the register dump would be usefull.
 >  >  
 >  >  I'll try to get that the next time.
 >  
 >  I got a new back-trace ...
 >  
 >  b{1}> bt
 >  ufs_fhtovp() at netbsd:ufs_fhtovp+0x2e
 >  ffs_fhtovp() at netbsd:ffs_fhtovp+0x55
 >  VFS_FHTOVP() at netbsd:VFS_FHTOVP+0x1c
 >  dofhopen() at netbsd:dofhopen+0xda
 >  syscall() at netbsd:syscall+0xc4
 >  
 >  ... and a register dump:
 >  
 >  db{1}> sh registers
 >  ds          9a90
 >  es          7480
 >  fs          9a30
 >  gs          a000
 >  rdi         fffffe81c281bac0
 >  rsi         0
 >  rbp         fffffe810eda9a80
 >  rbx         fffffe810eda9a90
 >  rdx         0
 >  rcx         7
 >  rax         0
 >  r8          fffffe810e9b3000
 >  r9          0
 >  r10         1
 >  r11         0
 >  r12         fffffe810eda9bc0
 >  r13         3
 >  r14         1c
 >  r15         7f7ff7b4f3c0
 >  rip         ffffffff8042b121    ufs_fhtovp+0x2e
 >  cs          8
 >  rflags      10246
 >  rsp         fffffe810eda9a60
 >  ss          10
 >  netbsd:ufs_fhtovp+0x2e: cmpw    $0,c8(%rdx)

 Manual Bouyer has looked at this crash. The kernel paniced here:

         ip = VTOI(nvp);
 -->     if (ip->i_mode == 0 || ip->i_gen != ufhp->ufid_gen) {
                 vput(nvp);
                 *vpp = NULLVP;
                 return (ESTALE);
         }

 I guess that an extra check whether "ip" is NULL would prevent the
 panic. But I'm not sure whether that is the correct fix.

 	Kind regards

 -- 
 Matthias Scheler                                  http://zhadum.org.uk/

From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
To: Matthias Scheler <tron@zhadum.org.uk>
Cc: 
Subject: Re: kern/46221: Kernel panic in NFS server code
Date: Wed, 21 Mar 2012 15:44:28 +0100

 <snip>
 > 
 > Manual Bouyer has looked at this crash. The kernel paniced here:
 > 
 >         ip = VTOI(nvp);
 > -->     if (ip->i_mode == 0 || ip->i_gen != ufhp->ufid_gen) {
 >                 vput(nvp);
 >                 *vpp = NULLVP;
 >                 return (ESTALE);
 >         }
 > 
 > I guess that an extra check whether "ip" is NULL would prevent the
 > panic. But I'm not sure whether that is the correct fix.

 Please change this to

 	ip = VTOI(nvp);
 +	if (ip == NULL) {
 +		vprintf("NULL IP", nvp);
 +		panic("NULL IP");
 +	}
 	if (...

 so on the next crash we know more about the state of the vnode/inode.

 If you are able to debug the core dump you could `print *nvp' here.

 --
 Juergen Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)

From: Matthias Scheler <tron@zhadum.org.uk>
To: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
Cc: NetBSD GNATS <gnats-bugs@NetBSD.org>
Subject: Re: kern/46221: Kernel panic in NFS server code
Date: Wed, 21 Mar 2012 15:08:27 +0000

 On Wed, Mar 21, 2012 at 03:44:28PM +0100, J. Hannken-Illjes wrote:
 > <snip>
 > > 
 > > Manual Bouyer has looked at this crash. The kernel paniced here:
 > > 
 > >         ip = VTOI(nvp);
 > > -->     if (ip->i_mode == 0 || ip->i_gen != ufhp->ufid_gen) {
 > >                 vput(nvp);
 > >                 *vpp = NULLVP;
 > >                 return (ESTALE);
 > >         }
 > > 
 > > I guess that an extra check whether "ip" is NULL would prevent the
 > > panic. But I'm not sure whether that is the correct fix.
 > 
 > Please change this to
 > 
 > 	ip = VTOI(nvp);
 > +	if (ip == NULL) {
 > +		vprintf("NULL IP", nvp);
 > +		panic("NULL IP");
 > +	}
 > 	if (...
 > 
 > so on the next crash we know more about the state of the vnode/inode.

 I've already changed it like this:

 Index: sys/ufs/ufs/ufs_vfsops.c
 ===================================================================
 RCS file: /cvsroot/src/sys/ufs/ufs/ufs_vfsops.c,v
 retrieving revision 1.50
 diff -u -r1.50 ufs_vfsops.c
 --- sys/ufs/ufs/ufs_vfsops.c	1 Feb 2012 05:34:43 -0000	1.50
 +++ sys/ufs/ufs/ufs_vfsops.c	21 Mar 2012 15:07:49 -0000
 @@ -223,7 +223,11 @@
  		return (error);
  	}
  	ip = VTOI(nvp);
 -	if (ip->i_mode == 0 || ip->i_gen != ufhp->ufid_gen) {
 +	if (ip == NULL || ip->i_mode == 0 || ip->i_gen != ufhp->ufid_gen) {
 +		if (ip == NULL) {
 +			aprint_normal("ufs_fhtovp: ip == NULL on vp %p\n",
 +			    nvp);
 +		}
  		vput(nvp);
  		*vpp = NULLVP;
  		return (ESTALE);

 I'll check whether I get some of these kernel messages.

 > If you are able to debug the core dump you could `print *nvp' here.

 Kernel core dumps have never worked for me under NetBSD/amd64.

 	Kind regards

 -- 
 Matthias Scheler                                  http://zhadum.org.uk/

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org,
        tron@zhadum.org.uk
Subject: Re: kern/46221: Kernel panic in NFS server code
Date: Wed, 21 Mar 2012 16:49:08 +0100

 On Wed, Mar 21, 2012 at 03:10:10PM +0000, Matthias Scheler wrote:
 > The following reply was made to PR kern/46221; it has been noted by GNATS.
 > 
 > From: Matthias Scheler <tron@zhadum.org.uk>
 > To: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
 > Cc: NetBSD GNATS <gnats-bugs@NetBSD.org>
 > Subject: Re: kern/46221: Kernel panic in NFS server code
 > Date: Wed, 21 Mar 2012 15:08:27 +0000
 > 
 >  On Wed, Mar 21, 2012 at 03:44:28PM +0100, J. Hannken-Illjes wrote:
 >  > <snip>
 >  > > 
 >  > > Manual Bouyer has looked at this crash. The kernel paniced here:
 >  > > 
 >  > >         ip = VTOI(nvp);
 >  > > -->     if (ip->i_mode == 0 || ip->i_gen != ufhp->ufid_gen) {
 >  > >                 vput(nvp);
 >  > >                 *vpp = NULLVP;
 >  > >                 return (ESTALE);
 >  > >         }
 >  > > 
 >  > > I guess that an extra check whether "ip" is NULL would prevent the
 >  > > panic. But I'm not sure whether that is the correct fix.
 >  > 
 >  > Please change this to
 >  > 
 >  > 	ip = VTOI(nvp);
 >  > +	if (ip == NULL) {
 >  > +		vprintf("NULL IP", nvp);
 >  > +		panic("NULL IP");
 >  > +	}
 >  > 	if (...
 >  > 
 >  > so on the next crash we know more about the state of the vnode/inode.
 >  
 >  I've already changed it like this:
 >  
 >  Index: sys/ufs/ufs/ufs_vfsops.c
 >  ===================================================================
 >  RCS file: /cvsroot/src/sys/ufs/ufs/ufs_vfsops.c,v
 >  retrieving revision 1.50
 >  diff -u -r1.50 ufs_vfsops.c
 >  --- sys/ufs/ufs/ufs_vfsops.c	1 Feb 2012 05:34:43 -0000	1.50
 >  +++ sys/ufs/ufs/ufs_vfsops.c	21 Mar 2012 15:07:49 -0000
 >  @@ -223,7 +223,11 @@
 >   		return (error);
 >   	}
 >   	ip = VTOI(nvp);
 >  -	if (ip->i_mode == 0 || ip->i_gen != ufhp->ufid_gen) {
 >  +	if (ip == NULL || ip->i_mode == 0 || ip->i_gen != ufhp->ufid_gen) {
 >  +		if (ip == NULL) {
 >  +			aprint_normal("ufs_fhtovp: ip == NULL on vp %p\n",
 >  +			    nvp);
 >  +		}
 >   		vput(nvp);
 >   		*vpp = NULLVP;
 >   		return (ESTALE);
 >  
 >  I'll check whether I get some of these kernel messages.
 >  
 >  > If you are able to debug the core dump you could `print *nvp' here.
 >  
 >  Kernel core dumps have never worked for me under NetBSD/amd64.

 But I think, once you have the vnode address, you can do a
 show vnode <addr>
 from ddb.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
To: Matthias Scheler <tron@zhadum.org.uk>
Cc: 
Subject: Re: kern/46221: Kernel panic in NFS server code
Date: Wed, 21 Mar 2012 16:13:03 +0100

 Please add an vprint() call so we get more details for this vnode.

 --
 Juergen Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)

 On Mar 21, 2012, at 4:08 PM, Matthias Scheler wrote:

 > I've already changed it like this:
 > 
 > Index: sys/ufs/ufs/ufs_vfsops.c
 > ===================================================================
 > RCS file: /cvsroot/src/sys/ufs/ufs/ufs_vfsops.c,v
 > retrieving revision 1.50
 > diff -u -r1.50 ufs_vfsops.c
 > --- sys/ufs/ufs/ufs_vfsops.c	1 Feb 2012 05:34:43 -0000	1.50
 > +++ sys/ufs/ufs/ufs_vfsops.c	21 Mar 2012 15:07:49 -0000
 > @@ -223,7 +223,11 @@
 > 		return (error);
 > 	}
 > 	ip = VTOI(nvp);
 > -	if (ip->i_mode == 0 || ip->i_gen != ufhp->ufid_gen) {
 > +	if (ip == NULL || ip->i_mode == 0 || ip->i_gen != ufhp->ufid_gen) {
 > +		if (ip == NULL) {
 > +			aprint_normal("ufs_fhtovp: ip == NULL on vp %p\n",
 > +			    nvp);
 > +		}
 > 		vput(nvp);
 > 		*vpp = NULLVP;
 > 		return (ESTALE);

From: Matthias Scheler <tron@zhadum.org.uk>
To: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
Cc: NetBSD GNATS <gnats-bugs@NetBSD.org>
Subject: Re: kern/46221: Kernel panic in NFS server code
Date: Wed, 21 Mar 2012 16:27:55 +0000

 On Wed, Mar 21, 2012 at 04:13:03PM +0100, J. Hannken-Illjes wrote:
 > Please add an vprint() call so we get more details for this vnode.

 Done. The patch now looks like this:

 Index: sys/ufs/ufs/ufs_vfsops.c
 ===================================================================
 RCS file: /cvsroot/src/sys/ufs/ufs/ufs_vfsops.c,v
 retrieving revision 1.50
 diff -u -r1.50 ufs_vfsops.c
 --- sys/ufs/ufs/ufs_vfsops.c	1 Feb 2012 05:34:43 -0000	1.50
 +++ sys/ufs/ufs/ufs_vfsops.c	21 Mar 2012 16:25:47 -0000
 @@ -223,7 +223,9 @@
  		return (error);
  	}
  	ip = VTOI(nvp);
 -	if (ip->i_mode == 0 || ip->i_gen != ufhp->ufid_gen) {
 +	if (ip == NULL || ip->i_mode == 0 || ip->i_gen != ufhp->ufid_gen) {
 +		if (ip == NULL)
 +			vprint("ufs_fhtovp: ip == NULL on vp %p\n", nvp);
  		vput(nvp);
  		*vpp = NULLVP;
  		return (ESTALE);

 [Hmm, the "on vp %p" stuff is bogus but won't hurt.]

 	Kind regards

 -- 
 Matthias Scheler                                  http://zhadum.org.uk/

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org,
        tron@zhadum.org.uk
Subject: Re: kern/46221: Kernel panic in NFS server code
Date: Wed, 21 Mar 2012 17:40:39 +0100

 On Wed, Mar 21, 2012 at 01:35:02PM +0000, Matthias Scheler wrote:
 >  >  I got a new back-trace ...
 >  >  
 >  >  b{1}> bt
 >  >  ufs_fhtovp() at netbsd:ufs_fhtovp+0x2e
 >  >  ffs_fhtovp() at netbsd:ffs_fhtovp+0x55
 >  >  VFS_FHTOVP() at netbsd:VFS_FHTOVP+0x1c
 >  >  dofhopen() at netbsd:dofhopen+0xda
 >  >  syscall() at netbsd:syscall+0xc4
 >  >  
 >  >  ... and a register dump:
 >  >  
 >  >  db{1}> sh registers
 >  >  ds          9a90
 >  >  es          7480
 >  >  fs          9a30
 >  >  gs          a000
 >  >  rdi         fffffe81c281bac0
 >  >  rsi         0
 >  >  rbp         fffffe810eda9a80
 >  >  rbx         fffffe810eda9a90
 >  >  rdx         0
 >  >  rcx         7
 >  >  rax         0
 >  >  r8          fffffe810e9b3000
 >  >  r9          0
 >  >  r10         1
 >  >  r11         0
 >  >  r12         fffffe810eda9bc0
 >  >  r13         3
 >  >  r14         1c
 >  >  r15         7f7ff7b4f3c0
 >  >  rip         ffffffff8042b121    ufs_fhtovp+0x2e
 >  >  cs          8
 >  >  rflags      10246
 >  >  rsp         fffffe810eda9a60
 >  >  ss          10
 >  >  netbsd:ufs_fhtovp+0x2e: cmpw    $0,c8(%rdx)
 >  
 >  Manual Bouyer has looked at this crash. The kernel paniced here:
 >  
 >          ip = VTOI(nvp);
 >  -->     if (ip->i_mode == 0 || ip->i_gen != ufhp->ufid_gen) {
 >                  vput(nvp);
 >                  *vpp = NULLVP;
 >                  return (ESTALE);
 >          }
 >  
 >  I guess that an extra check whether "ip" is NULL would prevent the
 >  panic. But I'm not sure whether that is the correct fix.

 Looks like kern/41147. Fixed here:
 http://mail-index.netbsd.org/source-changes/2009/09/20/msg001090.html
 and then in a better way here:
 http://mail-index.netbsd.org/source-changes/2009/11/05/msg002668.html

 most of this has been removed in vfs_subr.c The commit log says that
 "Now that ffs on disk inodes get freed in the reclaim routine" this is
 no longer needed. 

 At first glance I can't see a problem with this. But I wonder if stacked
 filesystems could have a role in this; is the NFS-exported filesystem also
 acceeded by layered FSes ?

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: Matthias Scheler <tron@zhadum.org.uk>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: NetBSD GNATS <gnats-bugs@NetBSD.org>
Subject: Re: kern/46221: Kernel panic in NFS server code
Date: Wed, 21 Mar 2012 17:12:28 +0000

 On Wed, Mar 21, 2012 at 05:40:39PM +0100, Manuel Bouyer wrote:
 > >  I guess that an extra check whether "ip" is NULL would prevent the
 > >  panic. But I'm not sure whether that is the correct fix.
 > 
 > Looks like kern/41147. Fixed here:
 > http://mail-index.netbsd.org/source-changes/2009/09/20/msg001090.html
 > and then in a better way here:
 > http://mail-index.netbsd.org/source-changes/2009/11/05/msg002668.html

 It seems that we need something like that in "netbsd-6".

 > At first glance I can't see a problem with this. But I wonder if stacked
 > filesystems could have a role in this; is the NFS-exported filesystem also
 > acceeded by layered FSes ?

 No, it isn't. But the server sometimes mounts the NFS volume from itself
 via NFS. I've a suspicion that it only happens after I ran two scripts
 on the machine. The first one creates a lot of null and local NFS mounts.
 The second one will remove them later.

 	Kind regards

 -- 
 Matthias Scheler                                  http://zhadum.org.uk/

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Matthias Scheler <tron@zhadum.org.uk>
Cc: NetBSD GNATS <gnats-bugs@NetBSD.org>
Subject: Re: kern/46221: Kernel panic in NFS server code
Date: Wed, 21 Mar 2012 19:07:37 +0100

 On Wed, Mar 21, 2012 at 05:12:28PM +0000, Matthias Scheler wrote:
 > On Wed, Mar 21, 2012 at 05:40:39PM +0100, Manuel Bouyer wrote:
 > > >  I guess that an extra check whether "ip" is NULL would prevent the
 > > >  panic. But I'm not sure whether that is the correct fix.
 > > 
 > > Looks like kern/41147. Fixed here:
 > > http://mail-index.netbsd.org/source-changes/2009/09/20/msg001090.html
 > > and then in a better way here:
 > > http://mail-index.netbsd.org/source-changes/2009/11/05/msg002668.html
 > 
 > It seems that we need something like that in "netbsd-6".

 The code has changed in this area; I'm not sure the same code will
 work (there is a KASSERT in cleanvnode() that, I think, would
 have fired if we had the same senario). There is also a KASSERT()
 in vget() that we're not returning a clean vnode.

 I wonder if while we're slepping on vp->v_interlock, the vnode could
 have been moved to another filesystem ...
 printing the vnode when the problem happens will really help.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: Matthias Scheler <tron@zhadum.org.uk>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/46221: Kernel panic in NFS server code
Date: Fri, 23 Mar 2012 00:01:16 +0000

 On Wed, Mar 21, 2012 at 06:10:07PM +0000, Manuel Bouyer wrote:
 >  The code has changed in this area; I'm not sure the same code will
 >  work (there is a KASSERT in cleanvnode() that, I think, would
 >  have fired if we had the same senario). There is also a KASSERT()
 >  in vget() that we're not returning a clean vnode.
 >  
 >  I wonder if while we're slepping on vp->v_interlock, the vnode could
 >  have been moved to another filesystem ...
 >  printing the vnode when the problem happens will really help.

 The new code in ufs_fhtovp() finally triggered.
 Here is the output of vprint():

 ufs_fhtovp: ip == NULL on vp %p
 : vnode @ 0xfffffe8169463be0, flags (0x80010<MPSAFE,CLEAN>)
         tag VT_NON(0), type VREG(1), usecount 1, writecount 0, holdcount 0
         freelisthd 0x0, mount 0xfffffe8214f82000, data 0x0 lock 0xfffffe8169463cf0

 	Kind regards

 -- 
 Matthias Scheler                                  http://zhadum.org.uk/

From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
To: gnats-bugs@NetBSD.org
Cc: Matthias Scheler <tron@zhadum.org.uk>
Subject: Re: kern/46221: Kernel panic in NFS server code
Date: Fri, 23 Mar 2012 10:03:40 +0100

 Please try the attached diff that changes vn_lock() to return an invalid
 vnode only if the caller requested it by setting LK_RETRY.

 --
 Juergen Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)

 Index: vfs_vnops.c
 ===================================================================
 RCS file: /cvsroot/src/sys/kern/vfs_vnops.c,v
 retrieving revision 1.183
 diff -p -u -4 -r1.183 vfs_vnops.c
 --- vfs_vnops.c	14 Oct 2011 09:23:31 -0000	1.183
 +++ vfs_vnops.c	23 Mar 2012 08:59:24 -0000
 @@ -804,8 +804,17 @@ vn_lock(struct vnode *vp, int flags)
  			error = ENOENT;
  		} else {
  			mutex_exit(vp->v_interlock);
  			error = VOP_LOCK(vp, (flags & ~LK_RETRY));
 +			if (error == 0 && (flags & LK_RETRY) == 0) {
 +				mutex_enter(vp->v_interlock);
 +				if ((vp->v_iflag & VI_CLEAN)) {
 +					mutex_exit(vp->v_interlock);
 +					VOP_UNLOCK(vp);
 +					return ENOENT;
 +				}
 +				mutex_exit(vp->v_interlock);
 +			}
  			if (error == 0 || error == EDEADLK || error == EBUSY)
  				return (error);
  		}
  	} while (flags & LK_RETRY);

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Matthias Scheler <tron@zhadum.org.uk>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/46221: Kernel panic in NFS server code
Date: Fri, 23 Mar 2012 10:33:49 +0100

 --FL5UXtIhxfXey3p5
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline

 On Fri, Mar 23, 2012 at 12:01:16AM +0000, Matthias Scheler wrote:
 > On Wed, Mar 21, 2012 at 06:10:07PM +0000, Manuel Bouyer wrote:
 > >  The code has changed in this area; I'm not sure the same code will
 > >  work (there is a KASSERT in cleanvnode() that, I think, would
 > >  have fired if we had the same senario). There is also a KASSERT()
 > >  in vget() that we're not returning a clean vnode.
 > >  
 > >  I wonder if while we're slepping on vp->v_interlock, the vnode could
 > >  have been moved to another filesystem ...
 > >  printing the vnode when the problem happens will really help.
 > 
 > The new code in ufs_fhtovp() finally triggered.
 > Here is the output of vprint():
 > 
 > ufs_fhtovp: ip == NULL on vp %p
 > : vnode @ 0xfffffe8169463be0, flags (0x80010<MPSAFE,CLEAN>)
 >         tag VT_NON(0), type VREG(1), usecount 1, writecount 0, holdcount 0
 >         freelisthd 0x0, mount 0xfffffe8214f82000, data 0x0 lock 0xfffffe8169463cf0

 So it got a clean vnode. This is bad.
 I suspect ufs_ihashget() is returnning the clean vnode; otherwise it has to be
 in ffs_vget() itself.

 Can you run a kernel with the attached patch and see if one of the KASSERT
 fires ? You can probably turn the check to a printf if you want to
 avoid the panic.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

 --FL5UXtIhxfXey3p5
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: attachment; filename=diff

 Index: ffs/ffs_vfsops.c
 ===================================================================
 RCS file: /cvsroot/src/sys/ufs/ffs/ffs_vfsops.c,v
 retrieving revision 1.269
 diff -u -p -u -r1.269 ffs_vfsops.c
 --- ffs/ffs_vfsops.c	7 Oct 2011 09:35:07 -0000	1.269
 +++ ffs/ffs_vfsops.c	23 Mar 2012 09:31:56 -0000
 @@ -1867,6 +1867,7 @@ ffs_vget(struct mount *mp, ino_t ino, st
  	}							/* XXX */
  	uvm_vnp_setsize(vp, ip->i_size);
  	*vpp = vp;
 +	KASSERT((vp->v_iflag & VI_CLEAN) == 0);
  	return (0);
  }

 Index: ufs/ufs_ihash.c
 ===================================================================
 RCS file: /cvsroot/src/sys/ufs/ufs/ufs_ihash.c,v
 retrieving revision 1.31
 diff -u -p -u -r1.31 ufs_ihash.c
 --- ufs/ufs_ihash.c	12 Jun 2011 03:36:02 -0000	1.31
 +++ ufs/ufs_ihash.c	23 Mar 2012 09:31:56 -0000
 @@ -153,6 +153,7 @@ ufs_ihashget(dev_t dev, ino_t inum, int 
  				if (vget(vp, flags))
  					goto loop;
  			}
 +			KASSERT((vp->v_iflag & VI_CLEAN) == 0);
  			return (vp);
  		}
  	}

 --FL5UXtIhxfXey3p5--

From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/46221: Kernel panic in NFS server code
Date: Fri, 23 Mar 2012 10:47:36 +0100

 On Mar 23, 2012, at 10:35 AM, Manuel Bouyer wrote:

 > So it got a clean vnode. This is bad.
 > I suspect ufs_ihashget() is returnning the clean vnode; otherwise it =
 has to be
 > in ffs_vget() itself.

 I would expext ufs_fhtovp -> ffs_vget -> ufs_ihashget -> vget and then =
 the
 window between mutex_exit(vp->v_interlock) and vn_lock(vp, flags).

 It is asserted not VI_CLEAN before the mutex_exit() and definitely =
 VI_CLEAN
 after the vn_lock().

 See my previous message on changing vn_lock() to return invalid vnodes
 only when the caller requested it by setting LK_RETRY.

 --
 Juergen Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig =
 (Germany)=

From: Matthias Scheler <tron@zhadum.org.uk>
To: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/46221: Kernel panic in NFS server code
Date: Fri, 23 Mar 2012 14:08:56 +0000

 On Fri, Mar 23, 2012 at 10:03:40AM +0100, J. Hannken-Illjes wrote:
 > Please try the attached diff that changes vn_lock() to return an invalid
 > vnode only if the caller requested it by setting LK_RETRY.

 I'm now running a kernel with this patch and my patch applied.
 If you patch works I should never see the message from my patch again.
 But I guess it will take at least a week before we can be certain.

 	Kind regards

 -- 
 Matthias Scheler                                  http://zhadum.org.uk/

From: Matthias Scheler <tron@zhadum.org.uk>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/46221: Kernel panic in NFS server code
Date: Fri, 23 Mar 2012 14:11:06 +0000

 On Fri, Mar 23, 2012 at 10:33:49AM +0100, Manuel Bouyer wrote:
 > > The new code in ufs_fhtovp() finally triggered.
 > > Here is the output of vprint():
 > > 
 > > ufs_fhtovp: ip == NULL on vp %p
 > > : vnode @ 0xfffffe8169463be0, flags (0x80010<MPSAFE,CLEAN>)
 > >         tag VT_NON(0), type VREG(1), usecount 1, writecount 0, holdcount 0
 > >         freelisthd 0x0, mount 0xfffffe8214f82000, data 0x0 lock 0xfffffe8169463cf0
 > 
 > So it got a clean vnode. This is bad.
 > I suspect ufs_ihashget() is returnning the clean vnode; otherwise it has to be
 > in ffs_vget() itself.
 > 
 > Can you run a kernel with the attached patch and see if one of the KASSERT
 > fires ? You can probably turn the check to a printf if you want to
 > avoid the panic.

 I'm currently testing a kernel Juergen Hannken-Illjes's patch
 if that is alright.

 	Kind regards

 -- 
 Matthias Scheler                                  http://zhadum.org.uk/

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Matthias Scheler <tron@zhadum.org.uk>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/46221: Kernel panic in NFS server code
Date: Fri, 23 Mar 2012 15:29:37 +0100

 On Fri, Mar 23, 2012 at 02:11:06PM +0000, Matthias Scheler wrote:
 > I'm currently testing a kernel Juergen Hannken-Illjes's patch
 > if that is alright.

 Yes, I've seen his patch after replying. making vget not return a clean
 vnode unless LK_RETRY is probably right.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/46221: Kernel panic in NFS server code
Date: Fri, 23 Mar 2012 15:11:02 +0000

 On Fri, Mar 23, 2012 at 09:05:04AM +0000, J. Hannken-Illjes wrote:
  >  Please try the attached diff that changes vn_lock() to return an invalid
  >  vnode only if the caller requested it by setting LK_RETRY.

 Is this really what we want? Most callers of vn_lock don't check for
 error.

 Also, it seems bogus for the vnode to be cleaned while we're holding a
 reference to it.

 -- 
 David A. Holland
 dholland@netbsd.org

From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
To: gnats-bugs@NetBSD.org
Cc: David Holland <dholland-bugs@netbsd.org>
Subject: Re: kern/46221: Kernel panic in NFS server code
Date: Fri, 23 Mar 2012 17:54:14 +0100

 On Mar 23, 2012, at 4:15 PM, David Holland wrote:

 > On Fri, Mar 23, 2012 at 09:05:04AM +0000, J. Hannken-Illjes wrote:
 >> Please try the attached diff that changes vn_lock() to return an invalid
 >> vnode only if the caller requested it by setting LK_RETRY.
 > 
 > Is this really what we want? Most callers of vn_lock don't check for
 > error.

 For now it should fix the error.  Any caller of vn_lock() has to either
 check for error OR pass LK_RETRY.  I'm not aware of any call violating
 this rule.

 > Also, it seems bogus for the vnode to be cleaned while we're holding a
 > reference to it.

 Sure, there is a race somewhere and once we get it fixed the change
 should become an assertion.

 --
 Juergen Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)

From: Matthias Scheler <tron@zhadum.org.uk>
To: NetBSD GNATS <gnats-bugs@NetBSD.org>
Cc: 
Subject: Re: kern/46221: Kernel panic in NFS server code
Date: Wed, 4 Apr 2012 12:56:00 +0100

 On Fri, Mar 23, 2012 at 04:55:02PM +0000, J. Hannken-Illjes wrote:
 >  On Mar 23, 2012, at 4:15 PM, David Holland wrote:
 >  
 >  > On Fri, Mar 23, 2012 at 09:05:04AM +0000, J. Hannken-Illjes wrote:
 >  >> Please try the attached diff that changes vn_lock() to return an invalid
 >  >> vnode only if the caller requested it by setting LK_RETRY.
 >  > 
 >  > Is this really what we want? Most callers of vn_lock don't check for
 >  > error.
 >  
 >  For now it should fix the error.  Any caller of vn_lock() has to either
 >  check for error OR pass LK_RETRY.  I'm not aware of any call violating
 >  this rule.

 My NFS server is now up for almost 12 days without problems. As my
 belts and braces fix in ufs_fhtovp() didn't produce any kernel messages
 it seems that Juergen's patch indeed prevents the problem from happening.

 	Kind regards

 -- 
 Matthias Scheler                                  http://zhadum.org.uk/

From: "Juergen Hannken-Illjes" <hannken@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/46221 CVS commit: src/sys/kern
Date: Thu, 5 Apr 2012 07:26:37 +0000

 Module Name:	src
 Committed By:	hannken
 Date:		Thu Apr  5 07:26:37 UTC 2012

 Modified Files:
 	src/sys/kern: vfs_vnops.c

 Log Message:
 Fix vn_lock() to return an invalid (dead, clean) vnode
 only if the caller requested it by setting LK_RETRY.

 Should fix PR #46221: Kernel panic in NFS server code


 To generate a diff of this commit:
 cvs rdiff -u -r1.183 -r1.184 src/sys/kern/vfs_vnops.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

Responsible-Changed-From-To: kern-bug-people->hannken
Responsible-Changed-By: hannken@NetBSD.org
Responsible-Changed-When: Thu, 12 Apr 2012 09:52:06 +0000
Responsible-Changed-Why:
Take.


State-Changed-From-To: open->pending-pullups
State-Changed-By: hannken@NetBSD.org
State-Changed-When: Thu, 12 Apr 2012 09:52:06 +0000
State-Changed-Why:
Fixed in -current -- Pullup pending.


From: "Jeff Rizzo" <riz@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/46221 CVS commit: [netbsd-6] src/sys/kern
Date: Thu, 12 Apr 2012 17:15:23 +0000

 Module Name:	src
 Committed By:	riz
 Date:		Thu Apr 12 17:15:23 UTC 2012

 Modified Files:
 	src/sys/kern [netbsd-6]: vfs_vnops.c

 Log Message:
 Pull up following revision(s) (requested by hannken in ticket #179):
 	sys/kern/vfs_vnops.c: revision 1.184
 Fix vn_lock() to return an invalid (dead, clean) vnode
 only if the caller requested it by setting LK_RETRY.
 Should fix PR #46221: Kernel panic in NFS server code


 To generate a diff of this commit:
 cvs rdiff -u -r1.183 -r1.183.8.1 src/sys/kern/vfs_vnops.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: pending-pullups->closed
State-Changed-By: hannken@NetBSD.org
State-Changed-When: Thu, 12 Apr 2012 17:25:06 +0000
State-Changed-Why:
Pulled up.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.