NetBSD Problem Report #44651

From hf@spg.tu-darmstadt.de  Mon Feb 28 15:35:10 2011
Return-Path: <hf@spg.tu-darmstadt.de>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by www.NetBSD.org (Postfix) with ESMTP id 7E78A63B8A7
	for <gnats-bugs@gnats.NetBSD.org>; Mon, 28 Feb 2011 15:35:10 +0000 (UTC)
Message-Id: <201102281535.p1SFZ2QF023206@venediger.nt.e-technik.tu-darmstadt.de>
Date: Mon, 28 Feb 2011 16:35:02 +0100 (CET)
From: Hauke Fath <hf@spg.tu-darmstadt.de>
Reply-To: Hauke Fath <hf@spg.tu-darmstadt.de>
To: gnats-bugs@gnats.NetBSD.org
Cc: Hauke Fath <hf@spg.tu-darmstadt.de>
Subject: mountd panics nullfs on large disk
X-Send-Pr-Version: 3.95

>Number:         44651
>Category:       kern
>Synopsis:       mountd panics nullfs on large disk
>Confidential:   no
>Severity:       critical
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Feb 28 15:40:00 +0000 2011
>Last-Modified:  Fri Feb 22 14:30:00 +0000 2019
>Originator:     Hauke Fath
>Release:        NetBSD 5.1_STABLE
>Organization:
-- 
/~\  The ASCII Ribbon Campaign                      Hauke Fath
\ /    No HTML/RTF in email	          Institut für Nachrichtentechnik
 X     No Word docs in email	                    TU Darmstadt
/ \  Respect for open standards                Ruf +49-6151-16-3281
>Environment:


System: NetBSD venediger 5.1_STABLE NetBSD 5.1_STABLE (VENEDIGER) #0: Wed Feb 23 21:05:10 CET 2011 hf@Hochstuhl:/var/obj/netbsd-builds/5/i386/sys/arch/i386/compile/VENEDIGER i386
Architecture: i386
Machine: i386

% df                                                   
Filesystem   1K-blocks       Used      Avail %Cap Mounted on
/dev/ld0a       1938830     560488    1281402  30% /
/dev/ld0e       9694638    2134610    7075298  23% /var
tmpfs             10240        332       9908   3% /var/run
/dev/ld0f       6301678     687640    5298956  11% /usr/pkg
/dev/ld0g       6301678     606954    5379642  10% /cvsroot
/dev/ld0h    1705366888  738082700  882015844  45% /u
tmpfs            512000         36     511964   0% /tmp
kernfs                1          1          0 100% /kern
procfs                4          4          0 100% /usr/pkg/emul/linux/proc
ptyfs                 1          1          0 100% /dev/pts
% dmesg | grep amr
amr0 at pci3 dev 2 function 0: AMI RAID <MegaRAID SCSI 320-2>
amr0: interrupting at ioapic2 pin 4
amr0: firmware 1L51, BIOS G500, 64MB RAM
ld0 at amr0 unit 0: RAID 5, optimal
% fgrep null /etc/fstab
##/u/homes                      /public/homes           null    rw,hidden       0 0
/u/binaries/linux/cxoffice      /public/winbin          null    ro,hidden       0 0
/u/binaries/linux/matlabr13_1   /public/matlab-r13      null    ro,hidden       0 0
/u/binaries/linux/matlabr14     /public/matlab-r14      null    ro,hidden       0 0
/u/software/public              /public/software        null    rw,hidden       0 0
/u/software/payware             /public/payware         null    rw,hidden       0 0
/u/documents                    /public/documents       null    rw,hidden       0 0
/u/nts/research                 /public/nts-research    null    rw,hidden       0 0
/u/nts/teaching                 /public/nts-teaching    null    rw,hidden       0 0
/u/spg/research                 /public/research        null    rw,hidden       0 0
/u/spg/teaching                 /public/teaching        null    rw,hidden       0 0
/u/spg/prac                     /public/spg-prac        null    rw,hidden       0 0
/u/scratch                      /public/scratch         null    rw,hidden       0 0
/var/spool/export/usr.sparc     /public/usr.sparc       null    rw,hidden       0 0
/var/spool/export/pasterze      /public/pasterze        null    rw,hidden       0 0
%

# du -chs /u/*
3.1G    binaries
449M    documents
558G    homes
30G     local
24G     nts
621M    scratch
52G     software
36G     spg
4.0K    tmp
704G    total
#

>Description:

	Having mountd export a nullfs on a large disk leads to a
	"panic: ifree: freeing free inode".

	The machine in question is a file-server (smb, afp, but mostly
	nfs) running off a RAID 5 on an amr(4) (lsilogic MegaRAID
	320-2). It was recently upgraded from a 4 disk 900 GB RAID-5
	to 2x4 disk 1800 GB RAID-5, striped.

	In order to be able to export parts of /u with differing
	credentials, those subtrees are null_mounted to
	/public/${share}.

	After the upgrade, the machine would panic while going
	multi-user, at different points but always with the same
	message:

[...]
Mounting all filesystems...
/cvsroot: replaying log to disk
/u: replaying log to disk
Clearing temporary files.
Turning on accounting.
Feb 28 07:35:23 venediger /netbsd: Accounting started
Starting amd.
Feb 28 07:35:23 venediger amd[223]/info:  using configuration file /etc/amd.conf
Creating a.out runtime link editor directory cache.
Starting mountd.
Starting nfsd.
Starting statd.
Starting lockd.
ifree: dev = 0x1307, ino = 55104332, fs = /u
panic: ifree: freeing free inode
Begin traceback...
uvm_fault(0xcd2fe5cc, 0, 1) -> 0xe
fatal page fault in supervisor mode
trap type 6 code 0 eip c02c06ae cs 8 eflags 10246 cr2 0 ilevel 0
panic: trap
Faulted in mid-traceback; aborting...
dumping to dev 19,1 offset 8
dump 101 100 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 70 


rebooting...

	Invariably, an 'fsck -f' of /u came up without any issues. Half
	of the time, the amr(4) was messed up like

[...]
Updating motd.
Starting ntpd.
Starting timed.
/etc/rc: WARNING: $nmbd is not set properly - see rc.conf(5).
Starting sshd.
Starting sendmail.
ifree: dev = 0x1307, ino = 55104332, fs = /u
panic: ifree: freeing free inodeifree: dev = 0x1307, ino = 74028942, fs = /u

Begin traceback...
uvm_fault(0xcea051a8, 0, 1) -> 0xe
fatal page fault in supervisor mode
trap type 6 code 0 eip c02c06ae cs 8 eflags 10246 cr2 0 ilevel 0
panic: trap
Faulted in mid-traceback; aborting...
dumping to dev 19,1 offset 8
dump 105 amr0: bad status (not active; 0x0416)
amr0: bad status (not active; 0x0416)
amr0: bad status (not active; 0x0412)
amr0: bad status (not active; 0x0412)
104 amr0: bad status (not active; 0x0412)
amr0: bad status (not active; 0x0412)
103 amr0: bad status (not active; 0x0416)
amr0: bad status (not active; 0x0416)
amr0: bad status (not active; 0x0412)
amr0: bad status (not active; 0x0416)
amr0: bad status (not active; 0x0416)
102 amr0: bad status (not active; 0x0416)
amr0: bad status (not active; 0x0416)
amr0: bad status (not active; 0x0412)
101 amr0: bad status (not active; 0x0412)
100 amr0: bad status (not active; 0x0412)
99 amr0: bad status (not active; 0x0416)
amr0: bad status (not active; 0x0412)
98 97 amr0: bad status (not active; 0x0416)
amr0: bad status (not active; 0x0416)
amr0: bad status (not active; 0x0412)
amr0: bad status (not active; 0x0416)
amr0: bad status (not active; 0x0412)
96 error 35


amr0: bad status (not active; 0x040)
rebooting...

	indicating something was scribbling over kernel memory.

	Booting a -current kernel made no difference.

	Re-making and restoring /u made no difference.

	Switching off 'log' on the mount made no difference 
	(other than the fscking 20 min).

	In single user, /u was mountable and accessible just fine.

	Toggling the rc.conf switches, I found that leaving mountd(8)
	out made all the difference.

	Locking out nfs clients by /etc/hosts.{allow,deny} rules
	didn't make a difference, though.

	Going through the list of null mounts in fstab, I found that
	commenting out the null_mount of /u/homes prevented the
	panic. 

	Since an export through null_mount worked for /u/homes on the
	smaller disk array, either the mere size of the file-system
	(1600 GB), or switching /u to FFSv2 pushes nullfs over the
	edge.

	I have an assortment of cores (~16 MB each), if it helps;
	unfortunately, the stack traces look less than helpful.

>How-To-Repeat:

	Set up a multi-TB disk. Null_mount a sub-tree worth half a TB
	elsewhere, and export the mount point through nfs.

>Fix:
	My workaround for now is to export /u/homes directly, instead
	of through a null_mount.

	My guess is that something wraps in nullfs for big sub-trees,
	when it shouldn't.

>Release-Note:

>Audit-Trail:

Responsible-Changed-From-To: gnats-admin->kern-bug-people
Responsible-Changed-By: dholland@NetBSD.org
Responsible-Changed-When: Wed, 09 Mar 2011 09:04:30 +0000
Responsible-Changed-Why:
pick up after gnats


From: Hauke Fath <hf@spg.tu-darmstadt.de>
To: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
Cc: 
Subject: Re: kern/44651 (mountd panics nullfs on large disk)
Date: Fri, 22 Feb 2019 15:28:35 +0100

 FTR, this bug just reared its ugly head again... on the same machine, 
 running netbsd-7 from November 2018, re-purposed as nfs server for 
 sources, and Radmind server.

 I fscked the three partitions, dumped them to a sata disk, doubled the 
 size of the raid 10 array, and restored. Booting, I was greeted with

 NetBSD 7.2_STABLE (MONOLITHIC) #1: Wed Nov  7 12:01:16 CET 2018

 hf@Hochstuhl:/var/obj/netbsd-builds/7/amd64/sys/arch/amd64/compile/MONOLITHIC
 total memory = 4094 MB
 avail memory = 3956 MB

 [...]

 Starting sendmail.
 Starting smmsp.
 Starting radmind.
 Starting inetd.
 Starting cron.
 Fri Feb 22 11:04:47 CET 2019
 ifree: dev = 0xa804, ino = 7935387, fs = /u
 panic: ifree: freeing free inode
 cpu0: Begin traceback...
 vpanic() at netbsd:vpanic+0x13c
 snprintf() at netbsd:snprintf
 ffs_mapsearch() at netbsd:ffs_mapsearch
 ffs_freefile() at netbsd:ffs_freefile+0xfd
 ffs_reclaim() at netbsd:ffs_reclaim+0x15a
 VOP_RECLAIM() at netbsd:VOP_RECLAIM+0x2f
 vclean() at netbsd:vclean+0xbd
 vrelel() at netbsd:vrelel+0x636
 ufs_fhtovp() at netbsd:ufs_fhtovp+0x54
 ffs_fhtovp() at netbsd:ffs_fhtovp+0x4c
 VFS_FHTOVP() at netbsd:VFS_FHTOVP+0x1c
 layerfs_fhtovp() at netbsd:layerfs_fhtovp+0x21
 VFS_FHTOVP() at netbsd:VFS_FHTOVP+0x1c
 nfsrv_fhtovp() at netbsd:nfsrv_fffffhtovp+0x68
 nfs_namei() at netbsd:nfs_namei+0x154
 nfsrv_lookup() at netbsd:nfsrv_lookup+0x264
 do_nfssvc() at netbsd:do_nfssvc+0x347
 syscall() at netbsd:syscall+0x9a
 --- syscall (number 155) ---
 7f7ff703cb3a:
 cpu0: End traceback...

 dumping to dev 168,2 (offset=8391975, size=1048155):
 dump area improper


 uvm_fault(0xfffffe812e5048a8, 0x0, 2) -> e
 fatal page fault in supervisor mode
 trap type 6 code 2 rip ffffffff8061e38f cs 8 rflags 10286 cr2 84 ilevel 
 8 rsp fffffe8040963e00
 curlwp 0xfffffe812e4c3320 pid 279.1 lowest kstack 0xfffffe80409602c0
 rebooting...


 Just like before, fsck(8) did not find anything wrong with /u, which is 
 a 1000 GB partition now, about half full. The panic was reproducible.

 /etc/fstab null mounts are:

 /u/sources/netbsd-5/src         /public/netbsd-5        null    ro,hidden
 /u/sources/netbsd-6/src         /public/netbsd-6        null    ro,hidden
 /u/sources/netbsd-7/src         /public/netbsd-7        null    ro,hidden
 /u/sources/netbsd-8/src         /public/netbsd-8        null    ro,hidden
 /u/sources/netbsd-developer/src /public/netbsd-developer null   ro,hidden
 /u/sources/pkgsrc               /public/pkgsrc          null    ro,hidden
 #
 /u/packages/distfiles           /public/pkg-distfiles   null    rw,hidden
 #
 /u/sources/freebsd-11/src       /public/freebsd-11      null    ro,hidden

 % du -sh /public/*
 2.9G    /public/freebsd-11
 4.0K    /public/netbsd-5
 2.7G    /public/netbsd-6
 3.2G    /public/netbsd-7
 3.2G    /public/netbsd-8
 4.3G    /public/netbsd-developer
 10G     /public/pkg-distfiles
 2.2G    /public/pkgsrc
 %

 After disabling the largest of the null mounts (/public/pkg-distfiles) 
 the machine booted fine.

 For lack of a better idea, I upgraded to -8, and the panic does not 
 occur there.

>Unformatted:
Home
PR Database Search
(Contact us) $NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.