NetBSD Problem Report #44651
From hf@spg.tu-darmstadt.de Mon Feb 28 15:35:10 2011
Return-Path: <hf@spg.tu-darmstadt.de>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
by www.NetBSD.org (Postfix) with ESMTP id 7E78A63B8A7
for <gnats-bugs@gnats.NetBSD.org>; Mon, 28 Feb 2011 15:35:10 +0000 (UTC)
Message-Id: <201102281535.p1SFZ2QF023206@venediger.nt.e-technik.tu-darmstadt.de>
Date: Mon, 28 Feb 2011 16:35:02 +0100 (CET)
From: Hauke Fath <hf@spg.tu-darmstadt.de>
Reply-To: Hauke Fath <hf@spg.tu-darmstadt.de>
To: gnats-bugs@gnats.NetBSD.org
Cc: Hauke Fath <hf@spg.tu-darmstadt.de>
Subject: mountd panics nullfs on large disk
X-Send-Pr-Version: 3.95
>Number: 44651
>Category: kern
>Synopsis: mountd panics nullfs on large disk
>Confidential: no
>Severity: critical
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Mon Feb 28 15:40:00 +0000 2011
>Last-Modified: Fri Feb 22 14:30:00 +0000 2019
>Originator: Hauke Fath
>Release: NetBSD 5.1_STABLE
>Organization:
--
/~\ The ASCII Ribbon Campaign Hauke Fath
\ / No HTML/RTF in email Institut für Nachrichtentechnik
X No Word docs in email TU Darmstadt
/ \ Respect for open standards Ruf +49-6151-16-3281
>Environment:
System: NetBSD venediger 5.1_STABLE NetBSD 5.1_STABLE (VENEDIGER) #0: Wed Feb 23 21:05:10 CET 2011 hf@Hochstuhl:/var/obj/netbsd-builds/5/i386/sys/arch/i386/compile/VENEDIGER i386
Architecture: i386
Machine: i386
% df
Filesystem 1K-blocks Used Avail %Cap Mounted on
/dev/ld0a 1938830 560488 1281402 30% /
/dev/ld0e 9694638 2134610 7075298 23% /var
tmpfs 10240 332 9908 3% /var/run
/dev/ld0f 6301678 687640 5298956 11% /usr/pkg
/dev/ld0g 6301678 606954 5379642 10% /cvsroot
/dev/ld0h 1705366888 738082700 882015844 45% /u
tmpfs 512000 36 511964 0% /tmp
kernfs 1 1 0 100% /kern
procfs 4 4 0 100% /usr/pkg/emul/linux/proc
ptyfs 1 1 0 100% /dev/pts
% dmesg | grep amr
amr0 at pci3 dev 2 function 0: AMI RAID <MegaRAID SCSI 320-2>
amr0: interrupting at ioapic2 pin 4
amr0: firmware 1L51, BIOS G500, 64MB RAM
ld0 at amr0 unit 0: RAID 5, optimal
% fgrep null /etc/fstab
##/u/homes /public/homes null rw,hidden 0 0
/u/binaries/linux/cxoffice /public/winbin null ro,hidden 0 0
/u/binaries/linux/matlabr13_1 /public/matlab-r13 null ro,hidden 0 0
/u/binaries/linux/matlabr14 /public/matlab-r14 null ro,hidden 0 0
/u/software/public /public/software null rw,hidden 0 0
/u/software/payware /public/payware null rw,hidden 0 0
/u/documents /public/documents null rw,hidden 0 0
/u/nts/research /public/nts-research null rw,hidden 0 0
/u/nts/teaching /public/nts-teaching null rw,hidden 0 0
/u/spg/research /public/research null rw,hidden 0 0
/u/spg/teaching /public/teaching null rw,hidden 0 0
/u/spg/prac /public/spg-prac null rw,hidden 0 0
/u/scratch /public/scratch null rw,hidden 0 0
/var/spool/export/usr.sparc /public/usr.sparc null rw,hidden 0 0
/var/spool/export/pasterze /public/pasterze null rw,hidden 0 0
%
# du -chs /u/*
3.1G binaries
449M documents
558G homes
30G local
24G nts
621M scratch
52G software
36G spg
4.0K tmp
704G total
#
>Description:
Having mountd export a nullfs on a large disk leads to a
"panic: ifree: freeing free inode".
The machine in question is a file-server (smb, afp, but mostly
nfs) running off a RAID 5 on an amr(4) (lsilogic MegaRAID
320-2). It was recently upgraded from a 4 disk 900 GB RAID-5
to 2x4 disk 1800 GB RAID-5, striped.
In order to be able to export parts of /u with differing
credentials, those subtrees are null_mounted to
/public/${share}.
After the upgrade, the machine would panic while going
multi-user, at different points but always with the same
message:
[...]
Mounting all filesystems...
/cvsroot: replaying log to disk
/u: replaying log to disk
Clearing temporary files.
Turning on accounting.
Feb 28 07:35:23 venediger /netbsd: Accounting started
Starting amd.
Feb 28 07:35:23 venediger amd[223]/info: using configuration file /etc/amd.conf
Creating a.out runtime link editor directory cache.
Starting mountd.
Starting nfsd.
Starting statd.
Starting lockd.
ifree: dev = 0x1307, ino = 55104332, fs = /u
panic: ifree: freeing free inode
Begin traceback...
uvm_fault(0xcd2fe5cc, 0, 1) -> 0xe
fatal page fault in supervisor mode
trap type 6 code 0 eip c02c06ae cs 8 eflags 10246 cr2 0 ilevel 0
panic: trap
Faulted in mid-traceback; aborting...
dumping to dev 19,1 offset 8
dump 101 100 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 70
rebooting...
Invariably, an 'fsck -f' of /u came up without any issues. Half
of the time, the amr(4) was messed up like
[...]
Updating motd.
Starting ntpd.
Starting timed.
/etc/rc: WARNING: $nmbd is not set properly - see rc.conf(5).
Starting sshd.
Starting sendmail.
ifree: dev = 0x1307, ino = 55104332, fs = /u
panic: ifree: freeing free inodeifree: dev = 0x1307, ino = 74028942, fs = /u
Begin traceback...
uvm_fault(0xcea051a8, 0, 1) -> 0xe
fatal page fault in supervisor mode
trap type 6 code 0 eip c02c06ae cs 8 eflags 10246 cr2 0 ilevel 0
panic: trap
Faulted in mid-traceback; aborting...
dumping to dev 19,1 offset 8
dump 105 amr0: bad status (not active; 0x0416)
amr0: bad status (not active; 0x0416)
amr0: bad status (not active; 0x0412)
amr0: bad status (not active; 0x0412)
104 amr0: bad status (not active; 0x0412)
amr0: bad status (not active; 0x0412)
103 amr0: bad status (not active; 0x0416)
amr0: bad status (not active; 0x0416)
amr0: bad status (not active; 0x0412)
amr0: bad status (not active; 0x0416)
amr0: bad status (not active; 0x0416)
102 amr0: bad status (not active; 0x0416)
amr0: bad status (not active; 0x0416)
amr0: bad status (not active; 0x0412)
101 amr0: bad status (not active; 0x0412)
100 amr0: bad status (not active; 0x0412)
99 amr0: bad status (not active; 0x0416)
amr0: bad status (not active; 0x0412)
98 97 amr0: bad status (not active; 0x0416)
amr0: bad status (not active; 0x0416)
amr0: bad status (not active; 0x0412)
amr0: bad status (not active; 0x0416)
amr0: bad status (not active; 0x0412)
96 error 35
amr0: bad status (not active; 0x040)
rebooting...
indicating something was scribbling over kernel memory.
Booting a -current kernel made no difference.
Re-making and restoring /u made no difference.
Switching off 'log' on the mount made no difference
(other than the fscking 20 min).
In single user, /u was mountable and accessible just fine.
Toggling the rc.conf switches, I found that leaving mountd(8)
out made all the difference.
Locking out nfs clients by /etc/hosts.{allow,deny} rules
didn't make a difference, though.
Going through the list of null mounts in fstab, I found that
commenting out the null_mount of /u/homes prevented the
panic.
Since an export through null_mount worked for /u/homes on the
smaller disk array, either the mere size of the file-system
(1600 GB), or switching /u to FFSv2 pushes nullfs over the
edge.
I have an assortment of cores (~16 MB each), if it helps;
unfortunately, the stack traces look less than helpful.
>How-To-Repeat:
Set up a multi-TB disk. Null_mount a sub-tree worth half a TB
elsewhere, and export the mount point through nfs.
>Fix:
My workaround for now is to export /u/homes directly, instead
of through a null_mount.
My guess is that something wraps in nullfs for big sub-trees,
when it shouldn't.
>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: gnats-admin->kern-bug-people
Responsible-Changed-By: dholland@NetBSD.org
Responsible-Changed-When: Wed, 09 Mar 2011 09:04:30 +0000
Responsible-Changed-Why:
pick up after gnats
From: Hauke Fath <hf@spg.tu-darmstadt.de>
To: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
Cc:
Subject: Re: kern/44651 (mountd panics nullfs on large disk)
Date: Fri, 22 Feb 2019 15:28:35 +0100
FTR, this bug just reared its ugly head again... on the same machine,
running netbsd-7 from November 2018, re-purposed as nfs server for
sources, and Radmind server.
I fscked the three partitions, dumped them to a sata disk, doubled the
size of the raid 10 array, and restored. Booting, I was greeted with
NetBSD 7.2_STABLE (MONOLITHIC) #1: Wed Nov 7 12:01:16 CET 2018
hf@Hochstuhl:/var/obj/netbsd-builds/7/amd64/sys/arch/amd64/compile/MONOLITHIC
total memory = 4094 MB
avail memory = 3956 MB
[...]
Starting sendmail.
Starting smmsp.
Starting radmind.
Starting inetd.
Starting cron.
Fri Feb 22 11:04:47 CET 2019
ifree: dev = 0xa804, ino = 7935387, fs = /u
panic: ifree: freeing free inode
cpu0: Begin traceback...
vpanic() at netbsd:vpanic+0x13c
snprintf() at netbsd:snprintf
ffs_mapsearch() at netbsd:ffs_mapsearch
ffs_freefile() at netbsd:ffs_freefile+0xfd
ffs_reclaim() at netbsd:ffs_reclaim+0x15a
VOP_RECLAIM() at netbsd:VOP_RECLAIM+0x2f
vclean() at netbsd:vclean+0xbd
vrelel() at netbsd:vrelel+0x636
ufs_fhtovp() at netbsd:ufs_fhtovp+0x54
ffs_fhtovp() at netbsd:ffs_fhtovp+0x4c
VFS_FHTOVP() at netbsd:VFS_FHTOVP+0x1c
layerfs_fhtovp() at netbsd:layerfs_fhtovp+0x21
VFS_FHTOVP() at netbsd:VFS_FHTOVP+0x1c
nfsrv_fhtovp() at netbsd:nfsrv_fffffhtovp+0x68
nfs_namei() at netbsd:nfs_namei+0x154
nfsrv_lookup() at netbsd:nfsrv_lookup+0x264
do_nfssvc() at netbsd:do_nfssvc+0x347
syscall() at netbsd:syscall+0x9a
--- syscall (number 155) ---
7f7ff703cb3a:
cpu0: End traceback...
dumping to dev 168,2 (offset=8391975, size=1048155):
dump area improper
uvm_fault(0xfffffe812e5048a8, 0x0, 2) -> e
fatal page fault in supervisor mode
trap type 6 code 2 rip ffffffff8061e38f cs 8 rflags 10286 cr2 84 ilevel
8 rsp fffffe8040963e00
curlwp 0xfffffe812e4c3320 pid 279.1 lowest kstack 0xfffffe80409602c0
rebooting...
Just like before, fsck(8) did not find anything wrong with /u, which is
a 1000 GB partition now, about half full. The panic was reproducible.
/etc/fstab null mounts are:
/u/sources/netbsd-5/src /public/netbsd-5 null ro,hidden
/u/sources/netbsd-6/src /public/netbsd-6 null ro,hidden
/u/sources/netbsd-7/src /public/netbsd-7 null ro,hidden
/u/sources/netbsd-8/src /public/netbsd-8 null ro,hidden
/u/sources/netbsd-developer/src /public/netbsd-developer null ro,hidden
/u/sources/pkgsrc /public/pkgsrc null ro,hidden
#
/u/packages/distfiles /public/pkg-distfiles null rw,hidden
#
/u/sources/freebsd-11/src /public/freebsd-11 null ro,hidden
% du -sh /public/*
2.9G /public/freebsd-11
4.0K /public/netbsd-5
2.7G /public/netbsd-6
3.2G /public/netbsd-7
3.2G /public/netbsd-8
4.3G /public/netbsd-developer
10G /public/pkg-distfiles
2.2G /public/pkgsrc
%
After disabling the largest of the null mounts (/public/pkg-distfiles)
the machine booted fine.
For lack of a better idea, I upgraded to -8, and the panic does not
occur there.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.