NetBSD Problem Report #46224
From petar@starling.smokva.net Mon Mar 19 02:29:38 2012
Return-Path: <petar@starling.smokva.net>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
by www.NetBSD.org (Postfix) with ESMTP id A32CC63B946
for <gnats-bugs@gnats.NetBSD.org>; Mon, 19 Mar 2012 02:29:38 +0000 (UTC)
Message-Id: <20120319022943.C33CD17830A0@starling.smokva.net>
Date: Mon, 19 Mar 2012 03:29:43 +0100 (CET)
From: Petar Bogdanovic <petar@smokva.net>
To: gnats-bugs@gnats.NetBSD.org
Subject: fatal page fault, kernfs_readdir()
X-Send-Pr-Version: 3.95
>Number: 46224
>Category: kern
>Synopsis: kernel crash: fatal page fault in kernfs_readdir()
>Confidential: no
>Severity: critical
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Mon Mar 19 02:30:01 +0000 2012
>Last-Modified: Sun Apr 15 14:54:17 +0000 2012
>Originator: Petar Bogdanovic
>Release: NetBSD 6.0_BETA (16.03.2012)
>Organization:
>Environment:
amd64
>Description:
a pretty recent netbsd-6 kernel (date: 16.03., arch: amd64) just
crashed several times. The bug seems reproducible and does not
appear, when no kernfs is involved:
$ mount
/dev/raid0a on / type ffs (log, NFS exported, local)
kernfs on /kern type kernfs (local)
$ sudo find / -name '*,v'
/etc/mtree/special.local,v
(...many more lines...)
/var/backups/boot.cfg.current,v
uvm_fault(0xfffffe8114c4dbd0, 0x0, 1) -> e
fatal page fault in supervisor mode
trap type 6 code 0 rip ffffffff804f4ceb cs 8 rflags 10297 cr2 0 cpl 0
rsp fffffe80016077a0
kernel: page fault trap, code=0
Stopped in pid 847.1 (find) at netbsd:kernfs_readdir+0x687: movq
7fb0b30e
(%rip),%rdi
db{1}> bt
kernfs_readdir() at netbsd:kernfs_readdir+0x687
VOP_READDIR() at netbsd:VOP_READDIR+0x65
vn_readdir() at netbsd:vn_readdir+0xf6
sys___getdents30() at netbsd:sys___getdents30+0x76
syscall() at netbsd:syscall+0xc4
The same situation yields a slightly different result when
ddb.onpanic=0 and ends with what seems to be a complete meltdown
after the core was successfully dumped:
uvm_fault(0xfffffe811556ad40, 0x0, 1) -> e
fatal page fault in supervisor mode
trap type 6 code 0 rip ffffffff804f4ceb cs 8 rflags 10297 cr2 0 cpl 0 rsp fffffe80015b77a0
panic: trap
cpu1: Begin traceback...
printf_nolog() at netbsd:printf_nolog
startlwp() at netbsd:startlwp
alltraps() at netbsd:alltraps+0xa2
VOP_READDIR() at netbsd:VOP_READDIR+0x65
vn_readdir() at netbsd:vn_readdir+0xf6
sys___getdents30() at netbsd:sys___getdents30+0x76
syscall() at netbsd:syscall+0xc4
cpu1: End traceback...
(..dump begins, finishes..)
pmap_kenter_pa: mapping already present
pmap_kenter_pa: mapping already present
pmap_kenter_pa: mapping already present
(..many, many more identical lines..)
(..takes as long as the core dump..)
pmap_kenter_pa: mapping already present
pmap_kenter_pa: mapping already present
pmap_kenter_pa: mapping already present
succeeded
Skipping crash dump on recursive panic
panic: wdc_exec_command: polled command not done
cpu1: Begin traceback...
printf_nolog() at netbsd:printf_nolog
wdccommand() at netbsd:wdccommand
wd_flushcache() at netbsd:wd_flushcache+0xd7
wd_shutdown() at netbsd:wd_shutdown+0x3e
pmf_system_shutdown() at netbsd:pmf_system_shutdown+0x81
cpu_reboot() at netbsd:cpu_reboot+0x2c
vpanic() at netbsd:vpanic+0x1dd
printf_nolog() at netbsd:printf_nolog
startlwp() at netbsd:startlwp
alltraps() at netbsd:alltraps+0xa2
VOP_READDIR() at netbsd:VOP_READDIR+0x65
vn_readdir() at netbsd:vn_readdir+0xf6
sys___getdents30() at netbsd:sys___getdents30+0x76
syscall() at netbsd:syscall+0xc4
cpu1: End traceback...
rebooting...
>How-To-Repeat:
find /kern -ls
>Fix:
none
>Release-Note:
>Audit-Trail:
From: Greg Oster <oster@cs.usask.ca>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/46224: fatal page fault, kernfs_readdir()
Date: Mon, 19 Mar 2012 08:53:57 -0600
I don't know if I'm seeing quite the same error, but I've been chasing
a similar issue the last few days... What I see is:
fatal breakpoint trap in supervisor
mode trap type 1 code 0 rip ffffffff80133415 cs e030 rflags 282 cr2
7f7ff7327080 cpl 0 rsp
ffffa0005b72d9a0 Stopped in pid 396.1 (find) at
netbsd:breakpoint+0x5: leave breakpoint() at netbsd:breakpoint+0x5
pool_cache_put_paddr() at netbsd:pool_cache_put_paddr+0x25
static_qc_pools() at ffffffff80661100
static_qc_pools() at ffffffff80661480
Bad frame pointer: 0xffffffff8078a7e0
ds ffff
es a14a
fs 0
gs b278
rdi 0
rsi fffffffe
rbp ffffa0005b72d9a0
rbx ffffa0005b72dad0
rdx 1000000
rcx ffffa0000456b000
rax ffffffff80d0b0c0
r8 ffffa0000456b000
r9 400
r10 2
r11 ffffa0000460308d
r12 ffffa00004603000
r13 ffffa0000739a870
r14 ffffffffffffffff
r15 ffffa00004603098
rip ffffffff80133415 breakpoint+0x5
cs e030
rflags 282
rsp ffffa0005b72d9a0
ss e02b
netbsd:breakpoint+0x5: leave
db{3}>
and I can trigger it on-demand with a: find -x / -name "ajsdf" -print
The kernel is a netbsd-6 XEN3_DOMU kernel on amd64, with DEBUG and
debug_freecheck turned on.
Later...
Greg Oster
From: Lars Heidieker <lars@heidieker.de>
To: gnats-bugs@NetBSD.org, oster@cs.usask.ca
Cc:
Subject: Re: kern/46224: fatal page fault, kernfs_readdir()
Date: Thu, 22 Mar 2012 17:58:39 +0100
If I haven't missed anything debug_freecheck is broken. I hacked my way
around two problems first the disable logic if running out of slots is
the wrong way round, if that is corrected startup fails as I circumvent
that by a hack the system kept running until running out of slots (which
I made to panic so I couldn't miss it).
The bug(s) that are out there aren't those indicated by debug_freecheck.
Lars
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.