NetBSD Problem Report #47041
From campbell@mumble.net Fri Oct 5 14:06:49 2012
Return-Path: <campbell@mumble.net>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
by www.NetBSD.org (Postfix) with ESMTP id 28E5963B8EB
for <gnats-bugs@gnats.NetBSD.org>; Fri, 5 Oct 2012 14:06:49 +0000 (UTC)
Message-Id: <20121005140622.2DC18604AF@jupiter.mumble.net>
Date: Fri, 5 Oct 2012 14:06:22 +0000 (UTC)
From: Taylor R Campbell <campbell+netbsd@mumble.net>
Reply-To: Taylor R Campbell <campbell+netbsd@mumble.net>
To: gnats-bugs@gnats.NetBSD.org
Subject: amd64 kernel core dumps are broken
X-Send-Pr-Version: 3.95
>Number: 47041
>Category: kern
>Synopsis: amd64 kernel core dumps are broken
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: jdolecek
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri Oct 05 14:10:00 +0000 2012
>Closed-Date: Thu Jan 07 15:24:29 +0000 2021
>Last-Modified: Thu Jan 07 15:24:39 +0000 2021
>Originator: Taylor R Campbell <campbell+netbsd@mumble.net>
>Release: NetBSD 6.99.11
>Organization:
>Environment:
Architecture: amd64
Machine: amd64
>Description:
After a panic in namei the other day, my system wrote a kernel
core dump which savecore had trouble reading:
Oct 3 14:40:58 manticore savecore: reboot after panic: leaf `current.ro' should be empty
Oct 3 14:40:58 manticore savecore: system went down at Tue Oct 2 18:17:25 2012
Oct 3 14:40:58 manticore savecore: writing core to ./netbsd.0.core
Oct 3 18:06:41 manticore savecore: writing kernel to ./netbsd.0
Oct 3 18:06:41 manticore savecore: kvm_read: invalid translation (invalid level 4 PDE)
Oct 3 18:06:41 manticore savecore: (null): Bad address
The `(null): Bad address' error I have been seeing for years on
i386, but the `invalid level 3/4 PDE' messages I have not seen
before. Also, the three and a half hour delay in writing the
core dump is...odd. Most of the time, savecore -- and anything
else trying to stat netbsd.0.core, such as `ls -l' -- was stuck
waiting for a vnode lock, presumably of netbsd.0.core, but I
don't know who held it (crash(8) is broken when it comes to
vnodes), and nothing seemed to be spinning.
Attempting to load the core in gdb didn't work either:
(gdb) target kvm netbsd.0.core
invalid translation (invalid level 3 PDE)
(gdb) bt
invalid translation (invalid level 3 PDE)
invalid translation (invalid level 3 PDE)
invalid translation (invalid level 3 PDE)
(gdb) bt
invalid translation (invalid level 3 PDE)
invalid translation (invalid level 3 PDE)
(gdb) bt
invalid translation (invalid level 3 PDE)
invalid translation (invalid level 3 PDE)
(gdb)
>How-To-Repeat:
Panic and then try to savecore.
>Fix:
Yes, please!
Let me know what other diagnostics to run to learn more about
this...
>Release-Note:
>Audit-Trail:
From: Thomas Klausner <wiz@NetBSD.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/47041: amd64 kernel core dumps are broken
Date: Fri, 5 Oct 2012 16:47:17 +0200
This looks different than my kernel core dump problem, see
http://mail-index.netbsd.org/current-users/2012/10/03/msg021192.html
Thomas
Responsible-Changed-From-To: kern-bug-people->chs
Responsible-Changed-By: chs@NetBSD.org
Responsible-Changed-When: Sun, 18 Nov 2012 00:36:34 +0000
Responsible-Changed-Why:
I've been working on this.
State-Changed-From-To: open->feedback
State-Changed-By: chs@NetBSD.org
State-Changed-When: Sun, 18 Nov 2012 00:36:34 +0000
State-Changed-Why:
I've made a number of fixes to amd64 kernel dumps in -current
and it's all working for me now. is it working for you too?
From: Taylor R Campbell <campbell+netbsd@mumble.net>
To: gnats-bugs@NetBSD.org
Cc: chs@NetBSD.org, kern-bug-people@netbsd.org,
netbsd-bugs@netbsd.org, gnats-admin@netbsd.org,
Subject: Re: kern/47041 (amd64 kernel core dumps are broken)
Date: Sat, 12 Jan 2013 07:24:15 +0000
Date: Sun, 18 Nov 2012 00:36:37 +0000 (UTC)
From: chs@NetBSD.org
I've made a number of fixes to amd64 kernel dumps in -current
and it's all working for me now. is it working for you too?
Well, I just hit a panic on that machine...and (although the stack
trace from ddb that I caught flying by on the serial console was
enough to debug the problem) it didn't even make a kernel crash dump
because
dump Skipping crash dump on recursive panic
panic: wddump: polled command has been queued
Since I'm about to be messing with that machine's kernel and rebooting
anyway soon, I'll try and elicit a core dump with a newer kernel and
when it's not under the heavy load of a pkgsrc bulk build.
From: Taylor R Campbell <campbell+netbsd@mumble.net>
To: gnats-bugs@NetBSD.org
Cc: chs@NetBSD.org, kern-bug-people@netbsd.org, netbsd-bugs@netbsd.org,
gnats-admin@netbsd.org
Subject: Re: kern/47041 (amd64 kernel core dumps are broken)
Date: Sat, 12 Jan 2013 08:16:05 +0000
Similar issue again, triggered by `sysctl -w kern.panic_now=3D1' while
the machine was idle. My wild guess is that this smells faintly like
an ahcisata MP-safety issue.
Skipping crash dump on recursive panic
panic: pool_get(ataspl): free list modified: magic=3Ddeaddeed; page 0xfffff=
e8874831000; item addr 0xfffffe8874831f00
cpu18: Begin traceback...
printf_nolog() at netbsd:printf_nolog
pool_cache_cpu_init1() at netbsd:pool_cache_cpu_init1
ata_get_xfer() at netbsd:ata_get_xfer+0x2d
ahci_ata_bio() at netbsd:ahci_ata_bio+0x2f
wddump() at netbsd:wddump+0x167
raiddump() at netbsd:raiddump+0x227
dumpsys_seg() at netbsd:dumpsys_seg+0xbc
dump_seg_iter() at netbsd:dump_seg_iter+0xfb
dodumpsys() at netbsd:dodumpsys+0x267
dumpsys() at netbsd:dumpsys+0x1d
vpanic() at netbsd:vpanic+0x1dd
printf_nolog() at netbsd:printf_nolog
fill_lwp() at netbsd:fill_lwp
sysctl_dispatch() at netbsd:sysctl_dispatch+0xc6
sys___sysctl() at netbsd:sys___sysctl+0xeb
syscall() at netbsd:syscall+0x94
--- syscall (number 202) ---
7f7ff74e024a:
cpu18: End traceback...
rebooting...
From: Taylor R Campbell <campbell+netbsd@mumble.net>
To: gnats-bugs@NetBSD.org
Cc: chs@NetBSD.org, kern-bug-people@netbsd.org, netbsd-bugs@netbsd.org,
gnats-admin@netbsd.org
Subject: Re: kern/47041 (amd64 kernel core dumps are broken)
Date: Sat, 12 Jan 2013 08:56:52 +0000
It's not clear to me whether this is just because the dump is bad or
because savecore is still broken, but when I tried to run savecore
after the last panic (unless I have mixed up panics in my memory,
which is entirely possible at this hour), it started off by saying
savecore: kvm_read: invalid translation (invalid level 3 PDE)
but proceeded to start writing a core dump to disk. In the morning
(it's a 32 GB dump being written to a sync ffs) I'll see whether gdb
finds this core dump digestible.
From: Chuck Silvers <chuq@chuq.com>
To: Taylor R Campbell <campbell+netbsd@mumble.net>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
netbsd-bugs@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/47041 (amd64 kernel core dumps are broken)
Date: Sat, 12 Jan 2013 08:27:22 -0800
On Sat, Jan 12, 2013 at 08:16:05AM +0000, Taylor R Campbell wrote:
> Similar issue again, triggered by `sysctl -w kern.panic_now=1' while
> the machine was idle. My wild guess is that this smells faintly like
> an ahcisata MP-safety issue.
>
> Skipping crash dump on recursive panic
> panic: pool_get(ataspl): free list modified: magic=deaddeed; page 0xfffffe8874831000; item addr 0xfffffe8874831f00
>
> cpu18: Begin traceback...
> printf_nolog() at netbsd:printf_nolog
> pool_cache_cpu_init1() at netbsd:pool_cache_cpu_init1
> ata_get_xfer() at netbsd:ata_get_xfer+0x2d
> ahci_ata_bio() at netbsd:ahci_ata_bio+0x2f
> wddump() at netbsd:wddump+0x167
...
this ahcisata problem is reported separately in PR 41095.
-Chuck
From: Chuck Silvers <chuq@chuq.com>
To: Taylor R Campbell <campbell+netbsd@mumble.net>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
netbsd-bugs@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/47041 (amd64 kernel core dumps are broken)
Date: Sat, 12 Jan 2013 08:33:16 -0800
On Sat, Jan 12, 2013 at 08:56:52AM +0000, Taylor R Campbell wrote:
> It's not clear to me whether this is just because the dump is bad or
> because savecore is still broken, but when I tried to run savecore
> after the last panic (unless I have mixed up panics in my memory,
> which is entirely possible at this hour), it started off by saying
>
> savecore: kvm_read: invalid translation (invalid level 3 PDE)
>
> but proceeded to start writing a core dump to disk. In the morning
> (it's a 32 GB dump being written to a sync ffs) I'll see whether gdb
> finds this core dump digestible.
not all kvm_read errors are fatal for savecore, so your recollection
could be correct.
enabling machdep.sparse_dump=1 will help with large-memory systems.
that should be working now as well.
-Chuck
State-Changed-From-To: feedback->open
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Tue, 12 Apr 2016 04:49:28 +0000
State-Changed-Why:
this PR was accidentally left in feedback the past 3.5 years
From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/47041 CVS commit: [jdolecek-ncq] src/sys/dev
Date: Fri, 16 Jun 2017 20:40:49 +0000
Module Name: src
Committed By: jdolecek
Date: Fri Jun 16 20:40:49 UTC 2017
Modified Files:
src/sys/dev/ata [jdolecek-ncq]: ata.c ata_wdc.c atavar.h wd.c
src/sys/dev/ic [jdolecek-ncq]: ahcisata_core.c mvsata.c siisata.c wdc.c
src/sys/dev/scsipi [jdolecek-ncq]: atapi_wdc.c
Log Message:
adjust reset channel and dump paths
- channel reset now always kills active transfer, even on dump path, but
now doesn't touch the queued waiting transfers; also kill_xfer hook is
always called, so that HBA can free any private xfer resources and thus
the dump request has chance to work
- kill_xfer routines now always call ata_deactivate_xfer(); added KASSERT()s
to ata_free_xfer() to expect deactivated xfer
- when called during channel reset before dump, ata_kill_active() drops
any queued waiting transfers without processing
- do not (re)queue any transfers in wddone() when dumping
- kill AT_RST_NOCMD flag
This should also hopefully fix the 'polled command has been queued' panic
as reported in:
PR kern/11811 by John Hawkinson
PR kern/47041 by Taylor R Campbell
PR kern/51979 by Martin Husemann
dump tested working with piixide(4) and ahci(4). mvsata(4) dump times out,
but otherwise tested working, will be fixed separately. siisata(4) mechanically
changed and not tested.
To generate a diff of this commit:
cvs rdiff -u -r1.132.8.8 -r1.132.8.9 src/sys/dev/ata/ata.c
cvs rdiff -u -r1.105.6.3 -r1.105.6.4 src/sys/dev/ata/ata_wdc.c
cvs rdiff -u -r1.92.8.8 -r1.92.8.9 src/sys/dev/ata/atavar.h
cvs rdiff -u -r1.428.2.15 -r1.428.2.16 src/sys/dev/ata/wd.c
cvs rdiff -u -r1.57.6.12 -r1.57.6.13 src/sys/dev/ic/ahcisata_core.c
cvs rdiff -u -r1.35.6.10 -r1.35.6.11 src/sys/dev/ic/mvsata.c
cvs rdiff -u -r1.30.4.15 -r1.30.4.16 src/sys/dev/ic/siisata.c
cvs rdiff -u -r1.283.2.4 -r1.283.2.5 src/sys/dev/ic/wdc.c
cvs rdiff -u -r1.123.4.4 -r1.123.4.5 src/sys/dev/scsipi/atapi_wdc.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: open->feedback
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Wed, 26 Jul 2017 17:21:45 +0000
State-Changed-Why:
not sure how to propose testing this, but we should probably do something
rather than just close it.
State-Changed-From-To: feedback->open
State-Changed-By: maya@NetBSD.org
State-Changed-When: Wed, 26 Jul 2017 17:35:15 +0000
State-Changed-Why:
This still occurs and the changes to fix it are in a branch not yet merged. testing can be done by dropping to ddb during heavy disk activity and typing 'sync'. normal coredumps work.
Responsible-Changed-From-To: chs->jdolecek
Responsible-Changed-By: jdolecek@NetBSD.org
Responsible-Changed-When: Sat, 07 Oct 2017 17:48:16 +0000
Responsible-Changed-Why:
Committed possible fix.
State-Changed-From-To: open->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Sat, 07 Oct 2017 17:48:16 +0000
State-Changed-Why:
Possibly fixed on -current with NCQ merge. Can you retest?
From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/47041 CVS commit: [jdolecek-ncqfixes] src/sys/dev/ata
Date: Sat, 6 Oct 2018 20:27:36 +0000
Module Name: src
Committed By: jdolecek
Date: Sat Oct 6 20:27:36 UTC 2018
Modified Files:
src/sys/dev/ata [jdolecek-ncqfixes]: ata.c atavar.h wd.c
Log Message:
remove AT_RST_EMERG, do the queue reset explicitly in wd(4)
this should explicitly fix PR kern/47041 with sync during heavy
disk activity, even thought it was actually already implicitly fixed by calling
ata_thread_run() for drive reset in previous commit already, since the
function already called ata_queue_reset()
drop now unused ch_reset_flags and drive_reset_flags
To generate a diff of this commit:
cvs rdiff -u -r1.141.6.13 -r1.141.6.14 src/sys/dev/ata/ata.c
cvs rdiff -u -r1.99.2.8 -r1.99.2.9 src/sys/dev/ata/atavar.h
cvs rdiff -u -r1.441.2.10 -r1.441.2.11 src/sys/dev/ata/wd.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/47041 CVS commit: [jdolecek-ncqfixes] src/sys/dev/ata
Date: Sat, 6 Oct 2018 21:19:55 +0000
Module Name: src
Committed By: jdolecek
Date: Sat Oct 6 21:19:55 UTC 2018
Modified Files:
src/sys/dev/ata [jdolecek-ncqfixes]: ata.c ata_subr.c atavar.h wd.c
Log Message:
actually, just make dump use the same queue skip as recovery, and remove the
no longer necessary ata_queue_reset() call from wd(4)
also for PR kern/47041
To generate a diff of this commit:
cvs rdiff -u -r1.141.6.14 -r1.141.6.15 src/sys/dev/ata/ata.c
cvs rdiff -u -r1.6.2.6 -r1.6.2.7 src/sys/dev/ata/ata_subr.c
cvs rdiff -u -r1.99.2.9 -r1.99.2.10 src/sys/dev/ata/atavar.h
cvs rdiff -u -r1.441.2.11 -r1.441.2.12 src/sys/dev/ata/wd.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: feedback->closeD
State-Changed-By: maya@NetBSD.org
State-Changed-When: Thu, 07 Jan 2021 15:24:29 +0000
State-Changed-Why:
Feedback timeout. ssuming fixed.
State-Changed-From-To: closeD->closed
State-Changed-By: maya@NetBSD.org
State-Changed-When: Thu, 07 Jan 2021 15:24:39 +0000
State-Changed-Why:
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.