NetBSD Problem Report #47041

From campbell@mumble.net  Fri Oct  5 14:06:49 2012
Return-Path: <campbell@mumble.net>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	by www.NetBSD.org (Postfix) with ESMTP id 28E5963B8EB
	for <gnats-bugs@gnats.NetBSD.org>; Fri,  5 Oct 2012 14:06:49 +0000 (UTC)
Message-Id: <20121005140622.2DC18604AF@jupiter.mumble.net>
Date: Fri,  5 Oct 2012 14:06:22 +0000 (UTC)
From: Taylor R Campbell <campbell+netbsd@mumble.net>
Reply-To: Taylor R Campbell <campbell+netbsd@mumble.net>
To: gnats-bugs@gnats.NetBSD.org
Subject: amd64 kernel core dumps are broken
X-Send-Pr-Version: 3.95

>Number:         47041
>Category:       kern
>Synopsis:       amd64 kernel core dumps are broken
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    jdolecek
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Oct 05 14:10:00 +0000 2012
>Closed-Date:    Thu Jan 07 15:24:29 +0000 2021
>Last-Modified:  Thu Jan 07 15:24:39 +0000 2021
>Originator:     Taylor R Campbell <campbell+netbsd@mumble.net>
>Release:        NetBSD 6.99.11
>Organization:
>Environment:
Architecture: amd64
Machine: amd64
>Description:

	After a panic in namei the other day, my system wrote a kernel
	core dump which savecore had trouble reading:

Oct  3 14:40:58 manticore savecore: reboot after panic: leaf `current.ro' should be empty
Oct  3 14:40:58 manticore savecore: system went down at Tue Oct  2 18:17:25 2012 
Oct  3 14:40:58 manticore savecore: writing core to ./netbsd.0.core
Oct  3 18:06:41 manticore savecore: writing kernel to ./netbsd.0
Oct  3 18:06:41 manticore savecore: kvm_read: invalid translation (invalid level 4 PDE)
Oct  3 18:06:41 manticore savecore: (null): Bad address

	The `(null): Bad address' error I have been seeing for years on
	i386, but the `invalid level 3/4 PDE' messages I have not seen
	before.  Also, the three and a half hour delay in writing the
	core dump is...odd.  Most of the time, savecore -- and anything
	else trying to stat netbsd.0.core, such as `ls -l' -- was stuck
	waiting for a vnode lock, presumably of netbsd.0.core, but I
	don't know who held it (crash(8) is broken when it comes to
	vnodes), and nothing seemed to be spinning.

	Attempting to load the core in gdb didn't work either:

(gdb) target kvm netbsd.0.core
invalid translation (invalid level 3 PDE)
(gdb) bt
invalid translation (invalid level 3 PDE)
invalid translation (invalid level 3 PDE)
invalid translation (invalid level 3 PDE)
(gdb) bt
invalid translation (invalid level 3 PDE)
invalid translation (invalid level 3 PDE)
(gdb) bt
invalid translation (invalid level 3 PDE)
invalid translation (invalid level 3 PDE)
(gdb) 

>How-To-Repeat:

	Panic and then try to savecore.

>Fix:

	Yes, please!

	Let me know what other diagnostics to run to learn more about
	this...

>Release-Note:

>Audit-Trail:
From: Thomas Klausner <wiz@NetBSD.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/47041: amd64 kernel core dumps are broken
Date: Fri, 5 Oct 2012 16:47:17 +0200

 This looks different than my kernel core dump problem, see
 http://mail-index.netbsd.org/current-users/2012/10/03/msg021192.html
  Thomas

Responsible-Changed-From-To: kern-bug-people->chs
Responsible-Changed-By: chs@NetBSD.org
Responsible-Changed-When: Sun, 18 Nov 2012 00:36:34 +0000
Responsible-Changed-Why:
I've been working on this.


State-Changed-From-To: open->feedback
State-Changed-By: chs@NetBSD.org
State-Changed-When: Sun, 18 Nov 2012 00:36:34 +0000
State-Changed-Why:
I've made a number of fixes to amd64 kernel dumps in -current
and it's all working for me now.  is it working for you too?


From: Taylor R Campbell <campbell+netbsd@mumble.net>
To: gnats-bugs@NetBSD.org
Cc: chs@NetBSD.org, kern-bug-people@netbsd.org,
	netbsd-bugs@netbsd.org, gnats-admin@netbsd.org,
Subject: Re: kern/47041 (amd64 kernel core dumps are broken)
Date: Sat, 12 Jan 2013 07:24:15 +0000

    Date: Sun, 18 Nov 2012 00:36:37 +0000 (UTC)
    From: chs@NetBSD.org

    I've made a number of fixes to amd64 kernel dumps in -current
    and it's all working for me now.  is it working for you too?

 Well, I just hit a panic on that machine...and (although the stack
 trace from ddb that I caught flying by on the serial console was
 enough to debug the problem) it didn't even make a kernel crash dump
 because

 dump Skipping crash dump on recursive panic
 panic: wddump: polled command has been queued

 Since I'm about to be messing with that machine's kernel and rebooting
 anyway soon, I'll try and elicit a core dump with a newer kernel and
 when it's not under the heavy load of a pkgsrc bulk build.

From: Taylor R Campbell <campbell+netbsd@mumble.net>
To: gnats-bugs@NetBSD.org
Cc: chs@NetBSD.org, kern-bug-people@netbsd.org, netbsd-bugs@netbsd.org,
	gnats-admin@netbsd.org
Subject: Re: kern/47041 (amd64 kernel core dumps are broken)
Date: Sat, 12 Jan 2013 08:16:05 +0000

 Similar issue again, triggered by `sysctl -w kern.panic_now=3D1' while
 the machine was idle.  My wild guess is that this smells faintly like
 an ahcisata MP-safety issue.

 Skipping crash dump on recursive panic
 panic: pool_get(ataspl): free list modified: magic=3Ddeaddeed; page 0xfffff=
 e8874831000; item addr 0xfffffe8874831f00

 cpu18: Begin traceback...
 printf_nolog() at netbsd:printf_nolog
 pool_cache_cpu_init1() at netbsd:pool_cache_cpu_init1
 ata_get_xfer() at netbsd:ata_get_xfer+0x2d
 ahci_ata_bio() at netbsd:ahci_ata_bio+0x2f
 wddump() at netbsd:wddump+0x167
 raiddump() at netbsd:raiddump+0x227
 dumpsys_seg() at netbsd:dumpsys_seg+0xbc
 dump_seg_iter() at netbsd:dump_seg_iter+0xfb
 dodumpsys() at netbsd:dodumpsys+0x267
 dumpsys() at netbsd:dumpsys+0x1d
 vpanic() at netbsd:vpanic+0x1dd
 printf_nolog() at netbsd:printf_nolog
 fill_lwp() at netbsd:fill_lwp
 sysctl_dispatch() at netbsd:sysctl_dispatch+0xc6
 sys___sysctl() at netbsd:sys___sysctl+0xeb
 syscall() at netbsd:syscall+0x94
 --- syscall (number 202) ---
 7f7ff74e024a:
 cpu18: End traceback...
 rebooting...

From: Taylor R Campbell <campbell+netbsd@mumble.net>
To: gnats-bugs@NetBSD.org
Cc: chs@NetBSD.org, kern-bug-people@netbsd.org, netbsd-bugs@netbsd.org,
	gnats-admin@netbsd.org
Subject: Re: kern/47041 (amd64 kernel core dumps are broken)
Date: Sat, 12 Jan 2013 08:56:52 +0000

 It's not clear to me whether this is just because the dump is bad or
 because savecore is still broken, but when I tried to run savecore
 after the last panic (unless I have mixed up panics in my memory,
 which is entirely possible at this hour), it started off by saying

 savecore: kvm_read: invalid translation (invalid level 3 PDE)

 but proceeded to start writing a core dump to disk.  In the morning
 (it's a 32 GB dump being written to a sync ffs) I'll see whether gdb
 finds this core dump digestible.

From: Chuck Silvers <chuq@chuq.com>
To: Taylor R Campbell <campbell+netbsd@mumble.net>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
	netbsd-bugs@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/47041 (amd64 kernel core dumps are broken)
Date: Sat, 12 Jan 2013 08:27:22 -0800

 On Sat, Jan 12, 2013 at 08:16:05AM +0000, Taylor R Campbell wrote:
 > Similar issue again, triggered by `sysctl -w kern.panic_now=1' while
 > the machine was idle.  My wild guess is that this smells faintly like
 > an ahcisata MP-safety issue.
 > 
 > Skipping crash dump on recursive panic
 > panic: pool_get(ataspl): free list modified: magic=deaddeed; page 0xfffffe8874831000; item addr 0xfffffe8874831f00
 > 
 > cpu18: Begin traceback...
 > printf_nolog() at netbsd:printf_nolog
 > pool_cache_cpu_init1() at netbsd:pool_cache_cpu_init1
 > ata_get_xfer() at netbsd:ata_get_xfer+0x2d
 > ahci_ata_bio() at netbsd:ahci_ata_bio+0x2f
 > wddump() at netbsd:wddump+0x167
 ...


 this ahcisata problem is reported separately in PR 41095.

 -Chuck

From: Chuck Silvers <chuq@chuq.com>
To: Taylor R Campbell <campbell+netbsd@mumble.net>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
	netbsd-bugs@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/47041 (amd64 kernel core dumps are broken)
Date: Sat, 12 Jan 2013 08:33:16 -0800

 On Sat, Jan 12, 2013 at 08:56:52AM +0000, Taylor R Campbell wrote:
 > It's not clear to me whether this is just because the dump is bad or
 > because savecore is still broken, but when I tried to run savecore
 > after the last panic (unless I have mixed up panics in my memory,
 > which is entirely possible at this hour), it started off by saying
 > 
 > savecore: kvm_read: invalid translation (invalid level 3 PDE)
 > 
 > but proceeded to start writing a core dump to disk.  In the morning
 > (it's a 32 GB dump being written to a sync ffs) I'll see whether gdb
 > finds this core dump digestible.

 not all kvm_read errors are fatal for savecore, so your recollection
 could be correct.

 enabling machdep.sparse_dump=1 will help with large-memory systems.
 that should be working now as well.

 -Chuck

State-Changed-From-To: feedback->open
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Tue, 12 Apr 2016 04:49:28 +0000
State-Changed-Why:
this PR was accidentally left in feedback the past 3.5 years


From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/47041 CVS commit: [jdolecek-ncq] src/sys/dev
Date: Fri, 16 Jun 2017 20:40:49 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Fri Jun 16 20:40:49 UTC 2017

 Modified Files:
 	src/sys/dev/ata [jdolecek-ncq]: ata.c ata_wdc.c atavar.h wd.c
 	src/sys/dev/ic [jdolecek-ncq]: ahcisata_core.c mvsata.c siisata.c wdc.c
 	src/sys/dev/scsipi [jdolecek-ncq]: atapi_wdc.c

 Log Message:
 adjust reset channel and dump paths
 - channel reset now always kills active transfer, even on dump path, but
   now doesn't touch the queued waiting transfers; also kill_xfer hook is
   always called, so that HBA can free any private xfer resources and thus
   the dump request has chance to work
 - kill_xfer routines now always call ata_deactivate_xfer(); added KASSERT()s
   to ata_free_xfer() to expect deactivated xfer
 - when called during channel reset before dump, ata_kill_active() drops
   any queued waiting transfers without processing
 - do not (re)queue any transfers in wddone() when dumping
 - kill AT_RST_NOCMD flag

 This should also hopefully fix the 'polled command has been queued' panic
 as reported in:
 PR kern/11811 by John Hawkinson
 PR kern/47041 by Taylor R Campbell
 PR kern/51979 by Martin Husemann

 dump tested working with piixide(4) and ahci(4). mvsata(4) dump times out,
 but otherwise tested working, will be fixed separately. siisata(4) mechanically
 changed and not tested.


 To generate a diff of this commit:
 cvs rdiff -u -r1.132.8.8 -r1.132.8.9 src/sys/dev/ata/ata.c
 cvs rdiff -u -r1.105.6.3 -r1.105.6.4 src/sys/dev/ata/ata_wdc.c
 cvs rdiff -u -r1.92.8.8 -r1.92.8.9 src/sys/dev/ata/atavar.h
 cvs rdiff -u -r1.428.2.15 -r1.428.2.16 src/sys/dev/ata/wd.c
 cvs rdiff -u -r1.57.6.12 -r1.57.6.13 src/sys/dev/ic/ahcisata_core.c
 cvs rdiff -u -r1.35.6.10 -r1.35.6.11 src/sys/dev/ic/mvsata.c
 cvs rdiff -u -r1.30.4.15 -r1.30.4.16 src/sys/dev/ic/siisata.c
 cvs rdiff -u -r1.283.2.4 -r1.283.2.5 src/sys/dev/ic/wdc.c
 cvs rdiff -u -r1.123.4.4 -r1.123.4.5 src/sys/dev/scsipi/atapi_wdc.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->feedback
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Wed, 26 Jul 2017 17:21:45 +0000
State-Changed-Why:
not sure how to propose testing this, but we should probably do something
rather than just close it.


State-Changed-From-To: feedback->open
State-Changed-By: maya@NetBSD.org
State-Changed-When: Wed, 26 Jul 2017 17:35:15 +0000
State-Changed-Why:
This still occurs and the changes to fix it are in a branch not yet merged. testing can be done by dropping to ddb during heavy disk activity and typing 'sync'. normal coredumps work.


Responsible-Changed-From-To: chs->jdolecek
Responsible-Changed-By: jdolecek@NetBSD.org
Responsible-Changed-When: Sat, 07 Oct 2017 17:48:16 +0000
Responsible-Changed-Why:
Committed possible fix.


State-Changed-From-To: open->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Sat, 07 Oct 2017 17:48:16 +0000
State-Changed-Why:
Possibly fixed on -current with NCQ merge. Can you retest?


From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/47041 CVS commit: [jdolecek-ncqfixes] src/sys/dev/ata
Date: Sat, 6 Oct 2018 20:27:36 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Sat Oct  6 20:27:36 UTC 2018

 Modified Files:
 	src/sys/dev/ata [jdolecek-ncqfixes]: ata.c atavar.h wd.c

 Log Message:
 remove AT_RST_EMERG, do the queue reset explicitly in wd(4)

 this should explicitly fix PR kern/47041 with sync during heavy
 disk activity, even thought it was actually already implicitly fixed by calling
 ata_thread_run() for drive reset in previous commit already, since the
 function already called ata_queue_reset()

 drop now unused ch_reset_flags and drive_reset_flags


 To generate a diff of this commit:
 cvs rdiff -u -r1.141.6.13 -r1.141.6.14 src/sys/dev/ata/ata.c
 cvs rdiff -u -r1.99.2.8 -r1.99.2.9 src/sys/dev/ata/atavar.h
 cvs rdiff -u -r1.441.2.10 -r1.441.2.11 src/sys/dev/ata/wd.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/47041 CVS commit: [jdolecek-ncqfixes] src/sys/dev/ata
Date: Sat, 6 Oct 2018 21:19:55 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Sat Oct  6 21:19:55 UTC 2018

 Modified Files:
 	src/sys/dev/ata [jdolecek-ncqfixes]: ata.c ata_subr.c atavar.h wd.c

 Log Message:
 actually, just make dump use the same queue skip as recovery, and remove the
 no longer necessary ata_queue_reset() call from wd(4)

 also for PR kern/47041


 To generate a diff of this commit:
 cvs rdiff -u -r1.141.6.14 -r1.141.6.15 src/sys/dev/ata/ata.c
 cvs rdiff -u -r1.6.2.6 -r1.6.2.7 src/sys/dev/ata/ata_subr.c
 cvs rdiff -u -r1.99.2.9 -r1.99.2.10 src/sys/dev/ata/atavar.h
 cvs rdiff -u -r1.441.2.11 -r1.441.2.12 src/sys/dev/ata/wd.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: feedback->closeD
State-Changed-By: maya@NetBSD.org
State-Changed-When: Thu, 07 Jan 2021 15:24:29 +0000
State-Changed-Why:
Feedback timeout. ssuming fixed.


State-Changed-From-To: closeD->closed
State-Changed-By: maya@NetBSD.org
State-Changed-When: Thu, 07 Jan 2021 15:24:39 +0000
State-Changed-Why:


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.