NetBSD Problem Report #43274
From mrg@eterna.com.au Fri May 7 05:20:13 2010
Return-Path: <mrg@eterna.com.au>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
by www.NetBSD.org (Postfix) with ESMTP id 2C22C63B8FE
for <gnats-bugs@gnats.NetBSD.org>; Fri, 7 May 2010 05:20:13 +0000 (UTC)
Message-Id: <20100507052009.8D7C437560@splode.eterna.com.au>
Date: Fri, 7 May 2010 15:20:09 +1000 (EST)
From: mrg@eterna.com.au
Reply-To: mrg@eterna.com.au
To: gnats-bugs@gnats.NetBSD.org
Subject: re(4) crash on ultra10 - uncorrectable DMA error
X-Send-Pr-Version: 3.95
>Number: 43274
>Category: kern
>Synopsis: re(4) crash on ultra10 - uncorrectable DMA error
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: analyzed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri May 07 05:25:00 +0000 2010
>Closed-Date:
>Last-Modified: Sun Feb 26 06:26:13 +0000 2012
>Originator: matthew green
>Release: NetBSD 5.99.24
>Organization:
people's front against (bozotic) www (softwar foundation)
>Environment:
System: NetBSD main-protagonist.eterna23.net 5.99.24 NetBSD 5.99.24 (_main_) #39: Wed Mar 17 09:09:42 PDT 2010 mrg@space-bird.eterna23.net:/var/obj/sparc64/usr/src/sys/arch/sparc64/compile/_main_ sparc64
Architecture: sparc64
Machine: sparc64
card looks like:
re0 at pci2 dev 1 function 0: RealTek 8169/8110 Gigabit Ethernet (rev. 0x10)
re0: interrupting at ivec 10
re0: Ethernet address 00:0f:b5:42:7c:8f
re0: using 512 tx descriptors
rgephy0 at re0 phy 7: RTL8169S/8110S/8211 1000BASE-T media interface, rev. 0
rgephy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
>Description:
ultra10 crashed earlier today with this on the console:
login: psycho0: uncorrectable DMA error AFAR 11b8450 AFSR 0x410000ff40800000<BLK,P_DTE,P_DRD>
psycho0: IOVA c0114000 IOTTE 3fc84012
Stopped in pid 0.3 (system) at netbsd:cpu_Debugger+0x4: nop
db{0}> bt
sparc_interrupt(ffffffffffffffe0, 20, 1000000, 6, 4, 3aa6840) at netbsd:sparc_interrupt+0x1e8
_bus_dmamap_unload(1819140, 2f36000, 0, 5ea, 8, 7fffffffffffffff) at netbsd:_bus_dmamap_unload+0x74
iommu_dvmamap_unload(2df5880, 2f36000, 6000, 5ea, 8, 0) at netbsd:iommu_dvmamap_unload+0x28
re_txeof(c57a000, c, c17364c, 3fc84000, 0, 5ea) at netbsd:re_txeof+0x108
re_intr(c57a000, 42d2e70, 5ea, 0, 5, 401) at netbsd:re_intr+0x134
intr_biglock_wrapper(2df4a00, 0, e0017ed0, 10, 114b0e0, c173668) at netbsd:intr_biglock_wrapper+0x10
sparc_interrupt(0, 42d2e70, 1f4, 0, 2, 0) at netbsd:sparc_interrupt+0x1e8
ifq_enqueue(c57a008, 0, 2, 2, c1739a2, 1000000) at netbsd:ifq_enqueue+0xa8
ether_output(0, 42d2e70, 3c19a20, 3a97650, 2810, 3aa6840) at netbsd:ether_output+0x6bc
ip_output(14, 0, 3c19a20, c57a008, 3c08a00, 4326810) at netbsd:ip_output+0xfa4
ip_forward(42d86a0, 1, c4dac08, 0, c4dac08, ac101837) at netbsd:ip_forward+0x158
ip_input(5dc, 0, 0, c050e00, 114b0e0, c053b70) at netbsd:ip_input+0xb84
ipintr(1879c00, 0, c053740, 6, 34, de) at netbsd:ipintr+0x34
softint_thread(c02e230, c053740, 0, c050e00, 1296780, c052bf0) at netbsd:softint_thread+0x64
lwp_trampoline(f0067458, fffa9cf8, 111800, 110728, fffa9df8, 1) at netbsd:lwp_trampoline+0x8
db{0}> c
unfortunately it did not dump core or do anything after this 'c',
so the only data i really have is this stack trace.
>How-To-Repeat:
unknown.
>Fix:
>Release-Note:
>Audit-Trail:
From: Takeshi Nakayama <tn@catvmics.ne.jp>
To: gnats-bugs@NetBSD.org, mrg@eterna.com.au
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: Re: kern/43274: re(4) crash on ultra10 - uncorrectable DMA error
Date: Sat, 08 May 2010 06:04:40 +0900 (JST)
>>> mrg@eterna.com.au wrote
> ultra10 crashed earlier today with this on the console:
>
> login: psycho0: uncorrectable DMA error AFAR 11b8450 AFSR 0x410000ff40800000<BLK,P_DTE,P_DRD>
> psycho0: IOVA c0114000 IOTTE 3fc84012
> Stopped in pid 0.3 (system) at netbsd:cpu_Debugger+0x4: nop
> db{0}> bt
> sparc_interrupt(ffffffffffffffe0, 20, 1000000, 6, 4, 3aa6840) at netbsd:sparc_interrupt+0x1e8
> _bus_dmamap_unload(1819140, 2f36000, 0, 5ea, 8, 7fffffffffffffff) at netbsd:_bus_dmamap_unload+0x74
> iommu_dvmamap_unload(2df5880, 2f36000, 6000, 5ea, 8, 0) at netbsd:iommu_dvmamap_unload+0x28
> re_txeof(c57a000, c, c17364c, 3fc84000, 0, 5ea) at netbsd:re_txeof+0x108
> re_intr(c57a000, 42d2e70, 5ea, 0, 5, 401) at netbsd:re_intr+0x134
> intr_biglock_wrapper(2df4a00, 0, e0017ed0, 10, 114b0e0, c173668) at netbsd:intr_biglock_wrapper+0x10
> sparc_interrupt(0, 42d2e70, 1f4, 0, 2, 0) at netbsd:sparc_interrupt+0x1e8
> ifq_enqueue(c57a008, 0, 2, 2, c1739a2, 1000000) at netbsd:ifq_enqueue+0xa8
> ether_output(0, 42d2e70, 3c19a20, 3a97650, 2810, 3aa6840) at netbsd:ether_output+0x6bc
> ip_output(14, 0, 3c19a20, c57a008, 3c08a00, 4326810) at netbsd:ip_output+0xfa4
> ip_forward(42d86a0, 1, c4dac08, 0, c4dac08, ac101837) at netbsd:ip_forward+0x158
> ip_input(5dc, 0, 0, c050e00, 114b0e0, c053b70) at netbsd:ip_input+0xb84
> ipintr(1879c00, 0, c053740, 6, 34, de) at netbsd:ipintr+0x34
> softint_thread(c02e230, c053740, 0, c050e00, 1296780, c052bf0) at netbsd:softint_thread+0x64
> lwp_trampoline(f0067458, fffa9cf8, 111800, 110728, fffa9df8, 1) at netbsd:lwp_trampoline+0x8
> db{0}> c
I see a similar problem on tlp(4) on Netra X1. So please try this
workaround.
Index: sys/arch/sparc64/dev/iommu.c
===================================================================
RCS file: /cvsroot/src/sys/arch/sparc64/dev/iommu.c,v
retrieving revision 1.98
diff -u -d -r1.98 iommu.c
--- sys/arch/sparc64/dev/iommu.c 11 Mar 2010 03:54:56 -0000 1.98
+++ sys/arch/sparc64/dev/iommu.c 7 May 2010 14:07:08 -0000
@@ -358,8 +358,10 @@
* eliminating the next line, but the page is mapped
* until the next iommu_enter call.
*/
+#if 0 /* XXX */
is->is_tsb[IOTSBSLOT(va,is->is_tsbsize)] &= ~IOTTE_V;
membar_storestore();
+#endif
bus_space_write_8(is->is_bustag, is->is_iommu,
IOMMUREG(iommu_flush), va);
va += PAGE_SIZE;
As I noted as comment in iommu.c around this workaround, it seems
that unmapping an IOMMU page which is used by a device causes an
uncorrectable DMA error.
I could not figure out the problem other than this workaround.
-- Takeshi Nakayama
From: matthew green <mrg@eterna.com.au>
To: Takeshi Nakayama <tn@catvmics.ne.jp>
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org, gnats-bugs@NetBSD.org
Subject: re: kern/43274: re(4) crash on ultra10 - uncorrectable DMA error
Date: Sat, 08 May 2010 17:46:20 +1000
> ultra10 crashed earlier today with this on the console:
>
> login: psycho0: uncorrectable DMA error AFAR 11b8450 AFSR 0x410000ff40800000<BLK,P_DTE,P_DRD>
> psycho0: IOVA c0114000 IOTTE 3fc84012
[ .. ]
I see a similar problem on tlp(4) on Netra X1. So please try this
workaround.
Index: sys/arch/sparc64/dev/iommu.c
===================================================================
RCS file: /cvsroot/src/sys/arch/sparc64/dev/iommu.c,v
retrieving revision 1.98
diff -u -d -r1.98 iommu.c
--- sys/arch/sparc64/dev/iommu.c 11 Mar 2010 03:54:56 -0000 1.98
+++ sys/arch/sparc64/dev/iommu.c 7 May 2010 14:07:08 -0000
@@ -358,8 +358,10 @@
* eliminating the next line, but the page is mapped
* until the next iommu_enter call.
*/
+#if 0 /* XXX */
is->is_tsb[IOTSBSLOT(va,is->is_tsbsize)] &= ~IOTTE_V;
membar_storestore();
+#endif
bus_space_write_8(is->is_bustag, is->is_iommu,
IOMMUREG(iommu_flush), va);
va += PAGE_SIZE;
As I noted as comment in iommu.c around this workaround, it seems
that unmapping an IOMMU page which is used by a device causes an
uncorrectable DMA error.
I could not figure out the problem other than this workaround.
i noticed that open solaris never removes the valid bit from the
iotte's. i think we should commit the #if 0 or just remove that
code entirely...
.mrg.
From: matthew green <mrg@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/43274 CVS commit: src/sys/arch/sparc64/dev
Date: Thu, 17 Jun 2010 06:48:46 +0000
Module Name: src
Committed By: mrg
Date: Thu Jun 17 06:48:46 UTC 2010
Modified Files:
src/sys/arch/sparc64/dev: iommu.c
Log Message:
in iommu_remove() don't invalidate the IOMMU mapping. for reasons not
yet determined, some PCI devices (at least fxp(4) and re(4)) sometimes
appear to perform DMA operations while this is happening, and we get
uncorrectable DMA errors. ideally, this "shouldn't happen", but none
of the investigation so far has reveal the problem, and my source
investigation of both opensolaris and linux show that their perform
the invaliation when unmapping.
"handles" PR#43274 as well as other issues...
XXX: candidate for netbsd-5
To generate a diff of this commit:
cvs rdiff -u -r1.98 -r1.99 src/sys/arch/sparc64/dev/iommu.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: open->analyzed
State-Changed-By: mrg@NetBSD.org
State-Changed-When: Thu, 17 Jun 2010 06:53:13 +0000
State-Changed-Why:
ok, so we know that this problem happens in particular cases but we're not
sure why it fails yet, but do have a commited work around. uncommit the
workaround to truly solve this problem, someone, please.. :-)
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.