NetBSD Problem Report #43274

From mrg@eterna.com.au  Fri May  7 05:20:13 2010
Return-Path: <mrg@eterna.com.au>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by www.NetBSD.org (Postfix) with ESMTP id 2C22C63B8FE
	for <gnats-bugs@gnats.NetBSD.org>; Fri,  7 May 2010 05:20:13 +0000 (UTC)
Message-Id: <20100507052009.8D7C437560@splode.eterna.com.au>
Date: Fri,  7 May 2010 15:20:09 +1000 (EST)
From: mrg@eterna.com.au
Reply-To: mrg@eterna.com.au
To: gnats-bugs@gnats.NetBSD.org
Subject: re(4) crash on ultra10 - uncorrectable DMA error
X-Send-Pr-Version: 3.95

>Number:         43274
>Category:       kern
>Synopsis:       re(4) crash on ultra10 - uncorrectable DMA error
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          analyzed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri May 07 05:25:00 +0000 2010
>Closed-Date:    
>Last-Modified:  Sun Feb 26 06:26:13 +0000 2012
>Originator:     matthew green
>Release:        NetBSD 5.99.24
>Organization:
people's front against (bozotic) www (softwar foundation)
>Environment:
System: NetBSD main-protagonist.eterna23.net 5.99.24 NetBSD 5.99.24 (_main_) #39: Wed Mar 17 09:09:42 PDT 2010  mrg@space-bird.eterna23.net:/var/obj/sparc64/usr/src/sys/arch/sparc64/compile/_main_ sparc64
Architecture: sparc64
Machine: sparc64

	card looks like:

	re0 at pci2 dev 1 function 0: RealTek 8169/8110 Gigabit Ethernet (rev. 0x10)
	re0: interrupting at ivec 10
	re0: Ethernet address 00:0f:b5:42:7c:8f
	re0: using 512 tx descriptors
	rgephy0 at re0 phy 7: RTL8169S/8110S/8211 1000BASE-T media interface, rev. 0
	rgephy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto

>Description:

	ultra10 crashed earlier today with this on the console:

	login: psycho0: uncorrectable DMA error AFAR 11b8450 AFSR 0x410000ff40800000<BLK,P_DTE,P_DRD>
	psycho0: IOVA c0114000 IOTTE 3fc84012
	Stopped in pid 0.3 (system) at  netbsd:cpu_Debugger+0x4:        nop
	db{0}> bt
	sparc_interrupt(ffffffffffffffe0, 20, 1000000, 6, 4, 3aa6840) at netbsd:sparc_interrupt+0x1e8
	_bus_dmamap_unload(1819140, 2f36000, 0, 5ea, 8, 7fffffffffffffff) at netbsd:_bus_dmamap_unload+0x74
	iommu_dvmamap_unload(2df5880, 2f36000, 6000, 5ea, 8, 0) at netbsd:iommu_dvmamap_unload+0x28
	re_txeof(c57a000, c, c17364c, 3fc84000, 0, 5ea) at netbsd:re_txeof+0x108
	re_intr(c57a000, 42d2e70, 5ea, 0, 5, 401) at netbsd:re_intr+0x134
	intr_biglock_wrapper(2df4a00, 0, e0017ed0, 10, 114b0e0, c173668) at netbsd:intr_biglock_wrapper+0x10
	sparc_interrupt(0, 42d2e70, 1f4, 0, 2, 0) at netbsd:sparc_interrupt+0x1e8
	ifq_enqueue(c57a008, 0, 2, 2, c1739a2, 1000000) at netbsd:ifq_enqueue+0xa8
	ether_output(0, 42d2e70, 3c19a20, 3a97650, 2810, 3aa6840) at netbsd:ether_output+0x6bc
	ip_output(14, 0, 3c19a20, c57a008, 3c08a00, 4326810) at netbsd:ip_output+0xfa4
	ip_forward(42d86a0, 1, c4dac08, 0, c4dac08, ac101837) at netbsd:ip_forward+0x158
	ip_input(5dc, 0, 0, c050e00, 114b0e0, c053b70) at netbsd:ip_input+0xb84
	ipintr(1879c00, 0, c053740, 6, 34, de) at netbsd:ipintr+0x34
	softint_thread(c02e230, c053740, 0, c050e00, 1296780, c052bf0) at netbsd:softint_thread+0x64
	lwp_trampoline(f0067458, fffa9cf8, 111800, 110728, fffa9df8, 1) at netbsd:lwp_trampoline+0x8
	db{0}> c

	unfortunately it did not dump core or do anything after this 'c',
	so the only data i really have is this stack trace.

>How-To-Repeat:

	unknown.

>Fix:

>Release-Note:

>Audit-Trail:
From: Takeshi Nakayama <tn@catvmics.ne.jp>
To: gnats-bugs@NetBSD.org, mrg@eterna.com.au
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
 netbsd-bugs@netbsd.org
Subject: Re: kern/43274: re(4) crash on ultra10 - uncorrectable DMA error
Date: Sat, 08 May 2010 06:04:40 +0900 (JST)

 >>> mrg@eterna.com.au wrote

 > 	ultra10 crashed earlier today with this on the console:
 > 
 > 	login: psycho0: uncorrectable DMA error AFAR 11b8450 AFSR 0x410000ff40800000<BLK,P_DTE,P_DRD>
 > 	psycho0: IOVA c0114000 IOTTE 3fc84012
 > 	Stopped in pid 0.3 (system) at  netbsd:cpu_Debugger+0x4:        nop
 > 	db{0}> bt
 > 	sparc_interrupt(ffffffffffffffe0, 20, 1000000, 6, 4, 3aa6840) at netbsd:sparc_interrupt+0x1e8
 > 	_bus_dmamap_unload(1819140, 2f36000, 0, 5ea, 8, 7fffffffffffffff) at netbsd:_bus_dmamap_unload+0x74
 > 	iommu_dvmamap_unload(2df5880, 2f36000, 6000, 5ea, 8, 0) at netbsd:iommu_dvmamap_unload+0x28
 > 	re_txeof(c57a000, c, c17364c, 3fc84000, 0, 5ea) at netbsd:re_txeof+0x108
 > 	re_intr(c57a000, 42d2e70, 5ea, 0, 5, 401) at netbsd:re_intr+0x134
 > 	intr_biglock_wrapper(2df4a00, 0, e0017ed0, 10, 114b0e0, c173668) at netbsd:intr_biglock_wrapper+0x10
 > 	sparc_interrupt(0, 42d2e70, 1f4, 0, 2, 0) at netbsd:sparc_interrupt+0x1e8
 > 	ifq_enqueue(c57a008, 0, 2, 2, c1739a2, 1000000) at netbsd:ifq_enqueue+0xa8
 > 	ether_output(0, 42d2e70, 3c19a20, 3a97650, 2810, 3aa6840) at netbsd:ether_output+0x6bc
 > 	ip_output(14, 0, 3c19a20, c57a008, 3c08a00, 4326810) at netbsd:ip_output+0xfa4
 > 	ip_forward(42d86a0, 1, c4dac08, 0, c4dac08, ac101837) at netbsd:ip_forward+0x158
 > 	ip_input(5dc, 0, 0, c050e00, 114b0e0, c053b70) at netbsd:ip_input+0xb84
 > 	ipintr(1879c00, 0, c053740, 6, 34, de) at netbsd:ipintr+0x34
 > 	softint_thread(c02e230, c053740, 0, c050e00, 1296780, c052bf0) at netbsd:softint_thread+0x64
 > 	lwp_trampoline(f0067458, fffa9cf8, 111800, 110728, fffa9df8, 1) at netbsd:lwp_trampoline+0x8
 > 	db{0}> c

 I see a similar problem on tlp(4) on Netra X1.  So please try this
 workaround.


 Index: sys/arch/sparc64/dev/iommu.c
 ===================================================================
 RCS file: /cvsroot/src/sys/arch/sparc64/dev/iommu.c,v
 retrieving revision 1.98
 diff -u -d -r1.98 iommu.c
 --- sys/arch/sparc64/dev/iommu.c	11 Mar 2010 03:54:56 -0000	1.98
 +++ sys/arch/sparc64/dev/iommu.c	7 May 2010 14:07:08 -0000
 @@ -358,8 +358,10 @@
  		 * eliminating the next line, but the page is mapped
  		 * until the next iommu_enter call.
  		 */
 +#if 0 /* XXX */
  		is->is_tsb[IOTSBSLOT(va,is->is_tsbsize)] &= ~IOTTE_V;
  		membar_storestore();
 +#endif
  		bus_space_write_8(is->is_bustag, is->is_iommu,
  			IOMMUREG(iommu_flush), va);
  		va += PAGE_SIZE;


 As I noted as comment in iommu.c around this workaround, it seems
 that unmapping an IOMMU page which is used by a device causes an
 uncorrectable DMA error.

 I could not figure out the problem other than this workaround.

 -- Takeshi Nakayama

From: matthew green <mrg@eterna.com.au>
To: Takeshi Nakayama <tn@catvmics.ne.jp>
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
    netbsd-bugs@netbsd.org, gnats-bugs@NetBSD.org
Subject: re: kern/43274: re(4) crash on ultra10 - uncorrectable DMA error
Date: Sat, 08 May 2010 17:46:20 +1000


    > 	ultra10 crashed earlier today with this on the console:
    > 
    > 	login: psycho0: uncorrectable DMA error AFAR 11b8450 AFSR 0x410000ff40800000<BLK,P_DTE,P_DRD>
    > 	psycho0: IOVA c0114000 IOTTE 3fc84012
    [ .. ]

    I see a similar problem on tlp(4) on Netra X1.  So please try this
    workaround.


    Index: sys/arch/sparc64/dev/iommu.c
    ===================================================================
    RCS file: /cvsroot/src/sys/arch/sparc64/dev/iommu.c,v
    retrieving revision 1.98
    diff -u -d -r1.98 iommu.c
    --- sys/arch/sparc64/dev/iommu.c	11 Mar 2010 03:54:56 -0000	1.98
    +++ sys/arch/sparc64/dev/iommu.c	7 May 2010 14:07:08 -0000
    @@ -358,8 +358,10 @@
     		 * eliminating the next line, but the page is mapped
     		 * until the next iommu_enter call.
     		 */
    +#if 0 /* XXX */
     		is->is_tsb[IOTSBSLOT(va,is->is_tsbsize)] &= ~IOTTE_V;
     		membar_storestore();
    +#endif
     		bus_space_write_8(is->is_bustag, is->is_iommu,
     			IOMMUREG(iommu_flush), va);
     		va += PAGE_SIZE;


    As I noted as comment in iommu.c around this workaround, it seems
    that unmapping an IOMMU page which is used by a device causes an
    uncorrectable DMA error.

    I could not figure out the problem other than this workaround.

 i noticed that open solaris never removes the valid bit from the
 iotte's.  i think we should commit the #if 0 or just remove that
 code entirely...


 .mrg.

From: matthew green <mrg@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/43274 CVS commit: src/sys/arch/sparc64/dev
Date: Thu, 17 Jun 2010 06:48:46 +0000

 Module Name:	src
 Committed By:	mrg
 Date:		Thu Jun 17 06:48:46 UTC 2010

 Modified Files:
 	src/sys/arch/sparc64/dev: iommu.c

 Log Message:
 in iommu_remove() don't invalidate the IOMMU mapping.  for reasons not
 yet determined, some PCI devices (at least fxp(4) and re(4)) sometimes
 appear to perform DMA operations while this is happening, and we get
 uncorrectable DMA errors.  ideally, this "shouldn't happen", but none
 of the investigation so far has reveal the problem, and my source
 investigation of both opensolaris and linux show that their perform
 the invaliation when unmapping.

 "handles" PR#43274 as well as other issues...

 XXX: candidate for netbsd-5


 To generate a diff of this commit:
 cvs rdiff -u -r1.98 -r1.99 src/sys/arch/sparc64/dev/iommu.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->analyzed
State-Changed-By: mrg@NetBSD.org
State-Changed-When: Thu, 17 Jun 2010 06:53:13 +0000
State-Changed-Why:
ok, so we know that this problem happens in particular cases but we're not
sure why it fails yet, but do have a commited work around.  uncommit the
workaround to truly solve this problem, someone, please.. :-)


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.