NetBSD Problem Report #4691

Received: (qmail 21295 invoked from network); 15 Dec 1997 17:48:24 -0000
Message-Id: <199712151747.MAA00600@sometimes.weird.com>
Date: Mon, 15 Dec 1997 12:47:11 -0500 (EST)
From: woods@sometimes.weird.com
Reply-To: woods@planix.com
To: gnats-bugs@gnats.netbsd.org
Subject: sun3 ECC error reporting works, but error is not cleared and system loops forever
X-Send-Pr-Version: 3.95

>Number:         4691
>Category:       port-sun3
>Synopsis:       sun3 ECC error reporting works, but error is not cleared and system loops forever
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    port-sun3-maintainer
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Dec 15 09:50:02 +0000 1997
>Closed-Date:    
>Last-Modified:  Mon Dec 28 17:28:54 +0000 1998
>Originator:     Greg A. Woods
>Release:        NetBSD-current 1997/12/14
>Organization:
							Greg A. Woods

+1 416 443-1734      VE3TCP      <gwoods@acm.org>      <robohack!woods>
Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>
>Environment:

System: NetBSD sometimes 1.3_ALPHA NetBSD 1.3_ALPHA (MOUSETRAP) #3: Wed Dec 3 12:38:27 EST 1997 woods@sometimes:/var/usr.src/sys/arch/sun3/compile/MOUSETRAP sun3

>Description:

ECC memory, and memory error detection in general, is a very critical
and important issue for me.  So I decided I would enable ECC correctable
error interrupts on my test machine since I had already observed the CE
lamp lit on one of the boards in the system and I wanted to make sure
that the system logs noted this event as well.

So I did so, and today I found the machine in a tight loop forever
reporting the same error, with no keyboard or other I/O response:

	Memory error on CPU cycle!
	 ctx=4, vaddr=0xe5f8007, paddr=0x1254000
	 csr=d1<IPEND,IENA,CE_ENA,CE>

Unfortunately I have the PROM set to cause a system reset on watchdog
reset.  (With the other setting does a watchdog drop into the DDB
otherwise, or straight to the PROM?  If the latter is there a way to
get to the DDB from the PROM?)

>How-To-Repeat:

Find a memory board that has occasional correctable errors and install it.

Apply the following patch to /usr/src/sys/arch/sun3/dev/memerr.c and
build a new kernel:

11:53 [1233] # diff memerr.c-1.8 memerr.c
165c165
<               mer->me_csr = ME_CSR_IENA; /* | ME_ECC_CE_ENA */
---
>               mer->me_csr = ME_CSR_IENA | ME_ECC_CE_ENA;

Boot the new kernel and wait for the memory error to occur.

Observe that the system is locked in an interrupt loop reporting the
correctable error and that the CE LED is lit on the board.

Hit the watchdog reset button to break the loop and either drop to PROM or
reset the system.

>Fix:

Observe that the code in memerr.c that claims it will reset the error
doesn't do so:

recover:
        /* Clear the error by writing the address register. */
        me->me_vaddr = 0;
        return (1);

I would guess that the ECC boards need to be told directly that the error
has been handled and the interrupt should be disabled.  I've looked at
the code for sun4/200 support of ECC memory sent to me by Chuck Cranor, but
it looks like a whole lot more work needs to be done to make that code
fit the sun3 framework.  Perhaps in fact the sun3 framework should be
warped to match the sun4/200 framework.  For example Chuck's code does
special things to probe the memory boards during boot to determine their
size, starting address, etc.
>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: gnats-admin->port-sun3-maintainer 
Responsible-Changed-By: fair 
Responsible-Changed-When: Mon Dec 28 09:28:51 PST 1998 
Responsible-Changed-Why:  
This PR is the responsibility of the portmaster, 
not the GNATS database administrator. 
>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.