NetBSD Problem Report #8216

Received: (qmail 26866 invoked from network); 17 Aug 1999 17:54:49 -0000
Message-Id: <199908171754.TAA29916@natteravn.runit.sintef.no>
Date: Tue, 17 Aug 1999 19:54:15 +0200 (MEST)
From: jarle@runit.sintef.no
Reply-To: jarle@runit.sintef.no
To: gnats-bugs@gnats.netbsd.org
Subject: panic: machine check in XentArith
X-Send-Pr-Version: 3.95

>Number:         8216
>Category:       port-alpha
>Synopsis:       panic: machine check in XentArith
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    port-alpha-maintainer
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Aug 17 11:05:01 +0000 1999
>Closed-Date:    
>Last-Modified:  Fri Oct 26 06:49:54 +0000 2001
>Originator:     Jarle Greipsland
>Release:        NetBSD-current 1999-08-14
>Organization:

>Environment:

System: NetBSD honey 1.4J NetBSD 1.4J (HONEY) #3: Tue Aug 17 17:37:07 CEST 1999     jarle@honey:/usr/src/sys/arch/alpha/compile/HONEY alpha

>Description:

On my PC164 system I tried running the 'crashme' program.  Well, it did
indeed crash the system.  I compiled the crashme utility out of the current
package sources, version seems to be 2.4.  I then started it as

% crashme +2000 666 100 1:00:00

It ran for a while, and then the system panicked:

unexpected machine check:

    mces    = 0x1
    vector  = 0x670
    param   = 0xfffffc0000006060
    pc      = 0xfffffc0000300498
    ra      = 0x120001ac8
    curproc = 0xfffffc0004e265d8
        pid = 543, comm = crashme

panic: machine check
Stopped in crashme at   Debugger+0x4:   ret     zero,(ra)
db> trace
Debugger() at Debugger+0x4
panic() at panic+0xe4
machine_check() at machine_check+0x1fc
interrupt() at interrupt+0x134
XentInt() at XentInt+0x1c
--- interrupt (from ipl 0) ---
XentArith() at XentArith
--- arithmetic trap ---
*ABS*() at 0
*ABS*() at 0
--- root of call graph ---

Clearly not the appropriate behavior.  I know that there have(?) been some
problems with the kernel Debugger, so I don't know whether the above trace
tells the whole truth.

Also, the system has a slightly bad memory module, so it will print out
warnings about 'processor correctable errors' during periods with high
activity.  Though so far the system have never crashed during normal
operations.  But I guess there is always the chance that the two
abnormalties might be related.

The crashme panics seem quite reproducable, so if anyone require more
information I could surely repeat the exercise.


>How-To-Repeat:
Run crashme with the manpage parameter set.


>Fix:

No idea.
						-jarle
-- 
"If it makes goo on the windshield, we'll call it a bug."
				-- Larry Wall
>Release-Note:
>Audit-Trail:
State-Changed-From-To: open->feedback 
State-Changed-By: thorpej 
State-Changed-When: Fri May 4 10:30:52 PDT 2001 
State-Changed-Why:  
Ross committed FP completion code to -current, which may very 
well have addressed this problem.  Can you confirm? 

From: Ross Harvey <ross@ghs.com>
To: jarle@runit.sintef.no, thorpej@netbsd.org
Cc: gnats-bugs@gnats.netbsd.org
Subject: Re: port-alpha/8216
Date: Mon, 7 May 2001 11:48:25 -0700 (PDT)

 I am 99% certain that this is due to your bad ram, but I will run crashme
 anyway.  What did your crashme command line look like?

 It would, BTW, help a lot if you could fix the ram and then try to reproduce
 the test case. Also, we should keep gnats-bugs@netbsd.org on the cc list;
 such replies don't go to the mailing list but are filed with the pr, at
 least, when we have 

 Here is the explanation for your results, i.e., why you see the FP message
 and, more important,  why it appears to crash inside the kernel.

 The FP messages are printed when the kernel receives an arithmetic trap
 but finds no legal instruction as the cause. This is certain to happen from
 time to time with crashme. The kernel limits its output over these issues
 so that intentionally generated sequences cannot become a major disruption.
 No harm is done, so this behavior should simply be considered "as designed".

 Since the cpu can only take one interrupt or exception at a time, all such
 events are given a priority order. This order is not directly visible to
 the kernel, but is mainly an issue between the cpu and the palcode.  It is
 carefully arranged so that any simultaneous combination of normal events
 can be recovered. One consequence of this is that arithmetic traps have a
 very high priority for trapping into the palcode. This allows the outermost
 kernel frame to directly reference the innermost user frame and lets the
 kernel run its 8,000 lines of code to do the optional IEEE FP fixup when
 requested.

 However, FP operations run in the "background"; the CPU continues executing
 new ops even when previous FP ops are still in progress. It does this by
 putting a "scoreboard hold" on FP output registers, and this leads to a
 fun conflict if while an FP op that will eventually produce a trap is
 outstanding, the cpu hits another event.  Entering palcode itself acts
 like a trap barrier, and because the other traps are either faults (precise
 PC available) or aborts (no recovery possible, so who cares) it can do the
 FP one first. Information on the other event is either lost (but it doesn't
 matter, because it will happen again when returning from the FP event) or
 delivered later.

 In this case, "later".

 > From: Jarle Greipsland <jarle@runit.sintef.no>
 >
 > thorpej  writes:
 > > Synopsis: panic: machine check in XentArith
 > > State-Changed-From-To: open->feedback
 > > State-Changed-By: thorpej
 > > State-Changed-When: Fri May 4 10:30:52 PDT 2001
 > > State-Changed-Why: 
 > > Ross committed FP completion code to -current, which may very
 > > well have addressed this problem.  Can you confirm?
 >
 > The problem is still there.  Stack trace below.  However, this time I
 > got a couple of console messages well ahead (10-20 minutes) of the
 > crash itself (so I don't know whether they relate to the cause of the
 > crash):
 >
 > FP instruction 5c26b723
 > FP event 3/8/800000
 > Please report this to port-alpha-maintainer@netbsd.org
 >
 > FP instruction 5741d17d
 > FP event 1/a/a00000
 > Please report this to port-alpha-maintainer@netbsd.org
 >
 > The crash messages themselves:
 >
 > unexpected machine check:
 >
 >     mces    = 0x1
 >     vector  = 0x670
 >     param   = 0xfffffc0000006060
 >     pc      = 0xfffffc0000300370
 >     ra      = 0x120001c08
 >     curproc = 0xfffffc00010a3c78
 >         pid = 5635, comm = crashme
 >
 > panic: machine check
 > Stopped in pid 5635 (crashme) at        cpu_Debugger+0x4:       ret     zero,(ra
 > )
 > db> trace
 > cpu_Debugger() at cpu_Debugger+0x4
 > panic() at panic+0xfc
 > machine_check() at machine_check+0x1d8
 > interrupt() at interrupt+0x180
 > XentInt() at XentInt+0x1c
 > --- interrupt (from ipl 0) ---
 > XentArith() at XentArith
 > --- arithmetic trap ---
 > *ABS*() at          0
 > *ABS*() at          0
 > --- root of call graph ---
 > db> show reg
 > v0                 0x7
 > t0          0xfffffc00005fe768  db_onpanic
 > t1                 0x1
 > t2              0x1fcd  rn+0x1fad
 > t3          0xfffffc0000580c63  fmt.96+0x1068
 > t4          0xfffffc0007f14000  end+0x78a99a8
 > t5                 0x1
 > t6                   0
 > t7                   0
 > s0          0xfffffe0007167e38
 > s1                 0x8  rettmp
 > s2                 0x1
 > s3               0x100  rn+0xe0
 > s4          0xfffffe0007167ec8
 > s5               0x670  rn+0x650
 > s6          0xfffffc00005f6e0a  ncpuinit+0x322
 > a0                 0x7
 > a1                 0x8  rettmp
 > a2                 0x5
 > a3                 0x8  rettmp
 > a4                 0x3
 > a5                 0x8  rettmp
 > t8                0x1f  framesz+0xf
 > t9          0xfffffc0000580c62  fmt.96+0x1067
 > t10                0x1
 > t11                  0
 > ra          0xfffffc0000369e9c  panic+0xfc
 > t12         0xfffffc000055e260  cpu_Debugger
 > at                   0
 > gp          0xfffffc000060f8e8  mountroot+0x8008
 > sp          0xfffffe0007167dc8
 > pc          0xfffffc000055e264  cpu_Debugger+0x4
 > ps                 0x7
 > ai                   0
 > pv          0xfffffc000055e260  cpu_Debugger
 > cpu_Debugger+0x4:       ret     zero,(ra)
 >
 > 					-jarle
 >

From: Jarle Greipsland <jarle@runit.sintef.no>
To: ross@ghs.com
Cc: thorpej@netbsd.org, gnats-bugs@gnats.netbsd.org
Subject: Re: port-alpha/8216
Date: Tue, 08 May 2001 08:53:28 +0200

 Ross Harvey writes:
 > I am 99% certain that this is due to your bad ram

 Oops, my bad.  I forgot to mention that the memory situation finally
 got so bad that I had to remove 128MB of RAM and run the system with
 128MB in the 128 bit wide memory bus configuration.  Since then I have
 not seen any "processor correctable error"s.  I thus believe that
 there is no longer any problem with bad memory.  (Actually I suspect
 that the problem lies with one of the now unused memory controllers,
 and not with the memory modules themselves).

 > but I will run crashme anyway.  What did your crashme command line
 > look like?

 % crashme +2000 666 100 1:00:00

 > It would, BTW, help a lot if you could fix the ram and then try to
 > reproduce the test case.

 See above; I believe that the ram is no longer a source of problems.
 And yes, the problem is still there.

 > Here is the explanation for your results, i.e., why you see the FP
 > message and, more important, why it appears to crash inside the
 > kernel.
 [ ... ]
 Thank you very much for the thorough explanation.

 					-jarle
 -- 
 If an undetectable error occurs, the processor continues as if no error
 had occurred.			-- IBM S/360 Principles of Operation

From: Jarle Greipsland <jarle@runit.sintef.no>
To: ross@ghs.com
Cc: thorpej@netbsd.org, gnats-bugs@gnats.netbsd.org
Subject: Re: port-alpha/8216
Date: Tue, 08 May 2001 12:04:27 +0200

 Jarle Greipsland writes:
 >> but I will run crashme anyway.  What did your crashme command line
 >> look like?

 > % crashme +2000 666 100 1:00:00

 >> It would, BTW, help a lot if you could fix the ram and then try to
 >> reproduce the test case.

 > See above; I believe that the ram is no longer a source of problems.
 > And yes, the problem is still there.

 Just to make this clear: The problem that is still there is that the
 system panics while running crashme.  I have not observed any memory
 problems for more than a year.
 					-jarle
 -- 
 "strcmp(3): When you want to just relax and cruise down memory lane."
 				-- hpeyerl@novatel.ca
State-Changed-From-To: feedback->open 
State-Changed-By: chs 
State-Changed-When: Thu Oct 25 23:49:30 PDT 2001 
State-Changed-Why:  
feedback given, problem still there. 
>Unformatted:
Home
PR Database Search
(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.