NetBSD Problem Report #8216
Received: (qmail 26866 invoked from network); 17 Aug 1999 17:54:49 -0000
Message-Id: <199908171754.TAA29916@natteravn.runit.sintef.no>
Date: Tue, 17 Aug 1999 19:54:15 +0200 (MEST)
From: jarle@runit.sintef.no
Reply-To: jarle@runit.sintef.no
To: gnats-bugs@gnats.netbsd.org
Subject: panic: machine check in XentArith
X-Send-Pr-Version: 3.95
>Number: 8216
>Category: port-alpha
>Synopsis: panic: machine check in XentArith
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: port-alpha-maintainer
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Tue Aug 17 11:05:01 +0000 1999
>Closed-Date:
>Last-Modified: Fri Oct 26 06:49:54 +0000 2001
>Originator: Jarle Greipsland
>Release: NetBSD-current 1999-08-14
>Organization:
>Environment:
System: NetBSD honey 1.4J NetBSD 1.4J (HONEY) #3: Tue Aug 17 17:37:07 CEST 1999 jarle@honey:/usr/src/sys/arch/alpha/compile/HONEY alpha
>Description:
On my PC164 system I tried running the 'crashme' program. Well, it did
indeed crash the system. I compiled the crashme utility out of the current
package sources, version seems to be 2.4. I then started it as
% crashme +2000 666 100 1:00:00
It ran for a while, and then the system panicked:
unexpected machine check:
mces = 0x1
vector = 0x670
param = 0xfffffc0000006060
pc = 0xfffffc0000300498
ra = 0x120001ac8
curproc = 0xfffffc0004e265d8
pid = 543, comm = crashme
panic: machine check
Stopped in crashme at Debugger+0x4: ret zero,(ra)
db> trace
Debugger() at Debugger+0x4
panic() at panic+0xe4
machine_check() at machine_check+0x1fc
interrupt() at interrupt+0x134
XentInt() at XentInt+0x1c
--- interrupt (from ipl 0) ---
XentArith() at XentArith
--- arithmetic trap ---
*ABS*() at 0
*ABS*() at 0
--- root of call graph ---
Clearly not the appropriate behavior. I know that there have(?) been some
problems with the kernel Debugger, so I don't know whether the above trace
tells the whole truth.
Also, the system has a slightly bad memory module, so it will print out
warnings about 'processor correctable errors' during periods with high
activity. Though so far the system have never crashed during normal
operations. But I guess there is always the chance that the two
abnormalties might be related.
The crashme panics seem quite reproducable, so if anyone require more
information I could surely repeat the exercise.
>How-To-Repeat:
Run crashme with the manpage parameter set.
>Fix:
No idea.
-jarle
--
"If it makes goo on the windshield, we'll call it a bug."
-- Larry Wall
>Release-Note:
>Audit-Trail:
State-Changed-From-To: open->feedback
State-Changed-By: thorpej
State-Changed-When: Fri May 4 10:30:52 PDT 2001
State-Changed-Why:
Ross committed FP completion code to -current, which may very
well have addressed this problem. Can you confirm?
From: Ross Harvey <ross@ghs.com>
To: jarle@runit.sintef.no, thorpej@netbsd.org
Cc: gnats-bugs@gnats.netbsd.org
Subject: Re: port-alpha/8216
Date: Mon, 7 May 2001 11:48:25 -0700 (PDT)
I am 99% certain that this is due to your bad ram, but I will run crashme
anyway. What did your crashme command line look like?
It would, BTW, help a lot if you could fix the ram and then try to reproduce
the test case. Also, we should keep gnats-bugs@netbsd.org on the cc list;
such replies don't go to the mailing list but are filed with the pr, at
least, when we have
Here is the explanation for your results, i.e., why you see the FP message
and, more important, why it appears to crash inside the kernel.
The FP messages are printed when the kernel receives an arithmetic trap
but finds no legal instruction as the cause. This is certain to happen from
time to time with crashme. The kernel limits its output over these issues
so that intentionally generated sequences cannot become a major disruption.
No harm is done, so this behavior should simply be considered "as designed".
Since the cpu can only take one interrupt or exception at a time, all such
events are given a priority order. This order is not directly visible to
the kernel, but is mainly an issue between the cpu and the palcode. It is
carefully arranged so that any simultaneous combination of normal events
can be recovered. One consequence of this is that arithmetic traps have a
very high priority for trapping into the palcode. This allows the outermost
kernel frame to directly reference the innermost user frame and lets the
kernel run its 8,000 lines of code to do the optional IEEE FP fixup when
requested.
However, FP operations run in the "background"; the CPU continues executing
new ops even when previous FP ops are still in progress. It does this by
putting a "scoreboard hold" on FP output registers, and this leads to a
fun conflict if while an FP op that will eventually produce a trap is
outstanding, the cpu hits another event. Entering palcode itself acts
like a trap barrier, and because the other traps are either faults (precise
PC available) or aborts (no recovery possible, so who cares) it can do the
FP one first. Information on the other event is either lost (but it doesn't
matter, because it will happen again when returning from the FP event) or
delivered later.
In this case, "later".
> From: Jarle Greipsland <jarle@runit.sintef.no>
>
> thorpej writes:
> > Synopsis: panic: machine check in XentArith
> > State-Changed-From-To: open->feedback
> > State-Changed-By: thorpej
> > State-Changed-When: Fri May 4 10:30:52 PDT 2001
> > State-Changed-Why:
> > Ross committed FP completion code to -current, which may very
> > well have addressed this problem. Can you confirm?
>
> The problem is still there. Stack trace below. However, this time I
> got a couple of console messages well ahead (10-20 minutes) of the
> crash itself (so I don't know whether they relate to the cause of the
> crash):
>
> FP instruction 5c26b723
> FP event 3/8/800000
> Please report this to port-alpha-maintainer@netbsd.org
>
> FP instruction 5741d17d
> FP event 1/a/a00000
> Please report this to port-alpha-maintainer@netbsd.org
>
> The crash messages themselves:
>
> unexpected machine check:
>
> mces = 0x1
> vector = 0x670
> param = 0xfffffc0000006060
> pc = 0xfffffc0000300370
> ra = 0x120001c08
> curproc = 0xfffffc00010a3c78
> pid = 5635, comm = crashme
>
> panic: machine check
> Stopped in pid 5635 (crashme) at cpu_Debugger+0x4: ret zero,(ra
> )
> db> trace
> cpu_Debugger() at cpu_Debugger+0x4
> panic() at panic+0xfc
> machine_check() at machine_check+0x1d8
> interrupt() at interrupt+0x180
> XentInt() at XentInt+0x1c
> --- interrupt (from ipl 0) ---
> XentArith() at XentArith
> --- arithmetic trap ---
> *ABS*() at 0
> *ABS*() at 0
> --- root of call graph ---
> db> show reg
> v0 0x7
> t0 0xfffffc00005fe768 db_onpanic
> t1 0x1
> t2 0x1fcd rn+0x1fad
> t3 0xfffffc0000580c63 fmt.96+0x1068
> t4 0xfffffc0007f14000 end+0x78a99a8
> t5 0x1
> t6 0
> t7 0
> s0 0xfffffe0007167e38
> s1 0x8 rettmp
> s2 0x1
> s3 0x100 rn+0xe0
> s4 0xfffffe0007167ec8
> s5 0x670 rn+0x650
> s6 0xfffffc00005f6e0a ncpuinit+0x322
> a0 0x7
> a1 0x8 rettmp
> a2 0x5
> a3 0x8 rettmp
> a4 0x3
> a5 0x8 rettmp
> t8 0x1f framesz+0xf
> t9 0xfffffc0000580c62 fmt.96+0x1067
> t10 0x1
> t11 0
> ra 0xfffffc0000369e9c panic+0xfc
> t12 0xfffffc000055e260 cpu_Debugger
> at 0
> gp 0xfffffc000060f8e8 mountroot+0x8008
> sp 0xfffffe0007167dc8
> pc 0xfffffc000055e264 cpu_Debugger+0x4
> ps 0x7
> ai 0
> pv 0xfffffc000055e260 cpu_Debugger
> cpu_Debugger+0x4: ret zero,(ra)
>
> -jarle
>
From: Jarle Greipsland <jarle@runit.sintef.no>
To: ross@ghs.com
Cc: thorpej@netbsd.org, gnats-bugs@gnats.netbsd.org
Subject: Re: port-alpha/8216
Date: Tue, 08 May 2001 08:53:28 +0200
Ross Harvey writes:
> I am 99% certain that this is due to your bad ram
Oops, my bad. I forgot to mention that the memory situation finally
got so bad that I had to remove 128MB of RAM and run the system with
128MB in the 128 bit wide memory bus configuration. Since then I have
not seen any "processor correctable error"s. I thus believe that
there is no longer any problem with bad memory. (Actually I suspect
that the problem lies with one of the now unused memory controllers,
and not with the memory modules themselves).
> but I will run crashme anyway. What did your crashme command line
> look like?
% crashme +2000 666 100 1:00:00
> It would, BTW, help a lot if you could fix the ram and then try to
> reproduce the test case.
See above; I believe that the ram is no longer a source of problems.
And yes, the problem is still there.
> Here is the explanation for your results, i.e., why you see the FP
> message and, more important, why it appears to crash inside the
> kernel.
[ ... ]
Thank you very much for the thorough explanation.
-jarle
--
If an undetectable error occurs, the processor continues as if no error
had occurred. -- IBM S/360 Principles of Operation
From: Jarle Greipsland <jarle@runit.sintef.no>
To: ross@ghs.com
Cc: thorpej@netbsd.org, gnats-bugs@gnats.netbsd.org
Subject: Re: port-alpha/8216
Date: Tue, 08 May 2001 12:04:27 +0200
Jarle Greipsland writes:
>> but I will run crashme anyway. What did your crashme command line
>> look like?
> % crashme +2000 666 100 1:00:00
>> It would, BTW, help a lot if you could fix the ram and then try to
>> reproduce the test case.
> See above; I believe that the ram is no longer a source of problems.
> And yes, the problem is still there.
Just to make this clear: The problem that is still there is that the
system panics while running crashme. I have not observed any memory
problems for more than a year.
-jarle
--
"strcmp(3): When you want to just relax and cruise down memory lane."
-- hpeyerl@novatel.ca
State-Changed-From-To: feedback->open
State-Changed-By: chs
State-Changed-When: Thu Oct 25 23:49:30 PDT 2001
State-Changed-Why:
feedback given, problem still there.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.