NetBSD Problem Report #6799

Received: (qmail 22702 invoked from network); 13 Jan 1999 00:27:17 -0000
Message-Id: <199901122358.PAA00476@jules.nas.nasa.gov>
Date: Tue, 12 Jan 1999 15:58:09 -0800 (PST)
From: Matthew Jacob <mjacob@nas.nasa.gov>
Reply-To: mjacob@netbsd.org
To: gnats-bugs@gnats.netbsd.org
Subject: unexpected reboot with 'kernel stack not valid'
X-Send-Pr-Version: 3.95

>Number:         6799
>Category:       port-alpha
>Synopsis:       under light load Alpha 8200 panics with 'kernel stack not valid'
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    port-alpha-maintainer
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Jan 12 16:35:01 +0000 1999
>Closed-Date:    
>Last-Modified:  Sat Nov 12 16:56:24 +0000 2005
>Originator:     
>Release:        source cvs updated as of Jan-12-1999, 1000PST.
>Organization:
	NASA Ames Research Center
>Environment:

System: NetBSD jules.nas.nasa.gov 1.3I NetBSD 1.3I (JULES) #0: Tue Jan 12 10:36:40 PST 1999 mjacob@mathom.nas.nasa.gov:/space/NetBSD-current/src/sys/arch/alpha/compile/JULES alpha


>Description:

This has been around for a while but it's time to get serious about it.
Under moderately light load (building src), the kernel rebootes with:

halted CPU 8^M
^M
halt code = 2^M
kernel stack not valid halt^M
PC = fffffc00004eadac^M

For this kernel, this is right at the front of uvm_fault
(fffffc00004eada0 T uvm_fault).


The config file (JULES) is actually just a copy of the ALPHA config file.

This particular machine as 2GB primary memory and ~2GB swap. I tried
to repeat the build with mfs on /tmp not mounted, but got the same
panic.


>How-To-Repeat:

Run a build on the NetBSD /usr/src with the following script:

#!/bin/sh
#
# A Script to do a nightly build of the NetBSD source tree (without
# trashing the running system)
#

set -a
# The location of your source tree.
BSDSRCDIR=${BSDSRCDIR-/usr/src}
# The location of the object files produced by the build.
BSDOBJDIR=${BSDOBJDIR-/usr/obj}
# For the initial build, which doesn't include those crypto
# files which may not be exported from the US and Canada.
#EXPORTABLE_SYSTEM=1
# These two aren't really necessary; they just make life
# easier if/when you rebuild later.
BUILD=1
UPDATE=1
# The following variables must be set in the environment;
# /etc/mk.conf will not do!
# Where the installed files go.
DESTDIR=${DESTDIR-/proto}
# Where the .tgz files built for the release go.
RELEASEDIR=${RELEASEDIR-/release}
#
# Set PATH and LD_LIBRARY_PATH to use the built items when possible.
# Strictly speaking this should all be done twice.
#
DD=${DESTDIR}
PATH=${DD}/sbin:${DD}/usr/sbin:${DD}/usr/local/bin:${DD}/bin
PATH=$PATH:${DD}/usr/bin:${DD}/usr/local/sbin:/sbin:/usr/sbin
PATH=$PATH:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin
LD_LIBRARY_PATH=${DD}/usr/lib:${LD_LIBRARY_PATH}
cd ${BSDSRCDIR} && make obj && make build

>Fix:

>Release-Note:
>Audit-Trail:

From: Matthew Jacob <mjacob@feral.com>
To: port-alpha@netbsd.org
Cc: gnats-bugs@netbsd.org
Subject: port-alpha/6799
Date: Wed, 3 Feb 1999 11:02:38 -0800 (PST)

 Possibly related to this, but possibly not, here's a double fault
 into the prom.... The PC in question is prom_enter. It makes me wonder
 whether the switch to UVM is problematic with the re-entering the prom
 code that we have to do because we haven't finished console uart stuff
 (zsc at gbus...)....

 CPU 8 halted
   halt code = 6
   double error halt
   PC = fffffc00005aa7f0				<<<<<	prom_enter
 Haltcode 6 Double Machine Check
   03022636  WATCH$: 02-03-99 02:38:54
   00006302
   00000200  Frame Size
   00000000  Flag bits
   00000118  CPU Area Offset
   000001a0  System Area Offset
   0000fff0  MCheck Reason Mask
   00000001  MCheck Frame Rev
 EV5 IPRs:
   exc_addr:  fffffc00 005aa7f0  exc_sum:     00000000 00000000
   exc_mask:  00000000 00000000  isr:         00000000 00100000
   icsr:      00000061 60000000  icpe_stat:   00000000 00002000
   dcpe_stat: 00000000 00000000  va:          fffffe00 07533fb8
   mm_stat:   00000000 00016e91  sc_addr:     ffffff00 0001d28f
   sc_stat:   00000000 00000000  bc_tag_addr: ffffff80 00cfcfff
   ei_addr:   ffffff00 0011575f  ei_stat:     fffffff0 04ffffff
   fill_syn:  00000000 00009000  ld_lock:     ffffff00 01783a7f
   pal_base:  00000000 00018000  sys_ipr1:    00000000 00510008
 TLEP CSRs:
         tldev: 51008011       tlber: 00800490
         tlcnr: 00000140       tlvid: 00000098
        tlesr0: 00400303      tlesr1: 00400c0c
        tlesr2: 00406060      tlesr3: 00409090
      tlepaerr: 00040200    tlepderr: 00000000
      tlepmerr: 00000000       tlvmg: 00000000
   tlintrmask0: 000001ff  tlintrsum0: 00000802
   tlintrmask1: 00000000  tlintrsum1: 00000000
 TLSB Node 4
   TLDEV     51008011  TLBER     00800490
   TLESR0    00400303  TLESR1    00400c0c
   TLESR2    00406060  TLESR3    00409090
   TLEPAERR  00040200  TLEPDERR  00000000
   TLEPMERR  00000000  TLEPWERR0 deadbeef
   TLEPWERR1 deadbeef  TLEPWERR2 deadbeef
   TLEPWERR3 deadbeef
 TLSB Node 5
   TLDEV     00005000  TLBER     00100000
   TLESR0    00000303  TLESR1    00000c0c
   TLESR2    00006060  TLESR3    00009090
   TLFADR0   00115740  TLFADR1   07850000
   TLVID     00000080  TLMIR     80000001
   MCR       00000234  MER       00000001
 TLSB Node 8
   TLDEV     00002020  TLBER     00000000
   TLESR0    00000000  TLESR1    00000000
   TLESR2    00000000  TLESR3    00000000
   ICCNSE    00000000  ICCWTR    00000000
   IDPNSE0   00000006  IDPNSE1   00000006
   IDPNSE2   00000000  IDPNSE3   00000000
 IOP Node 8 Hose 0
   PCIERR0 00000000 PCIERR1 00000000 
 IOP Node 8 Hose 1
   PCIERR0 00004001 PCIERR1 00020000 PCIERR2 00000000 
 CPU 8 has 2 Halt Data Log entries



From: Matthew Jacob <mjacob@nas.nasa.gov>
To: gnats-bugs@netbsd.org
Cc:  Subject: port-alpha/6799
Date: Wed, 25 Aug 1999 16:40:56 -0700 (PDT)

 Update on this- this is still happening quite regularly with Release 1.4,
 but I've gathered hwrpb info in case that helps... It's been trivial
 to reproduce for 9 months now- not lack of information.

 Panic message was:

 CPU 8 halted
   halt code = 2
   kernel stack not valid halt
   PC = fffffc000039528c (lockmgr + 0x12)

 >>>show hwrpb
 HWRPB is at 2000
 00002000 hwrpb
 	  0  00000000 00002000 Physical address of base of HWRPB
 	  8  00000042 50525748 Identifying string 'HWRPB'
 	 16  00000000 00000009 HWRPB version number
 	 24  00000000 00002FA8 HWPRB size
 	 32  00000000 00000008 ID of primary processor
 	 40  00000000 00002000 System page size in bytes
 	 48  00000000 00000022 Physical address size in bits
 	 56  00000000 0000007F Maximum ASN value
 	 64  49343030 3237494E System serial number
 	 72  00000000 00004944
 	 80  00000000 0000000C Alpha system type
 	 88  00000000 00001065 system subtype
 	 96  00000000 00000000 System revision
 	104  00000000 00400000 Interval clock interrupt frequency
 	112  00000000 1A153F00 Cycle Counter frequency
 	120  FFFFFFFC 00000000 Virtual page table base
 	128  00000000 00000000 Reserved for architecture use, SBZ
 	136  00000000 00000140 Offset to Translation Buffer Hint Block
 	144  00000000 00000010 Number of processor supported
 	152  00000000 00000280 Size of Per-CPU Slots in bytes
 	160  00000000 00000180 Offset to Per-CPU Slots
 	168  00000000 00000002 Number of CTBs in CTB table
 	176  00000000 00000160 Size of largest CTB in CTB table
 	184  00000000 00002980 Offset to Console Terminal Block
 	192  00000000 00002C40 Offset to Console Routine Block
 	200  00000000 00002CA0 Offset to Memory Data Descriptors
 	208  00000000 00000000 Offset to Configuration Data Table
 	216  00000000 00139AC0 Offset to FRU Table
 	224  00000000 00000000 Starting VA of SAVE_TERM routine
 	232  00000000 00000000 Procedure Value of SAVE_TERM routine
 	240  FFFFFC00 00300FB4 Starting VA of RESTORE_TERM routine
 	248  00000000 00000001 Procedure Value of RESTORE_TERM routine
 	256  00000000 00000000 VA of restart routine
 	264  00000000 00000000 Restart procedure value
 	272  00000000 00000000 Reserved to System Software
 	280  00000000 00000000 Reserved to Hardware
 	288  49342C6E 9D23DD2C Checksum of HWRPB
 	296  00000000 00000000 RX Ready bitmask
 	304  00000000 00000000 TX Ready bitmask
 	312  00000000 00002EE8 Offset to DSRDB

 00003580 slot at index 8
 	FFFFFC00 005FFEF8 KSP
 	00000000 00000000 ESP
 	00000000 00000589 SSP
 	00000000 00000000 USP
 	00000000 00000000 PTBR
 	00000000 00000000 ASN
 	00000000 00000000 ASTEN_SR
 	00000000 00000000 FEN
 	00000000 00000000 CC
 	00000000 00000000 SCRATCH [0]
 	00000000 00000000 SCRATCH [1]
 	00000000 00000000 SCRATCH [2]
 	00000000 00000000 SCRATCH [3]
 	00000000 00000000 SCRATCH [4]
 	00000000 00000000 SCRATCH [5]
 	000001EE 00000000 SCRATCH [6]
 	0 Boot in progress
 	1 Restart capable
 	1 Processor available
 	1 Processor present
 	0 Operator halted
 	1 Context valid
 	1 Palcode valid
 	1 Palcode memory valid
 	1 Palcode loaded
 	0 Reserved MBZ
 	0 Halt requested	
 	0 Reserved MBZ
 	0 Reserved MBZ
 	00000000 00000000 PAL_MEM_LEN 		
 	00000000 00000000 PAL_SCR_LEN 		
 	00000000 00000000 PAL_MEM_ADR 		
 	00000000 00000000 PAL_SCR_ADR 		
 	00100005 00020116 PAL_REV 		
 	00000002 00000007 CPU_TYPE 		
 	00000000 00000007 CPU_VAR 		
 	00000000 00000000 CPU_REV 		
 	30353133 32375941 SERIAL_NUM		
 	00000000 00003432 SERIAL_NUM		
 	00000000 00008AB8 PAL_LOGOUT 		
 	00000000 00000690 PAL_LOGOUT_LEN 	
 	00000000 70474000 HALT_PCBB 		
 	FFFFFC00 0039528C HALT_PC 		
 	00000000 000004F0 HALT_PS 		
 	FFFFFC00 005CC0B0 HALT_ARGLIST 		
 	FFFFFC00 00493368 HALT_RETURN 		
 	FFFFFC00 00395280 HALT_VALUE 		
 	00000000 00000002 HALTCODE 		
 	00000000 00000000 RSVD_SW 		
 	00000000 RXLEN			
 	00000000 TXLEN			

 00003800 slot at index 9
 	00000000 00000000 KSP
 	00000000 00000000 ESP
 	00000000 00000000 SSP
 	00000000 00000000 USP
 	00000000 00000000 PTBR
 	00000000 00000000 ASN
 	00000000 00000000 ASTEN_SR
 	00000000 00000000 FEN
 	00000000 00000000 CC
 	00000000 00000000 SCRATCH [0]
 	00000000 00000000 SCRATCH [1]
 	00000000 00000000 SCRATCH [2]
 	00000000 00000000 SCRATCH [3]
 	00000000 00000000 SCRATCH [4]
 	00000000 00000000 SCRATCH [5]
 	000001CC 00000000 SCRATCH [6]
 	0 Boot in progress
 	0 Restart capable
 	1 Processor available
 	1 Processor present
 	0 Operator halted
 	0 Context valid
 	1 Palcode valid
 	1 Palcode memory valid
 	1 Palcode loaded
 	0 Reserved MBZ
 	0 Halt requested	
 	0 Reserved MBZ
 	0 Reserved MBZ
 	00000000 00000000 PAL_MEM_LEN 		
 	00000000 00000000 PAL_SCR_LEN 		
 	00000000 00000000 PAL_MEM_ADR 		
 	00000000 00000000 PAL_SCR_ADR 		
 	0010000A 00010114 PAL_REV 		
 	00000002 00000007 CPU_TYPE 		
 	00000000 00000003 CPU_VAR 		
 	00000000 00000000 CPU_REV 		
 	30353133 32375941 SERIAL_NUM		
 	00000000 00003432 SERIAL_NUM		
 	00000000 00009148 PAL_LOGOUT 		
 	00000000 00000690 PAL_LOGOUT_LEN 	
 	00000000 00003800 HALT_PCBB 		
 	00000000 00000000 HALT_PC 		
 	00000000 00001F00 HALT_PS 		
 	00000000 00000000 HALT_ARGLIST 		
 	00000000 00000000 HALT_RETURN 		
 	00000000 00000000 HALT_VALUE 		
 	00000000 00000000 HALTCODE 		
 	00000000 00000000 RSVD_SW 		
 	00000007 RXLEN			
 	00000013 TXLEN			

 00004980	console terminal block
 	00000000 00000002 TYPE
 	00000000 00000000 ID
 	00000000 00000000 RSVD
 	00000000 00000060 DEV_DEP_LEN
 	00000000 F4000000 CSR
 	00000000 000006C0 TX_SCB_OFFSET
 	00000000 00000680 RX_SCB_OFFSET
 	00000000 00002580 BAUD
 	00000000 00000000 PUTS_STATUS
 	00000000 00000000 GETC_STATUS

 00004C40	console routine block
 	00000000 10064210 VDISPATCH
 	00000000 00066210 PDISPATCH
 	00000000 10064220 VFIXUP
 	00000000 00066220 PFIXUP
 	00000000 00000002 ENTRIES
 	00000000 00000153 PAGES

 	00000000 10000000 V_ADDRESS
 	00000000 00002000 P_ADDRESS
 	00000000 000000FF PAGE_COUNT

 	00000000 101FE000 V_ADDRESS
 	00000000 7FF58000 P_ADDRESS
 	00000000 00000054 PAGE_COUNT

 00004CA0	memory descriptor
 	00000000 90214F62 CHECKSUM
 	00000000 00000000 IMP_DATA_PA
 	00000000 00000003 CLUSTER_COUNT

 	00000000 00000000 START_PFN
 	00000000 00000100 PFN_COUNT
 	00000000 00000000 TEST_COUNT
 	00000000 00000000 BITMAP_VA
 	00000000 00000000 BITMAP_PA
 	00000000 00000000 BITMAP_CHKSUM
 	00000000 00000001 USAGE
 	                0 bad page(s)

 	00000000 00000100 START_PFN
 	00000000 0003FEAC PFN_COUNT
 	00000000 0003FEAC TEST_COUNT
 	00000000 101FE000 BITMAP_VA
 	00000000 7FF58000 BITMAP_PA
 	FFFFFFFF FFFFF005 BITMAP_CHKSUM
 	00000000 00000000 USAGE
 	                0 bad page(s)

 	00000000 0003FFAC START_PFN
 	00000000 00000054 PFN_COUNT
 	00000000 00000000 TEST_COUNT
 	00000000 00000000 BITMAP_VA
 	00000000 00000000 BITMAP_PA
 	00000000 00000000 BITMAP_CHKSUM
 	00000000 00000001 USAGE
 	                0 bad page(s)

 00004EE8	Dynamic System Recognition Data block
 	00000000 00000619 SMM
 	00000000 00000018 Offset to LURT
 	00000000 00000068 Offset to Name Count
 	00000000 00000009 LURT Count
 	00000000 00000834 LURT Column 1
 	FFFFFFFF FFFFFFFF LURT Column 2
 	FFFFFFFF FFFFFFFF LURT Column 3
 	FFFFFFFF FFFFFFFF LURT Column 4
 	FFFFFFFF FFFFFFFF LURT Column 5
 	FFFFFFFF FFFFFFFF LURT Column 6
 	FFFFFFFF FFFFFFFF LURT Column 7
 	00000000 0000047E LURT Column 8
 	00000000 0000047E LURT Column 9
 	00000000 00000016 Name Count
 	Platform Name = AlphaServer 8200 5/440

State-Changed-From-To: open->feedback 
State-Changed-By: fair 
State-Changed-When: Thu Jan 17 23:44:31 PST 2002 
State-Changed-Why:  
Here we are, a tad over two years later - does this problem still occurr 
in NetBSD 1.5.2 or -current? 
State-Changed-From-To: feedback->open 
State-Changed-By: mjacob 
State-Changed-When: Fri Jan 18 09:41:07 PST 2002 
State-Changed-Why:  
The problem will continue to exist, likely, until we get a working 
zs driver for the console. The presumption as to why this occurs 
is that the constant callbacks into the PROM for serial console 
hit some problem because we've done a lot to change mappings, etc. 
>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.