NetBSD Problem Report #10016
Received: (qmail 8405 invoked from network); 30 Apr 2000 03:05:07 -0000
Message-Id: <200004300302.XAA00525@zorkmid.mit.edu>
Date: Sat, 29 Apr 2000 23:02:06 -0400 (EDT)
From: John Hawkinson <jhawk@mit.edu>
Reply-To: jhawk@mit.edu
To: gnats-bugs@gnats.netbsd.org
Subject: ddb can get stuck in infinite page-faults
X-Send-Pr-Version: 3.95
>Number: 10016
>Category: kern
>Synopsis: ddb can get stuck in infinite page-faults
>Confidential: no
>Severity: serious
>Priority: low
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sun Apr 30 03:06:00 +0000 2000
>Closed-Date:
>Last-Modified: Thu Oct 20 11:02:20 +0000 2005
>Originator: John Hawkinson
>Release: 29 April 2000
>Organization:
MIT
>Environment:
System:
NetBSD zorkmid.mit.edu -current
NetBSD 1.4X (ZORKMID) #8: Sat Apr 29 22:31:17 EDT 2000
M jhawk@zorkmid.mit.edu:/usr/local/current-src/sys/arch/i386/compile/ZORKMID
>Description:
ddb can get stuck spewing seemingly infinite uvm_fault()s at
the user. a BREAK doesn't break out of this. Power cycling seems to
be the only way out.
This is poor. It should be possible to debug kernels across a
serial link with ddb without having remote power cycling gear. I'd
hate to try to debug some problem that was hitting a remote production
server and find that I had hung that machine and it required someone
to go visit it in person.
>How-To-Repeat:
In this case, attempt to debug a problem with pnpbios code
rebooting on a Sony VAIO PCG-Z505HE after the lpt probe.
> NetBSD/i386 BIOS Boot, Revision 2.6
>> (jhawk@zorkmid.mit.edu, Tue Apr 25 14:38:24 EDT 2000)
>> Memory: 638/64448 k
Use hd1a:netbsd to boot sd0 when wd0 is also installed
Press return to boot now, any other key for boot menu
booting wd0a:netbsd - starting in 0
type "?" or "help" for help.
> boot /cnetbsd -ds
booting wd0a:/cnetbsd (howto 0x42)
3428352+249856+294132+[184836+229274]=0x42ee96
[ netbsd ELF symbol table not valid ]
^M[ preserving 414112 bytes of netbsd a.out symbol table ]
^MStopped in at _cpu_Debugger+0x4: leave
^Mdb> b lpt_pnpbios_attach
^Mdb> c
^MCopyright (c) 1996, 1997, 1998, 1999, 2000
^M The NetBSD Foundation, Inc. All rights reserved.
^MCopyright (c) 1982, 1986, 1989, 1991, 1993
^M The Regents of the University of California. All rights reserved.
^MNetBSD 1.4X (ZORKMID) #8: Sat Apr 29 22:31:17 EDT 2000
^M jhawk@zorkmid.mit.edu:/usr/local/current-src/sys/arch/i386/compile/ZORKMID
^Mcpu0: family 6 model 8 step 1
^Mcpu0: Intel Pentium III (E) (686-class)
^Mtotal memory = 65088 KB
^Mavail memory = 55840 KB
^Musing 839 buffers containing 3356 KB of memory
^MBIOS32 rev. 0 found at 0xfd880
^Mmainbus0 (root)
^Mpnpbios0 at mainbus0: nodes 17, max len 210
^Mcom3 at pnpbios0 index 14 (PNP0501)
^Mcom3: io 3f8-3ff, irq 4
^Mcom3: ns16550a, working fifo
^Mcom3: console
^Mlpt3 at pnpbios0 index 18 (PNP0401)Breakpoint in swapper at _lpt_pnpbio
s_attach: pushl %ebp
^Mdb> until
^MAfter 13 instructions (0 loads, 0 stores),
^MStopped in swapper at _lpt_pnpbios_attach+0x1a: call _pnpbios_io_ma
p
^Mdb> next
^MAfter 210 instructions (0 loads, 0 stores),
^MStopped in swapper at _pnpbios_io_map+0x47: ret
^Mdb> n
^Mlpt3: io 378-37f 778-77f, irq 7, dma 3
^MAfter 26223 instructions (0 loads, 0 stores),
^MStopped in swapper at _lpt_pnpbios_attach+0x68: ret
^Mdb> n
^MAfter 26 instructions (0 loads, 0 stores),
^MStopped in swapper at _config_attach+0x31e: ret
^Mdb> n
^MAfter 7 instructions (0 loads, 0 stores),
^MStopped in swapper at _config_found_sm+0x4c: ret
^Mdb> u
^MAfter 9 instructions (0 loads, 0 stores),
^MStopped in swapper at _pnpbios_attachnode+0x2ac: ret
^Mdb> u
^MAfter 14 instructions (0 loads, 0 stores),
^MStopped in swapper at _pnpbios_attach+0x1e5: call _pnpbios_getnode
^Mdb> u
^MAfter 20 instructions (0 loads, 0 stores),
^MStopped in swapper at _pnpbios_getnode+0x65: call _pnpbioscall
^Mdb> u
^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
^Mkernel: page fault trap, code=0
^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
^Mkernel: page fault trap, code=0
^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
^Mkernel: page fault trap, code=0
^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
^Mkernel: page fault trap, code=0
^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
^Mkernel: page fault trap, code=0
^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
^Mkernel: page fault trap, code=0
^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
^Mkernel: page fault trap, code=0
^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
^Mkernel: page fault trap, code=0
^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
^Mkernel: page fault trap, code=0
^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
^Mkernel: page fault trap, code=0
^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
^Mkernel: page fault trap, code=0
^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
^Mkernel: page fault trap, code=0
^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
^Mkernel: page fault trap, code=0
^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
^Mkernel: page fault trap, code=0
^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
^Mkernel: page fault trap, code=0
^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
^Mkernel: page fault trap, code=0
^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
^Mkernel: page fault trap, code=0
^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
^Mkernel: page fault trap, code=0
^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
^Mkernel: page fault trap, code=0
^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
^Mkernel: page fault trap, code=0
^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
^Mkernel: page fault trap, code=0
^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
^Mkernel: page fault trap, code=0
^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
^M--db_more-
Note the debugger --db_more- prompt.
Kind of ironic, I think.
>Fix:
Fix uvm_fault()s not to loop like that?
Perhaps have db_more allow breaking out with
a 'q' keypress?
>Release-Note:
>Audit-Trail:
From: John Hawkinson <jhawk@mit.edu>
To: gnats-bugs@gnats.netbsd.org
Cc: netbsd-bugs@netbsd.org, Charles Hannum <root@ihack.net>
Subject: Re: kern/10016: ddb can get stuck in infinite uvm_fault()ing
Date: Thu, 4 May 2000 15:06:59 -0400
OK, I feel like I'm missing something with respect to how this is supposed
to work.
uvm_fault() is faulting inside db_read_bytes(), because the
code segment is invalid:
_db_read_bytes+0x10: movb 0(%ecx),%al
db> t
_db_read_bytes(0,4,c0548cf8,c04dab64,c0548d38) at _db_read_bytes+0x10
_db_get_value(0,4,0,5,0) at _db_get_value+0x18
_db_stop_at_pc(c04dab64,c0548d38) at _db_stop_at_pc+0x1c2
_db_trap(5,0,1,ffffffff,c0548e30) at _db_trap+0x39
_kdb_trap(5,0,c0548dac) at _kdb_trap+0xb4
_trap() at _trap+0x168
--- trap (number 5) ---
(null)(ff2) at 0
_pnpbios_getnode(1,c0548e30,c06b9600,d2) at _pnpbios_getnode+0x6a
_pnpbios_attach(c06bdfc0,c06afe80,c0548ed4,c06afe80,c0548ed4) at _pnpbios_attach
+0x223
_config_attach(c06bdfc0,c0458fb8,c0548ed4,c031d7b8,c06bdfc0) at _config_attach+0
x30a
_config_found_sm(c06bdfc0,c0548ed4,c031d7b8,0) at _config_found_sm+0x29
_mainbus_attach(0,c06bdfc0,0,c06bdfc0,0) at _mainbus_attach+0x41
_config_attach(0,c0458164,0,0,c04b9694) at _config_attach+0x30a
_config_rootfound(c02f86c9,0) at _config_rootfound+0x37
_cpu_configure(bfeff000,c0548fa8,c01a202b,c0546010,546000) at _cpu_configure+0x1
e
_configure(c0546010,546000,54d000,90,500007ff) at _configure+0x5a
_main(0,0,0,0,0) at _main+0x317
db> show reg
es 0x160010
ds 0x10
edi 0x5
esi 0x4
ebp 0xc0548cdc _end+0x6bdbc
ebx 0xc0548cf9 _end+0x6bdd9
edx 0x2
ecx 0x1
eax 0x3
eip 0xc02fbb14 _db_read_bytes+0x10
cs 0x8
eflags 0x10006
esp 0xc0548cd8 _end+0x6bdb8
ss 0xc04d0010 _playstats+0x6270
So one bug appears to be that ddb needs to do some checking before
blindly assuming it can deal here. 32/16 etc./ugh. Ignoring that
for the moment...
A page fault is incurred, handled by the i386 trap(), which calls
uvm_fault() which fails, and so we print the message, and loop to the
"we_re_toast:" label. Said label special-cases DDB and KGDB (for
breakpoint handling?) and in the ddb case, invokes kdb_trap() and then
returns, rather than panic()ing.
[ aside: I find it ironic that if you've built with DDB enabled
you don't see the trap data that you'd otherwise get printf()d out
before the panic("trap") [i.e. arch/i386/i386/trap.c ll 301-307],
but if you haven't built with DEBUG you cannot enable the display
of this upon entry to trap() via setting "trapdebug". So if you
build without DDB and without DEBUG, you get it, but if you build
with DDB and without debug, you don't, but if you build with DEBUG
and set trapdebug, you do get it. It seems odd, since one presumes
the normal case for people trying to debug things is they have
kernels build with DDB (for ocassional debugging) but not with
DEBUG set.
I don't really understand why the "if (trapdebug)" is
conditionalized on DEBUG -- it does not seem like one cmpl upon
every trap() invokation is particularly expensive, so I'd be
curious if there's objection to removing the #ifdef DEBUG (ll
262).
]
In any event, kdb_trap() (in db_interface.c) prints the trap type
information and then calls the MI db_trap(). db_trap() calls
db_stop_at_pc() to see if there's a ddb breakpoint in effect (or an
"until", etc.), and if not, calls db_restart_at_pc().
For the case of a pagefault, this causes an infinite loop since
the pagefault didn't happen on a breakpoint, and db_restart_at_pc() just
results in trying to re-execute the instruction which triggers the
pagefault. There's no opportunity for user interaction.
I'm unclear on whether I'm misunderstanding the sequence of events
of it they're just that broken :-(.
The following workaround seems to be helpful:
Index: db_trap.c
===================================================================
RCS file: /cvsroot/syssrc/sys/ddb/db_trap.c,v
retrieving revision 1.14
diff -c -3 -2 -p -r1.14 db_trap.c
*** db_trap.c 1999/04/12 20:38:21 1.14
--- db_trap.c 2000/05/04 18:54:23
***************
*** 47,82 ****
--- 47,85 ----
void
db_trap(type, code)
int type, code;
{
boolean_t bkpt;
boolean_t watchpt;
bkpt = IS_BREAKPOINT_TRAP(type, code);
watchpt = IS_WATCHPOINT_TRAP(type, code);
if (db_stop_at_pc(DDB_REGS, &bkpt)) {
if (db_inst_count) {
db_printf("After %d instructions (%d loads, %d stores),\n",
db_inst_count, db_load_count, db_store_count);
}
if (curproc != NULL) {
if (bkpt)
db_printf("Breakpoint in %s at\t", curproc->p_comm);
else if (watchpt)
db_printf("Watchpoint in %s at\t", curproc->p_comm);
else
db_printf("Stopped in %s at\t", curproc->p_comm);
} else if (bkpt)
db_printf("Breakpoint at\t");
else if (watchpt)
db_printf("Watchpoint at\t");
else
db_printf("Stopped at\t");
db_dot = PC_REGS(DDB_REGS);
db_print_loc_and_inst(db_dot);
db_command_loop();
+ } else if (type==T_PAGEFLT) {
+ db_printf("pagefault stop\t");
+ db_command_loop();
}
db_restart_at_pc(DDB_REGS, watchpt);
}
it certainly prevents the endless looping of uvm_fault()s
and allowed me to get the above stack trace.
I guess it's probably not legit to use "T_PAGEFLT" in MI code.
I suppose I could define IS_PAGEFAULT_TRAP() in the MD db_machdep.h
and then #ifdef the db_trap() code on IS_PAGEFAULT_TRAP().
But I'm not convinced that the code is in-principal correct (i.e.
ignoring the prev. paragraph's issue). I suppose it's certainly better
than the status quo.
Feedback would be appreciated.
--jhawk
From: John Hawkinson <jhawk@mit.edu>
To: gnats-bugs@gnats.netbsd.org
Cc: netbsd-bugs@netbsd.org, Charles Hannum <root@ihack.net>
Subject: Re: kern/10016: ddb can get stuck in infinite uvm_fault()ing
Date: Thu, 4 May 2000 18:05:37 -0400
| uvm_fault() is faulting inside db_read_bytes(), because the
| code segment is invalid:
|
| _db_read_bytes+0x10: movb 0(%ecx),%al
| db> t
| _db_read_bytes(0,4,c0548cf8,c04dab64,c0548d38) at _db_read_bytes+0x10
| _db_get_value(0,4,0,5,0) at _db_get_value+0x18
| _db_stop_at_pc(c04dab64,c0548d38) at _db_stop_at_pc+0x1c2
| _db_trap(5,0,1,ffffffff,c0548e30) at _db_trap+0x39
| _kdb_trap(5,0,c0548dac) at _kdb_trap+0xb4
| _trap() at _trap+0x168
| --- trap (number 5) ---
| (null)(ff2) at 0
| _pnpbios_getnode(1,c0548e30,c06b9600,d2) at _pnpbios_getnode+0x6a
This occurs on single-stepping (or until-ing) into pnpbioscall() and
off into biosland.
It seems that db_stop_at_pc() decides that the PC (eip) is zero, and
calls db_get_value(pc,...) [db_run.c]:
173 if (db_run_mode == STEP_CALLT) {
174 db_expr_t ins = db_get_value(pc, sizeof(int), FALSE);
175
176 /* continue until call or return */
177
178 if (!inst_call(ins) &&
179 !inst_return(ins) &&
180 !inst_trap_return(ins)) {
181 return (FALSE); /* continue */
182 }
183 }
pc is initialized from the passed in registers:
83 pc = PC_REGS(regs);
which presumbly written by the trap handler.
I don't know what to do. db_stop_at_pc() can be made to return in the pc==0
case and that's fine, but db_restart_at_pc() doesn't seem to be quite
so simple, so I'm not sure what to do with that case.
On some level, one expects this stuff to break if you start single-stepping
into the bios, but it would be nice if it were a bit more robust.
I guess it's a bug in the trap handler that is writing the pc?
--jhawk
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.