NetBSD Problem Report #10016

Received: (qmail 8405 invoked from network); 30 Apr 2000 03:05:07 -0000
Message-Id: <200004300302.XAA00525@zorkmid.mit.edu>
Date: Sat, 29 Apr 2000 23:02:06 -0400 (EDT)
From: John Hawkinson <jhawk@mit.edu>
Reply-To: jhawk@mit.edu
To: gnats-bugs@gnats.netbsd.org
Subject: ddb can get stuck in infinite page-faults
X-Send-Pr-Version: 3.95

>Number:         10016
>Category:       kern
>Synopsis:       ddb can get stuck in infinite page-faults
>Confidential:   no
>Severity:       serious
>Priority:       low
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Apr 30 03:06:00 +0000 2000
>Closed-Date:    
>Last-Modified:  Thu Oct 20 11:02:20 +0000 2005
>Originator:     John Hawkinson
>Release:        29 April 2000
>Organization:
MIT
>Environment:

System:
NetBSD zorkmid.mit.edu -current
NetBSD 1.4X (ZORKMID) #8: Sat Apr 29 22:31:17 EDT 2000
M    jhawk@zorkmid.mit.edu:/usr/local/current-src/sys/arch/i386/compile/ZORKMID

>Description:
	ddb can get stuck spewing seemingly infinite uvm_fault()s at
the user. a BREAK doesn't break out of this. Power cycling seems to
be the only way out.

	This is poor. It should be possible to debug kernels across a
serial link with ddb without having remote power cycling gear. I'd
hate to try to debug some problem that was hitting a remote production
server and find that I had hung that machine and it required someone
to go visit it in person.

>How-To-Repeat:
	In this case, attempt to debug a problem with pnpbios code
rebooting on a Sony VAIO PCG-Z505HE after the lpt probe.

 > NetBSD/i386 BIOS Boot, Revision 2.6
 >> (jhawk@zorkmid.mit.edu, Tue Apr 25 14:38:24 EDT 2000)
 >> Memory: 638/64448 k
 Use hd1a:netbsd to boot sd0 when wd0 is also installed
 Press return to boot now, any other key for boot menu
 booting wd0a:netbsd - starting in 0
 type "?" or "help" for help.
 > boot /cnetbsd -ds
 booting wd0a:/cnetbsd (howto 0x42)
 3428352+249856+294132+[184836+229274]=0x42ee96
 [ netbsd ELF symbol table not valid ]
 ^M[ preserving 414112 bytes of netbsd a.out symbol table ]
 ^MStopped in  at  _cpu_Debugger+0x4:      leave
 ^Mdb> b lpt_pnpbios_attach
 ^Mdb> c
 ^MCopyright (c) 1996, 1997, 1998, 1999, 2000
 ^M    The NetBSD Foundation, Inc.  All rights reserved.
 ^MCopyright (c) 1982, 1986, 1989, 1991, 1993
 ^M    The Regents of the University of California.  All rights reserved.

 ^MNetBSD 1.4X (ZORKMID) #8: Sat Apr 29 22:31:17 EDT 2000
 ^M    jhawk@zorkmid.mit.edu:/usr/local/current-src/sys/arch/i386/compile/ZORKMID
 ^Mcpu0: family 6 model 8 step 1
 ^Mcpu0: Intel Pentium III (E) (686-class)
 ^Mtotal memory = 65088 KB
 ^Mavail memory = 55840 KB
 ^Musing 839 buffers containing 3356 KB of memory
 ^MBIOS32 rev. 0 found at 0xfd880
 ^Mmainbus0 (root)
 ^Mpnpbios0 at mainbus0: nodes 17, max len 210
 ^Mcom3 at pnpbios0 index 14 (PNP0501)
 ^Mcom3: io 3f8-3ff, irq 4
 ^Mcom3: ns16550a, working fifo
 ^Mcom3: console
 ^Mlpt3 at pnpbios0 index 18 (PNP0401)Breakpoint in swapper at        _lpt_pnpbio
 s_attach:    pushl   %ebp
 ^Mdb> until
 ^MAfter 13 instructions (0 loads, 0 stores),
 ^MStopped in swapper at   _lpt_pnpbios_attach+0x1a:       call    _pnpbios_io_ma
 p
 ^Mdb> next
 ^MAfter 210 instructions (0 loads, 0 stores),
 ^MStopped in swapper at   _pnpbios_io_map+0x47:   ret
 ^Mdb> n

 ^Mlpt3: io 378-37f 778-77f, irq 7, dma 3
 ^MAfter 26223 instructions (0 loads, 0 stores),
 ^MStopped in swapper at   _lpt_pnpbios_attach+0x68:       ret
 ^Mdb> n
 ^MAfter 26 instructions (0 loads, 0 stores),
 ^MStopped in swapper at   _config_attach+0x31e:   ret
 ^Mdb> n
 ^MAfter 7 instructions (0 loads, 0 stores),
 ^MStopped in swapper at   _config_found_sm+0x4c:  ret
 ^Mdb> u
 ^MAfter 9 instructions (0 loads, 0 stores),
 ^MStopped in swapper at   _pnpbios_attachnode+0x2ac:      ret
 ^Mdb> u
 ^MAfter 14 instructions (0 loads, 0 stores),
 ^MStopped in swapper at   _pnpbios_attach+0x1e5:  call    _pnpbios_getnode
 ^Mdb> u
 ^MAfter 20 instructions (0 loads, 0 stores),
 ^MStopped in swapper at   _pnpbios_getnode+0x65:  call    _pnpbioscall
 ^Mdb> u
 ^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
 ^Mkernel: page fault trap, code=0
 ^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
 ^Mkernel: page fault trap, code=0
 ^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
 ^Mkernel: page fault trap, code=0
 ^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
 ^Mkernel: page fault trap, code=0
 ^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
 ^Mkernel: page fault trap, code=0
 ^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
 ^Mkernel: page fault trap, code=0
 ^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
 ^Mkernel: page fault trap, code=0
 ^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
 ^Mkernel: page fault trap, code=0
 ^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
 ^Mkernel: page fault trap, code=0
 ^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
 ^Mkernel: page fault trap, code=0
 ^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
 ^Mkernel: page fault trap, code=0
 ^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
 ^Mkernel: page fault trap, code=0
 ^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
 ^Mkernel: page fault trap, code=0
 ^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
 ^Mkernel: page fault trap, code=0
 ^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
 ^Mkernel: page fault trap, code=0
 ^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
 ^Mkernel: page fault trap, code=0
 ^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
 ^Mkernel: page fault trap, code=0
 ^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
 ^Mkernel: page fault trap, code=0
 ^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
 ^Mkernel: page fault trap, code=0
 ^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
 ^Mkernel: page fault trap, code=0
 ^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
 ^Mkernel: page fault trap, code=0
 ^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
 ^Mkernel: page fault trap, code=0
 ^Muvm_fault(0xc04a561c, 0x0, 0, 1) -> 1
 ^M--db_more-

 Note the debugger --db_more- prompt.
 Kind of ironic, I think.

>Fix:
	Fix uvm_fault()s not to loop like that?
	Perhaps have db_more allow breaking out with
	a 'q' keypress?
>Release-Note:
>Audit-Trail:

From: John Hawkinson <jhawk@mit.edu>
To: gnats-bugs@gnats.netbsd.org
Cc: netbsd-bugs@netbsd.org, Charles Hannum <root@ihack.net>
Subject: Re: kern/10016: ddb can get stuck in infinite uvm_fault()ing
Date: Thu, 4 May 2000 15:06:59 -0400

 OK, I feel like I'm missing something with respect to how this is supposed
 to work.

 uvm_fault() is faulting inside db_read_bytes(), because the
 code segment is invalid:

 _db_read_bytes+0x10:    movb    0(%ecx),%al
 db> t
 _db_read_bytes(0,4,c0548cf8,c04dab64,c0548d38) at _db_read_bytes+0x10
 _db_get_value(0,4,0,5,0) at _db_get_value+0x18
 _db_stop_at_pc(c04dab64,c0548d38) at _db_stop_at_pc+0x1c2
 _db_trap(5,0,1,ffffffff,c0548e30) at _db_trap+0x39
 _kdb_trap(5,0,c0548dac) at _kdb_trap+0xb4
 _trap() at _trap+0x168
 --- trap (number 5) ---
 (null)(ff2) at 0
 _pnpbios_getnode(1,c0548e30,c06b9600,d2) at _pnpbios_getnode+0x6a
 _pnpbios_attach(c06bdfc0,c06afe80,c0548ed4,c06afe80,c0548ed4) at _pnpbios_attach
 +0x223
 _config_attach(c06bdfc0,c0458fb8,c0548ed4,c031d7b8,c06bdfc0) at _config_attach+0
 x30a
 _config_found_sm(c06bdfc0,c0548ed4,c031d7b8,0) at _config_found_sm+0x29
 _mainbus_attach(0,c06bdfc0,0,c06bdfc0,0) at _mainbus_attach+0x41
 _config_attach(0,c0458164,0,0,c04b9694) at _config_attach+0x30a
 _config_rootfound(c02f86c9,0) at _config_rootfound+0x37
 _cpu_configure(bfeff000,c0548fa8,c01a202b,c0546010,546000) at _cpu_configure+0x1
 e
 _configure(c0546010,546000,54d000,90,500007ff) at _configure+0x5a
 _main(0,0,0,0,0) at _main+0x317
 db> show reg
 es            0x160010
 ds                0x10
 edi                0x5
 esi                0x4
 ebp         0xc0548cdc  _end+0x6bdbc
 ebx         0xc0548cf9  _end+0x6bdd9
 edx                0x2
 ecx                0x1
 eax                0x3
 eip         0xc02fbb14  _db_read_bytes+0x10
 cs                 0x8
 eflags         0x10006
 esp         0xc0548cd8  _end+0x6bdb8
 ss          0xc04d0010  _playstats+0x6270


 So one bug appears to be that ddb needs to do some checking before
 blindly assuming it can deal here. 32/16 etc./ugh. Ignoring that
 for the moment...

 A page fault is incurred, handled by the i386 trap(), which calls
 uvm_fault() which fails, and so we print the message, and loop to the
 "we_re_toast:" label. Said label special-cases DDB and KGDB (for
 breakpoint handling?) and in the ddb case, invokes kdb_trap() and then
 returns, rather than panic()ing.

 [      aside: I find  it ironic that if you've  built with DDB enabled
     you don't see the trap data that you'd otherwise get printf()d out
     before the panic("trap") [i.e.  arch/i386/i386/trap.c ll 301-307],
     but if you haven't built  with DEBUG you cannot enable the display
     of this  upon entry to trap()  via setting "trapdebug".  So if you
     build without DDB and without DEBUG,  you get it, but if you build
     with DDB and without debug, you don't, but if you build with DEBUG
     and set trapdebug, you do get it. It seems odd, since one presumes
     the normal  case for  people trying to  debug things is  they have
     kernels  build with DDB  (for ocassional  debugging) but  not with
     DEBUG set.

        I don't   really  understand   why  the  "if   (trapdebug)"  is
     conditionalized on  DEBUG -- it does  not seem like  one cmpl upon
     every  trap()  invokation is  particularly  expensive,  so I'd  be
     curious  if there's  objection to  removing the  #ifdef  DEBUG (ll
     262).
 ]

 In any event, kdb_trap() (in db_interface.c) prints the trap type
 information and then calls the MI db_trap(). db_trap() calls
 db_stop_at_pc() to see if there's a ddb breakpoint in effect (or an
 "until", etc.), and if not, calls db_restart_at_pc().

 For the case of a pagefault, this causes an infinite loop since
 the pagefault didn't happen on a breakpoint, and db_restart_at_pc() just
 results in trying to re-execute the instruction which triggers the
 pagefault. There's no opportunity for user interaction.

 I'm unclear on whether I'm misunderstanding the sequence of events
 of it they're just that broken :-(. 

 The following workaround seems to be helpful:

 Index: db_trap.c
 ===================================================================
 RCS file: /cvsroot/syssrc/sys/ddb/db_trap.c,v
 retrieving revision 1.14
 diff -c -3 -2 -p -r1.14 db_trap.c
 *** db_trap.c	1999/04/12 20:38:21	1.14
 --- db_trap.c	2000/05/04 18:54:23
 ***************
 *** 47,82 ****
 --- 47,85 ----
   void
   db_trap(type, code)
   	int	type, code;
   {
   	boolean_t	bkpt;
   	boolean_t	watchpt;

   	bkpt = IS_BREAKPOINT_TRAP(type, code);
   	watchpt = IS_WATCHPOINT_TRAP(type, code);

   	if (db_stop_at_pc(DDB_REGS, &bkpt)) {
   	    if (db_inst_count) {
   		db_printf("After %d instructions (%d loads, %d stores),\n",
   			  db_inst_count, db_load_count, db_store_count);
   	    }
   	    if (curproc != NULL) {
   		if (bkpt)
   		    db_printf("Breakpoint in %s at\t", curproc->p_comm);
   		else if (watchpt)
   		    db_printf("Watchpoint in %s at\t", curproc->p_comm);
   		else
   		    db_printf("Stopped in %s at\t", curproc->p_comm);
   	    } else if (bkpt)
   		db_printf("Breakpoint at\t");
   	    else if (watchpt)
   		db_printf("Watchpoint at\t");
   	    else
   		db_printf("Stopped at\t");
   	    db_dot = PC_REGS(DDB_REGS);
   	    db_print_loc_and_inst(db_dot);

   	    db_command_loop();
 + 	} else if (type==T_PAGEFLT) {
 + 	  db_printf("pagefault stop\t");
 + 	  db_command_loop();
   	}

   	db_restart_at_pc(DDB_REGS, watchpt);
   }

 it certainly prevents the endless looping of uvm_fault()s
 and allowed me to get the above stack trace.

 I guess it's probably not legit to use "T_PAGEFLT" in MI code.
 I suppose I could define IS_PAGEFAULT_TRAP() in the MD db_machdep.h
 and then #ifdef the db_trap() code on IS_PAGEFAULT_TRAP().

 But I'm not convinced that the code is in-principal correct (i.e.
 ignoring the prev. paragraph's issue). I suppose it's certainly better
 than the status quo.

 Feedback would be appreciated.

 --jhawk

From: John Hawkinson <jhawk@mit.edu>
To: gnats-bugs@gnats.netbsd.org
Cc: netbsd-bugs@netbsd.org, Charles Hannum <root@ihack.net>
Subject: Re: kern/10016: ddb can get stuck in infinite uvm_fault()ing
Date: Thu, 4 May 2000 18:05:37 -0400

 | uvm_fault() is faulting inside db_read_bytes(), because the
 | code segment is invalid:
 | 
 | _db_read_bytes+0x10:    movb    0(%ecx),%al
 | db> t
 | _db_read_bytes(0,4,c0548cf8,c04dab64,c0548d38) at _db_read_bytes+0x10
 | _db_get_value(0,4,0,5,0) at _db_get_value+0x18
 | _db_stop_at_pc(c04dab64,c0548d38) at _db_stop_at_pc+0x1c2
 | _db_trap(5,0,1,ffffffff,c0548e30) at _db_trap+0x39
 | _kdb_trap(5,0,c0548dac) at _kdb_trap+0xb4
 | _trap() at _trap+0x168
 | --- trap (number 5) ---
 | (null)(ff2) at 0
 | _pnpbios_getnode(1,c0548e30,c06b9600,d2) at _pnpbios_getnode+0x6a

 This occurs on single-stepping (or until-ing) into pnpbioscall() and
 off into biosland.

 It seems that db_stop_at_pc() decides that the PC (eip) is zero, and
 calls db_get_value(pc,...) [db_run.c]:

    173          if (db_run_mode == STEP_CALLT) {
    174              db_expr_t ins = db_get_value(pc, sizeof(int), FALSE);
    175  
    176              /* continue until call or return */
    177  
    178              if (!inst_call(ins) &&
    179                  !inst_return(ins) &&
    180                  !inst_trap_return(ins)) {
    181                  return (FALSE); /* continue */
    182              }
    183          }

 pc is initialized from the passed in registers:

     83          pc = PC_REGS(regs);

 which presumbly written by the trap handler.

 I don't know what to do. db_stop_at_pc() can be made to return in the pc==0
 case and that's fine, but db_restart_at_pc() doesn't seem to be quite
 so simple, so I'm not sure what to do with that case.

 On some level, one expects this stuff to break if you start single-stepping
 into the bios, but it would be nice if it were a bit more robust.

 I guess it's a bug in the trap handler that is writing the pc?

 --jhawk
>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.