NetBSD Problem Report #52560

From gson@gson.org  Tue Sep 19 17:49:45 2017
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id BA3B87A21F
	for <gnats-bugs@gnats.NetBSD.org>; Tue, 19 Sep 2017 17:49:45 +0000 (UTC)
Message-Id: <20170919174939.A2678989281@guava.gson.org>
Date: Tue, 19 Sep 2017 20:49:39 +0300 (EEST)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: gdb kernel backtrace fails to show function where trap occurred
X-Send-Pr-Version: 3.95

>Number:         52560
>Category:       port-i386
>Synopsis:       gdb kernel backtrace fails to show function where trap occurred
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    port-i386-maintainer
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Sep 19 17:50:00 +0000 2017
>Closed-Date:    
>Last-Modified:  Sat Feb 24 13:15:00 +0000 2018
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current, source date 2017.09.19.02.44.14
>Organization:

>Environment:
System: NetBSD
Architecture: i386
Machine: i386
>Description:

When an i386 kernel has crashed due to an errant pointer and dumped
core, and you examine the core file using gdb, the backtrace correctly
displays every function in the call stack *except* the one where the
error actually occurred.

For example, if I modify sys_reboot() so that it deliberatly
dereferences an invalid pointer, and then invoke the reboot syscall,
at the time of the crash the console will correctly display a
backtrace that includes sys_reboot():

  panic: trap
  cpu0: Begin traceback...
  vpanic(c109de38,c8f7cd78,c8f7cd78,c8f7ce04,c011f7d3,c109de38,c8f7ce10,c8f7ce10,2,e) at netbsd:vpanic+0x1bb
  vpanic(c109de38,c8f7ce10,c8f7ce10,2,e,c8f7ce10,c8f7cdb4,c1f5a1e4,c8f7a000,c0bc739b) at netbsd:vpanic
  trap() at netbsd:trap+0x27a
  --- trap (number 6) ---
  sys_reboot(c2086540,c8f7cf74,c8f7cf6c,ffff0ff0,c8f7cf3c,c01675de,c16b0e20,c2086540,c8f7cf74,c8f7cf6c) at netbsd:sys_reboot+0xa9
  sy_call(c16b0e20,c2086540,c8f7cf74,c8f7cf6c,c016773d,86540,c2086540,c8f7cf9c,c0167885,c16b0e20) at c016750e
  sy_invoke(c16b0e20,c2086540,c8f7cf74,c8f7cf6c,d0,0,c2086540,c1f5a1e4,d0,c16b0e20) at netbsd:sy_invoke+0xbb
  syscall() at netbsd:syscall+0xd7
  --- syscall (number 208) ---
  bab5a397:
  cpu0: End traceback...

but when later examining the crash dump with gdb, it displays an
incorrect address and a "??" in place of sys_reboot():

  #4  0xc011f7d3 in trap (frame=0xc8f7ce10)
      at /usr/src/sys/arch/i386/i386/trap.c:324
  #5  0xc011400f in alltraps ()
  #6  0xc8f7ce10 in ?? ()
  #7  0xc016750e in sy_call (sy=0xc16b0e20 <sysent+4160>, l=0xc2086540, 

I am marking this bug as critical because it is making a large class
of kernel bugs much harder to fix.

I first noticed this problem while filing PR 52553.  Since the
prodcedure for reproducing that issue requires specific hardware
(athn), below is an alternative procedure that does not.

>How-To-Repeat:

Apply the following patch:

Index: src/sys/kern/kern_xxx.c
===================================================================
RCS file: /bracket/repo/src/sys/kern/kern_xxx.c,v
retrieving revision 1.73
diff -u -r1.73 kern_xxx.c
--- src/sys/kern/kern_xxx.c	29 Oct 2015 00:27:08 -0000	1.73
+++ src/sys/kern/kern_xxx.c	19 Sep 2017 13:57:19 -0000
@@ -67,6 +67,10 @@
 	    0, NULL, NULL, NULL)) != 0)
 		return (error);

+	/* Abuse AB_DEBUG for testing trap handling */
+	if ((SCARG(uap, opt) & AB_DEBUG))
+		*((char *)1) = 0;
+
 	/*
 	 * Only use the boot string if RB_STRING is set.
 	 */

Build an i386 release with build.sh -V MKDEBUG=yes -V COPTS=-g.
(Or just a kernel; I built a full release because that's what
I have fully automated).

Install it, boot it, log in as root, and issue the command "reboot -x".
This will cause a trap in sys_reboot(), a core dump, and a reboot.

Log in as root again and issue the commands

  cd /var/crash
  gunzip *.gz
  gdb /netbsd
  (gdb) target kvm netbsd.0.core
  (gdb) bt

Notice how sys_reboot() does not appear in the backtrace.

>Fix:

>Release-Note:

>Audit-Trail:
From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: port-i386/52560: gdb kernel backtrace fails to show function where trap occurred
Date: Wed, 20 Sep 2017 16:24:34 +0300

 Here's a couple more pieces of information from running some
 additional tests:

 1. The problem is not new; a system built from sources dated
 2012.06.09.22.49.18 fails the same way.

 2. The amd64 port is not affected.
 -- 
 Andreas Gustafsson, gson@gson.org

From: "Maxime Villard" <maxv@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/52560 CVS commit: src/sys/arch
Date: Fri, 23 Feb 2018 09:00:56 +0000

 Module Name:	src
 Committed By:	maxv
 Date:		Fri Feb 23 09:00:56 UTC 2018

 Modified Files:
 	src/sys/arch/amd64/conf: Makefile.amd64
 	src/sys/arch/i386/conf: Makefile.i386

 Log Message:
 Add -fno-shrink-wrap, to force GCC to push the frames at the very beginning
 of the functions. Otherwise DDB is unable to display a correct stack trace
 if a fault occurred in a function before the frame was pushed.

 Discussed on tech-kern@, flag suggested by Krister Walfridsson. Should fix
 PR/52560.


 To generate a diff of this commit:
 cvs rdiff -u -r1.64 -r1.65 src/sys/arch/amd64/conf/Makefile.amd64
 cvs rdiff -u -r1.187 -r1.188 src/sys/arch/i386/conf/Makefile.i386

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->feedback
State-Changed-By: maxv@NetBSD.org
State-Changed-When: Fri, 23 Feb 2018 09:16:51 +0000
State-Changed-Why:
Can you re-test with the fix I committed?


State-Changed-From-To: feedback->open
State-Changed-By: maxv@NetBSD.org
State-Changed-When: Fri, 23 Feb 2018 18:12:18 +0000
State-Changed-Why:
back to open


From: Andreas Gustafsson <gson@gson.org>
To: maxv@NetBSD.org
Cc: gnats-bugs@NetBSD.org
Subject: Re: port-i386/52560 (gdb kernel backtrace fails to show function where trap occurred)
Date: Sat, 24 Feb 2018 11:18:39 +0200

 maxv@NetBSD.org wrote:
 > Can you re-test with the fix I committed?

 I see that the change was reverted, but since I had already started a
 test run of source date 2018.02.23.09.00.56, I might as well report
 the results.  The change you committed did not fix the problem; the
 function where the trap occurred was still not identified:

 (gdb) bt
 #0  maybe_dump (howto=260) at /usr/src/sys/arch/i386/i386/machdep.c:746
 #1  0xc011c48d in cpu_reboot (howto=260, bootstr=0x0)
     at /usr/src/sys/arch/i386/i386/machdep.c:765
 #2  0xc0c02a5a in vpanic (fmt=0xc10b0678 "trap", 
     ap=0xceae8d78 "\020\216\256\316\020\216\256\316\002")
     at /usr/src/sys/kern/subr_prf.c:342
 #3  0xc0c0288c in panic (fmt=0xc10b0678 "trap")
     at /usr/src/sys/kern/subr_prf.c:258
 #4  0xc011fdd8 in trap (frame=0xceae8e10)
     at /usr/src/sys/arch/i386/i386/trap.c:323
 #5  0xc0114492 in alltraps ()
 #6  0xceae8e10 in ?? ()
 #7  0xc0168313 in sy_call (sy=0xc16cda40 <sysent+4160>, l=0xc22cd540, 
     uap=0xceae8f74, rval=0xceae8f6c) at /usr/src/sys/sys/syscallvar.h:65
 #8  0xc01683e3 in sy_invoke (sy=0xc16cda40 <sysent+4160>, l=0xc22cd540, 
     uap=0xceae8f74, rval=0xceae8f6c, code=208)
     at /usr/src/sys/sys/syscallvar.h:94
 #9  0xc016868a in syscall (frame=0xceae8fa8)
     at /usr/src/sys/arch/x86/x86/syscall.c:140
 #10 0xc01006a9 in Xsyscall ()
 (gdb)

 In the commit messsage, you wrote:
 > Otherwise DDB is unable to display a correct stack trace
 > if a fault occurred in a function before the frame was pushed.

 DDB _is_ able to display a correct stack trace, or at least it was at
 the time when the PR was filed, as shown at the beginning of the PR.
 The problem only affects gdb.
 -- 
 Andreas Gustafsson, gson@gson.org

From: Maxime Villard <max@m00nbsd.net>
To: Andreas Gustafsson <gson@gson.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: port-i386/52560 (gdb kernel backtrace fails to show function
 where trap occurred)
Date: Sat, 24 Feb 2018 11:43:30 +0100

 Le 24/02/2018 à 10:18, Andreas Gustafsson a écrit :
 > maxv@NetBSD.org wrote:
 >> Can you re-test with the fix I committed?
 > 
 > I see that the change was reverted, but since I had already started a
 > test run of source date 2018.02.23.09.00.56, I might as well report
 > the results.  The change you committed did not fix the problem; the
 > function where the trap occurred was still not identified:
 > 
 > (gdb) bt
 > #0  maybe_dump (howto=260) at /usr/src/sys/arch/i386/i386/machdep.c:746
 > #1  0xc011c48d in cpu_reboot (howto=260, bootstr=0x0)
 >      at /usr/src/sys/arch/i386/i386/machdep.c:765
 > #2  0xc0c02a5a in vpanic (fmt=0xc10b0678 "trap",
 >      ap=0xceae8d78 "\020\216\256\316\020\216\256\316\002")
 >      at /usr/src/sys/kern/subr_prf.c:342
 > #3  0xc0c0288c in panic (fmt=0xc10b0678 "trap")
 >      at /usr/src/sys/kern/subr_prf.c:258
 > #4  0xc011fdd8 in trap (frame=0xceae8e10)
 >      at /usr/src/sys/arch/i386/i386/trap.c:323
 > #5  0xc0114492 in alltraps ()
 > #6  0xceae8e10 in ?? ()
 > #7  0xc0168313 in sy_call (sy=0xc16cda40 <sysent+4160>, l=0xc22cd540,
 >      uap=0xceae8f74, rval=0xceae8f6c) at /usr/src/sys/sys/syscallvar.h:65
 > #8  0xc01683e3 in sy_invoke (sy=0xc16cda40 <sysent+4160>, l=0xc22cd540,
 >      uap=0xceae8f74, rval=0xceae8f6c, code=208)
 >      at /usr/src/sys/sys/syscallvar.h:94
 > #9  0xc016868a in syscall (frame=0xceae8fa8)
 >      at /usr/src/sys/arch/x86/x86/syscall.c:140
 > #10 0xc01006a9 in Xsyscall ()
 > (gdb)
 > 
 > In the commit messsage, you wrote:
 >> Otherwise DDB is unable to display a correct stack trace
 >> if a fault occurred in a function before the frame was pushed.
 > 
 > DDB _is_ able to display a correct stack trace, or at least it was at
 > the time when the PR was filed, as shown at the beginning of the PR.
 > The problem only affects gdb.

 In fact your problem is that the _current_ function does not get displayed in
 GDB, and my patch fixed the fact that there were cases where you couldn't get
 the _previous_ function in DDB. I still put the PR in feedback, because I
 wanted to know whether somehow it would fix the issue you were getting.

 Maxime

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.