NetBSD Problem Report #40594

From www@NetBSD.org  Mon Feb  9 20:13:11 2009
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by narn.NetBSD.org (Postfix) with ESMTP id D1E3463B99D
	for <gnats-bugs@gnats.netbsd.org>; Mon,  9 Feb 2009 20:13:11 +0000 (UTC)
Message-Id: <20090209201311.938D263B896@narn.NetBSD.org>
Date: Mon,  9 Feb 2009 20:13:11 +0000 (UTC)
From: pooka@iki.fi
Reply-To: pooka@iki.fi
To: gnats-bugs@NetBSD.org
Subject: gdb does not work on 5.0 RC2
X-Send-Pr-Version: www-1.0

>Number:         40594
>Category:       kern
>Synopsis:       gdb does not work on 5.0 RC2
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Feb 09 20:15:00 +0000 2009
>Closed-Date:    Sun Mar 18 21:36:21 +0000 2012
>Last-Modified:  Sun Mar 18 21:36:21 +0000 2012
>Originator:     Antti Kantee
>Release:        5.0_RC2
>Organization:
>Environment:
>Description:
Somewhere between late 5.0_BETA and 5.0_RC (1 and 2) gdb stopped working.
Notably, my gdb is from Nov 2007.
>How-To-Repeat:
pain-rustique:1:~> gdb /bin/ls
GNU gdb 6.5
[snip]

(gdb) run
Starting program: /bin/ls
*hang*



>Fix:
It seems that executing ls ends up the "pause" wchan.  It is coming
from __sigsuspend14.  gdb, on the other hand, is doing wait4.
So I guess technically the program executed from is hanging, not gdb.

>Release-Note:

>Audit-Trail:

Responsible-Changed-From-To: kern-bug-people->ad
Responsible-Changed-By: ad@NetBSD.org
Responsible-Changed-When: Thu, 12 Feb 2009 14:35:35 +0000
Responsible-Changed-Why:
take


From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/40594: gdb does not work on 5.0 RC2
Date: Sun, 22 Feb 2009 21:41:22 +0000

 On Mon, Feb 09, 2009 at 08:15:01PM +0000, pooka@iki.fi wrote:
  > Somewhere between late 5.0_BETA and 5.0_RC (1 and 2) gdb stopped working.
  > Notably, my gdb is from Nov 2007.
  > >How-To-Repeat:
  > pain-rustique:1:~> gdb /bin/ls
  > GNU gdb 6.5
  > [snip]
  > 
  > (gdb) run
  > Starting program: /bin/ls
  > *hang*
  > 
  > >Fix:
  > It seems that executing ls ends up the "pause" wchan.  It is coming
  > from __sigsuspend14.  gdb, on the other hand, is doing wait4.
  > So I guess technically the program executed from is hanging, not gdb.

 The issue appears to be provoked by the shell spawned by gdb to start
 the inferior process; depending on what you have in your shell startup
 files the hang may or may not occur. In my case the problem seems to
 be tickled by

    setenv _UNAME `uname -s |& tr A-Z a-z`

 What seems to be happening is that the shell forks and then the fork
 runs a subprocess, and when the child shell exits, wait notifies gdb
 instead of the parent shell, so the child shell hangs around as a
 zombie, the parent shell (if *csh) blocks in sigsuspend waiting for a
 SIGCHLD it's not going to get, and gdb blocks in wait assuming
 something else is going to happen.

 In this run, process 14062 is gdb, 24361 is the parent shell (spawned
 by gdb), and 26479 is the child shell.

  24361      1 tcsh     CALL  read(8,0xbfbfad50,0x1000)
  14164      1 tr       CALL  exit(0)
  26479      1 tcsh     RET   __sigsuspend14 -1 errno 4 Interrupted system call
  26479      1 tcsh     PSIG  SIGCHLD caught handler=0x808463c mask=(2,20): code=CLD_EXITED child pid=14164, uid=32170,  status=0, utime=0, stime=0)
  26479      1 tcsh     CALL  setcontext(0xbfbf65b4)
  26479      1 tcsh     RET   write JUSTRETURN
  26479      1 tcsh     CALL  __wait450(0xffffffff,0xbfbf68a4,1,0xbfbf6854)
  26479      1 tcsh     RET   __wait450 14164/0x3754
  26479      1 tcsh     CALL  __wait450(0xffffffff,0xbfbf68a4,1,0xbfbf6854)
  26479      1 tcsh     RET   __wait450 -1 errno 10 No child processes

 So far, so good. The parent shell is sitting in read to collect the
 results from the backquotes; the child picks up the exit status of tr.

  26479      1 tcsh     CALL  __sigprocmask14(3,0xbfbf6900,0)
  26479      1 tcsh     RET   __sigprocmask14 0
  26479      1 tcsh     CALL  __sigprocmask14(0,0,0x80a4738)
  26479      1 tcsh     RET   __sigprocmask14 0
  26479      1 tcsh     CALL  exit(0)

 Now the child shell exits.

  14062      1 gdb      RET   __wait450 24361/0x5f29
  14062      1 gdb      CALL  ptrace(PT_GETREGS,0x5f29,0xbfbfe2ec,0)
  14062      1 gdb      RET   ptrace 0
  14062      1 gdb      CALL  ptrace(PT_CONTINUE,0x5f29,1,0x14)
  14062      1 gdb      RET   ptrace 0

 Now gdb picks up a wait result for the *parent* shell, which has not
 exited or done anything else that should cause this. This is
 apparently the exit notification for the child shell, messed up
 somehow.

 gdb apparently shrugs and tells the parent shell to continue.

  24361      1 tcsh     RET   read -1 errno 4 Interrupted system call
  24361      1 tcsh     CALL  read(8,0xbfbfad50,0x1000)
  24361      1 tcsh     GIO   fd 8 read 0 bytes
        ""
  24361      1 tcsh     RET   read 0
  24361      1 tcsh     CALL  close(8)
  24361      1 tcsh     RET   close 0

 The parent shell now drops out of read and closes its pipe...

  24361      1 tcsh     CALL  __sigprocmask14(1,0xbfbf6ca0,0xbfbf6cb0)
  24361      1 tcsh     RET   __sigprocmask14 0
  24361      1 tcsh     CALL  __sigsuspend14(0xbfbf6c90)

 ...and waits for a SIGCHLD from the child shell that it is never going
 to receive, because that exit result was misdirected above, or
 something.

  14062      1 gdb      CALL  __wait450(0xffffffff,0xbfbfe558,0,0)
  14062      1 gdb      RET   __wait450 RESTART

 and now gdb goes to sleep waiting for something to happen, which of
 course nothing will. This is where it hangs; the next thing in the
 trace is manual intervention via SIGKILL.

 I'm not sure if the child shell is being traced or not (one would
 expect that it would be, though) so it's not clear if what's happening
 is that the wrong process is being awakened from wait, that wait is
 reporting on the wrong process, or even just that the wrong pid is
 being returned, but it's pretty clear that wait is stuffed somehow.

 Unfortunately, find_stopped_child() is a maze of special cases and
 it's not clear what's going on inside it.

 -- 
 David A. Holland
 dholland@netbsd.org

From: Antti Kantee <pooka@iki.fi>
To: gnats-bugs@NetBSD.org
Cc: ad@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/40594: gdb does not work on 5.0 RC2
Date: Mon, 23 Feb 2009 18:10:46 +0200

 On Sun Feb 22 2009 at 21:45:02 +0000, David Holland wrote:
 >  The issue appears to be provoked by the shell spawned by gdb to start
 >  the inferior process; depending on what you have in your shell startup
 >  files the hang may or may not occur. In my case the problem seems to
 >  be tickled by

 Oh man, that's evil!  I tracked it down to a bunch of stuff in .aliases:
 	if (`tty` == "/dev/console")

 Knowing this workaround, I can run programs in gdb again.  Thanks!!

State-Changed-From-To: open->feedback
State-Changed-By: riz@NetBSD.org
State-Changed-When: Wed, 16 Jun 2010 21:52:05 +0000
State-Changed-Why:
Given the workaround and recent GDB changes in -current, can
this be closed?


State-Changed-From-To: feedback->analyzed
State-Changed-By: pooka@NetBSD.org
State-Changed-When: Thu, 17 Jun 2010 11:42:10 +0300
State-Changed-Why:
Unfortunately the thread support fixes do not address this issue,
so let's keep the PR open.


From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: ad@NetBSD.org, netbsd-bugs@netbsd.org, gnats-admin@netbsd.org,
	pooka@NetBSD.org, pooka@iki.fi
Subject: Re: kern/40594 (gdb does not work on 5.0 RC2)
Date: Sat, 19 Jun 2010 03:10:30 +0000

 On Thu, Jun 17, 2010 at 08:42:11AM +0000, pooka@NetBSD.org wrote:
  > State-Changed-From-To: feedback->analyzed
  > State-Changed-By: pooka@NetBSD.org
  > State-Changed-When: Thu, 17 Jun 2010 11:42:10 +0300
  > State-Changed-Why:
  > Unfortunately the thread support fixes do not address this issue,
  > so let's keep the PR open.

 It is a bug somewhere in the ptrace-related "logic" in wait, as best I
 can tell so far.

 -- 
 David A. Holland
 dholland@netbsd.org

From: Andrew Smallshaw <andrews@sdf.lonestar.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/40594 (gdb does not work on 5.0 RC2)
Date: Mon, 5 Jul 2010 22:04:58 +0100

 I'm posting this since I think I've just tripped on this problem
 myself on a fairly clean 5.0.2 system.  Interestingly that was
 using the stock ksh and not tcsh but again the problem can be pinned
 down to startup files, in this case my .kshrc and the lines:

 case `whoami` in 
     root)   PS1='# ' ;;
     *)      PS1='$ ' ;;
 esac

 That indicates it is a wider problem with shell/gdb interaction
 not just tcsh.  However, I tried setting $SHELL to /bin/sh and
 (after installing it) ksh93.  Both worked correctly.  Since I had
 been meaning to swap to ksh93 anyway (this is a fairly new install)
 that pretty much fixes it as far as I am concerned, but I write
 this in case it is any help as a workaround for anyone else having
 the problem.

 -- 
 Andrew Smallshaw
 andrews@sdf.lonestar.org

From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/40594 (gdb does not work on 5.0 RC2)
Date: Mon, 5 Jul 2010 21:17:31 +0000

 On Mon, Jul 05, 2010 at 09:10:05PM +0000, Andrew Smallshaw wrote:
  >  That indicates it is a wider problem with shell/gdb interaction
  >  not just tcsh.

 Yes, it's a kernel issue that has something to do with forking
 subprocesses from the shell that gdb uses to start the target
 program.

 -- 
 David A. Holland
 dholland@netbsd.org

Responsible-Changed-From-To: ad->jmcneill
Responsible-Changed-By: jmcneill@NetBSD.org
Responsible-Changed-When: Mon, 29 Aug 2011 18:02:43 +0000
Responsible-Changed-Why:
take


State-Changed-From-To: analyzed->feedback
State-Changed-By: jmcneill@NetBSD.org
State-Changed-When: Mon, 29 Aug 2011 18:02:43 +0000
State-Changed-Why:
I can't trigger the issue but I may have fixed this with the following commit:
  http://mail-index.netbsd.org/source-changes/2011/08/29/msg026588.html
Can somebody see if the problem is still present in HEAD?


Responsible-Changed-From-To: jmcneill->kern-bug-people
Responsible-Changed-By: jmcneill@NetBSD.org
Responsible-Changed-When: Mon, 29 Aug 2011 21:04:57 +0000
Responsible-Changed-Why:
No luck, I tried.


State-Changed-From-To: feedback->open
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Wed, 31 Aug 2011 08:49:24 +0000
State-Changed-Why:
problem remains, we seem to be moving forward on it though


From: "Christos Zoulas" <christos@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/40594 CVS commit: src/sys/kern
Date: Wed, 31 Aug 2011 12:09:56 -0400

 Module Name:	src
 Committed By:	christos
 Date:		Wed Aug 31 16:09:56 UTC 2011

 Modified Files:
 	src/sys/kern: kern_sleepq.c

 Log Message:
 PR/40594: Antti Kantee: Don't call issignal() here to determine what errno
 to set for the interrupted syscall, because issignal() will consume the signal
 and it will not be delivered to the process afterwards. Instead call
 sigispending() (which now returns the first pending signal) and does not
 consume the signal.


 To generate a diff of this commit:
 cvs rdiff -u -r1.41 -r1.42 src/sys/kern/kern_sleepq.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->closed
State-Changed-By: christos@NetBSD.org
State-Changed-When: Sun, 04 Sep 2011 10:48:13 -0400
State-Changed-Why:
fixed, thanks; needs pullups


From: Matthew Mondor <mm_lists@pulsar-zone.net>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/40594 (gdb does not work on 5.0 RC2)
Date: Sun, 18 Sep 2011 20:22:56 -0400

 On Sun,  4 Sep 2011 14:48:14 +0000 (UTC)
 christos@NetBSD.org wrote:

 > Synopsis: gdb does not work on 5.0 RC2
 > 
 > State-Changed-From-To: open->closed
 > State-Changed-By: christos@NetBSD.org
 > State-Changed-When: Sun, 04 Sep 2011 10:48:13 -0400
 > State-Changed-Why:
 > fixed, thanks; needs pullups

 Is there a pullup # for this for my notes? (I've been silently tracking
 this PR as well)

 Thanks,
 -- 
 Matt

State-Changed-From-To: closed->pending-pullups
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sat, 05 Nov 2011 13:07:38 +0000
State-Changed-Why:
pullup-5 #1668


From: "Manuel Bouyer" <bouyer@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/40594 CVS commit: [netbsd-5] src/sys
Date: Sat, 4 Feb 2012 16:58:00 +0000

 Module Name:	src
 Committed By:	bouyer
 Date:		Sat Feb  4 16:58:00 UTC 2012

 Modified Files:
 	src/sys/arch/amd64/amd64 [netbsd-5]: syscall.c
 	src/sys/arch/i386/i386 [netbsd-5]: syscall.c trap.c
 	src/sys/kern [netbsd-5]: kern_sig.c kern_sleepq.c kern_subr.c
 	    sys_process.c
 	src/sys/secmodel/bsd44 [netbsd-5]: secmodel_bsd44_suser.c
 	src/sys/sys [netbsd-5]: proc.h ptrace.h

 Log Message:
 Apply patch, requested by jmcneill in ticket #1668:
 	sys/arch/amd64/amd64/syscall.c			patch
 	sys/arch/i386/i386/syscall.c			patch
 	sys/arch/i386/i386/trap.c			patch
 	sys/kern/kern_sig.c				patch
 	sys/kern/kern_sleepq.c				patch
 	sys/kern/kern_subr.c				patch
 	sys/kern/sys_process.c				patch
 	sys/secmodel/bsd44/secmodel_bsd44_suser.c	patch
 	sys/sys/proc.h					patch
 	sys/sys/ptrace.h				patch

 arch/i386/i386/machdep.c, arch/amd64/amd64/machdep.c (from
 arch/x86/x86/machdep.c) by christos:
 Remove code that was used to avoid register spills. setcontext(2) can change
 the registers, so re-fetching will produce the wrong result for trace_exit().
 arch/i386/i386/trap.c by reinoud:
 Fix the illegal instruction return address. It was using the value of the
 cpu's %cr2 register but thats not valid:

 CR2 Contains a value called Page Fault Linear Address (PFLA). When a page
 fault occurs, the address the program attempted to access is stored in the CR2
 register.

 And this is thus NOT the illegal instruction address!
 kern/kern_sig.c by christos:
 PR kern/45327: Jared McNeill: ptrace: siginfo doesn't work with traced processes
 When saving the signal in p->p_xstat, clear it from the pending mask, but
 don't remove it from the siginfo queue, so that next time the debugger
 delivers it, the original information is found.
 When posting a signal from the debugger l->l_sigpendset is not set, so we
 use the process pending signal and add it back to the process pending set.
 Split sigget into sigget() and siggetinfo(). When a signal comes from the
 debugger (l->l_sigpendset == NULL), using siggetinfo() try to fetch the
 siginfo information from l->l_sigpend and then from p->p_sigpend if it
 was not found. This allows us to pass siginfo information for traps from
 the debugger.
 don't delete signal from the debugger.
 kern/kern_sleepq.c by christos:
 PR kern/40594: Antti Kantee: Don't call issignal() here to determine what errno
 to set for the interrupted syscall, because issignal() will consume the signal
 and it will not be delivered to the process afterwards. Instead call
 sigispending() (which now returns the first pending signal) and does not
 consume the signal.
 We need to process SA_STOP signals immediately, and not deliver them to
 the process. Instead of re-structuring the code to do that, call issignal()
 like before in that case. (tail -F /file^Zfg should not get interrupted).
 kern/kern_subr.c by jmcneill, christos:
 PR kern/45312: ptrace: PT_SETREGS can't alter system calls

 Add a new PT_SYSCALLEMU request that cancels the current syscall, for
 use with PT_SYSCALL.
 For PT_SYSCALLEMU, no need to stop again on syscall exit.
 ifdef unused variable with -UPTRACE

 kern/sys_process.c, sys/proc.h, sys/ptrace.h, secmodel/bsd44/secmodel_bsd44_suser.c by jmcneill, christos:
 PR kern/43681: PT_SYSCALL appears to be broken

 sys_ptrace: For PT_CONTINUE/PT_SYSCALL/PT_DETACH, modify the p_trace_enabled
 flag of the target process, not the calling process.
 Process the signal now, otherwise calling issignal() and ignoring
 the return will lose the signal if it came from the debugger
 (issignal() clears p->p_xstat)
 PR kern/45312: ptrace: PT_SETREGS can't alter system calls

 Add a new PT_SYSCALLEMU request that cancels the current syscall, for
 use with PT_SYSCALL.
 PR kern/45330: ptrace: signals can alter syscall return values

 process_stoptrace: defer signal processing to userret, ok christos@


 To generate a diff of this commit:
 cvs rdiff -u -r1.44 -r1.44.4.1 src/sys/arch/amd64/amd64/syscall.c
 cvs rdiff -u -r1.57 -r1.57.4.1 src/sys/arch/i386/i386/syscall.c
 cvs rdiff -u -r1.241.4.3 -r1.241.4.4 src/sys/arch/i386/i386/trap.c
 cvs rdiff -u -r1.289.4.6 -r1.289.4.7 src/sys/kern/kern_sig.c
 cvs rdiff -u -r1.35 -r1.35.4.1 src/sys/kern/kern_sleepq.c
 cvs rdiff -u -r1.192.4.1 -r1.192.4.2 src/sys/kern/kern_subr.c
 cvs rdiff -u -r1.143.4.1 -r1.143.4.2 src/sys/kern/sys_process.c
 cvs rdiff -u -r1.59 -r1.59.4.1 src/sys/secmodel/bsd44/secmodel_bsd44_suser.c
 cvs rdiff -u -r1.282 -r1.282.4.1 src/sys/sys/proc.h
 cvs rdiff -u -r1.40 -r1.40.20.1 src/sys/sys/ptrace.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: pending-pullups->closed
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sun, 18 Mar 2012 21:36:21 +0000
State-Changed-Why:
This problem, along with a pile of other ptrace problems, was fixed last
summer and the netbsd-5 pullups have now been applied.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.