NetBSD Problem Report #40594
From www@NetBSD.org Mon Feb 9 20:13:11 2009
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
by narn.NetBSD.org (Postfix) with ESMTP id D1E3463B99D
for <gnats-bugs@gnats.netbsd.org>; Mon, 9 Feb 2009 20:13:11 +0000 (UTC)
Message-Id: <20090209201311.938D263B896@narn.NetBSD.org>
Date: Mon, 9 Feb 2009 20:13:11 +0000 (UTC)
From: pooka@iki.fi
Reply-To: pooka@iki.fi
To: gnats-bugs@NetBSD.org
Subject: gdb does not work on 5.0 RC2
X-Send-Pr-Version: www-1.0
>Number: 40594
>Category: kern
>Synopsis: gdb does not work on 5.0 RC2
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: kern-bug-people
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Mon Feb 09 20:15:00 +0000 2009
>Closed-Date: Sun Mar 18 21:36:21 +0000 2012
>Last-Modified: Sun Mar 18 21:36:21 +0000 2012
>Originator: Antti Kantee
>Release: 5.0_RC2
>Organization:
>Environment:
>Description:
Somewhere between late 5.0_BETA and 5.0_RC (1 and 2) gdb stopped working.
Notably, my gdb is from Nov 2007.
>How-To-Repeat:
pain-rustique:1:~> gdb /bin/ls
GNU gdb 6.5
[snip]
(gdb) run
Starting program: /bin/ls
*hang*
>Fix:
It seems that executing ls ends up the "pause" wchan. It is coming
from __sigsuspend14. gdb, on the other hand, is doing wait4.
So I guess technically the program executed from is hanging, not gdb.
>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: kern-bug-people->ad
Responsible-Changed-By: ad@NetBSD.org
Responsible-Changed-When: Thu, 12 Feb 2009 14:35:35 +0000
Responsible-Changed-Why:
take
From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/40594: gdb does not work on 5.0 RC2
Date: Sun, 22 Feb 2009 21:41:22 +0000
On Mon, Feb 09, 2009 at 08:15:01PM +0000, pooka@iki.fi wrote:
> Somewhere between late 5.0_BETA and 5.0_RC (1 and 2) gdb stopped working.
> Notably, my gdb is from Nov 2007.
> >How-To-Repeat:
> pain-rustique:1:~> gdb /bin/ls
> GNU gdb 6.5
> [snip]
>
> (gdb) run
> Starting program: /bin/ls
> *hang*
>
> >Fix:
> It seems that executing ls ends up the "pause" wchan. It is coming
> from __sigsuspend14. gdb, on the other hand, is doing wait4.
> So I guess technically the program executed from is hanging, not gdb.
The issue appears to be provoked by the shell spawned by gdb to start
the inferior process; depending on what you have in your shell startup
files the hang may or may not occur. In my case the problem seems to
be tickled by
setenv _UNAME `uname -s |& tr A-Z a-z`
What seems to be happening is that the shell forks and then the fork
runs a subprocess, and when the child shell exits, wait notifies gdb
instead of the parent shell, so the child shell hangs around as a
zombie, the parent shell (if *csh) blocks in sigsuspend waiting for a
SIGCHLD it's not going to get, and gdb blocks in wait assuming
something else is going to happen.
In this run, process 14062 is gdb, 24361 is the parent shell (spawned
by gdb), and 26479 is the child shell.
24361 1 tcsh CALL read(8,0xbfbfad50,0x1000)
14164 1 tr CALL exit(0)
26479 1 tcsh RET __sigsuspend14 -1 errno 4 Interrupted system call
26479 1 tcsh PSIG SIGCHLD caught handler=0x808463c mask=(2,20): code=CLD_EXITED child pid=14164, uid=32170, status=0, utime=0, stime=0)
26479 1 tcsh CALL setcontext(0xbfbf65b4)
26479 1 tcsh RET write JUSTRETURN
26479 1 tcsh CALL __wait450(0xffffffff,0xbfbf68a4,1,0xbfbf6854)
26479 1 tcsh RET __wait450 14164/0x3754
26479 1 tcsh CALL __wait450(0xffffffff,0xbfbf68a4,1,0xbfbf6854)
26479 1 tcsh RET __wait450 -1 errno 10 No child processes
So far, so good. The parent shell is sitting in read to collect the
results from the backquotes; the child picks up the exit status of tr.
26479 1 tcsh CALL __sigprocmask14(3,0xbfbf6900,0)
26479 1 tcsh RET __sigprocmask14 0
26479 1 tcsh CALL __sigprocmask14(0,0,0x80a4738)
26479 1 tcsh RET __sigprocmask14 0
26479 1 tcsh CALL exit(0)
Now the child shell exits.
14062 1 gdb RET __wait450 24361/0x5f29
14062 1 gdb CALL ptrace(PT_GETREGS,0x5f29,0xbfbfe2ec,0)
14062 1 gdb RET ptrace 0
14062 1 gdb CALL ptrace(PT_CONTINUE,0x5f29,1,0x14)
14062 1 gdb RET ptrace 0
Now gdb picks up a wait result for the *parent* shell, which has not
exited or done anything else that should cause this. This is
apparently the exit notification for the child shell, messed up
somehow.
gdb apparently shrugs and tells the parent shell to continue.
24361 1 tcsh RET read -1 errno 4 Interrupted system call
24361 1 tcsh CALL read(8,0xbfbfad50,0x1000)
24361 1 tcsh GIO fd 8 read 0 bytes
""
24361 1 tcsh RET read 0
24361 1 tcsh CALL close(8)
24361 1 tcsh RET close 0
The parent shell now drops out of read and closes its pipe...
24361 1 tcsh CALL __sigprocmask14(1,0xbfbf6ca0,0xbfbf6cb0)
24361 1 tcsh RET __sigprocmask14 0
24361 1 tcsh CALL __sigsuspend14(0xbfbf6c90)
...and waits for a SIGCHLD from the child shell that it is never going
to receive, because that exit result was misdirected above, or
something.
14062 1 gdb CALL __wait450(0xffffffff,0xbfbfe558,0,0)
14062 1 gdb RET __wait450 RESTART
and now gdb goes to sleep waiting for something to happen, which of
course nothing will. This is where it hangs; the next thing in the
trace is manual intervention via SIGKILL.
I'm not sure if the child shell is being traced or not (one would
expect that it would be, though) so it's not clear if what's happening
is that the wrong process is being awakened from wait, that wait is
reporting on the wrong process, or even just that the wrong pid is
being returned, but it's pretty clear that wait is stuffed somehow.
Unfortunately, find_stopped_child() is a maze of special cases and
it's not clear what's going on inside it.
--
David A. Holland
dholland@netbsd.org
From: Antti Kantee <pooka@iki.fi>
To: gnats-bugs@NetBSD.org
Cc: ad@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/40594: gdb does not work on 5.0 RC2
Date: Mon, 23 Feb 2009 18:10:46 +0200
On Sun Feb 22 2009 at 21:45:02 +0000, David Holland wrote:
> The issue appears to be provoked by the shell spawned by gdb to start
> the inferior process; depending on what you have in your shell startup
> files the hang may or may not occur. In my case the problem seems to
> be tickled by
Oh man, that's evil! I tracked it down to a bunch of stuff in .aliases:
if (`tty` == "/dev/console")
Knowing this workaround, I can run programs in gdb again. Thanks!!
State-Changed-From-To: open->feedback
State-Changed-By: riz@NetBSD.org
State-Changed-When: Wed, 16 Jun 2010 21:52:05 +0000
State-Changed-Why:
Given the workaround and recent GDB changes in -current, can
this be closed?
State-Changed-From-To: feedback->analyzed
State-Changed-By: pooka@NetBSD.org
State-Changed-When: Thu, 17 Jun 2010 11:42:10 +0300
State-Changed-Why:
Unfortunately the thread support fixes do not address this issue,
so let's keep the PR open.
From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: ad@NetBSD.org, netbsd-bugs@netbsd.org, gnats-admin@netbsd.org,
pooka@NetBSD.org, pooka@iki.fi
Subject: Re: kern/40594 (gdb does not work on 5.0 RC2)
Date: Sat, 19 Jun 2010 03:10:30 +0000
On Thu, Jun 17, 2010 at 08:42:11AM +0000, pooka@NetBSD.org wrote:
> State-Changed-From-To: feedback->analyzed
> State-Changed-By: pooka@NetBSD.org
> State-Changed-When: Thu, 17 Jun 2010 11:42:10 +0300
> State-Changed-Why:
> Unfortunately the thread support fixes do not address this issue,
> so let's keep the PR open.
It is a bug somewhere in the ptrace-related "logic" in wait, as best I
can tell so far.
--
David A. Holland
dholland@netbsd.org
From: Andrew Smallshaw <andrews@sdf.lonestar.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/40594 (gdb does not work on 5.0 RC2)
Date: Mon, 5 Jul 2010 22:04:58 +0100
I'm posting this since I think I've just tripped on this problem
myself on a fairly clean 5.0.2 system. Interestingly that was
using the stock ksh and not tcsh but again the problem can be pinned
down to startup files, in this case my .kshrc and the lines:
case `whoami` in
root) PS1='# ' ;;
*) PS1='$ ' ;;
esac
That indicates it is a wider problem with shell/gdb interaction
not just tcsh. However, I tried setting $SHELL to /bin/sh and
(after installing it) ksh93. Both worked correctly. Since I had
been meaning to swap to ksh93 anyway (this is a fairly new install)
that pretty much fixes it as far as I am concerned, but I write
this in case it is any help as a workaround for anyone else having
the problem.
--
Andrew Smallshaw
andrews@sdf.lonestar.org
From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/40594 (gdb does not work on 5.0 RC2)
Date: Mon, 5 Jul 2010 21:17:31 +0000
On Mon, Jul 05, 2010 at 09:10:05PM +0000, Andrew Smallshaw wrote:
> That indicates it is a wider problem with shell/gdb interaction
> not just tcsh.
Yes, it's a kernel issue that has something to do with forking
subprocesses from the shell that gdb uses to start the target
program.
--
David A. Holland
dholland@netbsd.org
Responsible-Changed-From-To: ad->jmcneill
Responsible-Changed-By: jmcneill@NetBSD.org
Responsible-Changed-When: Mon, 29 Aug 2011 18:02:43 +0000
Responsible-Changed-Why:
take
State-Changed-From-To: analyzed->feedback
State-Changed-By: jmcneill@NetBSD.org
State-Changed-When: Mon, 29 Aug 2011 18:02:43 +0000
State-Changed-Why:
I can't trigger the issue but I may have fixed this with the following commit:
http://mail-index.netbsd.org/source-changes/2011/08/29/msg026588.html
Can somebody see if the problem is still present in HEAD?
Responsible-Changed-From-To: jmcneill->kern-bug-people
Responsible-Changed-By: jmcneill@NetBSD.org
Responsible-Changed-When: Mon, 29 Aug 2011 21:04:57 +0000
Responsible-Changed-Why:
No luck, I tried.
State-Changed-From-To: feedback->open
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Wed, 31 Aug 2011 08:49:24 +0000
State-Changed-Why:
problem remains, we seem to be moving forward on it though
From: "Christos Zoulas" <christos@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/40594 CVS commit: src/sys/kern
Date: Wed, 31 Aug 2011 12:09:56 -0400
Module Name: src
Committed By: christos
Date: Wed Aug 31 16:09:56 UTC 2011
Modified Files:
src/sys/kern: kern_sleepq.c
Log Message:
PR/40594: Antti Kantee: Don't call issignal() here to determine what errno
to set for the interrupted syscall, because issignal() will consume the signal
and it will not be delivered to the process afterwards. Instead call
sigispending() (which now returns the first pending signal) and does not
consume the signal.
To generate a diff of this commit:
cvs rdiff -u -r1.41 -r1.42 src/sys/kern/kern_sleepq.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: open->closed
State-Changed-By: christos@NetBSD.org
State-Changed-When: Sun, 04 Sep 2011 10:48:13 -0400
State-Changed-Why:
fixed, thanks; needs pullups
From: Matthew Mondor <mm_lists@pulsar-zone.net>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/40594 (gdb does not work on 5.0 RC2)
Date: Sun, 18 Sep 2011 20:22:56 -0400
On Sun, 4 Sep 2011 14:48:14 +0000 (UTC)
christos@NetBSD.org wrote:
> Synopsis: gdb does not work on 5.0 RC2
>
> State-Changed-From-To: open->closed
> State-Changed-By: christos@NetBSD.org
> State-Changed-When: Sun, 04 Sep 2011 10:48:13 -0400
> State-Changed-Why:
> fixed, thanks; needs pullups
Is there a pullup # for this for my notes? (I've been silently tracking
this PR as well)
Thanks,
--
Matt
State-Changed-From-To: closed->pending-pullups
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sat, 05 Nov 2011 13:07:38 +0000
State-Changed-Why:
pullup-5 #1668
From: "Manuel Bouyer" <bouyer@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/40594 CVS commit: [netbsd-5] src/sys
Date: Sat, 4 Feb 2012 16:58:00 +0000
Module Name: src
Committed By: bouyer
Date: Sat Feb 4 16:58:00 UTC 2012
Modified Files:
src/sys/arch/amd64/amd64 [netbsd-5]: syscall.c
src/sys/arch/i386/i386 [netbsd-5]: syscall.c trap.c
src/sys/kern [netbsd-5]: kern_sig.c kern_sleepq.c kern_subr.c
sys_process.c
src/sys/secmodel/bsd44 [netbsd-5]: secmodel_bsd44_suser.c
src/sys/sys [netbsd-5]: proc.h ptrace.h
Log Message:
Apply patch, requested by jmcneill in ticket #1668:
sys/arch/amd64/amd64/syscall.c patch
sys/arch/i386/i386/syscall.c patch
sys/arch/i386/i386/trap.c patch
sys/kern/kern_sig.c patch
sys/kern/kern_sleepq.c patch
sys/kern/kern_subr.c patch
sys/kern/sys_process.c patch
sys/secmodel/bsd44/secmodel_bsd44_suser.c patch
sys/sys/proc.h patch
sys/sys/ptrace.h patch
arch/i386/i386/machdep.c, arch/amd64/amd64/machdep.c (from
arch/x86/x86/machdep.c) by christos:
Remove code that was used to avoid register spills. setcontext(2) can change
the registers, so re-fetching will produce the wrong result for trace_exit().
arch/i386/i386/trap.c by reinoud:
Fix the illegal instruction return address. It was using the value of the
cpu's %cr2 register but thats not valid:
CR2 Contains a value called Page Fault Linear Address (PFLA). When a page
fault occurs, the address the program attempted to access is stored in the CR2
register.
And this is thus NOT the illegal instruction address!
kern/kern_sig.c by christos:
PR kern/45327: Jared McNeill: ptrace: siginfo doesn't work with traced processes
When saving the signal in p->p_xstat, clear it from the pending mask, but
don't remove it from the siginfo queue, so that next time the debugger
delivers it, the original information is found.
When posting a signal from the debugger l->l_sigpendset is not set, so we
use the process pending signal and add it back to the process pending set.
Split sigget into sigget() and siggetinfo(). When a signal comes from the
debugger (l->l_sigpendset == NULL), using siggetinfo() try to fetch the
siginfo information from l->l_sigpend and then from p->p_sigpend if it
was not found. This allows us to pass siginfo information for traps from
the debugger.
don't delete signal from the debugger.
kern/kern_sleepq.c by christos:
PR kern/40594: Antti Kantee: Don't call issignal() here to determine what errno
to set for the interrupted syscall, because issignal() will consume the signal
and it will not be delivered to the process afterwards. Instead call
sigispending() (which now returns the first pending signal) and does not
consume the signal.
We need to process SA_STOP signals immediately, and not deliver them to
the process. Instead of re-structuring the code to do that, call issignal()
like before in that case. (tail -F /file^Zfg should not get interrupted).
kern/kern_subr.c by jmcneill, christos:
PR kern/45312: ptrace: PT_SETREGS can't alter system calls
Add a new PT_SYSCALLEMU request that cancels the current syscall, for
use with PT_SYSCALL.
For PT_SYSCALLEMU, no need to stop again on syscall exit.
ifdef unused variable with -UPTRACE
kern/sys_process.c, sys/proc.h, sys/ptrace.h, secmodel/bsd44/secmodel_bsd44_suser.c by jmcneill, christos:
PR kern/43681: PT_SYSCALL appears to be broken
sys_ptrace: For PT_CONTINUE/PT_SYSCALL/PT_DETACH, modify the p_trace_enabled
flag of the target process, not the calling process.
Process the signal now, otherwise calling issignal() and ignoring
the return will lose the signal if it came from the debugger
(issignal() clears p->p_xstat)
PR kern/45312: ptrace: PT_SETREGS can't alter system calls
Add a new PT_SYSCALLEMU request that cancels the current syscall, for
use with PT_SYSCALL.
PR kern/45330: ptrace: signals can alter syscall return values
process_stoptrace: defer signal processing to userret, ok christos@
To generate a diff of this commit:
cvs rdiff -u -r1.44 -r1.44.4.1 src/sys/arch/amd64/amd64/syscall.c
cvs rdiff -u -r1.57 -r1.57.4.1 src/sys/arch/i386/i386/syscall.c
cvs rdiff -u -r1.241.4.3 -r1.241.4.4 src/sys/arch/i386/i386/trap.c
cvs rdiff -u -r1.289.4.6 -r1.289.4.7 src/sys/kern/kern_sig.c
cvs rdiff -u -r1.35 -r1.35.4.1 src/sys/kern/kern_sleepq.c
cvs rdiff -u -r1.192.4.1 -r1.192.4.2 src/sys/kern/kern_subr.c
cvs rdiff -u -r1.143.4.1 -r1.143.4.2 src/sys/kern/sys_process.c
cvs rdiff -u -r1.59 -r1.59.4.1 src/sys/secmodel/bsd44/secmodel_bsd44_suser.c
cvs rdiff -u -r1.282 -r1.282.4.1 src/sys/sys/proc.h
cvs rdiff -u -r1.40 -r1.40.20.1 src/sys/sys/ptrace.h
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: pending-pullups->closed
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sun, 18 Mar 2012 21:36:21 +0000
State-Changed-Why:
This problem, along with a pile of other ptrace problems, was fixed last
summer and the netbsd-5 pullups have now been applied.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.