NetBSD Problem Report #36183

From ad@hairylemon.org  Fri Apr 20 21:37:13 2007
Return-Path: <ad@hairylemon.org>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by narn.NetBSD.org (Postfix) with ESMTP id E76CB63B964
	for <gnats-bugs@gnats.NetBSD.org>; Fri, 20 Apr 2007 21:37:13 +0000 (UTC)
Message-Id: <E1Hf0n2-00015p-Mh@ns0.hairylemon.org>
Date: Fri, 20 Apr 2007 22:37:12 +0100
From: ad@netbsd.org
Sender: Andrew Doran <ad@hairylemon.org>
Reply-To: ad@netbsd.org
To: gnats-bugs@NetBSD.org
Subject: problem with ptrace and multithreaded processes
X-Send-Pr-Version: 3.95

>Number:         36183
>Category:       kern
>Synopsis:       problem with ptrace and multithreaded processes
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    ad
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Apr 20 21:40:00 +0000 2007
>Closed-Date:    Sat Apr 04 10:29:46 +0000 2009
>Last-Modified:  Sat Apr 04 10:29:46 +0000 2009
>Originator:     Andrew Doran
>Release:        NetBSD 4.99.17
>Organization:
The NetBSD Project
>Environment:
N/A
>Description:
Using ptrace, a process can be made to stop for various events. The debugger
can inject a signal to be handled by the process when resuming. There are two
problems with this currently:

1. Removal and handling of the injected signal is not atomic and this can
   cause a kernel panic if two threads try to handle it.

2. The thread elected to handle the signal may not be able to take it.

3. There is no documented policy around which thread should take the signal.
>How-To-Repeat:
Code inspection.
>Fix:
Address the 3 issues above.

>Release-Note:

>Audit-Trail:

Responsible-Changed-From-To: kern-bug-people->ad
Responsible-Changed-By: ad@netbsd.org
Responsible-Changed-When: Fri, 20 Apr 2007 21:41:28 +0000
Responsible-Changed-Why:
I'm looking into it.


From: Nick Hudson <nick.hudson@gmx.co.uk>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/36183: problem with ptrace and multithreaded processes
Date: Sat, 15 Nov 2008 12:23:15 +0000

 --Boundary-00=_z8rHJdryVpSZsGt
 Content-Type: text/plain;
   charset="us-ascii"
 Content-Transfer-Encoding: 7bit
 Content-Disposition: inline

 Here's a workaround patch from ad

 --Boundary-00=_z8rHJdryVpSZsGt
 Content-Type: text/x-diff;
   charset="us-ascii";
   name="pr36183.diff"
 Content-Transfer-Encoding: 7bit
 Content-Disposition: attachment;
 	filename="pr36183.diff"

 Index: sys/kern/kern_sig.c
 ===================================================================
 RCS file: /cvsroot/src/sys/kern/kern_sig.c,v
 retrieving revision 1.289
 diff -u -p -u -r1.289 kern_sig.c
 --- sys/kern/kern_sig.c	24 Oct 2008 18:07:36 -0000	1.289
 +++ sys/kern/kern_sig.c	15 Nov 2008 10:57:48 -0000
 @@ -1653,9 +1653,9 @@ sigswitch(bool ppsig, int ppmask, int si
  	 */
  	KERNEL_UNLOCK_ALL(l, &biglocks);
  	if (p->p_stat == SSTOP || (p->p_sflag & PS_STOPPING) != 0) {
 +		KASSERT(l->l_stat == LSONPROC);
  		p->p_nrlwps--;
  		lwp_lock(l);
 -		KASSERT(l->l_stat == LSONPROC || l->l_stat == LSSLEEP);
  		l->l_stat = LSSTOP;
  		lwp_unlock(l);
  	}
 @@ -1684,19 +1684,25 @@ sigchecktrace(sigpend_t **spp)
  	 * If we are no longer being traced, or the parent didn't
  	 * give us a signal, look for more signals.
  	 */
 -	if ((p->p_slflag & PSL_TRACED) == 0 || p->p_xstat == 0)
 +	if ((p->p_slflag & PSL_TRACED) == 0 ||
 +	    p->p_xstat == 0 || p->p_xlwp != l)
  		return 0;

 -	/* If there's a pending SIGKILL, process it immediately. */
 -	if (sigismember(&p->p_sigpend.sp_set, SIGKILL))
 +	/*
 +	 * If there's a pending SIGKILL or the process is on the way
 +	 * out, process immediately.
 +	 */
 +	if (sigismember(&p->p_sigpend.sp_set, SIGKILL) ||
 +	    (l->l_flag & (LW_WCORE | LW_WEXIT)) != 0)
  		return 0;

  	/*
 -	 * If the new signal is being masked, look for other signals.
 -	 * `p->p_sigctx.ps_siglist |= mask' is done in setrunnable().
 +	 * sigaddset() is done in setrunnable().
  	 */
  	signo = p->p_xstat;
 +	*spp = &l->l_sigpend;
  	p->p_xstat = 0;
 +	p->p_xlwp = NULL;
  	if ((sigprop[signo] & SA_TOLWP) != 0)
  		*spp = &l->l_sigpend;
  	else
 @@ -1956,11 +1962,24 @@ postsig(int signo)
  	}

  	/*
 +	 * If the process is exiting or dumping core (possible after the
 +	 * unlock above), then bail out now.  Both of these conditions
 +	 * will be visible with only the proc mutex held.  Note that the
 +	 * call to lwp_userret() is recursive, but we will not come back
 +	 * this way.
 +	 */
 +	if ((l->l_flag & (LW_WEXIT | LW_WCORE)) != 0) {
 +		lwp_userret(l);
 +		panic("postsig userret");
 +		/* NOTREACHED */
 +	}
 +
 +	/*
  	 * If we get here, the signal must be caught.
  	 */
  #ifdef DIAGNOSTIC
  	if (action == SIG_IGN || sigismember(&l->l_sigmask, signo))
 -		panic("postsig action");
 +		panic("postsig: action");
  #endif

  	kpsendsig(l, &ksi, returnmask);
 @@ -2261,6 +2280,7 @@ proc_unstop(struct proc *p)

  	p->p_stat = SACTIVE;
  	p->p_sflag &= ~PS_STOPPING;
 +	p->p_xlwp = NULL;
  	sig = p->p_xstat;

  	if (!p->p_waited)
 @@ -2276,15 +2296,21 @@ proc_unstop(struct proc *p)
  			setrunnable(l);
  			continue;
  		}
 -		if (sig && (l->l_flag & LW_SINTR) != 0) {
 -		        setrunnable(l);
 +		if (sig && (l->l_flag & LW_SINTR) != 0 &&
 +		    !sigismember(&l->l_sigmask, sig)) {
  		        sig = 0;
 +		        setrunnable(l);
  		} else {
  			l->l_stat = LSSLEEP;
  			p->p_nrlwps++;
  			lwp_unlock(l);
  		}
  	}
 +
 +	if (p->p_xlwp == NULL) {
 +		/* No LWP available to take the signal. */
 +		p->p_xstat = 0;
 +	}
  }

  static int
 Index: sys/kern/kern_synch.c
 ===================================================================
 RCS file: /cvsroot/src/sys/kern/kern_synch.c,v
 retrieving revision 1.255
 diff -u -p -u -r1.255 kern_synch.c
 --- sys/kern/kern_synch.c	15 Nov 2008 10:54:32 -0000	1.255
 +++ sys/kern/kern_synch.c	15 Nov 2008 10:57:49 -0000
 @@ -925,7 +925,6 @@ setrunnable(struct lwp *l)
  {
  	struct proc *p = l->l_proc;
  	struct cpu_info *ci;
 -	sigset_t *ss;

  	KASSERT((l->l_flag & LW_IDLE) == 0);
  	KASSERT(mutex_owned(p->p_lock));
 @@ -938,13 +937,12 @@ setrunnable(struct lwp *l)
  		 * If we're being traced (possibly because someone attached us
  		 * while we were stopped), check for a signal from the debugger.
  		 */
 -		if ((p->p_slflag & PSL_TRACED) != 0 && p->p_xstat != 0) {
 -			if ((sigprop[p->p_xstat] & SA_TOLWP) != 0)
 -				ss = &l->l_sigpend.sp_set;
 -			else
 -				ss = &p->p_sigpend.sp_set;
 -			sigaddset(ss, p->p_xstat);
 +		if ((p->p_slflag & PSL_TRACED) != 0 && p->p_xstat != 0 &&
 +		    p->p_xlwp == NULL &&
 +		    !sigismember(&l->l_sigmask, p->p_xstat)) {
 +			sigaddset(&l->l_sigpend.sp_set, p->p_xstat);
  			signotify(l);
 +			p->p_xlwp = l;
  		}
  		p->p_nrlwps++;
  		break;
 Index: sys/sys/proc.h
 ===================================================================
 RCS file: /cvsroot/src/sys/sys/proc.h,v
 retrieving revision 1.282
 diff -u -p -u -r1.282 proc.h
 --- sys/sys/proc.h	22 Oct 2008 11:14:33 -0000	1.282
 +++ sys/sys/proc.h	15 Nov 2008 10:57:49 -0000
 @@ -290,6 +290,7 @@ struct proc {
  	LIST_HEAD(, lwp) p_sigwaiters;	/* p: LWPs waiting for signals */
  	sigstore_t	p_sigstore;	/* p: process-wide signal state */
  	sigpend_t	p_sigpend;	/* p: pending signals */
 +	struct lwp	*p_xlwp;	/* s: LWP to take sig from debugger */
  	struct lcproc	*p_lwpctl;	/* p, a: _lwp_ctl() information */
  	pid_t		p_ppid;		/* :: cached parent pid */


 --Boundary-00=_z8rHJdryVpSZsGt--

From: Andrew Doran <ad@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/36183 CVS commit: src/sys/kern
Date: Sat, 13 Dec 2008 20:43:38 +0000 (UTC)

 Module Name:	src
 Committed By:	ad
 Date:		Sat Dec 13 20:43:38 UTC 2008

 Modified Files:
 	src/sys/kern: kern_sig.c kern_synch.c

 Log Message:
 PR kern/36183 problem with ptrace and multithreaded processes

 Fix the famous "gdb + threads = panic" problem.
 Also, fix another revivesa merge botch.


 To generate a diff of this commit:
 cvs rdiff -r1.292 -r1.293 src/sys/kern/kern_sig.c
 cvs rdiff -r1.255 -r1.256 src/sys/kern/kern_synch.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->analyzed
State-Changed-By: ad@NetBSD.org
State-Changed-When: Sat, 13 Dec 2008 20:51:49 +0000
State-Changed-Why:
The first two items are fixed.


Responsible-Changed-From-To: ad->kern-bug-people
Responsible-Changed-By: ad@NetBSD.org
Responsible-Changed-When: Sat, 13 Dec 2008 20:52:51 +0000
Responsible-Changed-Why:
The consensus is that we should replace ptrace with a Solaris-like procfs interface
that is capable of handling threads. I have no plans to do this.


From: Thor Lancelot Simon <tls@rek.tjls.com>
To: ad@netbsd.org
Cc: gnats-bugs@netbsd.org
Subject: Re: kern/36183 -- not quite fixed yet?
Date: Tue, 16 Dec 2008 22:54:13 -0500

 On Tue, Dec 16, 2008 at 10:29:38PM -0500, Thor Lancelot Simon wrote:
 > 
 > Then I tried the buggy program below.  The result, run under gdb on
 > on a 4-CPU VMware Fusion VM (on an 8-core host) with the i386 port was
 > a panic: kernel diagnostic assertion l->l_stat != LSZOMB (kern_lwp.c
 > line 1352).

 Correction: I believe I hit ^T to trigger the panic with the program
 running under gdb.

 Thor

From: Thor Lancelot Simon <tls@netbsd.org>
To: ad@netbsd.org
Cc: gnats-bugs@netbsd.org
Subject: kern/36183 -- not quite fixed yet?
Date: Tue, 16 Dec 2008 22:29:38 -0500

 I applied your kern_sig/kern_synch changes for PR 36183 (as well as the
 immediately previous SIGKILL change to kern_sig.c to my local netbsd-5
 tree (upon cursory examination, it looked like this should work).
 Initial results with a non-buggy pthread program program were promising
 though gdb still doesn't seem to know which threads are active so one
 can't switch threads.  However:

 1) Sending SIGINFO (via ^T) drops the traced executable into gdb.  This
    isn't the old behavior for non-threaded executables, is it?  I can't
    remember.

 2) Setting breakpoints at certain points in the program seems to cause
    the program to *exit* with SIGTRAP.

 3) exiting gdb doesn't seem to properly kill off the traced executable
    if it's been stopped at a breakpoint, then one tries to exit gdb.

 Then I tried the buggy program below.  The result, run under gdb on
 on a 4-CPU VMware Fusion VM (on an 8-core host) with the i386 port was
 a panic: kernel diagnostic assertion l->l_stat != LSZOMB (kern_lwp.c
 line 1352).

 I compiled the program on 4.99.72 as that's what I had handy; I doubt
 that made much of a differenc, but just so you know.

 #include <pthread.h>

 static void * t_worker(void *arg)
 {
 int splodeme[10 * 1024 * 1024];
 }

 int main(int argc, char **argv) {
 	pthread_t t;

 	pthread_create(&t, NULL, t_worker, NULL);

 	while(1) { };
 }
Responsible-Changed-From-To: kern-bug-people->ad
Responsible-Changed-By: ad@NetBSD.org
Responsible-Changed-When: Sat, 20 Dec 2008 22:37:17 +0000
Responsible-Changed-Why:
look again


From: Soren Jacobsen <snj@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/36183 CVS commit: [netbsd-5] src/sys/kern
Date: Mon,  2 Feb 2009 18:28:23 +0000 (UTC)

 Module Name:	src
 Committed By:	snj
 Date:		Mon Feb  2 18:28:23 UTC 2009

 Modified Files:
 	src/sys/kern [netbsd-5]: kern_sig.c kern_synch.c

 Log Message:
 Pull up following revision(s) (requested by ad in ticket #353):
 	sys/kern/kern_sig.c: revision 1.293
 	sys/kern/kern_synch.c: revision 1.256
 PR kern/36183 problem with ptrace and multithreaded processes
 Fix the famous "gdb + threads = panic" problem.
 Also, fix another revivesa merge botch.


 To generate a diff of this commit:
 cvs rdiff -r1.289 -r1.289.4.1 src/sys/kern/kern_sig.c
 cvs rdiff -r1.254 -r1.254.2.1 src/sys/kern/kern_synch.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Andrew Doran <ad@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/36183 CVS commit: src/sys
Date: Wed,  4 Feb 2009 21:17:39 +0000 (UTC)

 Module Name:	src
 Committed By:	ad
 Date:		Wed Feb  4 21:17:39 UTC 2009

 Modified Files:
 	src/sys/kern: kern_lwp.c sys_process.c
 	src/sys/sys: lwp.h

 Log Message:
 PR kern/36183 problem with ptrace and multithreaded processes

 Fix the crashy test case that Thor provided.


 To generate a diff of this commit:
 cvs rdiff -r1.126 -r1.127 src/sys/kern/kern_lwp.c
 cvs rdiff -r1.145 -r1.146 src/sys/kern/sys_process.c
 cvs rdiff -r1.116 -r1.117 src/sys/sys/lwp.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Christoph Egger <Christoph_Egger@gmx.de>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: Re: kern/36183: problem with ptrace and multithreaded processes
Date: Thu, 5 Feb 2009 11:31:57 +0100

 Use a multithreaded program in gdb. In my example I use qemu
 from svn tree.

 # gdb qemu-system-x86_64
 (gdb) set args guestimage.img -vnc :0
 (gdb) run
 Program received signal SIGUSR2, User defined signal 2.
 0x00007f7ffd1710da in ___lwp_park50 () from /usr/lib/libc.so.12
 (gdb) quit
 The program is running.  Exit anyway? (y or n) y

 gdb never exits. You have to kill gdb from an other shell
 to get back to the command line.

 sorry, pid 699 was killed: orphaned traced process
 Killed 
 # 

From: Soren Jacobsen <snj@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/36183 CVS commit: [netbsd-5] src/sys
Date: Fri,  6 Feb 2009 01:54:09 +0000 (UTC)

 Module Name:	src
 Committed By:	snj
 Date:		Fri Feb  6 01:54:09 UTC 2009

 Modified Files:
 	src/sys/kern [netbsd-5]: kern_lwp.c sys_process.c
 	src/sys/sys [netbsd-5]: lwp.h

 Log Message:
 Pull up following revision(s) (requested by ad in ticket #414):
 	sys/kern/kern_lwp.c: revision 1.127
 	sys/kern/sys_process.c: revision 1.146
 	sys/sys/lwp.h: revision 1.117
 PR kern/36183 problem with ptrace and multithreaded processes
 Fix the crashy test case that Thor provided.


 To generate a diff of this commit:
 cvs rdiff -r1.126 -r1.126.2.1 src/sys/kern/kern_lwp.c
 cvs rdiff -r1.143 -r1.143.4.1 src/sys/kern/sys_process.c
 cvs rdiff -r1.114 -r1.114.4.1 src/sys/sys/lwp.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: analyzed->closed
State-Changed-By: ad@NetBSD.org
State-Changed-When: Sat, 04 Apr 2009 10:29:46 +0000
State-Changed-Why:
replaced by 40594


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.