NetBSD Problem Report #53202

From gson@gson.org  Sun Apr 22 17:10:20 2018
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 3E1917A1CF
	for <gnats-bugs@gnats.NetBSD.org>; Sun, 22 Apr 2018 17:10:20 +0000 (UTC)
Message-Id: <20180422171013.5A09C989378@guava.gson.org>
Date: Sun, 22 Apr 2018 20:10:13 +0300 (EEST)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: Kernel hangs running t_ptrace_wait:resume1 test
X-Send-Pr-Version: 3.95

>Number:         53202
>Category:       kern
>Synopsis:       Kernel hangs running t_ptrace_wait:resume1 test
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Apr 22 17:15:00 +0000 2018
>Closed-Date:    Mon May 14 19:19:39 +0000 2018
>Last-Modified:  Mon May 14 19:19:39 +0000 2018
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current, source date >= 2017.12.02.22.51.22
>Organization:

>Environment:
System: NetBSD
Architecture: i386
Machine: i386
>Description:

Since the commit of kern_lwp.c 1.191, executing the resume1 test case
of the t_ptrace_wait test can cause the operating system to hang.  The
problem was first reported in

  http://mail-index.netbsd.org/current-users/2017/12/04/msg032841.html

but at that time it was not yet clear whether just the test framework
was hanging, or the kernel; it has now become clear that it is in fact
the kernel.

The problem is also mentioned in passing in comments to PR 51995, but
since the subject of that PR is "ptrace(2) PT_RESUME is not reliable"
rather than the more serious issue of the kernel hanging, I'm filing
this separate PR about the latter issue.

The problem occurs on multiple architectures under both qemu (i386,
amd64, sparc) and gxemul (hpcmips).  I have also reproduced it on
physical amd64 hardware after disabling all but one CPU using cpuctl.

I currently have a hung i386 system attached to a remote kernel
debugger under qemu using the procedure documented at

  https://wiki.netbsd.org/kernel_debugging_with_qemu/

The backtrace varies as the kernel appears to be in a loop
involving multiple functions, but here's a typical one:

(gdb) bt
#0  sleepq_remove (sq=0xc20d8b98, l=0xc20e3aa0)
    at /usr/src/sys/kern/kern_sleepq.c:137
#1  0xc0bd5e0d in sleepq_unsleep (l=0xc20e3aa0, cleanup=true)
    at /usr/src/sys/kern/kern_sleepq.c:347
#2  0xc0b97b90 in cv_unsleep (l=0xc20e3aa0, cleanup=true)
    at /usr/src/sys/kern/kern_condvar.c:227
#3  0xc0bb37bb in lwp_unsleep (l=0xc20e3aa0, cleanup=true)
    at /usr/src/sys/kern/kern_lwp.c:1526
#4  0xc0bd5ad6 in sleepq_block (timo=0, catch_p=true)
    at /usr/src/sys/kern/kern_sleepq.c:259
#5  0xc0b97c9d in cv_wait_sig (cv=0xc20d8b98, mtx=0xc2674800)
    at /usr/src/sys/kern/kern_condvar.c:272
#6  0xc0bb18a6 in lwp_wait (l=0xc20e3aa0, lid=0, departed=0x0, exiting=true)
    at /usr/src/sys/kern/kern_lwp.c:648
#7  0xc0ba810b in exit_lwps (l=0xc20e3aa0) at /usr/src/sys/kern/kern_exit.c:636
#8  0xc0ba7478 in exit1 (l=0xc20e3aa0, exitcode=0, signo=1)
    at /usr/src/sys/kern/kern_exit.c:223
#9  0xc0bd4d70 in sigexit (l=0xc20e3aa0, signo=1)
    at /usr/src/sys/kern/kern_sig.c:2106
#10 0xc0bd4554 in postsig (signo=1) at /usr/src/sys/kern/kern_sig.c:1904
#11 0xc0bb3880 in lwp_userret (l=0xc20e3aa0)
    at /usr/src/sys/kern/kern_lwp.c:1562
#12 0xc0169046 in mi_userret (l=0xc20e3aa0) at /usr/src/sys/sys/userret.h:94
#13 0xc01690c1 in userret (l=0xc20e3aa0) at ./machine/userret.h:80
#14 0xc01692af in syscall (frame=0xc9398fa8)
    at /usr/src/sys/arch/x86/x86/syscall.c:168
#15 0xc01006a9 in Xsyscall ()
(gdb)

If I place a breakpoint on line 637 of kern_exit.c, it gets hit
repeatedly, but a breakpoint on line 641 is never hit:

(gdb) l
632              * behind us or there may even be new LWPs created.  Therefore, a
633              * full retry is required on error.
634              */
635             while (p->p_nlwps > 1) {
636                     if (lwp_wait(l, 0, NULL, true)) {
637                             goto retry;
638                     }
639             }
640     
641             KERNEL_LOCK(nlocks, l);
(gdb) 

The lwp_wait() call is returning -3 = ERESTART, which originates in
sleepq_sigtoerror():

#0  sleepq_sigtoerror (l=0xc20e3aa0, sig=9)
    at /usr/src/sys/kern/kern_sleepq.c:400
#1  0xc0bd5bd5 in sleepq_block (timo=0, catch_p=true)
    at /usr/src/sys/kern/kern_sleepq.c:293
#2  0xc0b97c9d in cv_wait_sig (cv=0xc20d8b98, mtx=0xc2674800)
    at /usr/src/sys/kern/kern_condvar.c:272
#3  0xc0bb18a6 in lwp_wait (l=0xc20e3aa0, lid=0, departed=0x0, exiting=true)
    at /usr/src/sys/kern/kern_lwp.c:648
#4  0xc0ba810b in exit_lwps (l=0xc20e3aa0) at /usr/src/sys/kern/kern_exit.c:636
#5  0xc0ba7478 in exit1 (l=0xc20e3aa0, exitcode=0, signo=1)
    at /usr/src/sys/kern/kern_exit.c:223

If there are other gdb commands I can run to help debug this, let me
know.

See also PR 52892, "Tests hang on MIPS", for another problem that
appeared with the same commit.

>How-To-Repeat:

Because the test case that triggers the bug has been disabled, and
another change has been committed that also keeps the test case from
triggering the bug, you need to apply the following two patches to
reproduce the issue in -current as of source date 2018.04.19.21.21.44:

--- src/tests/lib/libc/sys/t_ptrace_wait.c.orig	2018-04-15 03:19:23.000000000 +0300
+++ src/tests/lib/libc/sys/t_ptrace_wait.c	2018-04-17 10:26:17.000000000 +0300
@@ -6444,7 +6444,6 @@
 	atf_tc_expect_fail("PR kern/51995");

 	// Hangs with qemu
-	ATF_REQUIRE(0 && "In order to get reliable failure, abort");

 	SYSCALL_REQUIRE(msg_open(&fds) == 0);

Index: src/tests/lib/libc/sys/msg.h
diff -c src/tests/lib/libc/sys/msg.h:1.2 src/tests/lib/libc/sys/msg.h:1.1
*** src/tests/lib/libc/sys/msg.h:1.2	Tue Mar 13 16:45:36 2018
--- src/tests/lib/libc/sys/msg.h	Mon Apr  3 00:44:00 2017
***************
*** 70,76 ****
  	CLOSEFD(fds->cfd[1]);
  	CLOSEFD(fds->pfd[0]);

! //	printf("Send %s\n", info);
  	rv = write(fds->pfd[1], msg, len);
  	if (rv != (ssize_t)len)
  		return 1;
--- 70,76 ----
  	CLOSEFD(fds->cfd[1]);
  	CLOSEFD(fds->pfd[0]);

! 	printf("Send %s\n", info);
  	rv = write(fds->pfd[1], msg, len);
  	if (rv != (ssize_t)len)
  		return 1;
***************
*** 88,94 ****
  	CLOSEFD(fds->pfd[1]);
  	CLOSEFD(fds->cfd[0]);

! //	printf("Send %s\n", info);
  	rv = write(fds->cfd[1], msg, len);
  	if (rv != (ssize_t)len)
  		return 1;
--- 88,94 ----
  	CLOSEFD(fds->pfd[1]);
  	CLOSEFD(fds->cfd[0]);

! 	printf("Send %s\n", info);
  	rv = write(fds->cfd[1], msg, len);
  	if (rv != (ssize_t)len)
  		return 1;
***************
*** 106,112 ****
  	CLOSEFD(fds->pfd[1]);
  	CLOSEFD(fds->cfd[0]);

! //	printf("Wait %s\n", info);
  	rv = read(fds->pfd[0], msg, len);
  	if (rv != (ssize_t)len)
  		return 1;
--- 106,112 ----
  	CLOSEFD(fds->pfd[1]);
  	CLOSEFD(fds->cfd[0]);

! 	printf("Wait %s\n", info);
  	rv = read(fds->pfd[0], msg, len);
  	if (rv != (ssize_t)len)
  		return 1;
***************
*** 124,130 ****
  	CLOSEFD(fds->cfd[1]);
  	CLOSEFD(fds->pfd[0]);

! //	printf("Wait %s\n", info);
  	rv = read(fds->cfd[0], msg, len);
  	if (rv != (ssize_t)len)
  		return 1;
--- 124,130 ----
  	CLOSEFD(fds->cfd[1]);
  	CLOSEFD(fds->pfd[0]);

! 	printf("Wait %s\n", info);
  	rv = read(fds->cfd[0], msg, len);
  	if (rv != (ssize_t)len)
  		return 1;

Then build an i386 release, boot it in qemu, and run the commands

# cd /usr/tests/lib/libc/sys/
# atf-run ./t_ptrace_wait4 >log 2>&1 &

Within a few minutes, the system becomes unresponsive.  You may still
get a new shell prompt if you hit enter, but attempting to run "ls"
will hang.

>Fix:

Since the problem appeared with kern_lwp.c 1.191, the obvious fix
would be to revert that commit (and its pullup).

>Release-Note:

>Audit-Trail:
From: Andreas Gustafsson <gson@gson.org>
To: christos@NetBSD.org, gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/53202: Kernel hangs running t_ptrace_wait:resume1 test
Date: Mon, 23 Apr 2018 18:16:25 +0300

 My analysis of this bug is that the kernel ends up in a loop where the
 current lwp is waiting for another lwp to exit, but the loop never
 yields the CPU by calling mi_switch(), so the other lwp never gets a
 chance to run and exit, unless we happen to be running on a
 multiprocessor.

 Before kern_lwp.c 1.191, this worked because sleepq_block() was called
 with catch_p=false, so the the "early" flag in sleepq_block() was 
 false and sleepq_block() called mi_switch() in the "else" clause of
 "if (early) ...".  Now catch_p=true, "early" is true, and mi_switch()
 is never called.

 Below is a gdb transcript from single stepping around the entire loop,
 beginning and ending at "goto retry".

 Breakpoint 1, exit_lwps (l=0xc1f9b020) at /usr/src/sys/kern/kern_exit.c:637
 637                             goto retry;
 (gdb) n
 610             KASSERT(mutex_owned(p->p_lock));
 (gdb) 
 616             LIST_FOREACH(l2, &p->p_lwps, l_sibling) {
 (gdb) 
 617                     if (l2 == l)
 (gdb) 
 618                             continue;
 (gdb) 
 616             LIST_FOREACH(l2, &p->p_lwps, l_sibling) {
 (gdb) 
 617                     if (l2 == l)
 (gdb) 
 619                     lwp_lock(l2);
 (gdb) n
 620                     l2->l_flag |= LW_WEXIT;
 (gdb) n
 621                     if ((l2->l_stat == LSSLEEP && (l2->l_flag & LW_SINTR)) ||
 (gdb) n
 622                         l2->l_stat == LSSUSPENDED || l2->l_stat == LSSTOP) {
 (gdb) print /x l2->l_stat
 $1 = 0x2
 (gdb) print /x l2->l_flag
 $2 = 0x1100000
 (gdb) n
 621                     if ((l2->l_stat == LSSLEEP && (l2->l_flag & LW_SINTR)) ||
 (gdb) 
 622                         l2->l_stat == LSSUSPENDED || l2->l_stat == LSSTOP) {
 (gdb) 
 627                     lwp_unlock(l2);
 (gdb) 
 616             LIST_FOREACH(l2, &p->p_lwps, l_sibling) {
 (gdb) 
 635             while (p->p_nlwps > 1) {
 (gdb) 
 636                     if (lwp_wait(l, 0, NULL, true)) {
 (gdb) s
 lwp_wait (l=0xc1f9b020, lid=0, departed=0x0, exiting=true)
     at /usr/src/sys/kern/kern_lwp.c:531
 531             const lwpid_t curlid = l->l_lid;
 (gdb) n
 532             proc_t *p = l->l_proc;
 (gdb) 
 536             KASSERT(mutex_owned(p->p_lock));
 (gdb) 
 538             p->p_nlwpwait++;
 (gdb) n
 539             l->l_waitingfor = lid;
 (gdb) print p->p_nlwpwait
 $3 = 1
 (gdb) n
 550                     if ((p->p_sflag & PS_WCORE) != 0) {
 (gdb) n
 560                     while ((l2 = p->p_zomblwp) != NULL) {
 (gdb) 
 571                     nfound = 0;
 (gdb) 
 572                     error = 0;
 (gdb) 
 573                     LIST_FOREACH(l2, &p->p_lwps, l_sibling) {
 (gdb) 
 583                             if (l2->l_lid == lid && l2->l_waitingfor == curlid) {
 (gdb) 
 587                             if (l2 == l)
 (gdb) 
 588                                     continue;
 (gdb) 
 573                     LIST_FOREACH(l2, &p->p_lwps, l_sibling) {
 (gdb) 
 583                             if (l2->l_lid == lid && l2->l_waitingfor == curlid) {
 (gdb) 
 587                             if (l2 == l)
 (gdb) 
 589                             if ((l2->l_prflag & LPR_DETACHED) != 0) {
 (gdb) 
 593                             if (lid != 0) {
 (gdb) 
 602                             } else if (l2->l_waiter != 0) {
 (gdb) 
 612                             nfound++;
 (gdb) 
 615                             if (l2->l_stat != LSZOMB)
 (gdb) 
 616                                     continue;
 (gdb) 
 573                     LIST_FOREACH(l2, &p->p_lwps, l_sibling) {
 (gdb) 
 635                     if (error != 0)
 (gdb) 
 637                     if (nfound == 0) {
 (gdb) 
 646                     if (exiting) {
 (gdb) 
 647                             KASSERT(p->p_nlwps > 1);
 (gdb) 
 648                             error = cv_wait_sig(&p->p_lwpcv, p->p_lock);
 (gdb) s
 cv_wait_sig (cv=0xc20d6d80, mtx=0xc2604180)
     at /usr/src/sys/kern/kern_condvar.c:266
 266             lwp_t *l = curlwp;
 (gdb) n
 269             KASSERT(mutex_owned(mtx));
 (gdb) 
 271             cv_enter(cv, mtx, l);
 (gdb) 
 272             error = sleepq_block(0, true);
 (gdb) s
 sleepq_block (timo=0, catch_p=true) at /usr/src/sys/kern/kern_sleepq.c:235
 235             int error = 0, sig;
 (gdb) n
 237             lwp_t *l = curlwp;
 (gdb) 
 238             bool early = false;
 (gdb) 
 239             int biglocks = l->l_biglocks;
 (gdb) 
 241             ktrcsw(1, 0);
 (gdb) 
 247             if (catch_p) {
 (gdb) 
 248                     l->l_flag |= LW_SINTR;
 (gdb) 
 249                     if ((l->l_flag & (LW_CANCELLED|LW_WEXIT|LW_WCORE)) != 0) {
 (gdb) 
 253                     } else if ((l->l_flag & LW_PENDSIG) != 0 && sigispending(l, 0))
 (gdb) 
 254                             early = true;
 (gdb) 
 257             if (early) {
 (gdb) 
 259                     lwp_unsleep(l, true);
 (gdb) s
 lwp_unsleep (l=0xc1f9b020, cleanup=true) at /usr/src/sys/kern/kern_lwp.c:1525
 1525            KASSERT(mutex_owned(l->l_mutex));
 (gdb) n
 1526            (*l->l_syncobj->sobj_unsleep)(l, cleanup);
 (gdb) s
 1527    }
 (gdb) s
 sleepq_block (timo=0, catch_p=true) at /usr/src/sys/kern/kern_sleepq.c:277
 277             if (catch_p && error == 0) {
 (gdb) print l->l_syncobj->sobj_unsleep
 $4 = (void (*)(struct lwp *, _Bool)) 0xc0bdb63f <sched_unsleep>
 (gdb) s
 278                     p = l->l_proc;
 (gdb) s
 279                     if ((l->l_flag & (LW_CANCELLED | LW_WEXIT | LW_WCORE)) != 0)
 (gdb) s
 281                     else if ((l->l_flag & LW_PENDSIG) != 0) {
 (gdb) s
 289                             mutex_enter(p->p_lock);
 (gdb) n
 290                             if (((sig = sigispending(l, 0)) != 0 &&
 (gdb) n
 291                                 (sigprop[sig] & SA_STOP) == 0) ||
 (gdb) n
 290                             if (((sig = sigispending(l, 0)) != 0 &&
 (gdb) n
 293                                     error = sleepq_sigtoerror(l, sig);
 (gdb) s
 sleepq_sigtoerror (l=0xc1f9b020, sig=9) at /usr/src/sys/kern/kern_sleepq.c:387
 387             struct proc *p = l->l_proc;
 (gdb) n
 390             KASSERT(mutex_owned(p->p_lock));
 (gdb) 
 395             if ((SIGACTION(p, sig).sa_flags & SA_RESTART) == 0)
 (gdb) 
 398                     error = ERESTART;
 (gdb) 
 400             return error;
 (gdb) 
 401     }
 (gdb) 
 sleepq_block (timo=0, catch_p=true) at /usr/src/sys/kern/kern_sleepq.c:294
 294                             mutex_exit(p->p_lock);
 (gdb) 
 298             ktrcsw(0, 0);
 (gdb) n
 299             if (__predict_false(biglocks != 0)) {
 (gdb) n
 302             return error;
 (gdb) 
 303     }
 (gdb) 
 cv_wait_sig (cv=0xc20d6d80, mtx=0xc2604180)
     at /usr/src/sys/kern/kern_condvar.c:273
 273             return cv_exit(cv, mtx, l, error);
 (gdb) n
 274     }
 (gdb) n
 lwp_wait (l=0xc1f9b020, lid=0, departed=0x0, exiting=true)
     at /usr/src/sys/kern/kern_lwp.c:649
 649                             if (error == 0)
 (gdb) n
 651                             break;
 (gdb) 
 685             if (lid != 0) {
 (gdb) 
 694             p->p_nlwpwait--;
 (gdb) 
 695             l->l_waitingfor = 0;
 (gdb) 
 696             cv_broadcast(&p->p_lwpcv);
 (gdb) n
 698             return error;
 (gdb) 
 699     }
 (gdb) 

 Breakpoint 1, exit_lwps (l=0xc1f9b020) at /usr/src/sys/kern/kern_exit.c:637
 637                             goto retry;
 (gdb) 

From: "Christos Zoulas" <christos@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/53202 CVS commit: src/sys/kern
Date: Mon, 23 Apr 2018 11:51:00 -0400

 Module Name:	src
 Committed By:	christos
 Date:		Mon Apr 23 15:51:00 UTC 2018

 Modified Files:
 	src/sys/kern: kern_lwp.c

 Log Message:
 PR/kern/53202: Kernel hangs running t_ptrace_wait:resume1 test, revert
 previous.


 To generate a diff of this commit:
 cvs rdiff -u -r1.191 -r1.192 src/sys/kern/kern_lwp.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Taylor R Campbell <riastradh@NetBSD.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/53202: Kernel hangs running t_ptrace_wait:resume1 test
Date: Mon, 23 Apr 2018 19:28:41 +0000

 Does inserting a call to yield() after lwp_unsleep in sleepq_block
 change anything, if we restore the use of cv_wait_sig in lwp_wait?

From: Andreas Gustafsson <gson@gson.org>
To: Taylor R Campbell <riastradh@NetBSD.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/53202: Kernel hangs running t_ptrace_wait:resume1 test
Date: Wed, 25 Apr 2018 10:32:39 +0300

 Taylor R Campbell wrote:
 >  Does inserting a call to yield() after lwp_unsleep in sleepq_block
 >  change anything, 

 I have now run a test with the following patch, plus the two patches
 given in the original PR submission:

 Index: src/sys/kern/kern_sleepq.c
 ===================================================================
 RCS file: /bracket/repo/src/sys/kern/kern_sleepq.c,v
 retrieving revision 1.51
 diff -u -r1.51 kern_sleepq.c
 --- src/sys/kern/kern_sleepq.c	3 Jul 2016 14:24:58 -0000	1.51
 +++ src/sys/kern/kern_sleepq.c	24 Apr 2018 17:39:03 -0000
 @@ -257,6 +257,7 @@
  	if (early) {
  		/* lwp_unsleep() will release the lock */
  		lwp_unsleep(l, true);
 +                yield();
  	} else {
  		if (timo) {
  			callout_schedule(&l->l_timeout_ch, timo);

 > if we restore the use of cv_wait_sig in lwp_wait?

 Instead of applying a fourth patch to -current to restore the use of
 cv_wait_sig, I applied the above three patches to sources from CVS
 date 2018.04.15.00.19.23, when cv_wait_sig was still being used.

 I then built and tested an i386 debug build on my own testbed.
 The resume1 tests failed with timeouts, but the test suite now ran to
 completion without hanging, and the other tests that failed are ones
 I have also seen failing in other tests runs around the same source
 date:

   kernel/t_timeleft:timeleft__lwp_park
   lib/libc/sys/t_ptrace_wait:resume1
   lib/libc/sys/t_ptrace_wait3:resume1
   lib/libc/sys/t_ptrace_wait4:resume1
   lib/libc/sys/t_ptrace_wait6:resume1
   lib/libc/sys/t_ptrace_waitid:resume1
   lib/libc/sys/t_ptrace_waitpid:resume1
   lib/librumphijack/t_tcpip:nfs_autoload
   usr.bin/cc/t_asan_poison:poison
   usr.bin/cc/t_asan_poison:poison_pic
   usr.bin/cc/t_asan_poison:poison_profile
   usr.bin/c++/t_asan_poison:poison
   usr.bin/c++/t_asan_poison:poison_pic
   usr.bin/c++/t_asan_poison:poison_profile

 -- 
 Andreas Gustafsson, gson@gson.org

From: Andreas Gustafsson <gson@gson.org>
To: Taylor R Campbell <riastradh@NetBSD.org>
Cc: gnats-bugs@NetBSD.org, christos@NetBSD.org
Subject: Re: kern/53202: Kernel hangs running t_ptrace_wait:resume1 test
Date: Thu, 26 Apr 2018 19:11:33 +0300

 Taylor,

 Earlier, you asked:
 >  Does inserting a call to yield() after lwp_unsleep in sleepq_block
 >  change anything, if we restore the use of cv_wait_sig in lwp_wait?

 Since this seems to be working, do you think it should be committed?
 -- 
 Andreas Gustafsson, gson@gson.org

From: Taylor R Campbell <campbell@mumble.net>
To: Andreas Gustafsson <gson@gson.org>
Cc: gnats-bugs@NetBSD.org, christos@NetBSD.org, rmind@NetBSD.org
Subject: Re: kern/53202: Kernel hangs running t_ptrace_wait:resume1 test
Date: Thu, 26 Apr 2018 16:54:01 +0000

 > Date: Thu, 26 Apr 2018 19:11:33 +0300
 > From: Andreas Gustafsson <gson@gson.org>
 > 
 > Earlier, you asked:
 > >  Does inserting a call to yield() after lwp_unsleep in sleepq_block
 > >  change anything, if we restore the use of cv_wait_sig in lwp_wait?
 > 
 > Since this seems to be working, do you think it should be committed?

 Maybe.  My understanding of the constraints inside the sleepq and
 scheduler logic is limited.

 It may be more prudent to find why the change to cv_wait_sig worked
 around whatever the root cause of the problem with Go was, but I don't
 have any brilliant ideas about that.

 In particular, it smells like there is a missing wakeup in the lwp
 exit logic, perhaps owing to an obscure case of a signal delivery that
 happens to cause cv_wait_sig to return early.  But exactly where the
 wakeup needs to happen is unclear.  The condition that the relevant
 cv_wait is waiting for in the `exiting' case of lwp_wait is not
 obvious -- there's a whole string of things whose change might trigger
 it.

From: Andreas Gustafsson <gson@gson.org>
To: christos@NetBSD.org
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/53202: Kernel hangs running t_ptrace_wait:resume1 test
Date: Fri, 27 Apr 2018 15:55:19 +0300

 Taylor R Campbell wrote:
 >  It may be more prudent to find why the change to cv_wait_sig worked
 >  around whatever the root cause of the problem with Go was, but I don't
 >  have any brilliant ideas about that.
 >  
 >  In particular, it smells like there is a missing wakeup in the lwp
 >  exit logic, perhaps owing to an obscure case of a signal delivery that
 >  happens to cause cv_wait_sig to return early.  But exactly where the
 >  wakeup needs to happen is unclear.  The condition that the relevant
 >  cv_wait is waiting for in the `exiting' case of lwp_wait is not
 >  obvious -- there's a whole string of things whose change might trigger
 >  it.

 Christos - is there a PR for the problem with Go that was the
 motivation for kern_lwp.c 1.191?
 -- 
 Andreas Gustafsson, gson@gson.org

State-Changed-From-To: open->needs-pullups
State-Changed-By: gson@NetBSD.org
State-Changed-When: Wed, 02 May 2018 13:49:01 +0000
State-Changed-Why:
kern_lwp.c 1.192 works and should be pulled up


State-Changed-From-To: needs-pullups->pending-pullups
State-Changed-By: gson@NetBSD.org
State-Changed-When: Fri, 04 May 2018 07:15:13 +0000
State-Changed-Why:
Pullup request submitted.


From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/53202 CVS commit: [netbsd-8] src/sys/kern
Date: Mon, 14 May 2018 19:11:22 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Mon May 14 19:11:21 UTC 2018

 Modified Files:
 	src/sys/kern [netbsd-8]: kern_lwp.c

 Log Message:
 Pull up following revision(s) (requested by gson in ticket #805):

 	sys/kern/kern_lwp.c: revision 1.192

 PR/kern/53202: Kernel hangs running t_ptrace_wait:resume1 test, revert
 previous.


 To generate a diff of this commit:
 cvs rdiff -u -r1.189.2.1 -r1.189.2.2 src/sys/kern/kern_lwp.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: pending-pullups->closed
State-Changed-By: gson@NetBSD.org
State-Changed-When: Mon, 14 May 2018 19:19:39 +0000
State-Changed-Why:
Pullup done.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.