NetBSD Problem Report #52892

From gson@gson.org  Wed Jan  3 15:45:52 2018
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 15D5B7A19A
	for <gnats-bugs@gnats.NetBSD.org>; Wed,  3 Jan 2018 15:45:52 +0000 (UTC)
Message-Id: <20180103154542.EFF759892B5@guava.gson.org>
Date: Wed,  3 Jan 2018 17:45:42 +0200 (EET)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: Tests hang on MIPS
X-Send-Pr-Version: 3.95

>Number:         52892
>Category:       port-mips
>Synopsis:       Tests hang on MIPS
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    port-mips-maintainer
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Jan 03 15:50:00 +0000 2018
>Closed-Date:    Fri Apr 27 12:17:46 +0000 2018
>Last-Modified:  Fri Apr 27 12:17:46 +0000 2018
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current, source date >= December 2017
>Organization:

>Environment:
System: NetBSD
Architecture: mips64el
Machine: pmax
>Description:

Neither the pmax nor the hpcmips test runs on babylon5.netbsd.org
managed to run to completion a single time during December 2017.
Although there have been several other issues causing the tests to
hang on other platform, as discussed on current-users, there also
appears to be a separate issue that specifically affects mips and
appeared in late November or early December.

The hanging test case varies, but so far they all seem to be
networking related in some way:

babylon5.netbsd.org$ cd /bracket/pmax/results/2017
babylon5.netbsd.org$ zgrep ': Traceback' 2017.12*/test.log.gz
2017.12.04.14.50.33/test.log.gz:    pktinfo_send_ifindex: Traceback (most recent call last):
2017.12.07.03.25.51/test.log.gz:    pktinfo_send_multicast: Traceback (most recent call last):
2017.12.07.10.22.04/test.log.gz:    pktinfo_send_ifindex: Traceback (most recent call last):
2017.12.09.16.00.19/test.log.gz:    read_random: Traceback (most recent call last):
2017.12.28.09.47.52/test.log.gz:    floodping2: Traceback (most recent call last):

babylon5.netbsd.org$ cd /bracket/hpcmips/results/2017
babylon5.netbsd.org$ zgrep ': Traceback' 2017.12*/test.log.gz
2017.12.04.14.50.33/test.log.gz:    pktinfo_send_ifindex: Traceback (most recent call last):
2017.12.09.16.00.19/test.log.gz:    nfs_create_many: Traceback (most recent call last):
2017.12.12.08.27.32/test.log.gz:    pktinfo_send_multicast: Traceback (most recent call last):
2017.12.14.06.29.15/test.log.gz:    pktinfo_send_ifindex: Traceback (most recent call last):
2017.12.22.02.36.46/test.log.gz:    nfs_create_many: Traceback (most recent call last):
2017.12.23.12.50.55/test.log.gz:    pktinfo_send_ifindex: Traceback (most recent call last):
2017.12.24.01.22.16/test.log.gz:    pktinfo_send_multicast: Traceback (most recent call last):
2017.12.25.06.39.00/test.log.gz:    floodping2: Traceback (most recent call last):
2017.12.26.08.30.58/test.log.gz:    pktinfo_send_ifindex: Traceback (most recent call last):
2017.12.28.09.47.52/test.log.gz:    pktinfo_send_ifindex: Traceback (most recent call last):

>How-To-Repeat:

>Fix:

>Release-Note:

>Audit-Trail:
From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org
Cc: christos@NetBSD.org
Subject: Re: port-mips/52892: Tests hang on MIPS
Date: Thu, 19 Apr 2018 15:40:41 +0300

 I have now looked into this some more, and have the following
 observations:

 1. The tests are still hanging every time on b5:

   http://releng.netbsd.org/b5reports/hpcmips/
   http://releng.netbsd.org/b5reports/pmax/

 2. I can reliably reproduce the hang on hpcmips under gxemul
 by running

   cd /usr/tests/net/icmp
   while true; do atf-run t_ping|atf-report; done

 On b5, this usually hangs within an hour, but on my own test machine,
 it took a day or two to hang.

 3. It's the kernel that's hanging, not just ATF.  For example, if I
 run the above test script in the background with output redirected to
 a log file, and "tail -f" the log file on the console, I'm unable to
 kill the tail process using the interrupt character after the test has
 hung.

 4. By bisection, I found that the hpcmips tests started hanging on b5
 at the time of the commit

   2017.12.02.22.51.22 christos src/sys/kern/kern_lwp.c 1.191

 Since this caused the tests to hang earlier in the test run, in the
 lib/libc/sys/t_ptrace_wait3:resume1 test rather than in the network
 related tests where it is now hanging, it was not immediately clear
 whether that commit also triggered the present issue.  However...

 5. If I revert that commit, the tests do run to completion on b5.
 -- 
 Andreas Gustafsson, gson@gson.org

From: christos@zoulas.com (Christos Zoulas)
To: Andreas Gustafsson <gson@gson.org>, gnats-bugs@NetBSD.org
Cc: 
Subject: Re: port-mips/52892: Tests hang on MIPS
Date: Thu, 19 Apr 2018 08:44:46 -0400

 On Apr 19,  3:40pm, gson@gson.org (Andreas Gustafsson) wrote:
 -- Subject: Re: port-mips/52892: Tests hang on MIPS

 | I have now looked into this some more, and have the following
 | observations:
 | 
 | 1. The tests are still hanging every time on b5:
 | 
 |   http://releng.netbsd.org/b5reports/hpcmips/
 |   http://releng.netbsd.org/b5reports/pmax/
 | 
 | 2. I can reliably reproduce the hang on hpcmips under gxemul
 | by running
 | 
 |   cd /usr/tests/net/icmp
 |   while true; do atf-run t_ping|atf-report; done
 | 
 | On b5, this usually hangs within an hour, but on my own test machine,
 | it took a day or two to hang.
 | 
 | 3. It's the kernel that's hanging, not just ATF.  For example, if I
 | run the above test script in the background with output redirected to
 | a log file, and "tail -f" the log file on the console, I'm unable to
 | kill the tail process using the interrupt character after the test has
 | hung.
 | 
 | 4. By bisection, I found that the hpcmips tests started hanging on b5
 | at the time of the commit
 | 
 |   2017.12.02.22.51.22 christos src/sys/kern/kern_lwp.c 1.191
 | 
 | Since this caused the tests to hang earlier in the test run, in the
 | lib/libc/sys/t_ptrace_wait3:resume1 test rather than in the network
 | related tests where it is now hanging, it was not immediately clear
 | whether that commit also triggered the present issue.  However...
 | 
 | 5. If I revert that commit, the tests do run to completion on b5.

 That's strange, because the change was put there to prevent a hang (in go).
 I.e. sleep interruptively instead of sleep with signals blocked.
 There must be some other race that's causing it. Can you try to always
 set errno to EAGAIN after cv_wait_sig() returns?

 christos

From: Andreas Gustafsson <gson@gson.org>
To: christos@zoulas.com (Christos Zoulas)
Cc: gnats-bugs@NetBSD.org
Subject: Re: port-mips/52892: Tests hang on MIPS
Date: Thu, 19 Apr 2018 16:40:11 +0300

 Christos Zoulas wrote:
 > That's strange, because the change was put there to prevent a hang (in go).
 > I.e. sleep interruptively instead of sleep with signals blocked.
 > There must be some other race that's causing it. Can you try to always
 > set errno to EAGAIN after cv_wait_sig() returns?

 Yes, but I also have a faster and more portable (not MIPS specific)
 way of reproducing the problem, so I'd rather report that as a
 separate PR first and use that method for the test.
 -- 
 Andreas Gustafsson, gson@gson.org

From: Kamil Rytarowski <n54@gmx.com>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: port-mips/52892: Tests hang on MIPS
Date: Thu, 19 Apr 2018 15:43:09 +0200

 This is an OpenPGP/MIME signed message (RFC 4880 and 3156)
 --OXj1DmNCS5rCvnNL8EEdJf4Rdg5pg7wOT
 Content-Type: multipart/mixed; boundary="1iafMRNxqhDA3KtDU1xc0WcbnLSKbgbXc";
  protected-headers="v1"
 From: Kamil Rytarowski <n54@gmx.com>
 To: gnats-bugs@NetBSD.org
 Message-ID: <26286fb1-9bb4-12a5-160a-e4fc5da7b429@gmx.com>
 Subject: Re: port-mips/52892: Tests hang on MIPS
 References: <pr-port-mips-52892@gnats.netbsd.org>
  <20180103154542.EFF759892B5@guava.gson.org>
  <20180419125001.64CD47A1FB@mollari.NetBSD.org>
 In-Reply-To: <20180419125001.64CD47A1FB@mollari.NetBSD.org>

 --1iafMRNxqhDA3KtDU1xc0WcbnLSKbgbXc
 Content-Type: text/plain; charset=utf-8
 Content-Language: en-US
 Content-Transfer-Encoding: quoted-printable

 On 19.04.2018 14:50, Christos Zoulas wrote:
 > The following reply was made to PR port-mips/52892; it has been noted b=
 y GNATS.
 >=20
 > From: christos@zoulas.com (Christos Zoulas)
 > To: Andreas Gustafsson <gson@gson.org>, gnats-bugs@NetBSD.org
 > Cc:=20
 > Subject: Re: port-mips/52892: Tests hang on MIPS
 > Date: Thu, 19 Apr 2018 08:44:46 -0400
 >=20
 >  On Apr 19,  3:40pm, gson@gson.org (Andreas Gustafsson) wrote:

 >  | 4. By bisection, I found that the hpcmips tests started hanging on b=
 5
 >  | at the time of the commit
 >  |=20
 >  |   2017.12.02.22.51.22 christos src/sys/kern/kern_lwp.c 1.191
 >  |=20
 >  | Since this caused the tests to hang earlier in the test run, in the
 >  | lib/libc/sys/t_ptrace_wait3:resume1 test rather than in the network
 >  | related tests where it is now hanging, it was not immediately clear
 >  | whether that commit also triggered the present issue.  However...
 >  |=20
 >  | 5. If I revert that commit, the tests do run to completion on b5.
 > =20
 >  That's strange, because the change was put there to prevent a hang (in=
  go).
 >  I.e. sleep interruptively instead of sleep with signals blocked.
 >  There must be some other race that's causing it. Can you try to always=

 >  set errno to EAGAIN after cv_wait_sig() returns?
 > =20
 >  christos
 > =20
 >=20

 Regarding resume1 - please disable it. It's already done this way on
 -current and -8.

 I plan to work on threads once I will be done with forks/vforks and signa=
 ls.


 --1iafMRNxqhDA3KtDU1xc0WcbnLSKbgbXc--

 --OXj1DmNCS5rCvnNL8EEdJf4Rdg5pg7wOT
 Content-Type: application/pgp-signature; name="signature.asc"
 Content-Description: OpenPGP digital signature
 Content-Disposition: attachment; filename="signature.asc"

 -----BEGIN PGP SIGNATURE-----

 iQJABAEBCAAqFiEELaxVpweEzw+lMDwuS7MI6bAudmwFAlrYnO0MHG41NEBnbXgu
 Y29tAAoJEEuzCOmwLnZsehoP/RchajgG5+zjn7APnXhAqCUEY8GUqOTt2Aln/wpX
 /81/MyXdHMukFlbKWKiu+KiJieYi65PJExnKt6A1pEu2OS+m1qRtLh0hQPA1eLMA
 nNAhF2fhWze049/MANUPPksmJHHteZONDJdWj3vZ31529Ni3kf1L1iOwN2T7wYR2
 pzLTMYXKvjXIrpYy2xKRiczUjE1jUMvCWQfGEcWydQNYaeXwxD7MBzyVc80CPiqz
 jXRntwPivRaiKcz2FgZAeHq6mphCXlIrUg1QazjcUNk+6hT4hfc9cTgqrOPqXpUM
 nwCk4hzVXnioUK2Tgz1r/SBJkw2nme4rNcxBZ9FD5pE6Y50osBlnB/0+/ZmxvXu/
 gqGLF4QnB9DfiKihoxeCVBZR9Z02Sm6W2QzQiKnUMZT82GYdrzpEFkotjF2bdJfn
 tbPqg7m8j/kySquVb1/shSHtxGqqOghPxkMyhuFgIn5eCpiU6dP5b5twPGj61Izq
 7NO8hP1rPhNZkCnVpfJAt7FZ0tyYrdyvUZc3az7nPuAbRXse13RovK5BS2RN8RFo
 PQfCLxvnUnxi6CnngEhnXudAlLEkl1ZkiKhX4IiuD/sVFwpLAICm1/A1/+PFihut
 qGiKsujGOnCHyiUBgyuoudju490syBtMV9gwuyg7aotgrO1WxO7E27gJAiUH168a
 T+9e
 =bkil
 -----END PGP SIGNATURE-----

 --OXj1DmNCS5rCvnNL8EEdJf4Rdg5pg7wOT--

From: Andreas Gustafsson <gson@gson.org>
To: Kamil Rytarowski <n54@gmx.com>
Cc: gnats-bugs@NetBSD.org, christos@NetBSD.org
Subject: Re: port-mips/52892: Tests hang on MIPS
Date: Thu, 19 Apr 2018 18:09:34 +0300

 Kamil Rytarowski wrote:
 >  Regarding resume1 - please disable it. It's already done this way on
 >  -current and -8.

 There's no third place to disable it, is there?  I was referring to
 test results of historical versions of -current from when it had not
 yet been disabled.
 -- 
 Andreas Gustafsson, gson@gson.org

From: christos@zoulas.com (Christos Zoulas)
To: Andreas Gustafsson <gson@gson.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: port-mips/52892: Tests hang on MIPS
Date: Thu, 19 Apr 2018 13:38:48 -0400

 On Apr 19,  4:40pm, gson@gson.org (Andreas Gustafsson) wrote:
 -- Subject: Re: port-mips/52892: Tests hang on MIPS

 | Christos Zoulas wrote:
 | > That's strange, because the change was put there to prevent a hang (in go).
 | > I.e. sleep interruptively instead of sleep with signals blocked.
 | > There must be some other race that's causing it. Can you try to always
 | > set errno to EAGAIN after cv_wait_sig() returns?
 | 
 | Yes, but I also have a faster and more portable (not MIPS specific)
 | way of reproducing the problem, so I'd rather report that as a
 | separate PR first and use that method for the test.

 Great, looking forward to it!

 christos

From: Andreas Gustafsson <gson@gson.org>
To: christos@zoulas.com (Christos Zoulas)
Cc: gnats-bugs@NetBSD.org
Subject: Re: port-mips/52892: Tests hang on MIPS
Date: Sun, 22 Apr 2018 17:29:13 +0300

 Christos Zoulas wrote:
 > Can you try to always set errno to EAGAIN after cv_wait_sig()
 > returns?

 My other method of reproducing the hang took a bit longer than
 expected, so in the meantime, I ended up running this test as
 requested after all.  I applied the following patch to
 2018.04.20.11.31.54 sources:

 --- src/sys/kern/kern_lwp.c.orig	2017-12-02 22:51:22.000000000 +0000
 +++ src/sys/kern/kern_lwp.c	2018-04-22 14:19:14.000000000 +0000
 @@ -646,8 +646,7 @@
  		if (exiting) {
  			KASSERT(p->p_nlwps > 1);
  			error = cv_wait_sig(&p->p_lwpcv, p->p_lock);
 -			if (error == 0)
 -				error = EAGAIN;
 +			error = EAGAIN;
  			break;
  		}

 With this patch, the hang still occurs, so it looks like the problem
 is with the part of kern_lwp.c 1.191 that replaces cv_wait() by
 cv_wait_sig(), not the part that changes the errno returned.
 -- 
 Andreas Gustafsson, gson@gson.org

From: christos@zoulas.com (Christos Zoulas)
To: Andreas Gustafsson <gson@gson.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: port-mips/52892: Tests hang on MIPS
Date: Sun, 22 Apr 2018 12:13:01 -0400

 On Apr 22,  5:29pm, gson@gson.org (Andreas Gustafsson) wrote:
 -- Subject: Re: port-mips/52892: Tests hang on MIPS

 | Christos Zoulas wrote:
 | > Can you try to always set errno to EAGAIN after cv_wait_sig()
 | > returns?
 | 
 | My other method of reproducing the hang took a bit longer than
 | expected, so in the meantime, I ended up running this test as
 | requested after all.  I applied the following patch to
 | 2018.04.20.11.31.54 sources:
 | 
 | --- src/sys/kern/kern_lwp.c.orig	2017-12-02 22:51:22.000000000 +0000
 | +++ src/sys/kern/kern_lwp.c	2018-04-22 14:19:14.000000000 +0000
 | @@ -646,8 +646,7 @@
 |  		if (exiting) {
 |  			KASSERT(p->p_nlwps > 1);
 |  			error = cv_wait_sig(&p->p_lwpcv, p->p_lock);
 | -			if (error == 0)
 | -				error = EAGAIN;
 | +			error = EAGAIN;
 |  			break;
 |  		}
 |  
 | With this patch, the hang still occurs, so it looks like the problem
 | is with the part of kern_lwp.c 1.191 that replaces cv_wait() by
 | cv_wait_sig(), not the part that changes the errno returned.

 How about if we make cvs_wait_sig look like cv_wait?
 But that might break go again..

 christos

 Index: kern_condvar.c
 ===================================================================
 RCS file: /cvsroot/src/sys/kern/kern_condvar.c,v
 retrieving revision 1.41
 diff -u -r1.41 kern_condvar.c
 --- kern_condvar.c	30 Jan 2018 07:52:22 -0000	1.41
 +++ kern_condvar.c	22 Apr 2018 16:12:19 -0000
 @@ -268,9 +268,10 @@

  	KASSERT(mutex_owned(mtx));

 -	cv_enter(cv, mtx, l);
 +	CV_LOCKDEBUG_HANDOFF(l, cv);
  	error = sleepq_block(0, true);
 -	return cv_exit(cv, mtx, l, error);
 +	mutex_enter(mtx);
 +	return error;
  }

  /*

From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@NetBSD.org
Cc: port-mips-maintainer@netbsd.org, gnats-admin@netbsd.org,
    netbsd-bugs@netbsd.org, gson@gson.org (Andreas Gustafsson)
Subject: re: port-mips/52892: Tests hang on MIPS
Date: Mon, 23 Apr 2018 02:57:10 +1000

 >  Index: kern_condvar.c
 >  ===================================================================
 >  RCS file: /cvsroot/src/sys/kern/kern_condvar.c,v
 >  retrieving revision 1.41
 >  diff -u -r1.41 kern_condvar.c
 >  --- kern_condvar.c	30 Jan 2018 07:52:22 -0000	1.41
 >  +++ kern_condvar.c	22 Apr 2018 16:12:19 -0000
 >  @@ -268,9 +268,10 @@
 >   
 >   	KASSERT(mutex_owned(mtx));
 >   
 >  -	cv_enter(cv, mtx, l);
 >  +	CV_LOCKDEBUG_HANDOFF(l, cv);
 >   	error = sleepq_block(0, true);
 >  -	return cv_exit(cv, mtx, l, error);
 >  +	mutex_enter(mtx);
 >  +	return error;
 >   }

 this can't be right.  you've removed cv_enter() entirely,
 and the cv_exit() only means we don't potentially cv_signal()
 something.  perhaps this is proc_lock being held issue, ie,
 is that the only place it could hang here?


 .mrg.

From: Andreas Gustafsson <gson@gson.org>
To: christos@zoulas.com (Christos Zoulas)
Cc: gnats-bugs@NetBSD.org
Subject: Re: port-mips/52892: Tests hang on MIPS
Date: Sun, 22 Apr 2018 20:17:06 +0300

 Christos Zoulas wrote:
 > | Yes, but I also have a faster and more portable (not MIPS specific)
 > | way of reproducing the problem, so I'd rather report that as a
 > | separate PR first and use that method for the test.
 > 
 > Great, looking forward to it!

 kern/53202
 -- 
 Andreas Gustafsson, gson@gson.org

From: christos@zoulas.com (Christos Zoulas)
To: matthew green <mrg@eterna.com.au>, gnats-bugs@NetBSD.org
Cc: port-mips-maintainer@netbsd.org, gnats-admin@netbsd.org, 
	netbsd-bugs@netbsd.org, gson@gson.org (Andreas Gustafsson)
Subject: re: port-mips/52892: Tests hang on MIPS
Date: Sun, 22 Apr 2018 14:38:21 -0400

 On Apr 23,  2:57am, mrg@eterna.com.au (matthew green) wrote:
 -- Subject: re: port-mips/52892: Tests hang on MIPS

 | >  Index: kern_condvar.c
 | >  ===================================================================
 | >  RCS file: /cvsroot/src/sys/kern/kern_condvar.c,v
 | >  retrieving revision 1.41
 | >  diff -u -r1.41 kern_condvar.c
 | >  --- kern_condvar.c	30 Jan 2018 07:52:22 -0000	1.41
 | >  +++ kern_condvar.c	22 Apr 2018 16:12:19 -0000
 | >  @@ -268,9 +268,10 @@
 | >   
 | >   	KASSERT(mutex_owned(mtx));
 | >   
 | >  -	cv_enter(cv, mtx, l);
 | >  +	CV_LOCKDEBUG_HANDOFF(l, cv);
 | >   	error = sleepq_block(0, true);
 | >  -	return cv_exit(cv, mtx, l, error);
 | >  +	mutex_enter(mtx);
 | >  +	return error;
 | >   }
 | 
 | this can't be right.  you've removed cv_enter() entirely,
 | and the cv_exit() only means we don't potentially cv_signal()
 | something.  perhaps this is proc_lock being held issue, ie,
 | is that the only place it could hang here?

 Then cv_wait() is wrong too, because it is the same code; except sleepq_block
 is called with false.

 christos

From: coypu@sdf.org
To: gnats-bugs@netbsd.org
Cc: 
Subject: re: port-mips/52892: Tests hang on MIPS
Date: Sun, 22 Apr 2018 19:28:51 +0000

 Hi,

 This also touches kern/53053: non-MULTIPROCESSOR hangs building lang/go
 (it hangs on the same place).

From: matthew green <mrg@eterna.com.au>
To: christos@zoulas.com (Christos Zoulas)
Cc: port-mips-maintainer@netbsd.org, gnats-admin@netbsd.org,
    netbsd-bugs@netbsd.org, gson@gson.org (Andreas Gustafsson),
    gnats-bugs@NetBSD.org
Subject: re: port-mips/52892: Tests hang on MIPS
Date: Mon, 23 Apr 2018 07:11:40 +1000

 > | >  Index: kern_condvar.c
 > | >  =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
 > | >  RCS file: /cvsroot/src/sys/kern/kern_condvar.c,v
 > | >  retrieving revision 1.41
 > | >  diff -u -r1.41 kern_condvar.c
 > | >  --- kern_condvar.c	30 Jan 2018 07:52:22 -0000	1.41
 > | >  +++ kern_condvar.c	22 Apr 2018 16:12:19 -0000
 > | >  @@ -268,9 +268,10 @@
 > | >   =

 > | >   	KASSERT(mutex_owned(mtx));
 > | >   =

 > | >  -	cv_enter(cv, mtx, l);
 > | >  +	CV_LOCKDEBUG_HANDOFF(l, cv);
 > | >   	error =3D sleepq_block(0, true);
 > | >  -	return cv_exit(cv, mtx, l, error);
 > | >  +	mutex_enter(mtx);
 > | >  +	return error;
 > | >   }
 > | =

 > | this can't be right.  you've removed cv_enter() entirely,
 > | and the cv_exit() only means we don't potentially cv_signal()
 > | something.  perhaps this is proc_lock being held issue, ie,
 > | is that the only place it could hang here?
 > =

 > Then cv_wait() is wrong too, because it is the same code; except sleepq_=
 block
 > is called with false.

 what cv_wait() are you looking at?  mine shows:

 235 void
 236 cv_wait(kcondvar_t *cv, kmutex_t *mtx)
 237 {
 238         lwp_t *l =3D curlwp;
 239
 240         KASSERT(mutex_owned(mtx));
 241
 242         cv_enter(cv, mtx, l);
 243
 244         /*
 245          * We can't use cv_exit() here since the cv might be destroyed=
  before
 246          * this thread gets a chance to run.  Instead, hand off the lo=
 ckdebug
 247          * responsibility to the thread that wakes us up.
 248          */
 249
 250         CV_LOCKDEBUG_HANDOFF(l, cv);
 251         (void)sleepq_block(0, false);
 252         mutex_enter(mtx);
 253 }

 note L242.

 did you actually test this change? :)


 .mrg.

From: christos@zoulas.com (Christos Zoulas)
To: matthew green <mrg@eterna.com.au>
Cc: port-mips-maintainer@netbsd.org, gnats-admin@netbsd.org, 
	netbsd-bugs@netbsd.org, gson@gson.org (Andreas Gustafsson), 
	gnats-bugs@NetBSD.org
Subject: re: port-mips/52892: Tests hang on MIPS
Date: Sun, 22 Apr 2018 17:32:37 -0400

 On Apr 23,  7:11am, mrg@eterna.com.au (matthew green) wrote:
 -- Subject: re: port-mips/52892: Tests hang on MIPS

 | > | >  Index: kern_condvar.c
 | > | >  =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
 | =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
 | =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
 | > | >  RCS file: /cvsroot/src/sys/kern/kern_condvar.c,v
 | > | >  retrieving revision 1.41
 | > | >  diff -u -r1.41 kern_condvar.c
 | > | >  --- kern_condvar.c	30 Jan 2018 07:52:22 -0000	1.41
 | > | >  +++ kern_condvar.c	22 Apr 2018 16:12:19 -0000
 | > | >  @@ -268,9 +268,10 @@
 | > | >  =20
 | > | >   	KASSERT(mutex_owned(mtx));
 | > | >  =20
 | > | >  -	cv_enter(cv, mtx, l);
 | > | >  +	CV_LOCKDEBUG_HANDOFF(l, cv);
 | > | >   	error =3D sleepq_block(0, true);
 | > | >  -	return cv_exit(cv, mtx, l, error);
 | > | >  +	mutex_enter(mtx);
 | > | >  +	return error;
 | > | >   }
 | > |=20
 | > | this can't be right.  you've removed cv_enter() entirely,
 | > | and the cv_exit() only means we don't potentially cv_signal()
 | > | something.  perhaps this is proc_lock being held issue, ie,
 | > | is that the only place it could hang here?
 | >=20
 | > Then cv_wait() is wrong too, because it is the same code; except sleepq_b=
 | lock
 | > is called with false.
 | 
 | what cv_wait() are you looking at?  mine shows:
 | 
 | 235 void
 | 236 cv_wait(kcondvar_t *cv, kmutex_t *mtx)
 | 237 {
 | 238         lwp_t *l =3D curlwp;
 | 239
 | 240         KASSERT(mutex_owned(mtx));
 | 241
 | 242         cv_enter(cv, mtx, l);
 | 243
 | 244         /*
 | 245          * We can't use cv_exit() here since the cv might be destroyed =
 | before
 | 246          * this thread gets a chance to run.  Instead, hand off the loc=
 | kdebug
 | 247          * responsibility to the thread that wakes us up.
 | 248          */
 | 249
 | 250         CV_LOCKDEBUG_HANDOFF(l, cv);
 | 251         (void)sleepq_block(0, false);
 | 252         mutex_enter(mtx);
 | 253 }
 | 
 | note L242.
 | 
 | did you actually test this change? :)

 No, it needs the cv_enter...


 christos

State-Changed-From-To: open->closed
State-Changed-By: gson@NetBSD.org
State-Changed-When: Fri, 27 Apr 2018 12:17:46 +0000
State-Changed-Why:
hpcmips tests no longer hang with src/sys/kern/kern_lwp.c 1.192.
See PR 53202 for analysis.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.