NetBSD Problem Report #55948

From gson@gson.org  Sat Jan 23 10:27:14 2021
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id DBE391A923C
	for <gnats-bugs@gnats.NetBSD.org>; Sat, 23 Jan 2021 10:27:14 +0000 (UTC)
Message-Id: <20210123102710.9C39F253EDE@guava.gson.org>
Date: Sat, 23 Jan 2021 12:27:10 +0200 (EET)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: threadpool_job_cancelthrash test randomly fails
X-Send-Pr-Version: 3.95

>Number:         55948
>Category:       kern
>Synopsis:       threadpool_job_cancelthrash test randomly fails
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Jan 23 10:30:00 +0000 2021
>Closed-Date:    Thu Jan 28 18:49:24 +0000 2021
>Last-Modified:  Thu Jan 28 18:49:24 +0000 2021
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current, source date >= 2021.01.16.23.51.51
>Organization:
>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:

The threadpool_job_cancelthrash test case of the rump/rumpkern/t_threadpool
test program now randomly fails on my testbed running on real amd64 hardware.

Log output from one such failure is at:

  https://www.gson.org/netbsd/bugs/build/amd64-baremetal/2021/2021.01.17.21.56.20/test.html#rump_rumpkern_t_threadpool_threadpool_job_cancelthrash

It does not fail on the qemu/nvmm based TNF testbed.

I ran an automated bisection, running the test repeatedly to be sure
to catch the failure even though it happens only randomly, and this
identified the following as the commit where the problem started:

  2021.01.16.23.51.50 chs src/sys/arch/arm/arm/psci.c 1.5
  2021.01.16.23.51.50 chs src/sys/conf/files 1.1278
  2021.01.16.23.51.51 chs src/sys/lib/libkern/arch/hppa/bcopy.S 1.16
  2021.01.16.23.51.51 chs src/sys/lib/libkern/libkern.h 1.141
  2021.01.16.23.51.51 chs src/sys/sys/cdefs.h 1.156
  2021.01.16.23.51.51 chs src/sys/sys/queue.h 1.76

>How-To-Repeat:

cd /usr/tests/rump
for i in $(seq 100)
do
    if
        atf-run rumpkern >log
        ! egrep '^tc-end:.* threadpool_job_cancelthrash, passed' log
    then
        echo FAIL
        break
    fi
done

>Fix:

>Release-Note:

>Audit-Trail:
From: Taylor R Campbell <riastradh@NetBSD.org>
To: gnats-bugs@NetBSD.org
Cc: netbsd-bugs@NetBSD.org, gson@gson.org, chs@NetBSD.org
Subject: Re: kern/55948: threadpool_job_cancelthrash test randomly fails
Date: Sat, 23 Jan 2021 15:41:42 +0000

 This is a multi-part message in MIME format.
 --=_AOwXQ/EBfoe4E3bC1VjlNXMqYcZsjEER

 Try the attached patch?

 --=_AOwXQ/EBfoe4E3bC1VjlNXMqYcZsjEER
 Content-Type: text/plain; charset="ISO-8859-1"; name="threadpool"
 Content-Transfer-Encoding: quoted-printable
 Content-Disposition: attachment; filename="threadpool.patch"

 diff -r 859d0e5ac5a2 sys/kern/kern_threadpool.c
 --- a/sys/kern/kern_threadpool.c	Sun Jan 17 22:02:46 2021 +0000
 +++ b/sys/kern/kern_threadpool.c	Sat Jan 23 15:40:39 2021 +0000
 @@ -1041,7 +1041,7 @@ threadpool_dispatcher_thread(void *arg)
 =20
  		/* There are idle threads, so try giving one a job.  */
  		struct threadpool_job *const job =3D TAILQ_FIRST(&pool->tp_jobs);
 -		TAILQ_REMOVE(&pool->tp_jobs, job, job_entry);
 +
  		/*
  		 * Take an extra reference on the job temporarily so that
  		 * it won't disappear on us while we have both locks dropped.
 @@ -1053,6 +1053,7 @@ threadpool_dispatcher_thread(void *arg)
  		/* If the job was cancelled, we'll no longer be its thread.  */
  		if (__predict_true(job->job_thread =3D=3D dispatcher)) {
  			mutex_spin_enter(&pool->tp_lock);
 +			TAILQ_REMOVE(&pool->tp_jobs, job, job_entry);
  			if (__predict_false(
  				    TAILQ_EMPTY(&pool->tp_idle_threads))) {
  				/*

 --=_AOwXQ/EBfoe4E3bC1VjlNXMqYcZsjEER--

From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/55948 CVS commit: src/sys/kern
Date: Sat, 23 Jan 2021 16:33:49 +0000

 Module Name:	src
 Committed By:	riastradh
 Date:		Sat Jan 23 16:33:49 UTC 2021

 Modified Files:
 	src/sys/kern: kern_threadpool.c

 Log Message:
 threadpool(9): Fix synchronization between cancel and dispatch.

 - threadpool_cancel_job_async tried to prevent
   threadpool_dispatcher_thread from taking the job by setting
   job->job_thread = NULL and then removing the job from the queue.

 - But threadpool_cancel_job_async didn't notice job->job_thread is
   null until after it also removes the job from the queue =>
   double-remove, *boom*.

 The solution is to teach threadpool_dispatcher_thread to wait until
 it has acquired the job lock to test whether job->job_thread is still
 valid before it decides to remove the job from the queue.

 Fixes PR kern/55948.

 XXX pullup-9


 To generate a diff of this commit:
 cvs rdiff -u -r1.22 -r1.23 src/sys/kern/kern_threadpool.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->feedback
State-Changed-By: riastradh@NetBSD.org
State-Changed-When: Sat, 23 Jan 2021 16:36:35 +0000
State-Changed-Why:
fix committed


From: Andreas Gustafsson <gson@gson.org>
To: "Taylor R Campbell" <riastradh@netbsd.org>
Cc: gnats-bugs@netbsd.org
Subject: Re: PR/55948 CVS commit: src/sys/kern
Date: Sun, 24 Jan 2021 12:14:27 +0200

 Taylor R Campbell wrote:
 >  threadpool(9): Fix synchronization between cancel and dispatch.
 [...]
 >  cvs rdiff -u -r1.22 -r1.23 src/sys/kern/kern_threadpool.c

 Confirmed fixed - thank you!  I was now able to run the test 100 times
 in a row without failures, and running the full test suite showed no
 new failures in other tests.
 -- 
 Andreas Gustafsson, gson@gson.org

State-Changed-From-To: feedback->needs-pullups
State-Changed-By: gson@NetBSD.org
State-Changed-When: Sun, 24 Jan 2021 10:22:42 +0000
State-Changed-Why:
Confirmed fixed, has a comment saying "XXX pullup-9".


From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/55948 CVS commit: [netbsd-9] src/sys/kern
Date: Mon, 25 Jan 2021 14:12:50 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Mon Jan 25 14:12:50 UTC 2021

 Modified Files:
 	src/sys/kern [netbsd-9]: kern_threadpool.c

 Log Message:
 Pull up following revision(s) (requested by riastradh in ticket #1187):

 	sys/kern/kern_threadpool.c: revision 1.23

 threadpool(9): Fix synchronization between cancel and dispatch.
 - threadpool_cancel_job_async tried to prevent
   threadpool_dispatcher_thread from taking the job by setting
   job->job_thread = NULL and then removing the job from the queue.
 - But threadpool_cancel_job_async didn't notice job->job_thread is
   null until after it also removes the job from the queue =>
   double-remove, *boom*.

 The solution is to teach threadpool_dispatcher_thread to wait until
 it has acquired the job lock to test whether job->job_thread is still
 valid before it decides to remove the job from the queue.

 Fixes PR kern/55948.

 XXX pullup-9


 To generate a diff of this commit:
 cvs rdiff -u -r1.15 -r1.15.6.1 src/sys/kern/kern_threadpool.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: needs-pullups->closed
State-Changed-By: riastradh@NetBSD.org
State-Changed-When: Thu, 28 Jan 2021 18:49:24 +0000
State-Changed-Why:
fixed and pulled up


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.