NetBSD Problem Report #55948
From gson@gson.org Sat Jan 23 10:27:14 2021
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id DBE391A923C
for <gnats-bugs@gnats.NetBSD.org>; Sat, 23 Jan 2021 10:27:14 +0000 (UTC)
Message-Id: <20210123102710.9C39F253EDE@guava.gson.org>
Date: Sat, 23 Jan 2021 12:27:10 +0200 (EET)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: threadpool_job_cancelthrash test randomly fails
X-Send-Pr-Version: 3.95
>Number: 55948
>Category: kern
>Synopsis: threadpool_job_cancelthrash test randomly fails
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: kern-bug-people
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sat Jan 23 10:30:00 +0000 2021
>Closed-Date: Thu Jan 28 18:49:24 +0000 2021
>Last-Modified: Thu Jan 28 18:49:24 +0000 2021
>Originator: Andreas Gustafsson
>Release: NetBSD-current, source date >= 2021.01.16.23.51.51
>Organization:
>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:
The threadpool_job_cancelthrash test case of the rump/rumpkern/t_threadpool
test program now randomly fails on my testbed running on real amd64 hardware.
Log output from one such failure is at:
https://www.gson.org/netbsd/bugs/build/amd64-baremetal/2021/2021.01.17.21.56.20/test.html#rump_rumpkern_t_threadpool_threadpool_job_cancelthrash
It does not fail on the qemu/nvmm based TNF testbed.
I ran an automated bisection, running the test repeatedly to be sure
to catch the failure even though it happens only randomly, and this
identified the following as the commit where the problem started:
2021.01.16.23.51.50 chs src/sys/arch/arm/arm/psci.c 1.5
2021.01.16.23.51.50 chs src/sys/conf/files 1.1278
2021.01.16.23.51.51 chs src/sys/lib/libkern/arch/hppa/bcopy.S 1.16
2021.01.16.23.51.51 chs src/sys/lib/libkern/libkern.h 1.141
2021.01.16.23.51.51 chs src/sys/sys/cdefs.h 1.156
2021.01.16.23.51.51 chs src/sys/sys/queue.h 1.76
>How-To-Repeat:
cd /usr/tests/rump
for i in $(seq 100)
do
if
atf-run rumpkern >log
! egrep '^tc-end:.* threadpool_job_cancelthrash, passed' log
then
echo FAIL
break
fi
done
>Fix:
>Release-Note:
>Audit-Trail:
From: Taylor R Campbell <riastradh@NetBSD.org>
To: gnats-bugs@NetBSD.org
Cc: netbsd-bugs@NetBSD.org, gson@gson.org, chs@NetBSD.org
Subject: Re: kern/55948: threadpool_job_cancelthrash test randomly fails
Date: Sat, 23 Jan 2021 15:41:42 +0000
This is a multi-part message in MIME format.
--=_AOwXQ/EBfoe4E3bC1VjlNXMqYcZsjEER
Try the attached patch?
--=_AOwXQ/EBfoe4E3bC1VjlNXMqYcZsjEER
Content-Type: text/plain; charset="ISO-8859-1"; name="threadpool"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment; filename="threadpool.patch"
diff -r 859d0e5ac5a2 sys/kern/kern_threadpool.c
--- a/sys/kern/kern_threadpool.c Sun Jan 17 22:02:46 2021 +0000
+++ b/sys/kern/kern_threadpool.c Sat Jan 23 15:40:39 2021 +0000
@@ -1041,7 +1041,7 @@ threadpool_dispatcher_thread(void *arg)
=20
/* There are idle threads, so try giving one a job. */
struct threadpool_job *const job =3D TAILQ_FIRST(&pool->tp_jobs);
- TAILQ_REMOVE(&pool->tp_jobs, job, job_entry);
+
/*
* Take an extra reference on the job temporarily so that
* it won't disappear on us while we have both locks dropped.
@@ -1053,6 +1053,7 @@ threadpool_dispatcher_thread(void *arg)
/* If the job was cancelled, we'll no longer be its thread. */
if (__predict_true(job->job_thread =3D=3D dispatcher)) {
mutex_spin_enter(&pool->tp_lock);
+ TAILQ_REMOVE(&pool->tp_jobs, job, job_entry);
if (__predict_false(
TAILQ_EMPTY(&pool->tp_idle_threads))) {
/*
--=_AOwXQ/EBfoe4E3bC1VjlNXMqYcZsjEER--
From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/55948 CVS commit: src/sys/kern
Date: Sat, 23 Jan 2021 16:33:49 +0000
Module Name: src
Committed By: riastradh
Date: Sat Jan 23 16:33:49 UTC 2021
Modified Files:
src/sys/kern: kern_threadpool.c
Log Message:
threadpool(9): Fix synchronization between cancel and dispatch.
- threadpool_cancel_job_async tried to prevent
threadpool_dispatcher_thread from taking the job by setting
job->job_thread = NULL and then removing the job from the queue.
- But threadpool_cancel_job_async didn't notice job->job_thread is
null until after it also removes the job from the queue =>
double-remove, *boom*.
The solution is to teach threadpool_dispatcher_thread to wait until
it has acquired the job lock to test whether job->job_thread is still
valid before it decides to remove the job from the queue.
Fixes PR kern/55948.
XXX pullup-9
To generate a diff of this commit:
cvs rdiff -u -r1.22 -r1.23 src/sys/kern/kern_threadpool.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: open->feedback
State-Changed-By: riastradh@NetBSD.org
State-Changed-When: Sat, 23 Jan 2021 16:36:35 +0000
State-Changed-Why:
fix committed
From: Andreas Gustafsson <gson@gson.org>
To: "Taylor R Campbell" <riastradh@netbsd.org>
Cc: gnats-bugs@netbsd.org
Subject: Re: PR/55948 CVS commit: src/sys/kern
Date: Sun, 24 Jan 2021 12:14:27 +0200
Taylor R Campbell wrote:
> threadpool(9): Fix synchronization between cancel and dispatch.
[...]
> cvs rdiff -u -r1.22 -r1.23 src/sys/kern/kern_threadpool.c
Confirmed fixed - thank you! I was now able to run the test 100 times
in a row without failures, and running the full test suite showed no
new failures in other tests.
--
Andreas Gustafsson, gson@gson.org
State-Changed-From-To: feedback->needs-pullups
State-Changed-By: gson@NetBSD.org
State-Changed-When: Sun, 24 Jan 2021 10:22:42 +0000
State-Changed-Why:
Confirmed fixed, has a comment saying "XXX pullup-9".
From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/55948 CVS commit: [netbsd-9] src/sys/kern
Date: Mon, 25 Jan 2021 14:12:50 +0000
Module Name: src
Committed By: martin
Date: Mon Jan 25 14:12:50 UTC 2021
Modified Files:
src/sys/kern [netbsd-9]: kern_threadpool.c
Log Message:
Pull up following revision(s) (requested by riastradh in ticket #1187):
sys/kern/kern_threadpool.c: revision 1.23
threadpool(9): Fix synchronization between cancel and dispatch.
- threadpool_cancel_job_async tried to prevent
threadpool_dispatcher_thread from taking the job by setting
job->job_thread = NULL and then removing the job from the queue.
- But threadpool_cancel_job_async didn't notice job->job_thread is
null until after it also removes the job from the queue =>
double-remove, *boom*.
The solution is to teach threadpool_dispatcher_thread to wait until
it has acquired the job lock to test whether job->job_thread is still
valid before it decides to remove the job from the queue.
Fixes PR kern/55948.
XXX pullup-9
To generate a diff of this commit:
cvs rdiff -u -r1.15 -r1.15.6.1 src/sys/kern/kern_threadpool.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: needs-pullups->closed
State-Changed-By: riastradh@NetBSD.org
State-Changed-When: Thu, 28 Jan 2021 18:49:24 +0000
State-Changed-Why:
fixed and pulled up
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.