NetBSD Problem Report #56414
From Manuel.Bouyer@lip6.fr Tue Sep 21 14:59:58 2021
Return-Path: <Manuel.Bouyer@lip6.fr>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 102F21A921F
for <gnats-bugs@gnats.NetBSD.org>; Tue, 21 Sep 2021 14:59:58 +0000 (UTC)
Message-Id: <20210921145943.44E676F98@armandeche.soc.lip6.fr>
Date: Tue, 21 Sep 2021 16:59:43 +0200 (MEST)
From: Manuel.Bouyer@lip6.fr
Reply-To: Manuel.Bouyer@lip6.fr
To: gnats-bugs@NetBSD.org
Subject: cmake hang on kqueue (condvar waiter list issue) ?
X-Send-Pr-Version: 3.95
>Number: 56414
>Category: kern
>Synopsis: cmake hang on kqueue
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: kern-bug-people
>State: feedback
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Tue Sep 21 15:00:00 +0000 2021
>Closed-Date:
>Last-Modified: Sun Jul 10 02:06:26 +0000 2022
>Originator: Manuel Bouyer
>Release: NetBSD 9.99.77
>Organization:
>Environment:
System: NetBSD amd64-nb9.netbsd.org 9.99.77 NetBSD 9.99.77 (amd64-nb9-PVH) #3: Sun Jan 10 08:56:13 UTC 2021 spz@franklin.NetBSD.org:/home/netbsd/current/amd64/obj/sys/arch/amd64/compile/amd64-nb9-PVH amd64
Architecture: x86_64
Machine: amd64
>Description:
In a full pbulk build, I see cmake process hanging on kqueue,
stalling the build. usually, a kill -STOP/kill -CONT on the
cmake process wakes it up, and the package build completes without
issue. This will occur several times in a pbulk run.
This has been discussed several times on NetBSD's lists,
as in
http://mail-index.netbsd.org/tech-kern/2021/01/12/msg027056.html
https://mail-index.netbsd.org/current-users/2021/04/06/msg040695.html
>How-To-Repeat:
run and monitor a full pbulk build on a -current host
>Fix:
>Release-Note:
>Audit-Trail:
From: Joerg Sonnenberger <joerg@bec.de>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/56414: cmake hang on kqueue (condvar waiter list issue) ?
Date: Wed, 22 Sep 2021 01:31:52 +0200
This is not a kernel issue, but a corruption of the wait list in
userland.
Joerg
From: Manuel Bouyer <manuel.bouyer@lip6.fr>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/56414: cmake hang on kqueue (condvar waiter list issue) ?
Date: Wed, 22 Sep 2021 06:33:09 +0200
> This is not a kernel issue, but a corruption of the wait list in
> userland.
Ha, interesting. The userland is 9.0_STABLE/amd64 from
Jun 11 22:49:34 UTC 2020. Any idea if a fix is available in netbsd-9 ?
--
Manuel Bouyer, LIP6, Universite Paris VI. Manuel.Bouyer@lip6.fr
NetBSD: 26 ans d'experience feront toujours la difference
--
From: Joerg Sonnenberger <joerg@bec.de>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org, Manuel.Bouyer@lip6.fr
Subject: Re: kern/56414: cmake hang on kqueue (condvar waiter list issue) ?
Date: Wed, 22 Sep 2021 15:45:45 +0200
> > This is not a kernel issue, but a corruption of the wait list in
> > userland.
>
> Ha, interesting. The userland is 9.0_STABLE/amd64 from
> Jun 11 22:49:34 UTC 2020. Any idea if a fix is available in netbsd-9 ?
There is none.
Joerg
From: Taylor R Campbell <riastradh@NetBSD.org>
To: gnats-bugs@NetBSD.org
Cc: Manuel.Bouyer@lip6.fr, joerg@bec.de, wiz@NetBSD.org, mlelstv@NetBSD.org
Subject: Re: kern/56414: cmake hang on kqueue (condvar waiter list issue) ?
Date: Sat, 9 Apr 2022 20:35:10 +0000
This is a multi-part message in MIME format.
--=_kfI//mdIC0kx2NGNjppVUyLgYbxCgVBW
Does this still manifest on HEAD? Do you have a machine running a
current kernel where you've seen the cmake hang? Can you run this
program on it and see if it fails?
Adjust N to be the (even) number of CPUs you have. It'll check for
progress every second and raise SIGABRT if not -- you can run it under
gdb to catch the SIGABRT or examine the core dump later if you like.
Took about an hour to wedge the first time on my 12-core/24-thread
Sandy Bridge system (but that's running netbsd-9 and everything
changed with pthread_cond.c in HEAD so it's not as useful as I hoped).
--=_kfI//mdIC0kx2NGNjppVUyLgYbxCgVBW
Content-Type: text/plain; charset="ISO-8859-1"; name="nbcv"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment; filename="nbcv.c"
#include <sys/param.h>
#include <err.h>
#include <pthread.h>
#include <sched.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
enum { N =3D 12 };
pthread_mutex_t mutex[N];
pthread_cond_t cond[N];
pthread_t sleeper[N/2];
pthread_t waker[N/2];
struct {
volatile unsigned v;
} __aligned(COHERENCY_UNIT) ticker[N/2];
static void
lock(unsigned i)
{
int error;
error =3D pthread_mutex_lock(&mutex[i]);
if (error)
errc(1, error, "pthread_mutex_lock");
}
static void
unlock(unsigned i)
{
int error;
error =3D pthread_mutex_unlock(&mutex[i]);
if (error)
errc(1, error, "pthread_mutex_unlock");
}
static void
wait(unsigned i)
{
int error;
error =3D pthread_cond_wait(&cond[i], &mutex[i]);
if (error)
errc(1, error, "pthread_cond_wait");
}
static void
wake_one(unsigned i)
{
int error;
error =3D pthread_cond_signal(&cond[i]);
if (error)
errc(1, error, "pthread_cond_signal");
}
static void __unused
wake_all(unsigned i)
{
int error;
error =3D pthread_cond_broadcast(&cond[i]);
if (error)
errc(1, error, "pthread_cond_broadcast");
}
static void *
start_sleeper(void *cookie)
{
unsigned t =3D (unsigned)(uintptr_t)cookie;
unsigned i;
for (i =3D 0;; i++, i %=3D N) {
lock(i);
wait(i);
unlock(i);
ticker[t].v++;
}
__unreachable();
}
static void *
start_waker(void *cookie)
{
unsigned i;
(void)cookie;
for (i =3D 0;; i++, i %=3D N) {
lock(i);
wake_one(i);
unlock(i);
}
__unreachable();
}
int
main(void)
{
uint64_t c =3D 0;
unsigned tickercache[N/2] =3D {0};
unsigned i, tmp;
int error;
for (i =3D 0; i < N; i++) {
error =3D pthread_mutex_init(&mutex[i], NULL);
if (error)
errc(1, error, "pthread_mutex_init");
error =3D pthread_cond_init(&cond[i], NULL);
if (error)
errc(1, error, "pthread_cond_init");
}
for (i =3D 0; i < N/2; i++) {
error =3D pthread_create(&sleeper[i], NULL, &start_sleeper,
(void *)(uintptr_t)i);
if (error)
errc(1, error ,"pthread_create sleeper");
error =3D pthread_create(&waker[i], NULL, &start_waker,
NULL);
if (error)
errc(1, error ,"pthread_create waker");
}
setlinebuf(stdout);
for (;;) {
sleep(1);
c =3D 0;
for (i =3D 0; i < N/2; i++) {
if ((tmp =3D ticker[i].v) !=3D tickercache[i]) {
c +=3D (tmp - tickercache[i]);
tickercache[i] =3D tmp;
} else {
printf("thread %u wedged\n", i);
raise(SIGABRT);
}
}
printf("%"PRIu64" wakeups\n", c);
}
return 0;
}
--=_kfI//mdIC0kx2NGNjppVUyLgYbxCgVBW--
From: Thomas Klausner <wiz@NetBSD.org>
To: NetBSD bugtracking <gnats-bugs@NetBSD.org>
Cc:
Subject: Re: kern/56414: cmake hang on kqueue (condvar waiter list issue) ?
Date: Wed, 13 Apr 2022 01:42:30 +0200
On Sat, Apr 09, 2022 at 08:35:10PM +0000, Taylor R Campbell wrote:
> Does this still manifest on HEAD? Do you have a machine running a
> current kernel where you've seen the cmake hang? Can you run this
> program on it and see if it fails?
>
> Adjust N to be the (even) number of CPUs you have. It'll check for
> progress every second and raise SIGABRT if not -- you can run it under
> gdb to catch the SIGABRT or examine the core dump later if you like.
>
> Took about an hour to wedge the first time on my 12-core/24-thread
> Sandy Bridge system (but that's running netbsd-9 and everything
> changed with pthread_cond.c in HEAD so it's not as useful as I hoped).
I had this running for days in gdb on a machine where I see
guile(often)/cmake(rare)/cargo(rare) hangs, but it didn't fail.
A bulk build with guile30 had a guile hang during the same time.
That was on 9.99.96/amd64 from March 29.
Thomas
From: Taylor R Campbell <riastradh@NetBSD.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/56414: cmake hang on kqueue (condvar waiter list issue) ?
Date: Thu, 14 Apr 2022 14:40:00 +0000
This is a multi-part message in MIME format.
--=_fvz4bw6Sgb7EvIjKFB6Nr1oMeWUc8Daj
I reduced the guile/boehm-gc hang, at least, to the following
reproducer -- it reliably hangs for me after a few hundred runs, with
libpthread from netbsd-9 or HEAD and a netbsd-9 kernel, but others
have reported hangs on recent all-HEAD components. Condvars are not
involved; unclear if it's the same issue as cmake.
--=_fvz4bw6Sgb7EvIjKFB6Nr1oMeWUc8Daj
Content-Type: text/plain; charset="ISO-8859-1"; name="malloctest1"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment; filename="malloctest1.c"
#include <err.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
static void *
start(void *cookie)
{
return malloc(12);
}
int
main(void)
{
pthread_t t[24];
unsigned i;
int error;
for (i =3D 0; i < __arraycount(t); i++) {
error =3D pthread_create(&t[i], NULL, &start, NULL);
if (error)
errc(1, error, "pthread_create");
}
for (i =3D 0; i < __arraycount(t); i++) {
error =3D pthread_join(t[i], NULL);
if (error)
errc(1, error, "pthread_join");
}
return 0;
}
--=_fvz4bw6Sgb7EvIjKFB6Nr1oMeWUc8Daj--
From: Joerg Sonnenberger <joerg@bec.de>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org, Manuel.Bouyer@lip6.fr
Subject: Re: kern/56414: cmake hang on kqueue (condvar waiter list issue) ?
Date: Sun, 17 Apr 2022 20:49:33 +0200
Am Thu, Apr 14, 2022 at 02:45:02PM +0000 schrieb Taylor R Campbell:
> I reduced the guile/boehm-gc hang, at least, to the following
> reproducer -- it reliably hangs for me after a few hundred runs, with
> libpthread from netbsd-9 or HEAD and a netbsd-9 kernel, but others
> have reported hangs on recent all-HEAD components. Condvars are not
> involved; unclear if it's the same issue as cmake.
Do you have a backtrace of a hanging instance?
Joerg
From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/56414: cmake hang on kqueue (condvar waiter list issue) ?
Date: Sun, 10 Jul 2022 01:56:07 +0000
Not sent to gnats (use gnats-bugs@)
------
From: Thomas Klausner <wiz@NetBSD.org>
To: gnats@NetBSD.org
Cc: Joerg Sonnenberger <joerg@bec.de>
Subject: Re: kern/56414: cmake hang on kqueue (condvar waiter list issue) ?
Date: Sun, 17 Apr 2022 20:52:38 +0200
Just ran this on current.
First 1000 runs were fine, run 1344 hung:
(gdb) thread apply all bt
Thread 2 (LWP 20616 of process 20616 ""):
#0 0x00007e6a0be478ea in _lwp_wait () from /usr/lib/libc.so.12
#1 0x00007e6a0c60d21b in pthread_join (thread=0x7e6a0c827c00, valptr=0x0) at /disk/6/archive/foreign/src/lib/libpthread/pthread.c:684
#2 0x0000000000400b3e in main ()
Thread 1 (LWP 15449 of process 20616 ""):
#0 0x00007e6a0beb255a in ___lwp_park60 () from /usr/lib/libc.so.12
#1 0x00007e6a0c609f4b in pthread__mutex_lock_slow (ptm=0x7e6a0c40c100 <je_arenas_lock+64>, ts=ts@entry=0x0) at /disk/6/archive/foreign/src/lib/libpthread/pthread_mutex.c:366
#2 0x00007e6a0c60a1e9 in pthread_mutex_lock (ptm=ptm@entry=0x7e6a0c40c100 <je_arenas_lock+64>) at /disk/6/archive/foreign/src/lib/libpthread/pthread_mutex.c:218
#3 0x00007e6a0bedf710 in malloc_mutex_lock_final (mutex=0x7e6a0c40c0c0 <je_arenas_lock>) at /disk/6/archive/foreign/src/external/bsd/jemalloc/lib/../include/jemalloc/internal/mutex.h:153
#4 je_malloc_mutex_lock_slow (mutex=mutex@entry=0x7e6a0c40c0c0 <je_arenas_lock>) at /disk/6/archive/foreign/src/external/bsd/jemalloc/lib/../dist/src/mutex.c:84
#5 0x00007e6a0bf1ae97 in malloc_mutex_lock (mutex=0x7e6a0c40c0c0 <je_arenas_lock>, tsdn=0x7e6a0c5f9040) at /disk/6/archive/foreign/src/external/bsd/jemalloc/lib/../include/jemalloc/internal/mutex.h:217
#6 je_arena_choose_hard (tsd=tsd@entry=0x7e6a0c5f9040, internal=internal@entry=false) at /disk/6/archive/foreign/src/external/bsd/jemalloc/lib/../dist/src/jemalloc.c:553
#7 0x00007e6a0bebf4be in arena_choose_impl (arena=0x0, internal=false, tsd=0x7e6a0c5f9040) at /disk/6/archive/foreign/src/external/bsd/jemalloc/lib/../include/jemalloc/internal/jemalloc_internal_inlines_b.h:22
#8 arena_choose_impl (arena=0x0, internal=false, tsd=0x7e6a0c5f9040) at /disk/6/archive/foreign/src/external/bsd/jemalloc/lib/../include/jemalloc/internal/jemalloc_internal_inlines_b.h:8
#9 arena_choose (arena=0x0, tsd=0x7e6a0c5f9040) at /disk/6/archive/foreign/src/external/bsd/jemalloc/lib/../include/jemalloc/internal/jemalloc_internal_inlines_b.h:63
#10 je_tsd_tcache_data_init (tsd=tsd@entry=0x7e6a0c5f9040) at /disk/6/archive/foreign/src/external/bsd/jemalloc/lib/../dist/src/tcache.c:421
#11 0x00007e6a0bebf725 in je_tsd_tcache_enabled_data_init (tsd=tsd@entry=0x7e6a0c5f9040) at /disk/6/archive/foreign/src/external/bsd/jemalloc/lib/../dist/src/tcache.c:350
#12 0x00007e6a0bebbf19 in tsd_data_init (tsd=0x7e6a0c5f9040) at /disk/6/archive/foreign/src/external/bsd/jemalloc/lib/../dist/src/tsd.c:87
#13 je_tsd_fetch_slow (tsd=tsd@entry=0x7e6a0c5f9040, minimal=minimal@entry=false) at /disk/6/archive/foreign/src/external/bsd/jemalloc/lib/../dist/src/tsd.c:147
#14 0x00007e6a0bf1b280 in tsd_fetch_impl (minimal=false, init=true) at /disk/6/archive/foreign/src/external/bsd/jemalloc/lib/../include/jemalloc/internal/tsd.h:265
#15 tsd_fetch () at /disk/6/archive/foreign/src/external/bsd/jemalloc/lib/../include/jemalloc/internal/tsd.h:291
#16 imalloc (dopts=<synthetic pointer>, sopts=<synthetic pointer>) at /disk/6/archive/foreign/src/external/bsd/jemalloc/lib/../dist/src/jemalloc.c:2036
#17 malloc (size=12) at /disk/6/archive/foreign/src/external/bsd/jemalloc/lib/../dist/src/jemalloc.c:2075
#18 0x0000000000400ab0 in start ()
#19 0x00007e6a0c60c87f in pthread__create_tramp (cookie=0x7e6a0c827c00) at /disk/6/archive/foreign/src/lib/libpthread/pthread.c:564
#20 0x00007e6a0be9b780 in ?? () from /usr/lib/libc.so.12
#21 0x0000000000000000 in ?? ()
Thomas
From: Patrick Welche <prlw1@talktalk.net>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/56414: cmake hang on kqueue (condvar waiter list issue) ?
Date: Thu, 28 Apr 2022 22:24:35 +0100
On Sun, Apr 17, 2022 at 06:50:02PM +0000, Joerg Sonnenberger wrote:
> Do you have a backtrace of a hanging instance?
Not sure whether you mean the reproducer or cmake. cmake on a 24 Apr 2022
NetBSD-9.99.96/amd64 just hung in
(gdb) thread apply all bt
Thread 2 (LWP 2135 of process 2135 ""):
#0 0x00007f7ff42478ea in _lwp_wait () from /usr/lib/libc.so.12
#1 0x00007f7ff560d21b in pthread_join (thread=0x7f7ff7bf4c00, valptr=0x0) at /usr/src/lib/libpthread/pthread.c:684
#2 0x00007f7ff52aef35 in __gthread_join (__value_ptr=0x0, __threadid=<optimized out>) at /usr/export/amd64/usr/include/g++/bits/gthr-posix.h:672
#3 std::thread::join (this=0x7f7ff7ea7af0) at /usr/src/external/gpl3/gcc/dist/libstdc++-v3/src/c++11/thread.cc:110
#4 0x0000000000302666 in cmWorkerPoolWorker::~cmWorkerPoolWorker() ()
#5 0x0000000000302804 in cmWorkerPoolInternal::UVSlotEnd(uv_async_s*) ()
#6 0x00007f7ff640d39c in uv.async_io.part () from /usr/pkg/lib/libuv.so.1
#7 0x00007f7ff641d2ce in uv.io_poll () from /usr/pkg/lib/libuv.so.1
#8 0x00007f7ff640db8f in uv_run () from /usr/pkg/lib/libuv.so.1
#9 0x0000000000303551 in cmWorkerPoolInternal::Process() ()
#10 0x0000000000303864 in cmWorkerPool::Process(void*) ()
#11 0x00000000002b4f86 in (anonymous namespace)::cmQtAutoMocUicT::Process() ()
#12 0x00000000002a7813 in cmQtAutoMocUic(std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >) ()
#13 0x000000000023c4eb in cmcmd::ExecuteCMakeCommand(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::unique_ptr<cmConsoleBuf, std::default_delete<cmConsoleBuf> >) ()
#14 0x00000000006f35cb in main ()
Thread 1 (LWP 25517 of process 2135 ""):
#0 0x00007f7ff42b255a in ___lwp_park60 () from /usr/lib/libc.so.12
#1 0x00007f7ff5609f4b in pthread__mutex_lock_slow (ptm=0x7f7ff7eac3e8, ts=0x0) at /usr/src/lib/libpthread/pthread_mutex.c:366
#2 0x00000000003041bc in cmWorkerPoolInternal::Work(unsigned int) ()
#3 0x00007f7ff52aeedc in std::execute_native_thread_routine (__p=0x7f7ff7eaa830) at /usr/src/external/gpl3/gcc/dist/libstdc++-v3/src/c++11/thread.cc:80
#4 0x00007f7ff560c87f in pthread__create_tramp (cookie=0x7f7ff7bf4c00) at /usr/src/lib/libpthread/pthread.c:564
#5 0x00007f7ff429b780 in ?? () from /usr/lib/libc.so.12
#6 0x0000000000000000 in ?? ()
From: Thomas Klausner <wiz@NetBSD.org>
To: NetBSD bugtracking <gnats-bugs@NetBSD.org>
Cc:
Subject: Re: kern/56414: cmake hang on kqueue (condvar waiter list issue) ?
Date: Sun, 15 May 2022 21:29:36 +0200
There were two different problems wrt the cmake hangs. But with chuq's
patch from
https://mail-index.netbsd.org/current-users/2022/05/01/msg042271.html
i.e.
http://www.netbsd.org/~chs/diff.pthread-park-stuck.1
and libuv-1.44.1nb1, i.e. the patch from
https://mail-index.netbsd.org/pkgsrc-changes/2022/05/15/msg254562.html
I see no cmake hangs any longer.
Thomas
State-Changed-From-To: open->feedback
State-Changed-By: martin@NetBSD.org
State-Changed-When: Fri, 03 Jun 2022 18:25:19 +0000
State-Changed-Why:
This should have been fixed with r1.181 of lib/libpthread/pthread.c.
Do you still see it?
From: Manuel Bouyer <manuel.bouyer@lip6.fr>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, netbsd-bugs@netbsd.org, gnats-admin@netbsd.org,
martin@NetBSD.org
Subject: Re: kern/56414 (cmake hang on kqueue)
Date: Mon, 6 Jun 2022 11:39:07 +0200
On Fri, Jun 03, 2022 at 06:25:20PM +0000, martin@NetBSD.org wrote:
> Synopsis: cmake hang on kqueue
>
> State-Changed-From-To: open->feedback
> State-Changed-By: martin@NetBSD.org
> State-Changed-When: Fri, 03 Jun 2022 18:25:19 +0000
> State-Changed-Why:
> This should have been fixed with r1.181 of lib/libpthread/pthread.c.
> Do you still see it?
I see it with a netbsd-9 userland, so if it has not been pulled up, it's
likely not fixed.
--
Manuel Bouyer, LIP6, Sorbonne Université. Manuel.Bouyer@lip6.fr
NetBSD: 26 ans d'experience feront toujours la difference
--
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.