NetBSD Problem Report #57628
From martin@aprisoft.de Mon Sep 25 09:39:26 2023
Return-Path: <martin@aprisoft.de>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id D0FAE1A9238
for <gnats-bugs@gnats.NetBSD.org>; Mon, 25 Sep 2023 09:39:26 +0000 (UTC)
Message-Id: <20230925093917.407CF5CC7A2@emmas.aprisoft.de>
Date: Mon, 25 Sep 2023 11:39:17 +0200 (CEST)
From: martin@NetBSD.org
Reply-To: martin@NetBSD.org
To: gnats-bugs@NetBSD.org
Subject: multithread programs may deadlock in ld.elf_so (sparc on sparc64)
X-Send-Pr-Version: 3.95
>Number: 57628
>Category: lib
>Synopsis: multithread programs may deadlock in ld.elf_so (sparc on sparc64)
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: lib-bug-people
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Mon Sep 25 09:40:01 +0000 2023
>Closed-Date: Tue Oct 03 09:49:04 +0000 2023
>Last-Modified: Tue Oct 03 09:50:01 +0000 2023
>Originator: Martin Husemann
>Release: NetBSD 10.99.9
>Organization:
The NetBSD Foundation, Inc.
>Environment:
System: NetBSD nelly.aprisoft.de 10.99.9 NetBSD 10.99.9 (NELLY) #75: Sun Sep 24 09:30:35 CEST 2023 martin@seven-days-to-the-wolves.aprisoft.de:/work/src/sys/arch/sparc64/compile/NELLY sparc
Architecture: sparc
Machine: sparc
>Description:
Running sparc userland ATF tests on sparc64 shows a few tests deadlocking
like this:
[~] root@nelly # gdb -p 15928
GNU gdb (GDB) 13.2
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "sparc--netbsdelf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 15928
Reading symbols from /usr/bin/rump_server...
Reading symbols from /usr/libdata/debug//usr/bin/rump_server.debug...
[New LWP 15537 of process 15928]
[New LWP 19607 of process 15928]
[New LWP 21178 of process 15928]
[New LWP 18138 of process 15928]
[New LWP 18206 of process 15928]
[New LWP 17453 of process 15928]
[New LWP 17889 of process 15928]
[New LWP 9653 of process 15928]
[New LWP 17218 of process 15928]
[New LWP 18547 of process 15928]
[New LWP 14775 of process 15928]
[New LWP 3924 of process 15928]
[New LWP 3116 of process 15928]
[New LWP 19701 of process 15928]
[New LWP 15928 of process 15928]
Reading symbols from /usr/lib/librumpkern_sysproxy.so.0...
Reading symbols from /usr/libdata/debug//usr/lib/librumpkern_sysproxy.so.0.0.debug...
Reading symbols from /usr/lib/librump.so.0...
Reading symbols from /usr/libdata/debug//usr/lib/librump.so.0.0.debug...
Reading symbols from /usr/lib/librumpvfs_nofifofs.so.0...
Reading symbols from /usr/libdata/debug//usr/lib/librumpvfs_nofifofs.so.0.0.debug...
Reading symbols from /usr/lib/librumpvfs.so.0...
Reading symbols from /usr/libdata/debug//usr/lib/librumpvfs.so.0.0.debug...
Reading symbols from /usr/lib/librumpuser.so.0...
Reading symbols from /usr/libdata/debug//usr/lib/librumpuser.so.0.1.debug...
Reading symbols from /usr/lib/libpthread.so.1...
Reading symbols from /usr/libdata/debug//usr/lib/libpthread.so.1.4.debug...
Reading symbols from /usr/lib/libsparc_v8.so.0...
Reading symbols from /usr/libdata/debug//usr/lib/libsparc_v8.so.0.0.debug...
(No debugging symbols found in /usr/libdata/debug//usr/lib/libsparc_v8.so.0.0.debug)
Reading symbols from /usr/lib/libc.so.12...
Reading symbols from /usr/libdata/debug//usr/lib/libc.so.12.221.debug...
Reading symbols from /usr/lib/libgcc_s.so.1...
Reading symbols from /usr/libdata/debug//usr/lib/libgcc_s.so.1.0.debug...
Reading symbols from /usr/lib/librumpkern_simplehook_tester.so...
Reading symbols from /usr/libdata/debug//usr/lib/librumpkern_simplehook_tester.so.0.0.debug...
Reading symbols from /usr/libexec/ld.elf_so...
Reading symbols from /usr/libdata/debug//usr/libexec/ld.elf_so.debug...
[Switching to LWP 19391 of process 15928]
_rtld_shared_enter () at /work/src/libexec/ld.elf_so/rtld.c:1716
1716 /work/src/libexec/ld.elf_so/rtld.c: No such file or directory.
(gdb) bt
#0 _rtld_shared_enter () at /work/src/libexec/ld.elf_so/rtld.c:1716
#1 0x2ad113b4 in _rtld_bind (obj=0x3eda4800, reloff=6504)
at /work/src/libexec/ld.elf_so/arch/sparc/mdreloc.c:420
#2 0x2ad10f10 in _rtld_bind_start () from /usr/libexec/ld.elf_so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) info thread
Id Target Id Frame
* 1 LWP 19391 of process 15928 "entbutler" _rtld_shared_enter ()
at /work/src/libexec/ld.elf_so/rtld.c:1716
2 LWP 15537 of process 15928 "xcall/1" pthread_cond_timedwait (
cond=0x3edcaba0, mutex=0x3edc7c80, abstime=0x0)
at /work/src/lib/libpthread/pthread_cond.c:172
3 LWP 19607 of process 15928 "sipbnc" pthread_cond_timedwait (
cond=0x3edca940, mutex=0x3edc7900, abstime=0x0)
at /work/src/lib/libpthread/pthread_cond.c:172
4 LWP 21178 of process 15928 "rumpclk1" pthread_cond_timedwait (
cond=0x3edca960, mutex=0x3edc7980, abstime=0x0)
at /work/src/lib/libpthread/pthread_cond.c:172
5 LWP 18138 of process 15928 "xcall/0" pthread_cond_timedwait (
cond=0x3edca940, mutex=0x3edc7900, abstime=0x0)
at /work/src/lib/libpthread/pthread_cond.c:172
6 LWP 18206 of process 15928 "rsi1/0" pthread_cond_timedwait (
cond=0x3edcab00, mutex=0x3edc7980, abstime=0x0)
at /work/src/lib/libpthread/pthread_cond.c:172
7 LWP 17453 of process 15928 "rsi0/0" pthread_cond_timedwait (
cond=0x3edca940, mutex=0x3edc7900, abstime=0x0)
at /work/src/lib/libpthread/pthread_cond.c:172
8 LWP 17889 of process 15928 "rsi1/3" pthread_cond_timedwait (
cond=0x3edcab60, mutex=0x3edc7980, abstime=0x0)
at /work/src/lib/libpthread/pthread_cond.c:172
9 LWP 9653 of process 15928 "rsi0/3" pthread_cond_timedwait (
cond=0x3edca940, mutex=0x3edc7900, abstime=0x0)
at /work/src/lib/libpthread/pthread_cond.c:172
10 LWP 17218 of process 15928 "rsi1/2" pthread_cond_timedwait (
cond=0x3edcab40, mutex=0x3edc7980, abstime=0x0)
at /work/src/lib/libpthread/pthread_cond.c:172
11 LWP 18547 of process 15928 "rsi0/2" pthread_cond_timedwait (
cond=0x3edca940, mutex=0x3edc7900, abstime=0x0)
at /work/src/lib/libpthread/pthread_cond.c:172
12 LWP 14775 of process 15928 "sipbnc" pthread_cond_timedwait (
cond=0x3edcab80, mutex=0x3edc7f80, abstime=0x0)
at /work/src/lib/libpthread/pthread_cond.c:172
13 LWP 3924 of process 15928 "rumpclk0" pthread_cond_timedwait (
cond=0x3edca940, mutex=0x3edc7900, abstime=0x0)
at /work/src/lib/libpthread/pthread_cond.c:172
14 LWP 3116 of process 15928 "rsi1/1" pthread_cond_timedwait (
cond=0x3edcab20, mutex=0x3edc7980, abstime=0x0)
at /work/src/lib/libpthread/pthread_cond.c:172
15 LWP 19701 of process 15928 "rsi0/1" pthread_cond_timedwait (
cond=0x3edca940, mutex=0x3edc7900, abstime=0x0)
at /work/src/lib/libpthread/pthread_cond.c:172
16 LWP 15928 of process 15928 "" _rtld_shared_enter ()
at /work/src/libexec/ld.elf_so/rtld.c:1716
(gdb) thread 16
[Switching to thread 16 (LWP 15928 of process 15928)]
#0 _rtld_shared_enter () at /work/src/libexec/ld.elf_so/rtld.c:1716
1716 in /work/src/libexec/ld.elf_so/rtld.c
(gdb) bt
#0 _rtld_shared_enter () at /work/src/libexec/ld.elf_so/rtld.c:1716
#1 0x2ad113b4 in _rtld_bind (obj=0x3eda4800, reloff=7044)
at /work/src/libexec/ld.elf_so/arch/sparc/mdreloc.c:420
#2 0x2ad10f10 in _rtld_bind_start () from /usr/libexec/ld.elf_so
>How-To-Repeat:
just try to run ATF tests with 32bit userland on a sparc64 kernel
>Fix:
n/a
>Release-Note:
>Audit-Trail:
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: lib/57628: multithread programs may deadlock in ld.elf_so (sparc
on sparc64)
Date: Tue, 26 Sep 2023 20:29:45 +0200
This may be due to missing setup of the atomic lock support stuff,
as we have two copies in the binary:
> gdb /usr/tests/dev/sysmon/t_swwdog
GNU gdb (GDB) 13.2
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "sparc--netbsdelf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/tests/dev/sysmon/t_swwdog...
Reading symbols from /usr/libdata/debug//usr/tests/dev/sysmon/t_swwdog.debug...
(gdb) break __libc_atomic_init
Function "__libc_atomic_init" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (__libc_atomic_init) pending.
(gdb) run disarm
Starting program: /usr/tests/dev/sysmon/t_swwdog disarm
Breakpoint 1.2, __libc_atomic_init ()
at /work/src/lib/libc/../../common/lib/libc/atomic/atomic_init_testset.c:281
281 /work/src/lib/libc/../../common/lib/libc/atomic/atomic_init_testset.c: No such file or directory.
(gdb) info break
Num Type Disp Enb Address What
1 breakpoint keep y <MULTIPLE>
breakpoint already hit 1 time
1.1 y 0x25ddded4 in __libc_atomic_init
at /work/src/lib/libc/../../common/lib/libc/atomic/atomic_init_testset.c:281
1.2 y 0x3ec301d4 in __libc_atomic_init
at /work/src/lib/libc/../../common/lib/libc/atomic/atomic_init_testset.c:281
(gdb) info dll
From To Syms Read Shared Object Library
0x25dd0ec8 0x25dde050 Yes /usr/libexec/ld.elf_so
0x3e784b0c 0x3e78e058 Yes /usr/lib/librumpdev_sysmon.so.0
0x3e7b2454 0x3e7b4e00 Yes /usr/lib/librumpdev.so.0
0x3e7e90c8 0x3e83ab64 Yes /usr/lib/librumpvfs.so.0
0x3e870640 0x3e8707b8 Yes /usr/lib/librumpvfs_nofifofs.so.0
0x3e8c5298 0x3e974b98 Yes /usr/lib/librump.so.0
0x3e9e2f38 0x3e9e91e8 Yes /usr/lib/librumpuser.so.0
0x3ea15600 0x3ea1ea20 Yes /usr/lib/libpthread.so.1
0x3ea44b24 0x3ea4ea20 Yes /usr/lib/libatf-c.so.0
0x3ea70378 0x3ea70580 Yes (*) /usr/lib/libsparc_v8.so.0
0x3eacce10 0x3ec30404 Yes /usr/lib/libc.so.12
0x3ecb195c 0x3ecba99c Yes /usr/lib/libgcc_s.so.1
... so one version of __libc_atomic_init in ld.elf_so itself, and one
in libc. The libc one gets it's __libc_atomic_init function called,
but the ld.elf_so one apparently never.
Which probably means that we are trying to use RAS to emulate the atomic
operations, on a multiprocessor machine - which is doomed.
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: lib/57628: multithread programs may deadlock in ld.elf_so (sparc
on sparc64)
Date: Wed, 27 Sep 2023 20:05:11 +0200
The attached patch seems to fix the issue for me, w/o any obivous bad
side effects (ld.elf_so size ~the same, initial tests work fine).
The same gdb test run from the previous rump example now shows both copies
of __libc_atomic_init() being called - the ld.elf_so internal one very
early (before any userland threads could have been created) and the second
one when libc constructors are run.
All our ld.elf_so seem to have the hidden function, but it is a no-op on most
architectures (when hardware supported atomic ops are good enough), so no
ifdef magic or whatever seems needed.
Is it this simple? Am I overlooking something?
Will run complete tests now...
Martin
Index: rtld.c
===================================================================
RCS file: /cvsroot/src/libexec/ld.elf_so/rtld.c,v
retrieving revision 1.215
diff -u -p -r1.215 rtld.c
--- rtld.c 30 Jul 2023 09:20:14 -0000 1.215
+++ rtld.c 27 Sep 2023 17:52:04 -0000
@@ -70,6 +70,13 @@ __RCSID("$NetBSD: rtld.c,v 1.215 2023/07
#endif
/*
+ * Hidden function from common/lib/libc/atomic - nop on machines
+ * with enough atomic ops. Need to explicitly call it early.
+ * libc has the same symbol and will initialize itself, but not our copy.
+ */
+void __libc_atomic_init(void);
+
+/*
* Function declarations.
*/
static void _rtld_init(caddr_t, caddr_t, const char *);
@@ -404,6 +411,8 @@ _rtld_init(caddr_t mapbase, caddr_t relo
ehdr = (Elf_Ehdr *)mapbase;
_rtld_objself.phdr = (Elf_Phdr *)((char *)mapbase + ehdr->e_phoff);
_rtld_objself.phsize = ehdr->e_phnum * sizeof(_rtld_objself.phdr[0]);
+
+ __libc_atomic_init();
}
/*
State-Changed-From-To: open->closed
State-Changed-By: martin@NetBSD.org
State-Changed-When: Tue, 03 Oct 2023 09:49:04 +0000
State-Changed-Why:
Fixed
From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/57628 CVS commit: src/libexec/ld.elf_so
Date: Tue, 3 Oct 2023 09:48:19 +0000
Module Name: src
Committed By: martin
Date: Tue Oct 3 09:48:19 UTC 2023
Modified Files:
src/libexec/ld.elf_so: rtld.c
Log Message:
PR 57628: at the end of _rtld_init() explicitly initialize the ld.elf_so
local copy of the atomic access support functions for machines that do not
implement all required ops in hardware (like 32bit sparc).
XXX would be better to figure out a way to share this copy with libc
(thereby using half as many RAS sections). But even if we would share it,
we have to init it early enough for ld.elf_so internal uses.
To generate a diff of this commit:
cvs rdiff -u -r1.215 -r1.216 src/libexec/ld.elf_so/rtld.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2023
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.