NetBSD Problem Report #57628

From martin@aprisoft.de  Mon Sep 25 09:39:26 2023
Return-Path: <martin@aprisoft.de>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id D0FAE1A9238
	for <gnats-bugs@gnats.NetBSD.org>; Mon, 25 Sep 2023 09:39:26 +0000 (UTC)
Message-Id: <20230925093917.407CF5CC7A2@emmas.aprisoft.de>
Date: Mon, 25 Sep 2023 11:39:17 +0200 (CEST)
From: martin@NetBSD.org
Reply-To: martin@NetBSD.org
To: gnats-bugs@NetBSD.org
Subject: multithread programs may deadlock in ld.elf_so (sparc on sparc64)
X-Send-Pr-Version: 3.95

>Number:         57628
>Category:       lib
>Synopsis:       multithread programs may deadlock in ld.elf_so (sparc on sparc64)
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    lib-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Sep 25 09:40:01 +0000 2023
>Closed-Date:    Tue Oct 03 09:49:04 +0000 2023
>Last-Modified:  Tue Oct 03 09:50:01 +0000 2023
>Originator:     Martin Husemann
>Release:        NetBSD 10.99.9
>Organization:
The NetBSD Foundation, Inc.
>Environment:
System: NetBSD nelly.aprisoft.de 10.99.9 NetBSD 10.99.9 (NELLY) #75: Sun Sep 24 09:30:35 CEST 2023 martin@seven-days-to-the-wolves.aprisoft.de:/work/src/sys/arch/sparc64/compile/NELLY sparc
Architecture: sparc
Machine: sparc
>Description:

Running sparc userland ATF tests on sparc64 shows a few tests deadlocking
like this:

[~] root@nelly # gdb -p 15928
GNU gdb (GDB) 13.2
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "sparc--netbsdelf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 15928
Reading symbols from /usr/bin/rump_server...
Reading symbols from /usr/libdata/debug//usr/bin/rump_server.debug...
[New LWP 15537 of process 15928]
[New LWP 19607 of process 15928]
[New LWP 21178 of process 15928]
[New LWP 18138 of process 15928]
[New LWP 18206 of process 15928]
[New LWP 17453 of process 15928]
[New LWP 17889 of process 15928]
[New LWP 9653 of process 15928]
[New LWP 17218 of process 15928]
[New LWP 18547 of process 15928]
[New LWP 14775 of process 15928]
[New LWP 3924 of process 15928]
[New LWP 3116 of process 15928]
[New LWP 19701 of process 15928]
[New LWP 15928 of process 15928]
Reading symbols from /usr/lib/librumpkern_sysproxy.so.0...
Reading symbols from /usr/libdata/debug//usr/lib/librumpkern_sysproxy.so.0.0.debug...
Reading symbols from /usr/lib/librump.so.0...
Reading symbols from /usr/libdata/debug//usr/lib/librump.so.0.0.debug...
Reading symbols from /usr/lib/librumpvfs_nofifofs.so.0...
Reading symbols from /usr/libdata/debug//usr/lib/librumpvfs_nofifofs.so.0.0.debug...
Reading symbols from /usr/lib/librumpvfs.so.0...
Reading symbols from /usr/libdata/debug//usr/lib/librumpvfs.so.0.0.debug...
Reading symbols from /usr/lib/librumpuser.so.0...
Reading symbols from /usr/libdata/debug//usr/lib/librumpuser.so.0.1.debug...
Reading symbols from /usr/lib/libpthread.so.1...
Reading symbols from /usr/libdata/debug//usr/lib/libpthread.so.1.4.debug...
Reading symbols from /usr/lib/libsparc_v8.so.0...
Reading symbols from /usr/libdata/debug//usr/lib/libsparc_v8.so.0.0.debug...
(No debugging symbols found in /usr/libdata/debug//usr/lib/libsparc_v8.so.0.0.debug)
Reading symbols from /usr/lib/libc.so.12...
Reading symbols from /usr/libdata/debug//usr/lib/libc.so.12.221.debug...
Reading symbols from /usr/lib/libgcc_s.so.1...
Reading symbols from /usr/libdata/debug//usr/lib/libgcc_s.so.1.0.debug...
Reading symbols from /usr/lib/librumpkern_simplehook_tester.so...
Reading symbols from /usr/libdata/debug//usr/lib/librumpkern_simplehook_tester.so.0.0.debug...
Reading symbols from /usr/libexec/ld.elf_so...
Reading symbols from /usr/libdata/debug//usr/libexec/ld.elf_so.debug...
[Switching to LWP 19391 of process 15928]
_rtld_shared_enter () at /work/src/libexec/ld.elf_so/rtld.c:1716
1716    /work/src/libexec/ld.elf_so/rtld.c: No such file or directory.
(gdb) bt
#0  _rtld_shared_enter () at /work/src/libexec/ld.elf_so/rtld.c:1716
#1  0x2ad113b4 in _rtld_bind (obj=0x3eda4800, reloff=6504)
    at /work/src/libexec/ld.elf_so/arch/sparc/mdreloc.c:420
#2  0x2ad10f10 in _rtld_bind_start () from /usr/libexec/ld.elf_so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) info thread
  Id   Target Id                              Frame 
* 1    LWP 19391 of process 15928 "entbutler" _rtld_shared_enter ()
    at /work/src/libexec/ld.elf_so/rtld.c:1716
  2    LWP 15537 of process 15928 "xcall/1"   pthread_cond_timedwait (
    cond=0x3edcaba0, mutex=0x3edc7c80, abstime=0x0)
    at /work/src/lib/libpthread/pthread_cond.c:172
  3    LWP 19607 of process 15928 "sipbnc"    pthread_cond_timedwait (
    cond=0x3edca940, mutex=0x3edc7900, abstime=0x0)
    at /work/src/lib/libpthread/pthread_cond.c:172
  4    LWP 21178 of process 15928 "rumpclk1"  pthread_cond_timedwait (
    cond=0x3edca960, mutex=0x3edc7980, abstime=0x0)
    at /work/src/lib/libpthread/pthread_cond.c:172
  5    LWP 18138 of process 15928 "xcall/0"   pthread_cond_timedwait (
    cond=0x3edca940, mutex=0x3edc7900, abstime=0x0)
    at /work/src/lib/libpthread/pthread_cond.c:172
  6    LWP 18206 of process 15928 "rsi1/0"    pthread_cond_timedwait (
    cond=0x3edcab00, mutex=0x3edc7980, abstime=0x0)
    at /work/src/lib/libpthread/pthread_cond.c:172
  7    LWP 17453 of process 15928 "rsi0/0"    pthread_cond_timedwait (
    cond=0x3edca940, mutex=0x3edc7900, abstime=0x0)
    at /work/src/lib/libpthread/pthread_cond.c:172
  8    LWP 17889 of process 15928 "rsi1/3"    pthread_cond_timedwait (
    cond=0x3edcab60, mutex=0x3edc7980, abstime=0x0)
    at /work/src/lib/libpthread/pthread_cond.c:172
  9    LWP 9653 of process 15928 "rsi0/3"     pthread_cond_timedwait (
    cond=0x3edca940, mutex=0x3edc7900, abstime=0x0)
    at /work/src/lib/libpthread/pthread_cond.c:172
  10   LWP 17218 of process 15928 "rsi1/2"    pthread_cond_timedwait (
    cond=0x3edcab40, mutex=0x3edc7980, abstime=0x0)
    at /work/src/lib/libpthread/pthread_cond.c:172
  11   LWP 18547 of process 15928 "rsi0/2"    pthread_cond_timedwait (
    cond=0x3edca940, mutex=0x3edc7900, abstime=0x0)
    at /work/src/lib/libpthread/pthread_cond.c:172
  12   LWP 14775 of process 15928 "sipbnc"    pthread_cond_timedwait (
    cond=0x3edcab80, mutex=0x3edc7f80, abstime=0x0)
    at /work/src/lib/libpthread/pthread_cond.c:172
  13   LWP 3924 of process 15928 "rumpclk0"   pthread_cond_timedwait (
    cond=0x3edca940, mutex=0x3edc7900, abstime=0x0)
    at /work/src/lib/libpthread/pthread_cond.c:172
  14   LWP 3116 of process 15928 "rsi1/1"     pthread_cond_timedwait (
    cond=0x3edcab20, mutex=0x3edc7980, abstime=0x0)
    at /work/src/lib/libpthread/pthread_cond.c:172
  15   LWP 19701 of process 15928 "rsi0/1"    pthread_cond_timedwait (
    cond=0x3edca940, mutex=0x3edc7900, abstime=0x0)
    at /work/src/lib/libpthread/pthread_cond.c:172
  16   LWP 15928 of process 15928 ""          _rtld_shared_enter ()
    at /work/src/libexec/ld.elf_so/rtld.c:1716
(gdb) thread 16
[Switching to thread 16 (LWP 15928 of process 15928)]
#0  _rtld_shared_enter () at /work/src/libexec/ld.elf_so/rtld.c:1716
1716    in /work/src/libexec/ld.elf_so/rtld.c
(gdb) bt
#0  _rtld_shared_enter () at /work/src/libexec/ld.elf_so/rtld.c:1716
#1  0x2ad113b4 in _rtld_bind (obj=0x3eda4800, reloff=7044)
    at /work/src/libexec/ld.elf_so/arch/sparc/mdreloc.c:420
#2  0x2ad10f10 in _rtld_bind_start () from /usr/libexec/ld.elf_so


>How-To-Repeat:

just try to run ATF tests with 32bit userland on a sparc64 kernel

>Fix:
n/a

>Release-Note:

>Audit-Trail:
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: lib/57628: multithread programs may deadlock in ld.elf_so (sparc
 on sparc64)
Date: Tue, 26 Sep 2023 20:29:45 +0200

 This may be due to missing setup of the atomic lock support stuff,
 as we have two copies in the binary:

  > gdb /usr/tests/dev/sysmon/t_swwdog 
 GNU gdb (GDB) 13.2
 Copyright (C) 2023 Free Software Foundation, Inc.
 License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
 This is free software: you are free to change and redistribute it.
 There is NO WARRANTY, to the extent permitted by law.
 Type "show copying" and "show warranty" for details.
 This GDB was configured as "sparc--netbsdelf".
 Type "show configuration" for configuration details.
 For bug reporting instructions, please see:
 <https://www.gnu.org/software/gdb/bugs/>.
 Find the GDB manual and other documentation resources online at:
     <http://www.gnu.org/software/gdb/documentation/>.

 For help, type "help".
 Type "apropos word" to search for commands related to "word"...
 Reading symbols from /usr/tests/dev/sysmon/t_swwdog...
 Reading symbols from /usr/libdata/debug//usr/tests/dev/sysmon/t_swwdog.debug...
 (gdb) break __libc_atomic_init
 Function "__libc_atomic_init" not defined.
 Make breakpoint pending on future shared library load? (y or [n]) y
 Breakpoint 1 (__libc_atomic_init) pending.
 (gdb) run disarm
 Starting program: /usr/tests/dev/sysmon/t_swwdog disarm

 Breakpoint 1.2, __libc_atomic_init ()
     at /work/src/lib/libc/../../common/lib/libc/atomic/atomic_init_testset.c:281
 281     /work/src/lib/libc/../../common/lib/libc/atomic/atomic_init_testset.c: No such file or directory.
 (gdb) info break
 Num     Type           Disp Enb Address    What
 1       breakpoint     keep y   <MULTIPLE> 
         breakpoint already hit 1 time
 1.1                         y   0x25ddded4 in __libc_atomic_init 
                                            at /work/src/lib/libc/../../common/lib/libc/atomic/atomic_init_testset.c:281
 1.2                         y   0x3ec301d4 in __libc_atomic_init 
                                            at /work/src/lib/libc/../../common/lib/libc/atomic/atomic_init_testset.c:281
 (gdb) info dll
 From        To          Syms Read   Shared Object Library
 0x25dd0ec8  0x25dde050  Yes         /usr/libexec/ld.elf_so
 0x3e784b0c  0x3e78e058  Yes         /usr/lib/librumpdev_sysmon.so.0
 0x3e7b2454  0x3e7b4e00  Yes         /usr/lib/librumpdev.so.0
 0x3e7e90c8  0x3e83ab64  Yes         /usr/lib/librumpvfs.so.0
 0x3e870640  0x3e8707b8  Yes         /usr/lib/librumpvfs_nofifofs.so.0
 0x3e8c5298  0x3e974b98  Yes         /usr/lib/librump.so.0
 0x3e9e2f38  0x3e9e91e8  Yes         /usr/lib/librumpuser.so.0
 0x3ea15600  0x3ea1ea20  Yes         /usr/lib/libpthread.so.1
 0x3ea44b24  0x3ea4ea20  Yes         /usr/lib/libatf-c.so.0
 0x3ea70378  0x3ea70580  Yes (*)     /usr/lib/libsparc_v8.so.0
 0x3eacce10  0x3ec30404  Yes         /usr/lib/libc.so.12
 0x3ecb195c  0x3ecba99c  Yes         /usr/lib/libgcc_s.so.1


 ... so one version of __libc_atomic_init in ld.elf_so itself, and one
 in libc. The libc one gets it's __libc_atomic_init function called,
 but the ld.elf_so one apparently never.

 Which probably means that we are trying to use RAS to emulate the atomic
 operations, on a multiprocessor machine - which is doomed.

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: lib/57628: multithread programs may deadlock in ld.elf_so (sparc
 on sparc64)
Date: Wed, 27 Sep 2023 20:05:11 +0200

 The attached patch seems to fix the issue for me, w/o any obivous bad
 side effects (ld.elf_so size ~the same, initial tests work fine).

 The same gdb test run from the previous rump example now shows both copies
 of __libc_atomic_init() being called - the ld.elf_so internal one very
 early (before any userland threads could have been created) and the second
 one when libc constructors are run.

 All our ld.elf_so seem to have the hidden function, but it is a no-op on most
 architectures (when hardware supported atomic ops are good enough), so no
 ifdef magic or whatever seems needed.

 Is it this simple? Am I overlooking something?
 Will run complete tests now...

 Martin

 Index: rtld.c
 ===================================================================
 RCS file: /cvsroot/src/libexec/ld.elf_so/rtld.c,v
 retrieving revision 1.215
 diff -u -p -r1.215 rtld.c
 --- rtld.c	30 Jul 2023 09:20:14 -0000	1.215
 +++ rtld.c	27 Sep 2023 17:52:04 -0000
 @@ -70,6 +70,13 @@ __RCSID("$NetBSD: rtld.c,v 1.215 2023/07
  #endif

  /*
 + * Hidden function from common/lib/libc/atomic - nop on machines
 + * with enough atomic ops. Need to explicitly call it early.
 + * libc has the same symbol and will initialize itself, but not our copy.
 + */
 +void __libc_atomic_init(void);
 +
 +/*
   * Function declarations.
   */
  static void     _rtld_init(caddr_t, caddr_t, const char *);
 @@ -404,6 +411,8 @@ _rtld_init(caddr_t mapbase, caddr_t relo
  	ehdr = (Elf_Ehdr *)mapbase;
  	_rtld_objself.phdr = (Elf_Phdr *)((char *)mapbase + ehdr->e_phoff);
  	_rtld_objself.phsize = ehdr->e_phnum * sizeof(_rtld_objself.phdr[0]);
 +
 +	__libc_atomic_init();
  }

  /*

State-Changed-From-To: open->closed
State-Changed-By: martin@NetBSD.org
State-Changed-When: Tue, 03 Oct 2023 09:49:04 +0000
State-Changed-Why:
Fixed


From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57628 CVS commit: src/libexec/ld.elf_so
Date: Tue, 3 Oct 2023 09:48:19 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Tue Oct  3 09:48:19 UTC 2023

 Modified Files:
 	src/libexec/ld.elf_so: rtld.c

 Log Message:
 PR 57628: at the end of _rtld_init() explicitly initialize the ld.elf_so
 local copy of the atomic access support functions for machines that do not
 implement all required ops in hardware (like 32bit sparc).

 XXX would be better to figure out a way to share this copy with libc
 (thereby using half as many RAS sections). But even if we would share it,
 we have to init it early enough for ld.elf_so internal uses.


 To generate a diff of this commit:
 cvs rdiff -u -r1.215 -r1.216 src/libexec/ld.elf_so/rtld.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2023 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.