NetBSD Problem Report #54192

From bjjl@chaos.lorenz.place  Fri May 10 18:07:21 2019
Return-Path: <bjjl@chaos.lorenz.place>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id CD1817A1C8
	for <gnats-bugs@gnats.NetBSD.org>; Fri, 10 May 2019 18:07:21 +0000 (UTC)
Message-Id: <20190510180718.8AAC0410823@chaos.lorenz.place>
Date: Fri, 10 May 2019 20:07:18 +0200 (CEST)
From: ben@pocket.services
To: gnats-bugs@NetBSD.org
Subject: lang/rust build error
X-Send-Pr-Version: 3.95

>Number:         54192
>Category:       toolchain
>Synopsis:       lang/rust build error (-current of May 10th)
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    toolchain-manager
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri May 10 18:10:00 +0000 2019
>Closed-Date:    Fri May 22 23:10:16 +0000 2020
>Last-Modified:  Fri May 22 23:10:16 +0000 2020
>Originator:     Benjamin Lorenz
>Release:        NetBSD 8.99.38
>Organization:

>Environment:


System: NetBSD chaos.lorenz.place 8.99.38 NetBSD 8.99.38 (GENERIC) #1: Mon May 6 14:33:41 CEST 2019 bjjl@chaos.lorenz.place:/home/bjjl/8.99/obj/sys/arch/amd64/compile/GENERIC amd64
Architecture: x86_64
Machine: amd64
>Description:
   Compiling proc-macro2 v0.4.24
   Compiling unicode-xid v0.1.0
   Compiling serde v1.0.82
dead lock detected
error: Could not compile `unicode-xid`.
warning: build failed, waiting for other jobs to finish...                                               
error: build failed
Traceback (most recent call last):
  File "./x.py", line 11, in <module>
    bootstrap.main()
  File "/home/bjjl/8.99/pkgsrc/lang/rust/work/rustc-1.34.1-src/src/bootstrap/bootstrap.py", line 845, in main
    bootstrap(help_triggered)
  File "/home/bjjl/8.99/pkgsrc/lang/rust/work/rustc-1.34.1-src/src/bootstrap/bootstrap.py", line 816, in bootstrap
    build.build_bootstrap()
  File "/home/bjjl/8.99/pkgsrc/lang/rust/work/rustc-1.34.1-src/src/bootstrap/bootstrap.py", line 652, in build_bootstrap
    run(args, env=env, verbose=self.verbose)
  File "/home/bjjl/8.99/pkgsrc/lang/rust/work/rustc-1.34.1-src/src/bootstrap/bootstrap.py", line 141, in run
    raise RuntimeError(err)
RuntimeError: failed to run: /home/bjjl/8.99/pkgsrc/lang/rust/work/rust-bootstrap/bin/cargo build --manifest-path /home/bjjl/8.99/pkgsrc/lang/rust/work/rustc-1.34.1-src/src/bootstrap/Cargo.toml --frozen         
*** Error code 1

Stop.
make[1]: stopped in /home/bjjl/8.99/pkgsrc/lang/rust
*** Error code 1


>How-To-Repeat:

>Fix:


>Release-Note:

>Audit-Trail:
From: Thomas Klausner <wiz@NetBSD.org>
To: NetBSD bugtracking <gnats-bugs@NetBSD.org>
Cc: 
Subject: Re: pkg/54192: lang/rust build error
Date: Fri, 10 May 2019 23:58:52 +0200

 ...
 >    Compiling proc-macro2 v0.4.24
 >    Compiling unicode-xid v0.1.0
 >    Compiling serde v1.0.82
 > dead lock detected
 > error: Could not compile `unicode-xid`.
 > warning: build failed, waiting for other jobs to finish...                                               
 > error: build failed

 Is this completely repeatable for you?
 I have seen similar errors before that usually go away when I retry.
  Thomas

From: Benjamin Lorenz <ben@pocket.services>
To: gnats-bugs@netbsd.org
Cc: pkg-manager@netbsd.org,
 gnats-admin@netbsd.org,
 pkgsrc-bugs@netbsd.org
Subject: Re: pkg/54192: lang/rust build error
Date: Sat, 11 May 2019 10:16:04 +0200

 > Is this completely repeatable for you?

 I tried several times and it happened each time.
 After commenting out MAKE_JOBS=3D4 in my /etc/mk.conf it was building =
 fine.
 Is it possible to force parallel compilation to be switched off for a =
 package?


From: Thomas Klausner <wiz@NetBSD.org>
To: NetBSD bugtracking <gnats-bugs@NetBSD.org>
Cc: 
Subject: Re: pkg/54192: lang/rust build error
Date: Sat, 11 May 2019 10:33:03 +0200

 On Sat, May 11, 2019 at 10:16:04AM +0200, Benjamin Lorenz wrote:
 > I tried several times and it happened each time.
 > After commenting out MAKE_JOBS=4 in my /etc/mk.conf it was building fine.
 > Is it possible to force parallel compilation to be switched off for a package?

 MAKE_JOBS_SAFE=no
 in the package Makefile.
  Thomas

From: Benjamin Lorenz <ben@pocket.services>
To: gnats-bugs@netbsd.org
Cc: pkg-manager@netbsd.org,
 gnats-admin@netbsd.org,
 pkgsrc-bugs@netbsd.org
Subject: Re: pkg/54192: lang/rust build error
Date: Sat, 11 May 2019 10:39:27 +0200

 > MAKE_JOBS_SAFE=no
 > in the package Makefile.

 Maybe worth putting into the package for the time being
 and wait if the problem ever happens again for somebody?


From: coypu@sdf.org
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: pkg/54192: lang/rust build error
Date: Sun, 12 May 2019 18:46:02 +0000

 > dead lock detected

 This error is from netbsd rtld, not from rust or the package.

From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: pkg/54192: lang/rust build error
Date: Mon, 13 May 2019 13:38:00 +0100

 I can't get past

 ===> Building for rust-1.34.1                                                                                  
 cd /tmp/pkgsrc/lang/rust/work/rustc-1.34.1-src  && /usr/bin/env CARGO_BUILD_JOBS=1 USETOOLS=no PTHREAD_CFLAGS=\
 ...
 PKG_CO
 NFIG_PATH= CWRAPPERS_CONFIG_DIR=/tmp/pkgsrc/lang/rust/work/.cwrapper/config  /usr/pkg/bin/python3.7 ./x.py -v b
 uild -j 1                                                                                                      
 running: /tmp/pkgsrc/lang/rust/work/rust-bootstrap/bin/cargo build --manifest-path /tmp/pkgsrc/lang/rust/work/r
 ustc-1.34.1-src/src/bootstrap/Cargo.toml --frozen
 ...
    Compiling bootstrap v0.0.0 (/tmp/pkgsrc/lang/rust/work/rustc-1.34.1-src/src/bootstrap)                     
 error: Failed to delete invalidated or incompatible incremental compilation session directory contents `/tmp/pkgsrc/lang/rust/work/rustc-1.34.1-src/build/bootstrap/debug/incremental/rustdoc-1j5kv1msz9wqw/s-fc6csurjcr-1ardgdb-working/dep-graph.bin`: No such file or directory (os error 2).

 This is with pbulk runs on -current/amd64. I now tried manually with from
 within the sandbox and

 # make show-var VARNAME=MAKE_JOBS

 #

 after a make clean and same result. (pbulk runs have been failing for a
 while - this was the first manual attempt)

From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: pkg/54192: lang/rust build error
Date: Mon, 13 May 2019 13:54:35 +0100

 Of course, now that I actually comment on a PR, a few more evocations
 of "make" in lang/rust later, and it did get past.

From: Thomas Klausner <tk@giga.or.at>
To: NetBSD bugtracking <gnats-bugs@NetBSD.org>
Cc: 
Subject: Re: pkg/54192: lang/rust build error
Date: Fri, 26 Jul 2019 15:44:41 +0200

 I see the "dead lock detected" a lot when building packages using rust.

 E.g. from librsvg2:

    Compiling regex v1.0.5
 dead lock detected
 error: Could not compile `bitflags`.
 warning: build failed, waiting for other jobs to finish...
 error: build failed


 wip/spotifyd:

    Compiling try-lock v0.1.0
 dead lock detected
 error: Could not compile `arrayvec`.
 warning: build failed, waiting for other jobs to finish...
 error: build failed
 *** Error code 101


 wip/alacritty:

    Compiling color_quant v1.0.1
 dead lock detected
 error: Could not compile `libc`.
 warning: build failed, waiting for other jobs to finish...
 error: build failed
 *** Error code 101


 Repeating the build sometimes makes it work, sometimes I need more
 tries.

 What exactly is the issue here, what is deadlocking?
  Thomas

From: coypu@sdf.org
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: pkg/54192: lang/rust build error
Date: Fri, 26 Jul 2019 14:40:20 +0000

 > What exactly is the issue here, what is deadlocking?

 NetBSD's ld.so

From: Thomas Klausner <tk@giga.or.at>
To: NetBSD bugtracking <gnats-bugs@NetBSD.org>
Cc: 
Subject: Re: pkg/54192: lang/rust build error
Date: Fri, 26 Jul 2019 19:51:12 +0200

 On Fri, Jul 26, 2019 at 02:45:01PM +0000, coypu@sdf.org wrote:
 >  > What exactly is the issue here, what is deadlocking?
 >  
 >  NetBSD's ld.so

 Why is it deadlocking?
  Thomas

From: coypu@sdf.org
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: pkg/54192: lang/rust build error
Date: Sat, 27 Jul 2019 06:33:05 +0000

 I think the lock implemented by _rtld_exclusive_enter is bogus.

         for (;;) {
                 if (atomic_cas_uint(&_rtld_mutex, 0, locked_value) == 0) {
                         membar_enter();
                         break;
                 }
 			/* Didn't get the lock */

                 waiter = atomic_swap_uint(&_rtld_waiter_exclusive, self);
                 membar_sync();
                 cur = _rtld_mutex;
                 if (cur == locked_value) {
 			/* Someone locked it. Die for no reason. */
                         _rtld_error("dead lock detected");
                         _rtld_die();
                 }
                 if (cur)
                         _lwp_park(CLOCK_REALTIME, 0, NULL, 0,
                             __UNVOLATILE(&_rtld_mutex), NULL);
                 atomic_cas_uint(&_rtld_waiter_exclusive, self, 0);
                 if (waiter)
                         _lwp_unpark(waiter, __UNVOLATILE(&_rtld_mutex));
 			/* Wake up other waiters for some reason.
 			   We saw the lock is still in use, so I don't understand why. */
         }


 I think the lock implemented by _rtld_shared_enter is bogus.

        for (;;) {
                 cur = _rtld_mutex;
                 /*
                  * First check if we are currently not exclusively locked.
                  */
                 if ((cur & RTLD_EXCLUSIVE_MASK) == 0) {
                         /* Yes, so increment use counter */
                         if (atomic_cas_uint(&_rtld_mutex, cur, cur + 1) != cur)
                                 continue;
                         membar_enter();
                         return;
                 }
                 /*
                  * Someone has an exclusive lock.  Puts us on the waiter list.
                  */
                 if (!self)
                         self = _lwp_self();
                 if (cur == (self | RTLD_EXCLUSIVE_MASK)) {
                         if (_rtld_mutex_may_recurse)
                                 return;
                         _rtld_error("dead lock detected");
                         _rtld_die();
                 }

 _rtld_mutex_may_recurse is false and never changed apparently.
 So our mechanism for handling exclusive -> shared lock seems bogus. I'm
 not sure, since I didn't successfully trigger it.

 Probably LD_BIND_NOW is a helpful hack.

Responsible-Changed-From-To: pkg-manager->toolchain-manager
Responsible-Changed-By: wiz@NetBSD.org
Responsible-Changed-When: Sat, 27 Jul 2019 07:04:56 +0000
Responsible-Changed-Why:
This PR is about a problem in ld.so now.


From: Joerg Sonnenberger <joerg@bec.de>
To: gnats-bugs@netbsd.org
Cc: pkg-manager@netbsd.org, gnats-admin@netbsd.org, pkgsrc-bugs@netbsd.org,
	ben@pocket.services
Subject: Re: pkg/54192: lang/rust build error
Date: Sat, 27 Jul 2019 17:58:45 +0200

 On Sat, Jul 27, 2019 at 06:35:01AM +0000, coypu@sdf.org wrote:
 >  I think the lock implemented by _rtld_shared_enter is bogus.

 You are not allowed to enter as shared again after taking an exclusive
 mutex. That's why all signals are blocked in that case.

 Joerg

From: coypu@sdf.org
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: pkg/54192: lang/rust build error
Date: Tue, 15 Oct 2019 19:12:11 +0000

 Here's a backtrace of the problem.
 (Rust 1.36, netbsd-current as of ~october 2019)

 Reading symbols from /usr/pkg/bin/cargo...
 [New process 1]
 Core was generated by `cargo'.
 Program terminated with signal SIGABRT, Aborted.
 #0  0x00007f7f8680d09a in _lwp_kill () from /usr/libexec/ld.elf_so
 (gdb) bt
 #0  0x00007f7f8680d09a in _lwp_kill () from /usr/libexec/ld.elf_so
 #1  0x00007f7f8680cf09 in abort () from /usr/libexec/ld.elf_so
 #2  0x00007f7f86801616 in _rtld_shared_enter () from /usr/libexec/ld.elf_so
 #3  0x00007f7f86800b91 in _rtld_bind () from /usr/libexec/ld.elf_so
 #4  0x00007f7f868007fd in _rtld_bind_start () from /usr/libexec/ld.elf_so
 #5  0x0000000000000206 in ?? ()
 #6  0x0000785c51a9043a in dup2 () from /usr/lib/libc.so.12
 #7  0x0000785c51b18592 in je_jemalloc_prefork () from /usr/lib/libc.so.12
 #8  0x0000785c5375c000 in ?? ()
 #9  0x000000000000009c in ?? ()
 #10 0x0000785c5260a0ee in pthread_sigmask () from /usr/lib/libpthread.so.1
 #11 0x00000000521845fd in std::sys::unix::process::process_inner::<impl std::sys::unix::process::process_common::Command>::do_exec ()
     at src/libstd/sys/unix/process/process_unix.rs:230
 #12 0x0000000052183edc in std::sys::unix::process::process_inner::<impl std::sys::unix::process::process_common::Command>::spawn ()
     at src/libstd/sys/unix/process/process_unix.rs:50
 #13 0x000000005216fe3b in std::process::Command::spawn () at src/libstd/process.rs:768
 #14 0x0000000051c3f094 in cargo::util::process_builder::ProcessBuilder::exec_with_streaming ()
 #15 0x0000000051dbda9a in <cargo::core::compiler::DefaultExecutor as cargo::core::compiler::Executor>::exec ()
 #16 0x0000000051dbad64 in <F as cargo::core::compiler::job::FnBox<A,R>>::call_box ()
 #17 0x0000000051d87c0f in <F as cargo::core::compiler::job::FnBox<A,R>>::call_box ()
 #18 0x0000000051d87c0f in <F as cargo::core::compiler::job::FnBox<A,R>>::call_box ()
 #19 0x0000000051df4f72 in crossbeam_utils::thread::ScopedThreadBuilder::spawn::{{closure}} ()
 #20 0x0000000051ae6f0b in std::sys_common::backtrace::__rust_begin_short_backtrace ()
 #21 0x000000005218e5ca in __rust_maybe_catch_panic () at src/libpanic_unwind/lib.rs:85
 #22 0x0000000051aec1b9 in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
 #23 0x000000005218818f in <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once ()
     at /cvs/pkgsrc/lang/rust/work/rustc-1.36.0-src/src/liballoc/boxed.rs:704
 #24 0x0000000052170a60 in <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once ()
     at /cvs/pkgsrc/lang/rust/work/rustc-1.36.0-src/src/liballoc/boxed.rs:704
 #25 std::sys_common::thread::start_thread () at src/libstd/sys_common/thread.rs:13
 #26 0x0000000052187ae6 in std::sys::unix::thread::Thread::new::thread_start () at src/libstd/sys/unix/thread.rs:79
 #27 0x0000785c5260c1e8 in ?? () from /usr/lib/libpthread.so.1
 #28 0x0000785c51a901d0 in ?? () from /usr/lib/libc.so.12
 Backtrace stopped: Cannot access memory at address 0x785c484d0000

 Attempting to debug it as a live process, not from a coredump, produces
 kernel panics.

From: Havard Eidnes <he@NetBSD.org>
To: gnats-bugs@netbsd.org, coypu@sdf.org
Cc: toolchain-manager@netbsd.org, netbsd-bugs@netbsd.org,
 ben@pocket.services
Subject: Re: pkg/54192: lang/rust build error
Date: Tue, 15 Oct 2019 23:20:07 +0200 (CEST)

 ----Next_Part(Tue_Oct_15_23_20_07_2019_498)--
 Content-Type: Text/Plain; charset=iso-8859-1
 Content-Transfer-Encoding: quoted-printable

 >  Here's a backtrace of the problem.
 >  (Rust 1.36, netbsd-current as of ~october 2019)
 >
 >  Reading symbols from /usr/pkg/bin/cargo...
 >  [New process 1]
 >  Core was generated by `cargo'.
 >  Program terminated with signal SIGABRT, Aborted.
 >  #0  0x00007f7f8680d09a in _lwp_kill () from /usr/libexec/ld.elf_so
 >  (gdb) bt
 >  #0  0x00007f7f8680d09a in _lwp_kill () from /usr/libexec/ld.elf_so
 >  #1  0x00007f7f8680cf09 in abort () from /usr/libexec/ld.elf_so
 >  #2  0x00007f7f86801616 in _rtld_shared_enter () from /usr/libexec/ld=
 .elf_so
 >  #3  0x00007f7f86800b91 in _rtld_bind () from /usr/libexec/ld.elf_so
 >  #4  0x00007f7f868007fd in _rtld_bind_start () from /usr/libexec/ld.e=
 lf_so
 >  #5  0x0000000000000206 in ?? ()
 >  #6  0x0000785c51a9043a in dup2 () from /usr/lib/libc.so.12
 >  #7  0x0000785c51b18592 in je_jemalloc_prefork () from /usr/lib/libc.=
 so.12
 >  #8  0x0000785c5375c000 in ?? ()
 >  #9  0x000000000000009c in ?? ()
 >  #10 0x0000785c5260a0ee in pthread_sigmask () from /usr/lib/libpthrea=
 d.so.1
 >  #11 0x00000000521845fd in std::sys::unix::process::process_inner::<i=
 mpl std::sys::unix::process::process_common::Command>::do_exec ()
 ...

 I admit that I never understood in any meaningful manner what the
 "dead lock detected!" error in ld.elf_so is actually objecting to.

 I suspect that the condition starts with

 "You cannot from two different threads in a process simultaneously
 do <x>", but I've really never grasped what a list of the possible
 "<x>" conditions are.

 Looking at the code in rtld.c, it appears that the exclusive lock is
 held while either loading a shared library (and its dependencies),
 via dlopen(), or while doing the book-keeping for calling the
 "_init" or "_fini" functions of any shared libraries (but not
 actually while those functions are invoked(?)).  However, it's not
 clear to me whether __HAVE_FUNCTION_DESCRIPTORS is defined or not,
 and therefore, under which circumstances either the exclusive or
 shared lock is used by e.g. do_dlsym().  And further, is any lock
 taken when a given function in a shared library is called the first
 time?  Or isn't ld.elf_so code involved in that at all?

 And what happens if some shard locks are held, but you happen to
 desire an exclusive lock?  I'm not able to tell from reading the
 code...

 Next, it's also not clear to me whether the restrictions imposed by
 the locking in ld.elf_so are ... "reasonable", i.e. whether this can
 be considered a bug in our ld.elf_so which we ought to fix, or
 whether it's rust / cargo doing something it should not do (and if
 that restriction is according to some standard or other, though
 that's probably doubtful).

 At a minimum, I'd say that the diagnostic could be better, i.e.
 ld.elf_so ought to itself be able to tell which "<x>" condition is
 violated, preferably in some terms that are more easily understood
 than simply "dead lock detected".  To that end I've drafted the
 attached diff to add some more verbosity, FWIW (compile tested on
 i386 only, I've tentatively added x86_64 _rtld_bind, other CPUs need
 similar treatment).

 Regards,

 - H=E5vard

 ----Next_Part(Tue_Oct_15_23_20_07_2019_498)--
 Content-Type: Text/Plain; charset=us-ascii
 Content-Transfer-Encoding: 7bit

 Index: reloc.c
 ===================================================================
 RCS file: /cvsroot/src/libexec/ld.elf_so/reloc.c,v
 retrieving revision 1.110
 diff -u -r1.110 reloc.c
 --- reloc.c	27 Apr 2017 08:37:15 -0000	1.110
 +++ reloc.c	15 Oct 2019 21:15:15 -0000
 @@ -263,7 +263,7 @@
  	_rtld_shared_exit();
  	target = _rtld_call_function_addr(obj,
  	    (Elf_Addr)obj->relocbase + def->st_value);
 -	_rtld_shared_enter();
 +	_rtld_shared_enter("_rtld_resolve_ifunc done");

  	return target;
  }
 Index: rtld.c
 ===================================================================
 RCS file: /cvsroot/src/libexec/ld.elf_so/rtld.c,v
 retrieving revision 1.183.4.2
 diff -u -r1.183.4.2 rtld.c
 --- rtld.c	29 Aug 2017 09:43:17 -0000	1.183.4.2
 +++ rtld.c	15 Oct 2019 21:15:15 -0000
 @@ -140,7 +140,7 @@
  {
  	_rtld_exclusive_exit(mask);
  	_rtld_call_function_void(obj, func);
 -	_rtld_exclusive_enter(mask);
 +	_rtld_exclusive_enter(mask, "initfini_done");
  }

  static void
 @@ -372,7 +372,7 @@

  	dbg(("rtld_exit()"));

 -	_rtld_exclusive_enter(&mask);
 +	_rtld_exclusive_enter(&mask, "rtld_exit");

  	_rtld_call_fini_functions(&mask, 1);

 @@ -735,7 +735,7 @@

  	_rtld_debug_state();	/* say hello to gdb! */

 -	_rtld_exclusive_enter(&mask);
 +	_rtld_exclusive_enter(&mask, "init functions");

  	dbg(("calling _init functions"));
  	_rtld_call_init_functions(&mask);
 @@ -942,7 +942,7 @@

  	dbg(("dlclose of %p", handle));

 -	_rtld_exclusive_enter(&mask);
 +	_rtld_exclusive_enter(&mask, "dlclose");

  	root = _rtld_dlcheck(handle);

 @@ -989,7 +989,7 @@

  	dbg(("dlopen of %s %d", name, mode));

 -	_rtld_exclusive_enter(&mask);
 +	_rtld_exclusive_enter(&mask, "dlopen");

  	flags |= (mode & RTLD_GLOBAL) ? _RTLD_GLOBAL : 0;
  	flags |= (mode & RTLD_NOLOAD) ? _RTLD_NOLOAD : 0;
 @@ -1079,10 +1079,10 @@
  #endif

  #ifdef __HAVE_FUNCTION_DESCRIPTORS
 -#define	lookup_mutex_enter()	_rtld_exclusive_enter(&mask)
 +#define	lookup_mutex_enter(why)	_rtld_exclusive_enter(&mask, why)
  #define	lookup_mutex_exit()	_rtld_exclusive_exit(&mask)
  #else
 -#define	lookup_mutex_enter()	_rtld_shared_enter()
 +#define	lookup_mutex_enter(why)	_rtld_shared_enter(why)
  #define	lookup_mutex_exit()	_rtld_shared_exit()
  #endif

 @@ -1099,7 +1099,7 @@
  	sigset_t mask;
  #endif

 -	lookup_mutex_enter();
 +	lookup_mutex_enter("do_dlsym");

  	hash = _rtld_elf_hash(name);
  	def = NULL;
 @@ -1195,7 +1195,7 @@
  		if (ELF_ST_TYPE(def->st_info) == STT_GNU_IFUNC) {
  #ifdef __HAVE_FUNCTION_DESCRIPTORS
  			lookup_mutex_exit();
 -			_rtld_shared_enter();
 +			_rtld_shared_enter("resolve_ifunc");
  #endif
  			p = (void *)_rtld_resolve_ifunc(defobj, def);
  			_rtld_shared_exit();
 @@ -1275,7 +1275,7 @@

  	dbg(("dladdr of %p", addr));

 -	lookup_mutex_enter();
 +	lookup_mutex_enter("dladdr");

  #ifdef __HAVE_FUNCTION_DESCRIPTORS
  	addr = _rtld_function_descriptor_function(addr);
 @@ -1348,7 +1348,7 @@

  	dbg(("dlinfo for %p %d", handle, req));

 -	_rtld_shared_enter();
 +	_rtld_shared_enter("dlinfo");

  	if (handle == RTLD_SELF) {
  #ifdef __powerpc__
 @@ -1397,7 +1397,7 @@

  	dbg(("dl_iterate_phdr"));

 -	_rtld_shared_enter();
 +	_rtld_shared_enter("dl_iterate_phdr");

  	for (obj = _rtld_objlist;  obj != NULL;  obj = obj->next) {
  		phdr_info.dlpi_addr = (Elf_Addr)obj->relocbase;
 @@ -1436,7 +1436,7 @@

  	dbg(("__dl_cxa_refcount of %p with %zd", addr, delta));

 -	_rtld_exclusive_enter(&mask);
 +	_rtld_exclusive_enter(&mask, "__dl_cxa_refcount");
  	obj = _rtld_obj_from_addr(addr);

  	if (obj == NULL) {
 @@ -1580,8 +1580,10 @@
  static volatile unsigned int _rtld_waiter_exclusive;
  static volatile unsigned int _rtld_waiter_shared;

 +const char *exclusive_lock_reason;
 +
  void
 -_rtld_shared_enter(void)
 +_rtld_shared_enter(const char *why)
  {
  	unsigned int cur;
  	lwpid_t waiter, self = 0;
 @@ -1608,7 +1610,10 @@
  		if (cur == (self | RTLD_EXCLUSIVE_MASK)) {
  			if (_rtld_mutex_may_recurse)
  				return;
 -			_rtld_error("dead lock detected");
 +			if (exclusive_lock_reason)
 +				_rtld_error("dead lock detected, want shared lock for %s, exclusive lock %s held", why, exclusive_lock_reason);
 +			else
 +				_rtld_error("dead lock detected, want shared lock for %s", why);
  			_rtld_die();
  		}
  		waiter = atomic_swap_uint(&_rtld_waiter_shared, self);
 @@ -1652,7 +1657,7 @@
  }

  void
 -_rtld_exclusive_enter(sigset_t *mask)
 +_rtld_exclusive_enter(sigset_t *mask, const char *why)
  {
  	lwpid_t waiter, self = _lwp_self();
  	unsigned int locked_value = (unsigned int)self | RTLD_EXCLUSIVE_MASK;
 @@ -1672,6 +1677,10 @@
  		membar_sync();
  		cur = _rtld_mutex;
  		if (cur == locked_value) {
 +			if (exclusive_lock_reason)
 +				_rtld_error("dead lock detected, want exclusive lock for %s, but exclusive lock %s held", why, exclusive_lock_reason);
 +			else
 +				_rtld_error("dead lock detected, want exclusive lock for %s", why);
  			_rtld_error("dead lock detected");
  			_rtld_die();
  		}
 @@ -1682,6 +1691,7 @@
  		if (waiter)
  			_lwp_unpark(waiter, __UNVOLATILE(&_rtld_mutex));
  	}
 +	exclusive_lock_reason = why;
  }

  void
 @@ -1692,6 +1702,8 @@
  	membar_exit();
  	_rtld_mutex = 0;
  	membar_sync();
 +	exclusive_lock_reason = NULL;
 +
  	if ((waiter = _rtld_waiter_exclusive) != 0)
  		_lwp_unpark(waiter, __UNVOLATILE(&_rtld_mutex));

 Index: rtld.h
 ===================================================================
 RCS file: /cvsroot/src/libexec/ld.elf_so/rtld.h,v
 retrieving revision 1.126.6.3
 diff -u -r1.126.6.3 rtld.h
 --- rtld.h	29 Aug 2017 09:43:17 -0000	1.126.6.3
 +++ rtld.h	15 Oct 2019 21:15:15 -0000
 @@ -378,9 +378,9 @@
  Objlist_Entry *_rtld_objlist_find(Objlist *, const Obj_Entry *);
  void _rtld_ref_dag(Obj_Entry *);

 -void _rtld_shared_enter(void);
 +void _rtld_shared_enter(const char *);
  void _rtld_shared_exit(void);
 -void _rtld_exclusive_enter(sigset_t *);
 +void _rtld_exclusive_enter(sigset_t *, const char *);
  void _rtld_exclusive_exit(sigset_t *);

  /* expand.c */
 Index: tls.c
 ===================================================================
 RCS file: /cvsroot/src/libexec/ld.elf_so/tls.c,v
 retrieving revision 1.10.8.1
 diff -u -r1.10.8.1 tls.c
 --- tls.c	25 Jul 2017 01:36:58 -0000	1.10.8.1
 +++ tls.c	15 Oct 2019 21:15:15 -0000
 @@ -63,7 +63,7 @@
  	void **dtv, **new_dtv;
  	sigset_t mask;

 -	_rtld_exclusive_enter(&mask);
 +	_rtld_exclusive_enter(&mask, "_rtld_tls_get_addr");

  	dtv = tcb->tcb_dtv;

 @@ -157,7 +157,7 @@
  	struct tls_tcb *tcb;
  	sigset_t mask;

 -	_rtld_exclusive_enter(&mask);
 +	_rtld_exclusive_enter(&mask, "_rtld_tls_allocate");
  	tcb = _rtld_tls_allocate_locked();
  	_rtld_exclusive_exit(&mask);

 @@ -171,7 +171,7 @@
  	uint8_t *p, *p_end;
  	sigset_t mask;

 -	_rtld_exclusive_enter(&mask);
 +	_rtld_exclusive_enter(&mask, "_rtld_tls_free");

  #ifdef __HAVE_TLS_VARIANT_I
  	p = (uint8_t *)tcb;
 Index: arch/i386/mdreloc.c
 ===================================================================
 RCS file: /cvsroot/src/libexec/ld.elf_so/arch/i386/mdreloc.c,v
 retrieving revision 1.37.8.1
 diff -u -r1.37.8.1 mdreloc.c
 --- arch/i386/mdreloc.c	4 Jul 2017 12:47:59 -0000	1.37.8.1
 +++ arch/i386/mdreloc.c	15 Oct 2019 21:15:15 -0000
 @@ -260,7 +260,7 @@

  	new_value = 0;	/* XXX gcc */

 -	_rtld_shared_enter();
 +	_rtld_shared_enter("_rtld_bind");
  	err = _rtld_relocate_plt_object(obj, rel, &new_value);
  	if (err)
  		_rtld_die();
 Index: arch/x86_64/mdreloc.c
 ===================================================================
 RCS file: /cvsroot/src/libexec/ld.elf_so/arch/x86_64/mdreloc.c,v
 retrieving revision 1.41.8.1
 diff -u -r1.41.8.1 mdreloc.c
 --- arch/x86_64/mdreloc.c	4 Jul 2017 12:47:59 -0000	1.41.8.1
 +++ arch/x86_64/mdreloc.c	15 Oct 2019 21:15:15 -0000
 @@ -342,7 +342,7 @@

  	new_value = 0; /* XXX GCC4 */

 -	_rtld_shared_enter();
 +	_rtld_shared_enter("_rtld_bind");
  	error = _rtld_relocate_plt_object(obj, rela, &new_value);
  	if (error)
  		_rtld_die();

 ----Next_Part(Tue_Oct_15_23_20_07_2019_498)----

From: matthew green <mrg@eterna.com.au>
To: Havard Eidnes <he@NetBSD.org>
Cc: toolchain-manager@netbsd.org, netbsd-bugs@netbsd.org,
    ben@pocket.services, gnats-bugs@netbsd.org, coypu@sdf.org
Subject: re: pkg/54192: lang/rust build error
Date: Wed, 16 Oct 2019 10:28:08 +1100

 > actually while those functions are invoked(?)).  However, it's not
 > clear to me whether __HAVE_FUNCTION_DESCRIPTORS is defined or not,

 FWIW, __HAVE_FUNCTION_DESCRIPTORS is only used on hppa.


 .mrg.

From: Joerg Sonnenberger <joerg@bec.de>
To: gnats-bugs@netbsd.org
Cc: toolchain-manager@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org, ben@pocket.services
Subject: Re: pkg/54192: lang/rust build error
Date: Wed, 16 Oct 2019 01:44:00 +0200

 --lrZ03NoBR/3+SXJZ
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline

 On Tue, Oct 15, 2019 at 07:15:01PM +0000, coypu@sdf.org wrote:
 >  Here's a backtrace of the problem.
 >  (Rust 1.36, netbsd-current as of ~october 2019)

 Try the attached patch. It's the only constellation that makes sense to
 me.

 Joerg

 --lrZ03NoBR/3+SXJZ
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: attachment; filename="lid-after-fork.diff"

 diff -r 7e46460ebf1d sys/kern/kern_lwp.c
 --- a/sys/kern/kern_lwp.c	Tue Oct 15 06:58:12 2019 +0000
 +++ b/sys/kern/kern_lwp.c	Wed Oct 16 00:35:25 2019 +0200
 @@ -902,6 +902,8 @@
  	if ((flags & LWP_PIDLID) != 0) {
  		lid = proc_alloc_pid(p2);
  		l2->l_pflag |= LP_PIDLID;
 +	} else if (p2->p_nlwps == 0) {
 +		lid = l1->l_lid;
  	} else {
  		lid = 0;
  	}

 --lrZ03NoBR/3+SXJZ--

From: coypu@sdf.org
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: pkg/54192: lang/rust build error
Date: Wed, 16 Oct 2019 13:32:51 +0000

 joerg's patch works pretty well.

From: Kamil Rytarowski <n54@gmx.com>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: pkg/54192: lang/rust build error
Date: Wed, 16 Oct 2019 15:33:31 +0200

 On 16.10.2019 01:55, Joerg Sonnenberger wrote:
 > The following reply was made to PR toolchain/54192; it has been noted by=
  GNATS.
 >
 > From: Joerg Sonnenberger <joerg@bec.de>
 > To: gnats-bugs@netbsd.org
 > Cc: toolchain-manager@netbsd.org, gnats-admin@netbsd.org,
 > 	netbsd-bugs@netbsd.org, ben@pocket.services
 > Subject: Re: pkg/54192: lang/rust build error
 > Date: Wed, 16 Oct 2019 01:44:00 +0200
 >
 >   --lrZ03NoBR/3+SXJZ
 >   Content-Type: text/plain; charset=3Dus-ascii
 >   Content-Disposition: inline
 >
 >   On Tue, Oct 15, 2019 at 07:15:01PM +0000, coypu@sdf.org wrote:
 >   >  Here's a backtrace of the problem.
 >   >  (Rust 1.36, netbsd-current as of ~october 2019)
 >
 >   Try the attached patch. It's the only constellation that makes sense t=
 o
 >   me.
 >
 >   Joerg
 >
 >   --lrZ03NoBR/3+SXJZ
 >   Content-Type: text/plain; charset=3Dus-ascii
 >   Content-Disposition: attachment; filename=3D"lid-after-fork.diff"
 >
 >   diff -r 7e46460ebf1d sys/kern/kern_lwp.c
 >   --- a/sys/kern/kern_lwp.c	Tue Oct 15 06:58:12 2019 +0000
 >   +++ b/sys/kern/kern_lwp.c	Wed Oct 16 00:35:25 2019 +0200
 >   @@ -902,6 +902,8 @@
 >    	if ((flags & LWP_PIDLID) !=3D 0) {
 >    		lid =3D proc_alloc_pid(p2);
 >    		l2->l_pflag |=3D LP_PIDLID;
 >   +	} else if (p2->p_nlwps =3D=3D 0) {
 >   +		lid =3D l1->l_lid;
 >    	} else {
 >    		lid =3D 0;
 >    	}
 >
 >   --lrZ03NoBR/3+SXJZ--
 >
 >

 I am against this patch as it breaks the current implementation specific
 behavior where we pick lwp=3D1 for the first thread in a process.

 If there is anything depends on this lid logic, it's a bug in ld.elf_so.

 If there is need to detect forks and distinguish threads, it's possible
 to store the pid+lid pair.

From: Kamil Rytarowski <n54@gmx.com>
To: gnats-bugs@netbsd.org, netbsd-bugs@netbsd.org, ben@pocket.services
Cc: 
Subject: Re: pkg/54192: lang/rust build error
Date: Wed, 16 Oct 2019 15:41:49 +0200

 On 16.10.2019 15:35, coypu@sdf.org wrote:
 > The following reply was made to PR toolchain/54192; it has been noted by=
  GNATS.
 >
 > From: coypu@sdf.org
 > To: gnats-bugs@netbsd.org
 > Cc:
 > Subject: Re: pkg/54192: lang/rust build error
 > Date: Wed, 16 Oct 2019 13:32:51 +0000
 >
 >   joerg's patch works pretty well.
 >
 >

 It's a hack that changes the correct kernel behavior. Please fix the
 real bug in ld.elf_so.

From: Martin Husemann <martin@duskware.de>
To: Kamil Rytarowski <n54@gmx.com>
Cc: gnats-bugs@netbsd.org, netbsd-bugs@netbsd.org, ben@pocket.services
Subject: Re: pkg/54192: lang/rust build error
Date: Wed, 16 Oct 2019 15:50:51 +0200

 Since it was not obvious to me, let me try to explain what happens:

  - a multithreaded program calls fork()
  - the forked child process only has a single lwp initially, wich has lid 1

 So in the parent thread _lwp_self() (aka lid) was != 1, and now in the child
 it has changed. Apparently something in ld.elf_so goes wrong later when
 using _lwp_self() == 1 with a memory state inherited that was setup for 
 _lwp_self() != 1.

 Joergs patch makes the single lwp inherit the lid from the forking parent
 thread (instead of always being 1).

 Martin

From: Kamil Rytarowski <n54@gmx.com>
To: "gnats-bugs@NetBSD.org" <gnats-bugs@NetBSD.org>
Cc: 
Subject: Re: pkg/54192: lang/rust build error
Date: Wed, 16 Oct 2019 16:05:50 +0200

 On 16.10.2019 15:56, Kamil Rytarowski wrote:
 > On 16.10.2019 15:50, Martin Husemann wrote:
 >> Since it was not obvious to me, let me try to explain what happens:
 >>
 >> =C2=A0 - a multithreaded program calls fork()
 >> =C2=A0 - the forked child process only has a single lwp initially, wich=
  has
 >> lid 1
 >>
 >> So in the parent thread _lwp_self() (aka lid) was !=3D 1, and now in th=
 e
 >> child
 >> it has changed. Apparently something in ld.elf_so goes wrong later when
 >> using _lwp_self() =3D=3D 1 with a memory state inherited that was setup=
  for
 >> _lwp_self() !=3D 1.
 >>
 >> Joergs patch makes the single lwp inherit the lid from the forking pare=
 nt
 >> thread (instead of always being 1).
 >>
 >> Martin
 >>
 >
 > This is right behavior of the kernel.
 >
 > The same will happen on FreeBSD (lwpid is always unique globally), on
 > Linux (getpid is always unique) etc.
 >
 > If there is some assumption in NetBSD on copied lid, it's a bug in
 > ld.elf_so.

 Proof from FreeBSD:

 $ cat test.c
 #include <pthread.h>
 #include <string.h>
 #include <stdio.h>
 #include <unistd.h>

 #include <sys/thr.h>

 static long
 _lwp_self(void)
 {
 	long Tid;
 	thr_self(&Tid);
 	return Tid;
 }

 void *thread_main(void *arg) {
    printf("child %ld\n", _lwp_self());
    if (fork() =3D=3D 0) {
       char buf[20];
       snprintf(buf, sizeof(buf), "child %ld\n", _lwp_self());
       write(1, buf, strlen(buf));
       _exit(0);
    }
    return 0;
 }

 int main(void) {
    printf("%ld\n", _lwp_self());
    pthread_t thread;
    pthread_create(&thread, NULL, thread_main, NULL);
    pthread_join(thread, NULL);
 }

 $ ./a.out
 100123
 child 100126
 child 100119

 I am strongly against changing the correct NetBSD kernel behavior here.

From: Kamil Rytarowski <n54@gmx.com>
To: 
Cc: gnats-bugs@netbsd.org
Subject: Re: pkg/54192: lang/rust build error
Date: Wed, 16 Oct 2019 15:56:47 +0200

 On 16.10.2019 15:50, Martin Husemann wrote:
 > Since it was not obvious to me, let me try to explain what happens:
 >
 >   - a multithreaded program calls fork()
 >   - the forked child process only has a single lwp initially, wich has l=
 id 1
 >
 > So in the parent thread _lwp_self() (aka lid) was !=3D 1, and now in the=
  child
 > it has changed. Apparently something in ld.elf_so goes wrong later when
 > using _lwp_self() =3D=3D 1 with a memory state inherited that was setup =
 for
 > _lwp_self() !=3D 1.
 >
 > Joergs patch makes the single lwp inherit the lid from the forking paren=
 t
 > thread (instead of always being 1).
 >
 > Martin
 >

 This is right behavior of the kernel.

 The same will happen on FreeBSD (lwpid is always unique globally), on
 Linux (getpid is always unique) etc.

 If there is some assumption in NetBSD on copied lid, it's a bug in
 ld.elf_so.

From: Joerg Sonnenberger <joerg@bec.de>
To: gnats-bugs@netbsd.org
Cc: toolchain-manager@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org, ben@pocket.services
Subject: Re: pkg/54192: lang/rust build error
Date: Wed, 16 Oct 2019 17:14:56 +0200

 On Wed, Oct 16, 2019 at 02:10:02PM +0000, Kamil Rytarowski wrote:
 >  > Joergs patch makes the single lwp inherit the lid from the forking paren=
 >  > thread (instead of always being 1).
 >  
 >  This is right behavior of the kernel.

 You have given no justification for that.

 >  The same will happen on FreeBSD (lwpid is always unique globally), on
 >  Linux (getpid is always unique) etc.

 So our per-process LWP identifier should behave like global identifier
 from FreeBSD/Linux, because? They have little in common and the existing
 behavior is invalid even under the FreeBSD/Linux semantic.

 On a fundamental level, there are two sane models for a thread
 identifier:
 (1) It is globally unique.
 (2) It is process local and invariant.

 The current implementation in NetBSD fails both. A forked child of a
 multi-threaded program in a special intermediary state. It still shows
 various shadows of the other threads of the parent. Semantically again,
 there are two models to implement sensibly:
 (1) fork() creates a new thread, clones the address space and drops the
     thread from the parent. In this case, the thread identifier would be
     unique before the clone.
 (2) fork() clones the address space and preserves the nature of the
     current thread. In this case, the thread identifier should be
     preserved as is.
 Again, right now we do neither. The problem exists because the child
 overlaps the thread identifier of the parent, it re-uses a
 thread-identifier that should still be a in-use. The patch implements
 the second semantic. It makes _lwp_self() truely idempotent. Ensuring
 that it is not used at the time of fork in the parent would also be an
 option, but it is certainly more work and slower.

 Joerg

From: Martin Husemann <martin@duskware.de>
To: Joerg Sonnenberger <joerg@bec.de>
Cc: gnats-bugs@netbsd.org, toolchain-manager@netbsd.org,
	gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, ben@pocket.services
Subject: Re: pkg/54192: lang/rust build error
Date: Wed, 16 Oct 2019 17:29:39 +0200

 On Wed, Oct 16, 2019 at 05:14:56PM +0200, Joerg Sonnenberger wrote:
 > On a fundamental level, there are two sane models for a thread
 > identifier:
 > (1) It is globally unique.
 > (2) It is process local and invariant.

 Agreed so far, but the question is whether "invariant" holds across fork().

 Pragmatically we should look for the easiest solution.

 Martin


From: coypu@sdf.org
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: pkg/54192: lang/rust build error
Date: Wed, 16 Oct 2019 15:48:08 +0000

 I ran the patch and had it continuously build cbindgen. It eventually
 got stuck.

 $ for cargopid in `pgrep cargo`; do gdb -q -p $cargopid -ex "thread apply all bt" -ex q -batch; done
 [Switching to LWP 3 of process 5283]
 0x00007f7e2300c3ba in ___lwp_park60 () from /usr/libexec/ld.elf_so

 Thread 1 (LWP 3 of process 5283):
 #0  0x00007f7e2300c3ba in ___lwp_park60 () from /usr/libexec/ld.elf_so
 #1  0x00007f7e23001595 in _rtld_shared_enter () from /usr/libexec/ld.elf_so
 #2  0x00007f7e23000b91 in _rtld_bind () from /usr/libexec/ld.elf_so
 #3  0x00007f7e230007fd in _rtld_bind_start () from /usr/libexec/ld.elf_so
 #4  0x0000000000000206 in ?? ()
 #5  0x000075ac8529043a in dup2 () from /usr/lib/libc.so.12
 #6  0x000075ac85318592 in je_jemalloc_prefork () from /usr/lib/libc.so.12
 #7  0x000075ac86088400 in ?? ()
 #8  0x000000000000009c in ?? ()
 #9  0x000075ac85e0a0ee in pthread_sigmask () from /usr/lib/libpthread.so.1
 #10 0x00000001e24e8d8c in do_exec () at src/libstd/sys/unix/process/process_unix.rs:230
 #11 0x00000001e24e88c0 in spawn () at src/libstd/sys/unix/process/process_unix.rs:50
 #12 0x00000001e24da4ea in spawn () at src/libstd/process.rs:742
 #13 0x00000001e1f01d97 in cargo::util::process_builder::ProcessBuilder::exec_with_streaming ()
 #14 0x00000001e2122685 in <cargo::core::compiler::DefaultExecutor as cargo::core::compiler::Executor>::exec_and_capture_output ()
 #15 0x00000001e212019f in <F as cargo::core::compiler::job::FnBox<A,R>>::call_box ()
 #16 0x00000001e1f8946f in <F as cargo::core::compiler::job::FnBox<A,R>>::call_box ()
 #17 0x00000001e1f8946f in <F as cargo::core::compiler::job::FnBox<A,R>>::call_box ()
 #18 0x00000001e1f8953c in cargo::core::compiler::job::Job::run ()
 #19 0x00000001e1daf3bd in crossbeam_utils::thread::ScopedThreadBuilder::spawn::{{closure}} ()
 #20 0x00000001e1e283bb in std::sys_common::backtrace::__rust_begin_short_backtrace ()
 #21 0x00000001e24ea82a in __rust_maybe_catch_panic () at src/libpanic_unwind/lib.rs:87
 #22 0x00000001e1ded2a9 in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
 #23 0x00000001e24cd25f in call_once<(),FnBox<()>> () at /rustc/3c235d5600393dfe6c36eeed34042efad8d4f26e/src/liballoc/boxed.rs:702
 #24 0x00000001e24e9a50 in call_once<(),alloc::boxed::Box<FnBox<()>>> () at /rustc/3c235d5600393dfe6c36eeed34042efad8d4f26e/src/liballoc/boxed.rs:702
 #25 start_thread () at src/libstd/sys_common/thread.rs:14
 #26 thread_start () at src/libstd/sys/unix/thread.rs:80
 #27 0x000075ac85e0c1e8 in ?? () from /usr/lib/libpthread.so.1
 #28 0x000075ac852901d0 in ?? () from /usr/lib/libc.so.12
 #29 0x0000000000000000 in ?? ()
 A debugging session is active.

 	Inferior 1 [process 5283] will be detached.

 Quit anyway? (y or n) [answered Y; input not from terminal]
 [Inferior 1 (process 5283) detached]
 [New LWP 2 of process 1551]
 [New LWP 1 of process 1551]
 [Switching to LWP 3 of process 1551]
 0x000075ac85242baa in read () from /usr/lib/libc.so.12

 Thread 3 (LWP 1 of process 1551):
 #0  0x000075ac852afb2a in ___lwp_park60 () from /usr/lib/libc.so.12
 #1  0x000075ac85e0a84e in pthread_cond_timedwait () from /usr/lib/libpthread.so.1
 #2  0x00000001e24cdd34 in wait () at src/libstd/sys/unix/condvar.rs:69
 #3  wait () at src/libstd/sys_common/condvar.rs:41
 #4  wait<()> () at src/libstd/sync/condvar.rs:204
 #5  park () at src/libstd/thread/mod.rs:909
 #6  0x00000001e24db5a2 in wait () at src/libstd/sync/mpsc/blocking.rs:71
 #7  0x00000001e1db2599 in std::sync::mpsc::shared::Packet<T>::recv ()
 #8  0x00000001e1fc7d97 in cargo::core::compiler::job_queue::JobQueue::drain_the_queue ()
 #9  0x00000001e207e805 in std::panicking::try::do_call ()
 #10 0x00000001e24ea82a in __rust_maybe_catch_panic () at src/libpanic_unwind/lib.rs:87
 #11 0x00000001e1dafd25 in crossbeam_utils::thread::scope ()
 #12 0x00000001e1fc662e in cargo::core::compiler::job_queue::JobQueue::execute ()
 #13 0x00000001e2037207 in cargo::core::compiler::context::Context::compile ()
 #14 0x00000001e1e08a51 in cargo::ops::cargo_compile::compile_ws ()
 #15 0x00000001e1e04aa9 in cargo::ops::cargo_compile::compile ()
 #16 0x00000001e1d96835 in cargo::commands::build::exec ()
 #17 0x00000001e1d53550 in cargo::cli::main ()
 #18 0x00000001e1d80c10 in cargo::main ()
 #19 0x00000001e1d77203 in std::rt::lang_start::{{closure}} ()
 #20 0x00000001e24dfd53 in {{closure}} () at src/libstd/rt.rs:49
 #21 do_call<closure,i32> () at src/libstd/panicking.rs:293
 #22 0x00000001e24ea82a in __rust_maybe_catch_panic () at src/libpanic_unwind/lib.rs:87
 #23 0x00000001e24e08e0 in try<i32,closure> () at src/libstd/panicking.rs:272
 #24 catch_unwind<closure,i32> () at src/libstd/panic.rs:388
 #25 lang_start_internal () at src/libstd/rt.rs:48
 #26 0x00000001e1d83132 in main ()

 Thread 2 (LWP 2 of process 1551):
 #0  0x000075ac852afb2a in ___lwp_park60 () from /usr/lib/libc.so.12
 #1  0x000075ac85e0a84e in pthread_cond_timedwait () from /usr/lib/libpthread.so.1
 #2  0x00000001e24cdd34 in wait () at src/libstd/sys/unix/condvar.rs:69
 #3  wait () at src/libstd/sys_common/condvar.rs:41
 #4  wait<()> () at src/libstd/sync/condvar.rs:204
 #5  park () at src/libstd/thread/mod.rs:909
 #6  0x00000001e24db5a2 in wait () at src/libstd/sync/mpsc/blocking.rs:71
 #7  0x00000001e24b0a26 in std::sync::mpsc::stream::Packet<T>::recv ()
 #8  0x00000001e24ad5c9 in std::sync::mpsc::Receiver<T>::recv ()
 #9  0x00000001e24b29da in std::sys_common::backtrace::__rust_begin_short_backtrace ()
 #10 0x00000001e24b16ee in std::panicking::try::do_call ()
 #11 0x00000001e24ea82a in __rust_maybe_catch_panic () at src/libpanic_unwind/lib.rs:87
 #12 0x00000001e24b1c14 in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
 #13 0x00000001e24cd25f in call_once<(),FnBox<()>> () at /rustc/3c235d5600393dfe6c36eeed34042efad8d4f26e/src/liballoc/boxed.rs:702
 #14 0x00000001e24e9a50 in call_once<(),alloc::boxed::Box<FnBox<()>>> () at /rustc/3c235d5600393dfe6c36eeed34042efad8d4f26e/src/liballoc/boxed.rs:702
 #15 start_thread () at src/libstd/sys_common/thread.rs:14
 #16 thread_start () at src/libstd/sys/unix/thread.rs:80
 #17 0x000075ac85e0c1e8 in ?? () from /usr/lib/libpthread.so.1
 #18 0x000075ac852901d0 in ?? () from /usr/lib/libc.so.12
 #19 0x0000000000400000 in ?? ()
 #20 0x000075ac84c00000 in ?? ()
 #21 0x0000001003a0efff in ?? ()
 #22 0x000075ac84a000c0 in ?? ()
 #23 0x00000000001fff40 in ?? ()
 #24 0x0000000000000000 in ?? ()

 Thread 1 (LWP 3 of process 1551):
 #0  0x000075ac85242baa in read () from /usr/lib/libc.so.12
 #1  0x000075ac85e07f1f in read () from /usr/lib/libpthread.so.1
 #2  0x00000001e24e85ba in read () at src/libstd/sys/unix/fd.rs:49
 #3  read () at src/libstd/sys/unix/pipe.rs:60
 #4  spawn () at src/libstd/sys/unix/process/process_unix.rs:76
 #5  0x00000001e24da4ea in spawn () at src/libstd/process.rs:742
 #6  0x00000001e1f01d97 in cargo::util::process_builder::ProcessBuilder::exec_with_streaming ()
 #7  0x00000001e2122685 in <cargo::core::compiler::DefaultExecutor as cargo::core::compiler::Executor>::exec_and_capture_output ()
 #8  0x00000001e212019f in <F as cargo::core::compiler::job::FnBox<A,R>>::call_box ()
 #9  0x00000001e1f8946f in <F as cargo::core::compiler::job::FnBox<A,R>>::call_box ()
 #10 0x00000001e1f8946f in <F as cargo::core::compiler::job::FnBox<A,R>>::call_box ()
 #11 0x00000001e1f8953c in cargo::core::compiler::job::Job::run ()
 #12 0x00000001e1daf3bd in crossbeam_utils::thread::ScopedThreadBuilder::spawn::{{closure}} ()
 #13 0x00000001e1e283bb in std::sys_common::backtrace::__rust_begin_short_backtrace ()
 #14 0x00000001e24ea82a in __rust_maybe_catch_panic () at src/libpanic_unwind/lib.rs:87
 #15 0x00000001e1ded2a9 in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
 #16 0x00000001e24cd25f in call_once<(),FnBox<()>> () at /rustc/3c235d5600393dfe6c36eeed34042efad8d4f26e/src/liballoc/boxed.rs:702
 #17 0x00000001e24e9a50 in call_once<(),alloc::boxed::Box<FnBox<()>>> () at /rustc/3c235d5600393dfe6c36eeed34042efad8d4f26e/src/liballoc/boxed.rs:702
 #18 start_thread () at src/libstd/sys_common/thread.rs:14
 #19 thread_start () at src/libstd/sys/unix/thread.rs:80
 #20 0x000075ac85e0c1e8 in ?? () from /usr/lib/libpthread.so.1
 #21 0x000075ac852901d0 in ?? () from /usr/lib/libc.so.12
 #22 0x0000000000000000 in ?? ()
 A debugging session is active.

 	Inferior 1 [process 1551] will be detached.

 Quit anyway? (y or n) [answered Y; input not from terminal]
 [Inferior 1 (process 1551) detached]

From: Kamil Rytarowski <n54@gmx.com>
To: Martin Husemann <martin@duskware.de>, Joerg Sonnenberger <joerg@bec.de>
Cc: gnats-bugs@netbsd.org, toolchain-manager@netbsd.org,
 gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, ben@pocket.services
Subject: Re: pkg/54192: lang/rust build error
Date: Wed, 16 Oct 2019 20:38:49 +0200

 On 16.10.2019 17:25, Joerg Sonnenberger wrote:
  >
  >   So our per-process LWP identifier should behave like global identifie=
 r
  >   from FreeBSD/Linux, because? They have little in common and the
 existing
  >   behavior is invalid even under the FreeBSD/Linux semantic.

 This was just an example that thread id is not preserved across forks.

 It's the same in NetBSD (but sometimes it can be by an accident preserved)=
 .


 On 16.10.2019 17:29, Martin Husemann wrote:
 > On Wed, Oct 16, 2019 at 05:14:56PM +0200, Joerg Sonnenberger wrote:
 >> On a fundamental level, there are two sane models for a thread
 >> identifier:
 >> (1) It is globally unique.
 >> (2) It is process local and invariant.
 >
 > Agreed so far, but the question is whether "invariant" holds across fork=
 ().
 >

 After fork(2) we have a new process entity and the LWP ID is local to
 new thread.

 > Pragmatically we should look for the easiest solution.
 >

 However... this behavior proposed by Joerg is still conformant with POSIX.

 "A process shall be created with a single thread. If a multi-threaded
 process calls fork(), the new process shall contain a replica of the
 calling thread and its entire address space, possibly including the
 states of mutexes and other resources. [...]
 "

 https://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html

 So, let's go with it.

 > Martin
 >
 >

From: Joerg Sonnenberger <joerg@bec.de>
To: gnats-bugs@netbsd.org
Cc: toolchain-manager@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org, ben@pocket.services
Subject: Re: pkg/54192: lang/rust build error
Date: Wed, 16 Oct 2019 22:01:10 +0200

 On Wed, Oct 16, 2019 at 03:50:01PM +0000, coypu@sdf.org wrote:
 > The following reply was made to PR toolchain/54192; it has been noted by GNATS.
 > 
 > From: coypu@sdf.org
 > To: gnats-bugs@netbsd.org
 > Cc: 
 > Subject: Re: pkg/54192: lang/rust build error
 > Date: Wed, 16 Oct 2019 15:48:08 +0000
 > 
 >  I ran the patch and had it continuously build cbindgen. It eventually
 > got stuck.

 It can still race against exclusive locks in other threads and there is
 no safe way to prevent that short of intercepting fork() in rtld, which
 creates its own set of problems. But that doesn't seem to be the problem
 here.

 >  $ for cargopid in `pgrep cargo`; do gdb -q -p $cargopid -ex "thread apply all bt" -ex q -batch; done
 >  [Switching to LWP 3 of process 5283]
 >  0x00007f7e2300c3ba in ___lwp_park60 () from /usr/libexec/ld.elf_so
 >  
 >  Thread 1 (LWP 3 of process 5283):

 What's the parent and what's the child here? Because as stack trace of
 the child this looks unholy.

 Joerg

From: coypu@sdf.org
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: pkg/54192: lang/rust build error
Date: Wed, 16 Oct 2019 21:46:28 +0000

 Thread 1 (LWP 3 of process 5283):

 This is the parent.

From: Joerg Sonnenberger <joerg@bec.de>
To: gnats-bugs@netbsd.org
Cc: toolchain-manager@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org, ben@pocket.services
Subject: Re: pkg/54192: lang/rust build error
Date: Thu, 17 Oct 2019 21:37:56 +0200

 On Wed, Oct 16, 2019 at 03:50:01PM +0000, coypu@sdf.org wrote:
 >  I ran the patch and had it continuously build cbindgen. It eventually
 >  got stuck.

 Please get debug symbols for libc at least. The stack trace is at least
 somewhat nonsensical, since jemalloc doesn't directly call dup2 at all.

 Joerg

From: Havard Eidnes <he@NetBSD.org>
To: gnats-bugs@netbsd.org, coypu@sdf.org
Cc: toolchain-manager@netbsd.org, netbsd-bugs@netbsd.org,
 ben@pocket.services
Subject: Re: pkg/54192: lang/rust build error
Date: Fri, 18 Oct 2019 11:08:05 +0200 (CEST)

 >  joerg's patch works pretty well.

 I won't weigh in on whether it is correct or not.

 I can however attest to the above statement: I rebuilt a kernel
 with the changed "inherit lwp-id from the fork()ing thread"
 change (isn't that what it does?), and started a build loop of
 rust, and it has now completed 18 rounds with no detected issues.

 Regards,

 - H=E5vard

From: Tobias Nygren <tnn@NetBSD.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: pkg/54192: lang/rust build error
Date: Fri, 18 Oct 2019 11:53:34 +0200

 On Fri, 18 Oct 2019 09:10:01 +0000 (UTC)
 Havard Eidnes <he@NetBSD.org> wrote:

 >  I won't weigh in on whether it is correct or not.

 I have no opinion either, except that we need to sort this out in all
 NetBSD release branches so a minimally intrusive initial fix would be
 desirable even if -current ends up with a different fix.

From: coypu@sdf.org
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: pkg/54192: lang/rust build error
Date: Fri, 18 Oct 2019 11:50:48 +0000

 I attempted to build the related libraries with more debug information.

 $ for cargopid in `pgrep cargo`; do gdb -q -p $cargopid -ex "thread apply all bt" -ex q -batch; done
 [Switching to LWP 3 of process 7409]
 0x00007f7f5520808a in ___lwp_park60 () from /usr/libexec/ld.elf_so

 Thread 1 (LWP 3 of process 7409):		<------------------ CHILD OF 8345
 #0  0x00007f7f5520808a in ___lwp_park60 () from /usr/libexec/ld.elf_so
 #1  0x00007f7f55201463 in _rtld_shared_enter () at rtld.c:1687
 #2  0x00007f7f55200be5 in _rtld_bind (obj=0x7f1ab8434000, reloff=<optimized out>) at /cvs/src/libexec/ld.elf_so/arch/x86_64/mdreloc.c:359
 #3  0x00007f7f5520088d in _rtld_bind_start () from /usr/libexec/ld.elf_so
 #4  0x0000000000000206 in ?? ()
 #5  0x00007f1ab669242a in dup2 () from /usr/lib/libc.so.12
 #6  0x00007f1ab69f82a0 in ?? () from /usr/lib/libc.so.12
 #7  0x0000000000000000 in ?? ()


 ------------------------------------------------------------------- PARENT BACKTRACES
 [New LWP 2 of process 8345]
 [New LWP 1 of process 8345]
 [Switching to LWP 3 of process 8345]
 0x00007f1ab6642bea in read () from /usr/lib/libc.so.12

 Thread 3 (LWP 1 of process 8345):
 #0  0x00007f1ab66b39ba in ___lwp_park60 () from /usr/lib/libc.so.12
 #1  0x00007f1ab720adfb in pthread_cond_timedwait (cond=0x7f1ab8406030, mutex=0x7f1ab8406000, abstime=abstime@entry=0x0) at pthread_cond.c:169
 #2  0x00007f1ab720ae8f in pthread_cond_wait (cond=<optimized out>, mutex=<optimized out>) at pthread_cond.c:218
 #3  0x00000000469bb3b3 in std::thread::park ()
 #4  0x00000000469b4942 in std::sync::mpsc::blocking::WaitToken::wait ()
 #5  0x00000000465110ce in std::sync::mpsc::shared::Packet<T>::recv ()
 #6  0x000000004633ed64 in std::sync::mpsc::Receiver<T>::recv ()
 #7  0x00000000465a0436 in cargo::core::compiler::job_queue::JobQueue::drain_the_queue ()
 #8  0x00000000465857a5 in std::panicking::try::do_call ()
 #9  0x00000000469d229a in __rust_maybe_catch_panic ()
 #10 0x00000000462ebfb0 in crossbeam_utils::thread::scope ()
 #11 0x000000004659e6d2 in cargo::core::compiler::job_queue::JobQueue::execute ()
 #12 0x00000000466681c9 in cargo::core::compiler::context::Context::compile ()
 #13 0x00000000463aedfb in cargo::ops::cargo_compile::compile_ws ()
 #14 0x00000000463ad927 in cargo::ops::cargo_compile::compile ()
 #15 0x0000000046276c24 in cargo::commands::build::exec ()
 #16 0x000000004628a5fc in cargo::cli::main ()
 #17 0x00000000462d42fb in cargo::main ()
 #18 0x00000000462ae2c3 in std::rt::lang_start::{{closure}} ()
 #19 0x00000000469cbf73 in std::panicking::try::do_call ()
 #20 0x00000000469d229a in __rust_maybe_catch_panic ()
 #21 0x00000000469b584b in std::rt::lang_start_internal ()
 #22 0x00000000462d6632 in main ()

 Thread 2 (LWP 2 of process 8345):
 #0  0x00007f1ab66b39ba in ___lwp_park60 () from /usr/lib/libc.so.12
 #1  0x00007f1ab720adfb in pthread_cond_timedwait (cond=0x7f1ab7998030, mutex=0x7f1ab7998000, abstime=abstime@entry=0x0) at pthread_cond.c:169
 #2  0x00007f1ab720ae8f in pthread_cond_wait (cond=<optimized out>, mutex=<optimized out>) at pthread_cond.c:218
 #3  0x00000000469bb3b3 in std::thread::park ()
 #4  0x00000000469b4942 in std::sync::mpsc::blocking::WaitToken::wait ()
 #5  0x00000000469961e9 in std::sync::mpsc::stream::Packet<T>::recv ()
 #6  0x0000000046990d59 in std::sync::mpsc::Receiver<T>::recv ()
 #7  0x000000004699564a in std::sys_common::backtrace::__rust_begin_short_backtrace ()
 #8  0x00000000469942de in std::panicking::try::do_call ()
 #9  0x00000000469d229a in __rust_maybe_catch_panic ()
 #10 0x0000000046994914 in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
 #11 0x00000000469c949f in <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once ()
 #12 0x00000000469b5110 in std::sys_common::thread::start_thread ()
 #13 0x00000000469ca9d6 in std::sys::unix::thread::Thread::new::thread_start ()
 #14 0x00007f1ab720c119 in pthread__create_tramp (cookie=0x7f1ab83a7c00) at pthread.c:593
 #15 0x00007f1ab66921c0 in ?? () from /usr/lib/libc.so.12
 #16 0x0000000000400000 in ?? ()
 #17 0x00007f1ab5c00000 in ?? ()
 #18 0x0000001003a0efff in ?? ()
 #19 0x00007f1ab5a000c0 in ?? ()
 #20 0x00000000001fff40 in ?? ()
 #21 0x0000000000000000 in ?? ()

 Thread 1 (LWP 3 of process 8345):
 #0  0x00007f1ab6642bea in read () from /usr/lib/libc.so.12
 #1  0x00007f1ab72080ef in read (d=<optimized out>, buf=<optimized out>, nbytes=<optimized out>) at pthread_cancelstub.c:485
 #2  0x00000000469c6b4d in std::sys::unix::process::process_inner::<impl std::sys::unix::process::process_common::Command>::spawn ()
 #3  0x00000000469cf92b in std::process::Command::spawn ()
 #4  0x00000000464fd6d4 in cargo::util::process_builder::ProcessBuilder::exec_with_streaming ()
 #5  0x00000000465a5eaa in <cargo::core::compiler::DefaultExecutor as cargo::core::compiler::Executor>::exec ()
 #6  0x000000004659b359 in <F as cargo::core::compiler::job::FnBox<A,R>>::call_box ()
 #7  0x000000004659d04f in <F as cargo::core::compiler::job::FnBox<A,R>>::call_box ()
 #8  0x000000004659d04f in <F as cargo::core::compiler::job::FnBox<A,R>>::call_box ()
 #9  0x00000000462eb6e0 in crossbeam_utils::thread::ScopedThreadBuilder::spawn::{{closure}} ()
 #10 0x000000004633cf9b in std::sys_common::backtrace::__rust_begin_short_backtrace ()
 #11 0x00000000469d229a in __rust_maybe_catch_panic ()
 #12 0x0000000046405769 in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
 #13 0x00000000469c949f in <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once ()
 #14 0x00000000469b5110 in std::sys_common::thread::start_thread ()
 #15 0x00000000469ca9d6 in std::sys::unix::thread::Thread::new::thread_start ()
 #16 0x00007f1ab720c119 in pthread__create_tramp (cookie=0x7f1ab83a9800) at pthread.c:593
 #17 0x00007f1ab66921c0 in ?? () from /usr/lib/libc.so.12
 #18 0x0000000000000000 in ?? ()

From: Mike Pumford <mpumford@mudcovered.org.uk>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: pkg/54192: lang/rust build error
Date: Sun, 20 Oct 2019 16:27:21 +0100

 I have a 9.0-BETA system that reliably hits this almost 100% at the 
 moment. I tried Joerg's kernel patch but it actually didn't help.

 Instead of 'deadlock detected' what I actually got was a real deadlock 
 (at exactly the place where the deadlock would have been detected 
 without the patch).

 My gdb stack backtrace indicated that the code was stuck in 
 rtld_shared_enter() the same as those already posted

 Mike

From: Thomas Klausner <tk@giga.or.at>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: pkg/54192: lang/rust build error
Date: Sun, 20 Oct 2019 19:34:39 +0200

 On Sun, Oct 20, 2019 at 03:30:02PM +0000, Mike Pumford wrote:
 > The following reply was made to PR toolchain/54192; it has been noted by GNATS.
 > 
 > From: Mike Pumford <mpumford@mudcovered.org.uk>
 > To: gnats-bugs@netbsd.org
 > Cc: 
 > Subject: Re: pkg/54192: lang/rust build error
 > Date: Sun, 20 Oct 2019 16:27:21 +0100
 > 
 >  I have a 9.0-BETA system that reliably hits this almost 100% at the 
 >  moment. I tried Joerg's kernel patch but it actually didn't help.
 >  
 >  Instead of 'deadlock detected' what I actually got was a real deadlock 
 >  (at exactly the place where the deadlock would have been detected 
 >  without the patch).
 >  
 >  My gdb stack backtrace indicated that the code was stuck in 
 >  rtld_shared_enter() the same as those already posted

 I'm running a -current kernel with the patch, and I had to restart the
 rust build a couple of times until it worked. So perhaps there is a
 second problem with similar symptoms, or it's just not a real fix.
  Thomas

State-Changed-From-To: open->closed
State-Changed-By: maya@NetBSD.org
State-Changed-When: Fri, 22 May 2020 23:10:16 +0000
State-Changed-Why:
There is a workaround for existing NetBSD releases (forcing -j1, cargo defaults to -jNCPU).
Joerg just disabled the workaround for NetBSD-current, and tested it himself.
I don't expect this to be appropriate to pull up in its entirety.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.