NetBSD Problem Report #49816

From yamt@NetBSD.org  Mon Apr  6 09:16:32 2015
Return-Path: <yamt@NetBSD.org>
Received: by mollari.NetBSD.org (Postfix, from userid 1270)
	id E68FBA654B; Mon,  6 Apr 2015 09:16:32 +0000 (UTC)
Message-Id: <20150406091632.E68FBA654B@mollari.NetBSD.org>
Date: Mon,  6 Apr 2015 09:16:32 +0000 (UTC)
From: yamt@NetBSD.org
Reply-To: yamt@NetBSD.org
To: gnats-bugs@NetBSD.org
Subject: rtld internal lock vs fork
X-Send-Pr-Version: 3.95

>Number:         49816
>Category:       lib
>Synopsis:       rtld internal lock vs fork
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    lib-bug-people
>State:          feedback
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Apr 06 09:20:00 +0000 2015
>Closed-Date:    
>Last-Modified:  Mon Jul 04 16:17:42 +0000 2022
>Originator:     YAMAMOTO Takashi
>Release:        NetBSD current
>Organization:

>Environment:


>Description:
	when a thread does fork(2), some other thread might hold _rtld_mutex.
	in that case, the child process will likely deadlock soon because
	of non-zero _rtld_mutex.  i've observed the problem with open vswitch.
>How-To-Repeat:
	configure OVS master with --enable-shared and "gmake -j32 check".
>Fix:


>Release-Note:

>Audit-Trail:
From: "YAMAMOTO Takashi" <yamt@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/49816 CVS commit: src/libexec/ld.elf_so
Date: Mon, 6 Apr 2015 09:34:15 +0000

 Module Name:	src
 Committed By:	yamt
 Date:		Mon Apr  6 09:34:15 UTC 2015

 Modified Files:
 	src/libexec/ld.elf_so: rtld.c

 Log Message:
 Fix membars around rtld internal mutex.

 This fixes the most of lockups i observed with Open vSwitch
 on NetBSD/amd64.  ("most of" because it still occasionally
 locks up because of other problems.  see PR/49816)


 To generate a diff of this commit:
 cvs rdiff -u -r1.176 -r1.177 src/libexec/ld.elf_so/rtld.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: lib/49816: rtld internal lock vs fork
Date: Mon, 6 Apr 2015 11:57:09 +0200

 Out of curiosity: which async-signal-safe functions is the child calling
 that involve ld.elf_so internal actions at that stage?

 Martin

From: yamt@netbsd.org (YAMAMOTO Takashi)
To: gnats-bugs@NetBSD.org
Cc: lib-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
	yamt@NetBSD.org
Subject: Re: lib/49816: rtld internal lock vs fork
Date: Mon,  6 Apr 2015 11:27:22 +0000 (UTC)

 >  Out of curiosity: which async-signal-safe functions is the child calling
 >  that involve ld.elf_so internal actions at that stage?
 >  
 >  Martin

 it seems that what's stuck in the child process is
 an ordinary _rtld_bind_start.

 YAMAMOTO Takashi

From: Joerg Sonnenberger <joerg@britannica.bec.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: PR/49816 CVS commit: src/libexec/ld.elf_so
Date: Mon, 6 Apr 2015 13:33:30 +0200

 On Mon, Apr 06, 2015 at 09:35:00AM +0000, YAMAMOTO Takashi wrote:
 >  Log Message:
 >  Fix membars around rtld internal mutex.
 >  
 >  This fixes the most of lockups i observed with Open vSwitch
 >  on NetBSD/amd64.  ("most of" because it still occasionally
 >  locks up because of other problems.  see PR/49816)

 None of those should matter on amd64? CAS has an implicit total memory
 barrier, so this seems to just add a lot of overhead for no reason.

 Joerg

From: Joerg Sonnenberger <joerg@britannica.bec.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: lib/49816: rtld internal lock vs fork
Date: Mon, 6 Apr 2015 15:50:34 +0200

 On Mon, Apr 06, 2015 at 09:20:00AM +0000, yamt@NetBSD.org wrote:
 > >Description:
 > 	when a thread does fork(2), some other thread might hold _rtld_mutex.
 > 	in that case, the child process will likely deadlock soon because
 > 	of non-zero _rtld_mutex.  i've observed the problem with open vswitch.

 Non-zero _rtld_mutex by itself should not be problem. The problem exists
 if another thread requires the exclusive lock. I do plan to rewrite rtld
 at some point to never require an exclusive lock for symbol look up, but
 that's far from trivial. In the old world before locking, you would just
 hit race conditions or not.

 Joerg

From: yamt@netbsd.org (YAMAMOTO Takashi)
To: gnats-bugs@NetBSD.org
Cc: lib-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
	yamt@NetBSD.org
Subject: Re: PR/49816 CVS commit: src/libexec/ld.elf_so
Date: Mon,  6 Apr 2015 16:09:29 +0000 (UTC)

 > The following reply was made to PR lib/49816; it has been noted by GNATS.
 > 
 > From: Joerg Sonnenberger <joerg@britannica.bec.de>
 > To: gnats-bugs@NetBSD.org
 > Cc: 
 > Subject: Re: PR/49816 CVS commit: src/libexec/ld.elf_so
 > Date: Mon, 6 Apr 2015 13:33:30 +0200
 > 
 >  On Mon, Apr 06, 2015 at 09:35:00AM +0000, YAMAMOTO Takashi wrote:
 >  >  Log Message:
 >  >  Fix membars around rtld internal mutex.
 >  >  
 >  >  This fixes the most of lockups i observed with Open vSwitch
 >  >  on NetBSD/amd64.  ("most of" because it still occasionally
 >  >  locks up because of other problems.  see PR/49816)
 >  
 >  None of those should matter on amd64? CAS has an implicit total memory
 >  barrier, so this seems to just add a lot of overhead for no reason.

 except that:
 * _rtld_exclusive_exit doesn't use CAS
 * this code is MI

 i agree that something like PTHREAD__ATOMIC_IS_MEMBAR
 would be a nice optimization, though.

 YAMAMOTO Takashi

 >  
 >  Joerg

From: Joerg Sonnenberger <joerg@britannica.bec.de>
To: gnats-bugs@NetBSD.org
Cc: lib-bug-people@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org, yamt@NetBSD.org
Subject: Re: PR/49816 CVS commit: src/libexec/ld.elf_so
Date: Mon, 6 Apr 2015 18:22:21 +0200

 On Mon, Apr 06, 2015 at 04:10:02PM +0000, YAMAMOTO Takashi wrote:
 >  except that:
 >  * _rtld_exclusive_exit doesn't use CAS
 >  * this code is MI
 >  
 >  i agree that something like PTHREAD__ATOMIC_IS_MEMBAR
 >  would be a nice optimization, though.

 So which platform are you worried about that doesn't have TSO and
 doesn't implicit membars for CAS? I'm asking because the only reason
 those changes should help your problem is if they massively penalize the
 operation.

 Joerg

From: yamt@netbsd.org (YAMAMOTO Takashi)
To: gnats-bugs@NetBSD.org
Cc: lib-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
	yamt@NetBSD.org
Subject: Re: lib/49816: rtld internal lock vs fork
Date: Mon,  6 Apr 2015 16:23:38 +0000 (UTC)

 >  > 	when a thread does fork(2), some other thread might hold _rtld_mutex.
 >  > 	in that case, the child process will likely deadlock soon because
 >  > 	of non-zero _rtld_mutex.  i've observed the problem with open vswitch.
 >  
 >  Non-zero _rtld_mutex by itself should not be problem. The problem exists
 >  if another thread requires the exclusive lock. I do plan to rewrite rtld
 >  at some point to never require an exclusive lock for symbol look up, but
 >  that's far from trivial. In the old world before locking, you would just
 >  hit race conditions or not.

 i agree on all points.

 and, at least in my case, "another thread requires the exclusive lock"
 actually happens.

 YAMAMOTO Takashi

From: yamt@netbsd.org (YAMAMOTO Takashi)
To: gnats-bugs@NetBSD.org
Cc: lib-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
	yamt@NetBSD.org
Subject: Re: PR/49816 CVS commit: src/libexec/ld.elf_so
Date: Mon,  6 Apr 2015 16:41:01 +0000 (UTC)

 > The following reply was made to PR lib/49816; it has been noted by GNATS.
 > 
 > From: Joerg Sonnenberger <joerg@britannica.bec.de>
 > To: gnats-bugs@NetBSD.org
 > Cc: lib-bug-people@netbsd.org, gnats-admin@netbsd.org,
 > 	netbsd-bugs@netbsd.org, yamt@NetBSD.org
 > Subject: Re: PR/49816 CVS commit: src/libexec/ld.elf_so
 > Date: Mon, 6 Apr 2015 18:22:21 +0200
 > 
 >  On Mon, Apr 06, 2015 at 04:10:02PM +0000, YAMAMOTO Takashi wrote:
 >  >  except that:
 >  >  * _rtld_exclusive_exit doesn't use CAS
 >  >  * this code is MI
 >  >  
 >  >  i agree that something like PTHREAD__ATOMIC_IS_MEMBAR
 >  >  would be a nice optimization, though.
 >  
 >  So which platform are you worried about that doesn't have TSO and
 >  doesn't implicit membars for CAS? I'm asking because the only reason
 >  those changes should help your problem is if they massively penalize the
 >  operation.

 well, can you explain why _rtld_exclusive_exit is safe
 without cas or barrier?

 YAMAMOTO Takashi

 >  
 >  Joerg

From: Joerg Sonnenberger <joerg@britannica.bec.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: PR/49816 CVS commit: src/libexec/ld.elf_so
Date: Mon, 6 Apr 2015 20:16:22 +0200

 On Mon, Apr 06, 2015 at 04:45:01PM +0000, YAMAMOTO Takashi wrote:
 > The following reply was made to PR lib/49816; it has been noted by GNATS.
 > 
 > From: yamt@netbsd.org (YAMAMOTO Takashi)
 > To: gnats-bugs@NetBSD.org
 > Cc: lib-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
 > 	yamt@NetBSD.org
 > Subject: Re: PR/49816 CVS commit: src/libexec/ld.elf_so
 > Date: Mon,  6 Apr 2015 16:41:01 +0000 (UTC)
 > 
 >  > The following reply was made to PR lib/49816; it has been noted by GNATS.
 >  > 
 >  > From: Joerg Sonnenberger <joerg@britannica.bec.de>
 >  > To: gnats-bugs@NetBSD.org
 >  > Cc: lib-bug-people@netbsd.org, gnats-admin@netbsd.org,
 >  > 	netbsd-bugs@netbsd.org, yamt@NetBSD.org
 >  > Subject: Re: PR/49816 CVS commit: src/libexec/ld.elf_so
 >  > Date: Mon, 6 Apr 2015 18:22:21 +0200
 >  > 
 >  >  On Mon, Apr 06, 2015 at 04:10:02PM +0000, YAMAMOTO Takashi wrote:
 >  >  >  except that:
 >  >  >  * _rtld_exclusive_exit doesn't use CAS
 >  >  >  * this code is MI
 >  >  >  
 >  >  >  i agree that something like PTHREAD__ATOMIC_IS_MEMBAR
 >  >  >  would be a nice optimization, though.
 >  >  
 >  >  So which platform are you worried about that doesn't have TSO and
 >  >  doesn't implicit membars for CAS? I'm asking because the only reason
 >  >  those changes should help your problem is if they massively penalize the
 >  >  operation.
 >  
 >  well, can you explain why _rtld_exclusive_exit is safe
 >  without cas or barrier?

 All sane MP platforms at least implement Total Store Ordering. So all
 unrelated stores are visible no later than the reset of the mutex.
 That's why I am surprised that it changes anything at all for you.

 Joerg

From: yamt@netbsd.org (YAMAMOTO Takashi)
To: gnats-bugs@NetBSD.org
Cc: lib-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
	yamt@NetBSD.org
Subject: Re: PR/49816 CVS commit: src/libexec/ld.elf_so
Date: Tue,  7 Apr 2015 01:51:39 +0000 (UTC)

 > The following reply was made to PR lib/49816; it has been noted by GNATS.
 > 
 > From: Joerg Sonnenberger <joerg@britannica.bec.de>
 > To: gnats-bugs@NetBSD.org
 > Cc: 
 > Subject: Re: PR/49816 CVS commit: src/libexec/ld.elf_so
 > Date: Mon, 6 Apr 2015 20:16:22 +0200
 > 
 >  On Mon, Apr 06, 2015 at 04:45:01PM +0000, YAMAMOTO Takashi wrote:
 >  > The following reply was made to PR lib/49816; it has been noted by GNATS.
 >  > 
 >  > From: yamt@netbsd.org (YAMAMOTO Takashi)
 >  > To: gnats-bugs@NetBSD.org
 >  > Cc: lib-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
 >  > 	yamt@NetBSD.org
 >  > Subject: Re: PR/49816 CVS commit: src/libexec/ld.elf_so
 >  > Date: Mon,  6 Apr 2015 16:41:01 +0000 (UTC)
 >  > 
 >  >  > The following reply was made to PR lib/49816; it has been noted by GNATS.
 >  >  > 
 >  >  > From: Joerg Sonnenberger <joerg@britannica.bec.de>
 >  >  > To: gnats-bugs@NetBSD.org
 >  >  > Cc: lib-bug-people@netbsd.org, gnats-admin@netbsd.org,
 >  >  > 	netbsd-bugs@netbsd.org, yamt@NetBSD.org
 >  >  > Subject: Re: PR/49816 CVS commit: src/libexec/ld.elf_so
 >  >  > Date: Mon, 6 Apr 2015 18:22:21 +0200
 >  >  > 
 >  >  >  On Mon, Apr 06, 2015 at 04:10:02PM +0000, YAMAMOTO Takashi wrote:
 >  >  >  >  except that:
 >  >  >  >  * _rtld_exclusive_exit doesn't use CAS
 >  >  >  >  * this code is MI
 >  >  >  >  
 >  >  >  >  i agree that something like PTHREAD__ATOMIC_IS_MEMBAR
 >  >  >  >  would be a nice optimization, though.
 >  >  >  
 >  >  >  So which platform are you worried about that doesn't have TSO and
 >  >  >  doesn't implicit membars for CAS? I'm asking because the only reason
 >  >  >  those changes should help your problem is if they massively penalize the
 >  >  >  operation.
 >  >  
 >  >  well, can you explain why _rtld_exclusive_exit is safe
 >  >  without cas or barrier?
 >  
 >  All sane MP platforms at least implement Total Store Ordering. So all
 >  unrelated stores are visible no later than the reset of the mutex.
 >  That's why I am surprised that it changes anything at all for you.
 >  
 >  Joerg

 the intel's manual says:  (8.2.2)

     Reads may be reordered with older writes to different locations
     but not with older writes to the same location.

 see also 8.2.3.4 Example 8-3.

 isn't it the case for _rtld_exclusive_exit?

 YAMAMOTO Takashi

From: Joerg Sonnenberger <joerg@britannica.bec.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: PR/49816 CVS commit: src/libexec/ld.elf_so
Date: Tue, 7 Apr 2015 13:03:14 +0200

 On Tue, Apr 07, 2015 at 01:55:01AM +0000, YAMAMOTO Takashi wrote:
 >  the intel's manual says:  (8.2.2)
 >  
 >      Reads may be reordered with older writes to different locations
 >      but not with older writes to the same location.
 >  
 >  see also 8.2.3.4 Example 8-3.
 >  
 >  isn't it the case for _rtld_exclusive_exit?

 I don't think that's relevant, this is about global visiblity of
 changes. E.g. if another thread tries to get the mutex, it must see all
 changes from the current thread when successfully acquired the mutex.

 Joerg

From: Masao Uebayashi <uebayasi@gmail.com>
To: gnats-bugs@netbsd.org
Cc: lib-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, 
	YAMAMOTO Takashi <yamt@netbsd.org>
Subject: Re: PR/49816 CVS commit: src/libexec/ld.elf_so
Date: Wed, 8 Apr 2015 12:42:28 +0900

 >  I don't think that's relevant, this is about global visiblity of
 >  changes. E.g. if another thread tries to get the mutex, it must see all
 >  changes from the current thread when successfully acquired the mutex.
 >
 >  Joerg

 I can't follow... Your first message was:

 > None of those should matter on amd64? CAS has an implicit total memory
 > barrier, so this seems to just add a lot of overhead for no reason.

 If you're talking about atomic vs. memory barrier, isn't it something
 addressed by the new C11 atomic API?  Kernel mutex code has
 MUTEX_RECEIVE()/MUTEX_GIVE() for that purpose (IIUC).

From: Joerg Sonnenberger <joerg@britannica.bec.de>
To: gnats-bugs@NetBSD.org
Cc: lib-bug-people@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org, yamt@NetBSD.org
Subject: Re: PR/49816 CVS commit: src/libexec/ld.elf_so
Date: Wed, 8 Apr 2015 11:39:06 +0200

 On Wed, Apr 08, 2015 at 03:45:01AM +0000, Masao Uebayashi wrote:
 > The following reply was made to PR lib/49816; it has been noted by GNATS.
 > 
 > From: Masao Uebayashi <uebayasi@gmail.com>
 > To: gnats-bugs@netbsd.org
 > Cc: lib-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, 
 > 	YAMAMOTO Takashi <yamt@netbsd.org>
 > Subject: Re: PR/49816 CVS commit: src/libexec/ld.elf_so
 > Date: Wed, 8 Apr 2015 12:42:28 +0900
 > 
 >  >  I don't think that's relevant, this is about global visiblity of
 >  >  changes. E.g. if another thread tries to get the mutex, it must see all
 >  >  changes from the current thread when successfully acquired the mutex.
 >  >
 >  >  Joerg
 >  
 >  I can't follow... Your first message was:

 The diff commited by yamt-san should not make any difference on
 architectures with Total Store Ordering, including x86, which he says he
 is running.

 >  > None of those should matter on amd64? CAS has an implicit total memory
 >  > barrier, so this seems to just add a lot of overhead for no reason.
 >  
 >  If you're talking about atomic vs. memory barrier, isn't it something
 >  addressed by the new C11 atomic API?  Kernel mutex code has
 >  MUTEX_RECEIVE()/MUTEX_GIVE() for that purpose (IIUC).

 Yes, except we don't have it available. I'm not sure if there even is
 support for it in gcc-4.8. For Clang, it would be just a question of
 installing stdatomic.h somewhere.

 Joerg

From: yamt@netbsd.org (YAMAMOTO Takashi)
To: gnats-bugs@NetBSD.org
Cc: lib-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
	yamt@NetBSD.org
Subject: Re: PR/49816 CVS commit: src/libexec/ld.elf_so
Date: Wed, 29 Apr 2015 01:45:51 +0000 (UTC)

 > The following reply was made to PR lib/49816; it has been noted by GNATS.
 > 
 > From: Joerg Sonnenberger <joerg@britannica.bec.de>
 > To: gnats-bugs@NetBSD.org
 > Cc: 
 > Subject: Re: PR/49816 CVS commit: src/libexec/ld.elf_so
 > Date: Tue, 7 Apr 2015 13:03:14 +0200
 > 
 >  On Tue, Apr 07, 2015 at 01:55:01AM +0000, YAMAMOTO Takashi wrote:
 >  >  the intel's manual says:  (8.2.2)
 >  >  
 >  >      Reads may be reordered with older writes to different locations
 >  >      but not with older writes to the same location.
 >  >  
 >  >  see also 8.2.3.4 Example 8-3.
 >  >  
 >  >  isn't it the case for _rtld_exclusive_exit?
 >  
 >  I don't think that's relevant, this is about global visiblity of
 >  changes. E.g. if another thread tries to get the mutex, it must see all
 >  changes from the current thread when successfully acquired the mutex.
 >  
 >  Joerg

 sorry for late reply.

 _rtld_exclusive_exit
     _rtld_mutex = 0;  // older write to different location
     waiter = _rtld_waiter_exclusive  // read

 my interpretation of the manual text is that these two accesses
 can be reordered.
 if it happens, _rtld_exclusive_enter can block on now unlocked mutex.

 YAMAMOTO Takashi

State-Changed-From-To: open->feedback
State-Changed-By: riastradh@NetBSD.org
State-Changed-When: Mon, 04 Jul 2022 16:17:42 +0000
State-Changed-Why:
Was this addressed by the following commits?

https://mail-index.netbsd.org/source-changes/2020/04/16/msg116256.html
https://mail-index.netbsd.org/source-changes/2020/04/19/msg116337.html


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.