NetBSD Problem Report #57075

From gson@gson.org  Wed Oct 26 15:29:21 2022
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id DE24E1A921F
	for <gnats-bugs@gnats.NetBSD.org>; Wed, 26 Oct 2022 15:29:21 +0000 (UTC)
Message-Id: <20221026152915.378BC254819@guava.gson.org>
Date: Wed, 26 Oct 2022 18:29:15 +0300 (EEST)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: nbctfmerge looped
X-Send-Pr-Version: 3.95

>Number:         57075
>Category:       toolchain
>Synopsis:       nbctfmerge looped
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    toolchain-manager
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Oct 26 15:30:01 +0000 2022
>Closed-Date:    
>Last-Modified:  Thu May 25 16:55:01 +0000 2023
>Originator:     Andreas Gustafsson
>Release:        NetBSD 9.2_STABLE
>Organization:

>Environment:
System: NetBSD lyta.NetBSD.org 9.2_STABLE NetBSD 9.2_STABLE (WWUSERVER) #2: Sun Apr 17 07:46:02 UTC 2022  spz@franklin.NetBSD.org:/home/netbsd/9/amd64/obj/sys/arch/amd64/compile/WWUSERVER amd64
Architecture: x86_64
Machine: amd64
>Description:

The testbed on lyta.netbsd.org recently got stuck because a nbctfmerge
process entered what looks like some kind of inifinite loop:

 UID   PID  PPID    CPU PRI NI    VSZ   RSS WCHAN   STAT TTY           TIME COMMAND
 666  1192  3076 216588  15 10  62236  8348 -       ONl  ?     103091:53.38 /tmp/bracket/build/2022.09.18.06.03.19-evbarm-earmv7hf/tools/bin/nbctfmerge -t -g -L VERSION -o amd am_ops.o amd.o amfs_auto.o amfs_generic.o amfs_direct.o amfs_error.o amfs_host.o amfs_link.o amfs_linkx.o amfs_nfsl.o amfs_nfsx.o amfs_program.o amfs_root.o amfs_toplvl.o amfs_union.o amq_subr.o amq_svc.o autil.o clock.o conf.o get_args.o info_exec.o info_file.o info_ndbm.o info_passwd.o info_sun.o info_union.o map.o mapc.o mntfs.o nfs_prot_svc.o nfs_start.o nfs_subr.o ops_cdfs.o ops_efs.o ops_mfs.o ops_nfs.o ops_nfs3.o ops_nullfs.o ops_pcfs.o ops_tfs.o ops_tmpfs.o ops_udf.o ops_ufs.o ops_umapfs.o ops_unionfs.o opts.o readdir.o restart.o rpc_fwd.o sched.o srvr_amfs_auto.o srvr_nfs.o sun_map.o sun_map_parse.o sun_map_tok.o conf_parse.o conf_tok.o info_hesiod.o info_ldap.o info_nis.o

Attaching with gdb showed:

(gdb) info threads
  Id   Target Id                Frame 
* 1    LWP 6 of process 1192 "" 0x00007f23412a220a in ___lwp_park60 () from /usr/lib/libc.so.12
  2    LWP 5 of process 1192 "" 0x00007f23412a220a in ___lwp_park60 () from /usr/lib/libc.so.12
  3    LWP 4 of process 1192 "" 0x00007f23412a220a in ___lwp_park60 () from /usr/lib/libc.so.12
  4    LWP 3 of process 1192 "" 0x00007f23412435ca in _lwp_unpark_all () from /usr/lib/libc.so.12
  5    LWP 2 of process 1192 "" 0x00007f23412a220a in ___lwp_park60 () from /usr/lib/libc.so.12
  6    LWP 1 of process 1192 "" 0x00007f23412a220a in ___lwp_park60 () from /usr/lib/libc.so.12
(gdb) thread 1
[Switching to thread 1 (LWP 6 of process 1192)]
#0  0x00007f23412a220a in ___lwp_park60 () from /usr/lib/libc.so.12
(gdb) bt
#0  0x00007f23412a220a in ___lwp_park60 () from /usr/lib/libc.so.12
#1  0x00007f234180aeaf in pthread_cond_timedwait (cond=0x624ff8 <wq+88>, mutex=0x624fc0 <wq+32>, abstime=0x0) at /home/netbsd/9/src/lib/libpthread/pthread_cond.c:168
#2  0x00000000004049fe in worker_thread ()
#3  0x00007f234180c863 in pthread__create_tramp (cookie=0x7f2341ff7000) at /home/netbsd/9/src/lib/libpthread/pthread.c:592
#4  0x00007f2341287e10 in ?? () from /usr/lib/libc.so.12
#5  0x0000000000000000 in ?? ()
(gdb) thread 2
[Switching to thread 2 (LWP 5 of process 1192)]
#0  0x00007f23412a220a in ___lwp_park60 () from /usr/lib/libc.so.12
(gdb) bt
#0  0x00007f23412a220a in ___lwp_park60 () from /usr/lib/libc.so.12
#1  0x00007f234180aeaf in pthread_cond_timedwait (cond=0x624ff8 <wq+88>, mutex=0x624fc0 <wq+32>, abstime=0x0) at /home/netbsd/9/src/lib/libpthread/pthread_cond.c:168
#2  0x00000000004049fe in worker_thread ()
#3  0x00007f234180c863 in pthread__create_tramp (cookie=0x7f2341ff9000) at /home/netbsd/9/src/lib/libpthread/pthread.c:592
#4  0x00007f2341287e10 in ?? () from /usr/lib/libc.so.12
#5  0x0000000000000000 in ?? ()
(gdb) thread 3
[Switching to thread 3 (LWP 4 of process 1192)]
#0  0x00007f23412a220a in ___lwp_park60 () from /usr/lib/libc.so.12
(gdb) bt
#0  0x00007f23412a220a in ___lwp_park60 () from /usr/lib/libc.so.12
#1  0x00007f234180aeaf in pthread_cond_timedwait (cond=0x624ff8 <wq+88>, mutex=0x624fc0 <wq+32>, abstime=0x0) at /home/netbsd/9/src/lib/libpthread/pthread_cond.c:168
#2  0x00000000004049fe in worker_thread ()
#3  0x00007f234180c863 in pthread__create_tramp (cookie=0x7f2341ffb000) at /home/netbsd/9/src/lib/libpthread/pthread.c:592
#4  0x00007f2341287e10 in ?? () from /usr/lib/libc.so.12
#5  0x0000000000000000 in ?? ()
(gdb) thread 4
[Switching to thread 4 (LWP 3 of process 1192)]
#0  0x00007f23412435ca in _lwp_unpark_all () from /usr/lib/libc.so.12
(gdb) bt
#0  0x00007f23412435ca in _lwp_unpark_all () from /usr/lib/libc.so.12
#1  0x00007f2341809b22 in pthread__mutex_wakeup (self=self@entry=0x7f2342203000, ptm=ptm@entry=0x7f2342205068) at /home/netbsd/9/src/lib/libpthread/pthread_mutex.c:606
#2  0x00007f2341809f48 in pthread__mutex_unlock_slow (ptm=0x7f2342205068) at /home/netbsd/9/src/lib/libpthread/pthread_mutex.c:516
#3  0x00000000004049cd in worker_thread ()
#4  0x00007f234180c863 in pthread__create_tramp (cookie=0x7f2342203000) at /home/netbsd/9/src/lib/libpthread/pthread.c:592
#5  0x00007f2341287e10 in ?? () from /usr/lib/libc.so.12
#6  0x0000000000000000 in ?? ()
(gdb) thread 5
[Switching to thread 5 (LWP 2 of process 1192)]
#0  0x00007f23412a220a in ___lwp_park60 () from /usr/lib/libc.so.12
(gdb) bt
#0  0x00007f23412a220a in ___lwp_park60 () from /usr/lib/libc.so.12
#1  0x00007f234180aeaf in pthread_cond_timedwait (cond=0x624ff8 <wq+88>, mutex=0x624fc0 <wq+32>, abstime=0x0) at /home/netbsd/9/src/lib/libpthread/pthread_cond.c:168
#2  0x00000000004049fe in worker_thread ()
#3  0x00007f234180c863 in pthread__create_tramp (cookie=0x7f234220f000) at /home/netbsd/9/src/lib/libpthread/pthread.c:592
#4  0x00007f2341287e10 in ?? () from /usr/lib/libc.so.12
#5  0x0000000000400000 in ?? ()
#6  0x00007f2341000000 in ?? ()
#7  0x0000001003a0efff in ?? ()
#8  0x00007f2340e000c0 in ?? ()
#9  0x00000000001fff40 in ?? ()
#10 0x0000000000000000 in ?? ()
(gdb) thread 6
[Switching to thread 6 (LWP 1 of process 1192)]
#0  0x00007f23412a220a in ___lwp_park60 () from /usr/lib/libc.so.12
(gdb) bt
#0  0x00007f23412a220a in ___lwp_park60 () from /usr/lib/libc.so.12
#1  0x00007f2341809d36 in pthread__mutex_lock_slow (ptm=0x624fc0 <wq+32>, ts=0x0) at /home/netbsd/9/src/lib/libpthread/pthread_mutex.c:384
#2  0x0000000000404738 in merge_ctf_cb ()
#3  0x0000000000406a1d in read_file ()
#4  0x0000000000406ddd in read_ctf ()
#5  0x00000000004056c6 in main ()
(gdb) 

The nbctfmerge binary and a core file (forced using SIGABRT) are saved
in lyta.netbsd.org:~bracket/bugs/nbctfmerge-loop and can be made
available if someone wants to examine them.

See also PR 55014, where nbctfmerge hung rather than looped.

>How-To-Repeat:

Don't know, it has only happened once so far.

>Fix:

>Release-Note:

>Audit-Trail:

Responsible-Changed-From-To: toolchain-manager->lukem
Responsible-Changed-By: lukem@NetBSD.org
Responsible-Changed-When: Thu, 25 May 2023 13:27:07 +0000
Responsible-Changed-Why:


State-Changed-From-To: open->feedback
State-Changed-By: lukem@NetBSD.org
State-Changed-When: Thu, 25 May 2023 13:27:07 +0000
State-Changed-Why:

This may be a dupe of toolchain/55014 ?

This may have been fixed by:
  https://mail-index.netbsd.org/source-changes-hg/2022/05/31/msg356307.html

Have you seen this recently?


Responsible-Changed-From-To: lukem->toolchain-manager
Responsible-Changed-By: lukem@NetBSD.org
Responsible-Changed-When: Thu, 25 May 2023 13:36:16 +0000
Responsible-Changed-Why:


State-Changed-From-To: feedback->open
State-Changed-By: lukem@NetBSD.org
State-Changed-When: Thu, 25 May 2023 13:36:16 +0000
State-Changed-Why:

I misread the symptoms. THis isn't quite like PR 55014


From: Andreas Gustafsson <gson@gson.org>
To: lukem@netbsd.org
Cc: gnats-bugs@netbsd.org
Subject: Re: toolchain/57075 (nbctfmerge looped)
Date: Thu, 25 May 2023 17:15:33 +0300

 lukem@NetBSD.org wrote:
 > This may have been fixed by:
 >   https://mail-index.netbsd.org/source-changes-hg/2022/05/31/msg356307.html
 >
 > Have you seen this recently?

 I don't recall seeing it happen again.  But even if the bug was fixed
 by the referenced commit, it could still happen on the TNF testbeds
 since it affects the nbctfmerge tool, which is linked with the host
 libpthread and therefore may still have the bug even if the system
 being built doesn't.
 -- 
 Andreas Gustafsson, gson@gson.org

From: Luke Mewburn <luke@mewburn.net>
To: gnats-bugs@netbsd.org
Cc: toolchain-manager@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org, Andreas Gustafsson <gson@gson.org>
Subject: Re: toolchain/57075 (nbctfmerge looped)
Date: Fri, 26 May 2023 01:35:03 +1000

 On 23-05-25 15:10, Andreas Gustafsson wrote:
   |  lukem@NetBSD.org wrote:
   |  > This may have been fixed by:
   |  >   https://mail-index.netbsd.org/source-changes-hg/2022/05/31/msg356307.html
   |  >
   |  > Have you seen this recently?
   |  
   |  I don't recall seeing it happen again.  But even if the bug was fixed
   |  by the referenced commit, it could still happen on the TNF testbeds
   |  since it affects the nbctfmerge tool, which is linked with the host
   |  libpthread and therefore may still have the bug even if the system
   |  being built doesn't.

 Whether or not this specific bug is fixed, as a matter of GNATS
 convention, do we hold up closing a bug after a fix because there might
 be a system with the old code still deployed?  This is a bit rhetorical,
 because we can't use that as our convention because we'd never close
 bugs!

 Regarding ctfmerge: given the previous lack of error handling/reporting
 in the concurrency / thread code in ctfmerge, and the bugs I found and
 fixed on at least one host (macOS) that were previously not obvious
 because of lack of error handling, I don't know if we get much value
 keeping open older ctfmerge tickets that relate to thread issues.

 Happy to read other opinions, of course.

From: Luke Mewburn <luke@mewburn.net>
To: gnats-bugs@netbsd.org
Cc: toolchain-manager@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org, Andreas Gustafsson <gson@gson.org>
Subject: Re: toolchain/57075 (nbctfmerge looped)
Date: Fri, 26 May 2023 01:39:15 +1000

 On 23-05-26 01:35, Luke Mewburn wrote:
   | On 23-05-25 15:10, Andreas Gustafsson wrote:
   |   |  lukem@NetBSD.org wrote:
   |   |  > This may have been fixed by:
   |   |  >   https://mail-index.netbsd.org/source-changes-hg/2022/05/31/msg356307.html
   |   |  >
   |   |  > Have you seen this recently?
   |   |  
   |   |  I don't recall seeing it happen again.  But even if the bug was fixed
   |   |  by the referenced commit, it could still happen on the TNF testbeds
   |   |  since it affects the nbctfmerge tool, which is linked with the host
   |   |  libpthread and therefore may still have the bug even if the system
   |   |  being built doesn't.
   | 
   | Regarding ctfmerge: given the previous lack of error handling/reporting
   | in the concurrency / thread code in ctfmerge, and the bugs I found and
   | fixed on at least one host (macOS) that were previously not obvious
   | because of lack of error handling, I don't know if we get much value
   | keeping open older ctfmerge tickets that relate to thread issues.

 To clarify, I meant:
   "older ctfmerge tickets that relate to *intermittent faults with* thread issues"

 Intermittent bugs are the worst! Especially in code not using the thread
 APIs as well as they should.

From: Andreas Gustafsson <gson@gson.org>
To: Luke Mewburn <luke@mewburn.net>
Cc: gnats-bugs@netbsd.org
Subject: Re: toolchain/57075 (nbctfmerge looped)
Date: Thu, 25 May 2023 19:46:55 +0300

 Luke Mewburn wrote:
 >   |  > Have you seen this recently?
 >   |  
 >   |  I don't recall seeing it happen again.  But even if the bug was fixed
 >   |  by the referenced commit, it could still happen on the TNF testbeds
 >   |  since it affects the nbctfmerge tool, which is linked with the host
 >   |  libpthread and therefore may still have the bug even if the system
 >   |  being built doesn't.
 > 
 > Whether or not this specific bug is fixed, as a matter of GNATS
 > convention, do we hold up closing a bug after a fix because there might
 > be a system with the old code still deployed?  This is a bit rhetorical,
 > because we can't use that as our convention because we'd never close
 > bugs!

 My intended meaning was not "the bug cannot be closed because the
 testbeds are still affected", but rather "since my experience is with
 testbeds that are not running -current, I'm afraid the question of
 whether or not I have seen the problem recently is not actually
 relevant to determining whether or not the bug has been fixed".
 -- 
 Andreas Gustafsson, gson@gson.org

From: Luke Mewburn <luke@mewburn.net>
To: Andreas Gustafsson <gson@gson.org>
Cc: Luke Mewburn <luke@mewburn.net>, gnats-bugs@netbsd.org
Subject: Re: toolchain/57075 (nbctfmerge looped)
Date: Fri, 26 May 2023 02:53:18 +1000

 On 23-05-25 19:46, Andreas Gustafsson wrote:
   | Luke Mewburn wrote:
   | >   |  > Have you seen this recently?
   | >   |  
   | >   |  I don't recall seeing it happen again.  But even if the bug was fixed
   | >   |  by the referenced commit, it could still happen on the TNF testbeds
   | >   |  since it affects the nbctfmerge tool, which is linked with the host
   | >   |  libpthread and therefore may still have the bug even if the system
   | >   |  being built doesn't.
   | > 
   | > Whether or not this specific bug is fixed, as a matter of GNATS
   | > convention, do we hold up closing a bug after a fix because there might
   | > be a system with the old code still deployed?  This is a bit rhetorical,
   | > because we can't use that as our convention because we'd never close
   | > bugs!
   | 
   | My intended meaning was not "the bug cannot be closed because the
   | testbeds are still affected", but rather "since my experience is with
   | testbeds that are not running -current, I'm afraid the question of
   | whether or not I have seen the problem recently is not actually
   | relevant to determining whether or not the bug has been fixed".

 Fair enough!
 And I didn't intend to minimise the issue by any means.

 cheers,
 Luke.

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2023 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.