NetBSD Problem Report #53285

From gson@gson.org  Sun May 13 09:23:30 2018
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id D6F7F7A1AE
	for <gnats-bugs@gnats.NetBSD.org>; Sun, 13 May 2018 09:23:30 +0000 (UTC)
Message-Id: <20180513092324.F0CF19899BD@guava.gson.org>
Date: Sun, 13 May 2018 12:23:24 +0300 (EEST)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: Build times tripled with make/job.c 1.193
X-Send-Pr-Version: 3.95

>Number:         53285
>Category:       bin
>Synopsis:       Build times tripled with make/job.c 1.193
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    sjg
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun May 13 09:25:00 +0000 2018
>Closed-Date:    Mon May 14 19:25:43 +0000 2018
>Last-Modified:  Mon May 14 19:35:00 +0000 2018
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current, source date >= 2018.05.12.18.17.04
>Organization:
>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:

The amount of time (real, not cpu) to build an amd64 relase on
babylon5.netbsd.org more than tripled with the commit of
src/usr.bin/make/job.c 1.193:

  babylon5.netbsd.org /bracket/i386/results/2018$ grep build_wall_time */bracket.db | tail -2
  2018.05.12.15.14.49/bracket.db:build_wall_time=4896.09
  2018.05.12.18.17.04/bracket.db:build_wall_time=16268.98

babylon5 is an 8-core, 16-hyperthread machine, building with -j 16.
The builds are niced and there is a few threads of other load, but I'm
also seeing a similar slowdown on my own build machine which has
no load other than the build.

Needless to say, this is negatively impacting the operation of the
TNF test infrastructure.

>How-To-Repeat:

Do a parallel release build on a multi-core machine with and without
make/job.c 1.193.  Compare build wall times.

>Fix:

Please revert the commit in case.

>Release-Note:

>Audit-Trail:

Responsible-Changed-From-To: bin-bug-people->sjg
Responsible-Changed-By: gson@NetBSD.org
Responsible-Changed-When: Sun, 13 May 2018 09:27:31 +0000
Responsible-Changed-Why:
Over to committer.


From: "Christos Zoulas" <christos@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/53285 CVS commit: src/usr.bin/make
Date: Sun, 13 May 2018 08:10:37 -0400

 Module Name:	src
 Committed By:	christos
 Date:		Sun May 13 12:10:37 UTC 2018

 Modified Files:
 	src/usr.bin/make: job.c

 Log Message:
 PR/53285: Andreas Gustafsson: Build times tripled with make/job.c 1.193
 Revert previous:
     2018.05.12.15.14.49/bracket.db:build_wall_time=4896.09
     2018.05.12.18.17.04/bracket.db:build_wall_time=16268.98


 To generate a diff of this commit:
 cvs rdiff -u -r1.193 -r1.194 src/usr.bin/make/job.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: bin/53285: Build times tripled with make/job.c 1.193
Date: Sun, 13 May 2018 19:45:43 +0700

     Date:        Sun, 13 May 2018 09:25:00 +0000 (UTC)
     From:        gson@gson.org (Andreas Gustafsson)
     Message-ID:  <20180513092500.C9A487A1DF@mollari.NetBSD.org>

   | Please revert the commit in case.

 Before giving up completely, perhaps it would be worth trying changing the
 "sleep(1)" into a much smaller usleep() or nanosleep() - which would have
 the same effect, of avoiding busy waiting, while not adding nearly as much
 delay (if it happens that the sleep needs to be a second occasionally, then
 that might mean 100 * usleep(10000) rather than 1 sleep(1), but it seems that
 most of the time, the wait does not need to be nearly that long).

 kre

From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org, sjg@NetBSD.org, kre@NetBSD.org
Cc: 
Subject: Re: bin/53285: Build times tripled with make/job.c 1.193
Date: Sun, 13 May 2018 16:08:29 +0300

 Robert Elz wrote:
 >  Before giving up completely, perhaps it would be worth trying changing the
 >  "sleep(1)" into a much smaller usleep() or nanosleep() - which would have
 >  the same effect, of avoiding busy waiting, while not adding nearly as much
 >  delay (if it happens that the sleep needs to be a second occasionally, then
 >  that might mean 100 * usleep(10000) rather than 1 sleep(1), but it seems that
 >  most of the time, the wait does not need to be nearly that long).

 I don't know what problem the original commit was trying to solve
 (someone please file a PR), but arbitrary sleeps are never the right
 solution.
 -- 
 Andreas Gustafsson, gson@gson.org

From: christos@zoulas.com (Christos Zoulas)
To: gnats-bugs@NetBSD.org, sjg@NetBSD.org, gnats-admin@netbsd.org, 
	netbsd-bugs@netbsd.org, gson@gson.org (Andreas Gustafsson)
Cc: 
Subject: Re: bin/53285: Build times tripled with make/job.c 1.193
Date: Sun, 13 May 2018 09:14:31 -0400

 On May 13, 12:50pm, kre@munnari.OZ.AU (Robert Elz) wrote:
 -- Subject: Re: bin/53285: Build times tripled with make/job.c 1.193

 | Before giving up completely, perhaps it would be worth trying changing the
 | "sleep(1)" into a much smaller usleep() or nanosleep() - which would have
 | the same effect, of avoiding busy waiting, while not adding nearly as much
 | delay (if it happens that the sleep needs to be a second occasionally, then
 | that might mean 100 * usleep(10000) rather than 1 sleep(1), but it seems that
 | most of the time, the wait does not need to be nearly that long).

 Or change this into a better/cheaper form of IPC? Sleeping without someone
 to wake you up does not sound right.

 christos

From: "Simon J. Gerraty" <sjg@juniper.net>
To: <gnats-bugs@NetBSD.org>
Cc: <gnats-admin@NetBSD.org>, <netbsd-bugs@NetBSD.org>,
        Andreas Gustafsson
	<gson@gson.org>, <sjg@juniper.net>
Subject: Re: bin/53285: Build times tripled with make/job.c 1.193
Date: Sun, 13 May 2018 10:18:06 -0700

 Robert Elz <kre@munnari.OZ.AU> wrote:
 >    | Please revert the commit in case.
 >  =

 >  Before giving up completely, perhaps it would be worth trying changing =
 the
 >  "sleep(1)" into a much smaller usleep() or nanosleep() - which would ha=
 ve
 >  the same effect, of avoiding busy waiting, while not adding nearly as m=
 uch
 >  delay (if it happens that the sleep needs to be a second occasionally, =
 then
 >  that might mean 100 * usleep(10000) rather than 1 sleep(1), but it seem=
 s that
 >  most of the time, the wait does not need to be nearly that long).

 Removing the sleep entirely, would not be unreasonable since
 the busy wait issue has existed for several years.
 Ie. when there are no tokens and no jobs running is when you'd
 want wantToken=3D1 but that's not when it was being set.

 Will try usleep(10000)....



From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: bin/53285: Build times tripled with make/job.c 1.193
Date: Sun, 13 May 2018 20:03:22 +0200

 Function Make_Run() does:

     while (!Lst_IsEmpty(toBeMade) || jobTokensRunning > 0) {
         Job_CatchOutput();
         (void)MakeStartJobs();
     }

 where Job_CatchOutput() does:

     do {
         nready = poll(fds + 1 - wantToken, nfds - 1 + wantToken, POLL_MSEC);
     } while (nready < 0 && errno == EINTR);

 with POLL_MSEC == 5000.

 Where is the busy waiting?

 --
 J. Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)

From: "Simon J. Gerraty" <sjg@juniper.net>
To: <gnats-bugs@NetBSD.org>
Cc: <gnats-admin@NetBSD.org>, <netbsd-bugs@NetBSD.org>,
        Andreas Gustafsson
	<gson@gson.org>, <sjg@juniper.net>
Subject: Re: bin/53285: Build times tripled with make/job.c 1.193
Date: Sun, 13 May 2018 11:33:14 -0700

 J. Hannken-Illjes <hannken@eis.cs.tu-bs.de> wrote:
 >  Function Make_Run() does:
 >  
 >      while (!Lst_IsEmpty(toBeMade) || jobTokensRunning > 0) {
 >          Job_CatchOutput();
 >          (void)MakeStartJobs();
 >      }
 >  

 >  Where is the busy waiting?

 Hmmm good point, I either ignored that ||
 or confused start of Job_CatchChildren that returns unless
 jobTokensRunning > 0

 Obviously need more coffee...


From: "Simon J. Gerraty" <sjg@juniper.net>
To: 
Cc: <gnats-bugs@NetBSD.org>, <gnats-admin@NetBSD.org>,
        <netbsd-bugs@NetBSD.org>, Andreas Gustafsson <gson@gson.org>,
        <sjg@juniper.net>
Subject: Re: bin/53285: Build times tripled with make/job.c 1.193
Date: Sun, 13 May 2018 11:58:00 -0700

 Simon J. Gerraty <sjg@juniper.net> wrote:
 > Obviously need more coffee...

 But anyway, means the originally propsed change (below) should be ok
 which is what a lot of folk have been running for some time with no
 ill effect.

 I'm interested though to see if simply fixing the when job token pipe
 was polled also results in reduced sys overhead.

 Index: job.c
 ===================================================================
 RCS file: /cvsroot/src/usr.bin/make/job.c,v
 retrieving revision 1.194
 diff -u -p -r1.194 job.c
 --- job.c	13 May 2018 12:10:36 -0000	1.194
 +++ job.c	13 May 2018 18:52:38 -0000
 @@ -2950,7 +2950,6 @@ Job_TokenWithdraw(void)
  	}
  	if (DEBUG(JOB))
  	    fprintf(debug_file, "(%d) blocked for token\n", getpid());
 -	wantToken = 1;
  	return FALSE;
      }


From: "Simon J. Gerraty" <sjg@juniper.net>
To: <gnats-bugs@NetBSD.org>
Cc: <gnats-admin@NetBSD.org>, <netbsd-bugs@NetBSD.org>,
        Andreas Gustafsson
	<gson@gson.org>, <sjg@juniper.net>
Subject: Re: bin/53285: Build times tripled with make/job.c 1.193
Date: Sun, 13 May 2018 10:59:30 -0700

 The patch below, is probably a better 1st step.
 It fixes the busy wait with no delay.

 It does still poll the job token pipe, but only when we would
 otherwise be busy waiting, rather when we have no need to.

 Index: job.c
 ===================================================================
 RCS file: /cvsroot/src/usr.bin/make/job.c,v
 retrieving revision 1.194
 diff -u -p -r1.194 job.c
 --- job.c	13 May 2018 12:10:36 -0000	1.194
 +++ job.c	13 May 2018 17:49:16 -0000
 @@ -2944,14 +2944,15 @@ Job_TokenWithdraw(void)
      count = read(tokenWaitJob.inPipe, &tok, 1);
      if (count == 0)
  	Fatal("eof on job pipe!");
 -    if (count < 0 && jobTokensRunning != 0) {
 +    if (count < 0) {
  	if (errno != EAGAIN) {
  	    Fatal("job pipe read: %s", strerror(errno));
  	}
  	if (DEBUG(JOB))
  	    fprintf(debug_file, "(%d) blocked for token\n", getpid());
 +	if (jobTokensRunning)
 +	    return FALSE;
  	wantToken = 1;
 -	return FALSE;
      }

      if (count == 1 && tok != '+') {

From: Martin Husemann <martin@duskware.de>
To: "Simon J. Gerraty" <sjg@juniper.net>
Cc: gnats-bugs@NetBSD.org
Subject: Re: bin/53285: Build times tripled with make/job.c 1.193
Date: Mon, 14 May 2018 09:55:02 +0200

 Stupid question: can't we use proper sem_timedwait(3) or similar and
 be done?

 Martin

State-Changed-From-To: open->closed
State-Changed-By: gson@NetBSD.org
State-Changed-When: Mon, 14 May 2018 19:25:43 +0000
State-Changed-Why:
Change reverted, build times are back to normal.


From: "Simon J. Gerraty" <sjg@juniper.net>
To: Martin Husemann <martin@duskware.de>
Cc: <gnats-bugs@NetBSD.org>, <sjg@juniper.net>
Subject: Re: bin/53285: Build times tripled with make/job.c 1.193
Date: Mon, 14 May 2018 12:19:58 -0700

 Martin Husemann <martin@duskware.de> wrote:

 > Stupid question: can't we use proper sem_timedwait(3) or similar and
 > be done?

 Yes and no.
 First off the busy wait issue was red-herring so no need to do anything
 on that front.
 Second - for bmake anyway, use of something like sem_timedwait could
 likely pose portability issue, nothing insurmountable but still a
 potential issue.

 Thanks
 --sjg

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.