NetBSD Problem Report #53285
From gson@gson.org Sun May 13 09:23:30 2018
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id D6F7F7A1AE
for <gnats-bugs@gnats.NetBSD.org>; Sun, 13 May 2018 09:23:30 +0000 (UTC)
Message-Id: <20180513092324.F0CF19899BD@guava.gson.org>
Date: Sun, 13 May 2018 12:23:24 +0300 (EEST)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: Build times tripled with make/job.c 1.193
X-Send-Pr-Version: 3.95
>Number: 53285
>Category: bin
>Synopsis: Build times tripled with make/job.c 1.193
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: sjg
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sun May 13 09:25:00 +0000 2018
>Closed-Date: Mon May 14 19:25:43 +0000 2018
>Last-Modified: Mon May 14 19:35:00 +0000 2018
>Originator: Andreas Gustafsson
>Release: NetBSD-current, source date >= 2018.05.12.18.17.04
>Organization:
>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:
The amount of time (real, not cpu) to build an amd64 relase on
babylon5.netbsd.org more than tripled with the commit of
src/usr.bin/make/job.c 1.193:
babylon5.netbsd.org /bracket/i386/results/2018$ grep build_wall_time */bracket.db | tail -2
2018.05.12.15.14.49/bracket.db:build_wall_time=4896.09
2018.05.12.18.17.04/bracket.db:build_wall_time=16268.98
babylon5 is an 8-core, 16-hyperthread machine, building with -j 16.
The builds are niced and there is a few threads of other load, but I'm
also seeing a similar slowdown on my own build machine which has
no load other than the build.
Needless to say, this is negatively impacting the operation of the
TNF test infrastructure.
>How-To-Repeat:
Do a parallel release build on a multi-core machine with and without
make/job.c 1.193. Compare build wall times.
>Fix:
Please revert the commit in case.
>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: bin-bug-people->sjg
Responsible-Changed-By: gson@NetBSD.org
Responsible-Changed-When: Sun, 13 May 2018 09:27:31 +0000
Responsible-Changed-Why:
Over to committer.
From: "Christos Zoulas" <christos@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/53285 CVS commit: src/usr.bin/make
Date: Sun, 13 May 2018 08:10:37 -0400
Module Name: src
Committed By: christos
Date: Sun May 13 12:10:37 UTC 2018
Modified Files:
src/usr.bin/make: job.c
Log Message:
PR/53285: Andreas Gustafsson: Build times tripled with make/job.c 1.193
Revert previous:
2018.05.12.15.14.49/bracket.db:build_wall_time=4896.09
2018.05.12.18.17.04/bracket.db:build_wall_time=16268.98
To generate a diff of this commit:
cvs rdiff -u -r1.193 -r1.194 src/usr.bin/make/job.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: bin/53285: Build times tripled with make/job.c 1.193
Date: Sun, 13 May 2018 19:45:43 +0700
Date: Sun, 13 May 2018 09:25:00 +0000 (UTC)
From: gson@gson.org (Andreas Gustafsson)
Message-ID: <20180513092500.C9A487A1DF@mollari.NetBSD.org>
| Please revert the commit in case.
Before giving up completely, perhaps it would be worth trying changing the
"sleep(1)" into a much smaller usleep() or nanosleep() - which would have
the same effect, of avoiding busy waiting, while not adding nearly as much
delay (if it happens that the sleep needs to be a second occasionally, then
that might mean 100 * usleep(10000) rather than 1 sleep(1), but it seems that
most of the time, the wait does not need to be nearly that long).
kre
From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org, sjg@NetBSD.org, kre@NetBSD.org
Cc:
Subject: Re: bin/53285: Build times tripled with make/job.c 1.193
Date: Sun, 13 May 2018 16:08:29 +0300
Robert Elz wrote:
> Before giving up completely, perhaps it would be worth trying changing the
> "sleep(1)" into a much smaller usleep() or nanosleep() - which would have
> the same effect, of avoiding busy waiting, while not adding nearly as much
> delay (if it happens that the sleep needs to be a second occasionally, then
> that might mean 100 * usleep(10000) rather than 1 sleep(1), but it seems that
> most of the time, the wait does not need to be nearly that long).
I don't know what problem the original commit was trying to solve
(someone please file a PR), but arbitrary sleeps are never the right
solution.
--
Andreas Gustafsson, gson@gson.org
From: christos@zoulas.com (Christos Zoulas)
To: gnats-bugs@NetBSD.org, sjg@NetBSD.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org, gson@gson.org (Andreas Gustafsson)
Cc:
Subject: Re: bin/53285: Build times tripled with make/job.c 1.193
Date: Sun, 13 May 2018 09:14:31 -0400
On May 13, 12:50pm, kre@munnari.OZ.AU (Robert Elz) wrote:
-- Subject: Re: bin/53285: Build times tripled with make/job.c 1.193
| Before giving up completely, perhaps it would be worth trying changing the
| "sleep(1)" into a much smaller usleep() or nanosleep() - which would have
| the same effect, of avoiding busy waiting, while not adding nearly as much
| delay (if it happens that the sleep needs to be a second occasionally, then
| that might mean 100 * usleep(10000) rather than 1 sleep(1), but it seems that
| most of the time, the wait does not need to be nearly that long).
Or change this into a better/cheaper form of IPC? Sleeping without someone
to wake you up does not sound right.
christos
From: "Simon J. Gerraty" <sjg@juniper.net>
To: <gnats-bugs@NetBSD.org>
Cc: <gnats-admin@NetBSD.org>, <netbsd-bugs@NetBSD.org>,
Andreas Gustafsson
<gson@gson.org>, <sjg@juniper.net>
Subject: Re: bin/53285: Build times tripled with make/job.c 1.193
Date: Sun, 13 May 2018 10:18:06 -0700
Robert Elz <kre@munnari.OZ.AU> wrote:
> | Please revert the commit in case.
> =
> Before giving up completely, perhaps it would be worth trying changing =
the
> "sleep(1)" into a much smaller usleep() or nanosleep() - which would ha=
ve
> the same effect, of avoiding busy waiting, while not adding nearly as m=
uch
> delay (if it happens that the sleep needs to be a second occasionally, =
then
> that might mean 100 * usleep(10000) rather than 1 sleep(1), but it seem=
s that
> most of the time, the wait does not need to be nearly that long).
Removing the sleep entirely, would not be unreasonable since
the busy wait issue has existed for several years.
Ie. when there are no tokens and no jobs running is when you'd
want wantToken=3D1 but that's not when it was being set.
Will try usleep(10000)....
From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: bin/53285: Build times tripled with make/job.c 1.193
Date: Sun, 13 May 2018 20:03:22 +0200
Function Make_Run() does:
while (!Lst_IsEmpty(toBeMade) || jobTokensRunning > 0) {
Job_CatchOutput();
(void)MakeStartJobs();
}
where Job_CatchOutput() does:
do {
nready = poll(fds + 1 - wantToken, nfds - 1 + wantToken, POLL_MSEC);
} while (nready < 0 && errno == EINTR);
with POLL_MSEC == 5000.
Where is the busy waiting?
--
J. Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)
From: "Simon J. Gerraty" <sjg@juniper.net>
To: <gnats-bugs@NetBSD.org>
Cc: <gnats-admin@NetBSD.org>, <netbsd-bugs@NetBSD.org>,
Andreas Gustafsson
<gson@gson.org>, <sjg@juniper.net>
Subject: Re: bin/53285: Build times tripled with make/job.c 1.193
Date: Sun, 13 May 2018 11:33:14 -0700
J. Hannken-Illjes <hannken@eis.cs.tu-bs.de> wrote:
> Function Make_Run() does:
>
> while (!Lst_IsEmpty(toBeMade) || jobTokensRunning > 0) {
> Job_CatchOutput();
> (void)MakeStartJobs();
> }
>
> Where is the busy waiting?
Hmmm good point, I either ignored that ||
or confused start of Job_CatchChildren that returns unless
jobTokensRunning > 0
Obviously need more coffee...
From: "Simon J. Gerraty" <sjg@juniper.net>
To:
Cc: <gnats-bugs@NetBSD.org>, <gnats-admin@NetBSD.org>,
<netbsd-bugs@NetBSD.org>, Andreas Gustafsson <gson@gson.org>,
<sjg@juniper.net>
Subject: Re: bin/53285: Build times tripled with make/job.c 1.193
Date: Sun, 13 May 2018 11:58:00 -0700
Simon J. Gerraty <sjg@juniper.net> wrote:
> Obviously need more coffee...
But anyway, means the originally propsed change (below) should be ok
which is what a lot of folk have been running for some time with no
ill effect.
I'm interested though to see if simply fixing the when job token pipe
was polled also results in reduced sys overhead.
Index: job.c
===================================================================
RCS file: /cvsroot/src/usr.bin/make/job.c,v
retrieving revision 1.194
diff -u -p -r1.194 job.c
--- job.c 13 May 2018 12:10:36 -0000 1.194
+++ job.c 13 May 2018 18:52:38 -0000
@@ -2950,7 +2950,6 @@ Job_TokenWithdraw(void)
}
if (DEBUG(JOB))
fprintf(debug_file, "(%d) blocked for token\n", getpid());
- wantToken = 1;
return FALSE;
}
From: "Simon J. Gerraty" <sjg@juniper.net>
To: <gnats-bugs@NetBSD.org>
Cc: <gnats-admin@NetBSD.org>, <netbsd-bugs@NetBSD.org>,
Andreas Gustafsson
<gson@gson.org>, <sjg@juniper.net>
Subject: Re: bin/53285: Build times tripled with make/job.c 1.193
Date: Sun, 13 May 2018 10:59:30 -0700
The patch below, is probably a better 1st step.
It fixes the busy wait with no delay.
It does still poll the job token pipe, but only when we would
otherwise be busy waiting, rather when we have no need to.
Index: job.c
===================================================================
RCS file: /cvsroot/src/usr.bin/make/job.c,v
retrieving revision 1.194
diff -u -p -r1.194 job.c
--- job.c 13 May 2018 12:10:36 -0000 1.194
+++ job.c 13 May 2018 17:49:16 -0000
@@ -2944,14 +2944,15 @@ Job_TokenWithdraw(void)
count = read(tokenWaitJob.inPipe, &tok, 1);
if (count == 0)
Fatal("eof on job pipe!");
- if (count < 0 && jobTokensRunning != 0) {
+ if (count < 0) {
if (errno != EAGAIN) {
Fatal("job pipe read: %s", strerror(errno));
}
if (DEBUG(JOB))
fprintf(debug_file, "(%d) blocked for token\n", getpid());
+ if (jobTokensRunning)
+ return FALSE;
wantToken = 1;
- return FALSE;
}
if (count == 1 && tok != '+') {
From: Martin Husemann <martin@duskware.de>
To: "Simon J. Gerraty" <sjg@juniper.net>
Cc: gnats-bugs@NetBSD.org
Subject: Re: bin/53285: Build times tripled with make/job.c 1.193
Date: Mon, 14 May 2018 09:55:02 +0200
Stupid question: can't we use proper sem_timedwait(3) or similar and
be done?
Martin
State-Changed-From-To: open->closed
State-Changed-By: gson@NetBSD.org
State-Changed-When: Mon, 14 May 2018 19:25:43 +0000
State-Changed-Why:
Change reverted, build times are back to normal.
From: "Simon J. Gerraty" <sjg@juniper.net>
To: Martin Husemann <martin@duskware.de>
Cc: <gnats-bugs@NetBSD.org>, <sjg@juniper.net>
Subject: Re: bin/53285: Build times tripled with make/job.c 1.193
Date: Mon, 14 May 2018 12:19:58 -0700
Martin Husemann <martin@duskware.de> wrote:
> Stupid question: can't we use proper sem_timedwait(3) or similar and
> be done?
Yes and no.
First off the busy wait issue was red-herring so no need to do anything
on that front.
Second - for bmake anyway, use of something like sem_timedwait could
likely pose portability issue, nothing insurmountable but still a
potential issue.
Thanks
--sjg
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.