NetBSD Problem Report #47524
From gson@gson.org Sat Feb 2 14:10:45 2013
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
by www.NetBSD.org (Postfix) with ESMTP id DBBEE63D7B3
for <gnats-bugs@gnats.NetBSD.org>; Sat, 2 Feb 2013 14:10:44 +0000 (UTC)
Message-Id: <20130202141040.27D8775EC5@guava.gson.org>
Date: Sat, 2 Feb 2013 16:10:39 +0200 (EET)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@gnats.NetBSD.org
Subject: make(1) loops in Job_CatchOutput()
X-Send-Pr-Version: 3.95
>Number: 47524
>Category: bin
>Synopsis: make(1) loops in Job_CatchOutput()
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: bin-bug-people
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sat Feb 02 14:15:01 +0000 2013
>Closed-Date: Thu Jul 06 09:40:49 +0000 2023
>Last-Modified: Thu Jul 06 09:40:49 +0000 2023
>Originator: Andreas Gustafsson
>Release: NetBSD-current source date 2013.01.29.01.05.57
>Organization:
>Environment:
System: NetBSD 6.0 amd64 (build host)
Architecture: amd64
Machine: amd64
>Description:
I have been running automated builds of NetBSD/i386 several times per
day for a number of years now. The current build host is a quad-core
NetBSD/amd64 6.0 system, building with -j 4.
Recently, one of these builds hung with two make(1) processes stuck in
endless loops. They have now used more than 24 hours of CPU each and
are still running as I write this.
Running ktruss against the looping processes shows them making
repeated calls to read() that always returned EAGAIN. Attaching
with gdb shows them looping in Job_CatchOutput() in job.c.
Since this is the first time I have seen this particular failure in
many thousands of builds, I suspect it was caused by a recent change.
On Jan 26, usr.bin/make/job.c 1.65 was committed, containing the
following change to Job_CatchOutput() (among other changes):
@@ -2046,7 +2050,8 @@
if (nready < 0 || readyfd(&childExitJob)) {
char token = 0;
nready -= 1;
- (void)read(childExitJob.inPipe, &token, 1);
+ while (read(childExitJob.inPipe, &token, 1) == -1 && errno == EAGAIN)
+ continue;
if (token == DO_JOB_RESUME[0])
/* Complete relay requested from our SIGCONT handler */
JobRestartJobs();
I believe the "while" loop added by this change is where make is now
looping.
Some highlights from a transcript of the debugging session follow.
$ ps -glaxw|grep make
101 10825 22842 0 85 0 10804 1176 pipe_rd S+ ttyp0 0:00.00 grep make
101 26878 22842 0 43 0 30680 7528 - T ttyp0 0:06.34 gdb /tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake
101 2312 7193 34745 17 10 7704 2268 - ON pts/3 1648:33.01 /tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake _THISDIR_ DISTRIBUTION_DONE release
101 20821 9556 0 80 10 94744 78508 pipe_wr IN pts/3 0:01.71 /tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake
101 26400 28328 34745 17 10 7704 1976 - ON+ pts/3 1650:08.03 /tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake -j 4 release
$ top
load averages: 1.99, 1.97, 1.92; up 27+19:38:26 13:31:14
77 processes: 66 sleeping, 3 stopped, 4 zombie, 4 on CPU
CPU0 states: 0.0% user, 0.0% nice, 2.2% system, 28.9% interrupt, 68.8% idle
CPU1 states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
CPU2 states: 0.0% user, 23.2% nice, 76.8% system, 0.0% interrupt, 0.0% idle
CPU3 states: 0.0% user, 22.4% nice, 77.6% system, 0.0% interrupt, 0.0% idle
Memory: 8702M Act, 4272M Inact, 48K Wired, 27M Exec, 8290M File, 81M Free
Swap: 16G Total, 1917M Used, 14G Free
PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
26400 gson 15 10 7704K 1976K CPU/2 27.7H 99.02% 99.02% nbmake
2312 gson 15 10 7704K 2268K CPU/3 27.7H 99.02% 99.02% nbmake
[...]
$ ktruss -p 26400
26400 1 nbmake read(0x9, 0x7f7fffffccbf, 0x1) Err#35 EAGAIN
26400 1 nbmake read(0x9, 0x7f7fffffccbf, 0x1) Err#35 EAGAIN
26400 1 nbmake read(0x9, 0x7f7fffffccbf, 0x1) Err#35 EAGAIN
[...]
$ proctree
[...]
| | \-+- 19473 gson time ./build.sh -j 4 -D /tmp/bracket/build/2013.01.29.01.05.57-i386/destdir -R /bracket/i386/test/2013.01.29.01.05.57/release -T /tmp/bracket/build/2013.01.29.01.05.57-i386/tools -O /tmp/bracket/build/2013.01.29.01.05.57-i386/obj -m i386 -V TMPDIR=/tmp -U release iso-image
| | \-+- 28328 gson sh ./build.sh -j 4 -D /tmp/bracket/build/2013.01.29.01.05.57-i386/destdir -R /bracket/i386/test/2013.01.29.01.05.57/release -T /tmp/bracket/build/2013.01.29.01.05.57-i386/tools -O /tmp/bracket/build/2013.01.29.01.05.57-i386/obj -m i386 -V TMPDIR=/tmp -U release iso-image
| | \-+- 26400 gson /tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake -j 4 release
| | \-+= 07193 gson sh
| | \-+- 02312 gson /tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake _THISDIR_ DISTRIBUTION_DONE release
| | \-+= 09556 gson sh
| | \-+- 20821 gson /tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake
| | |--= 12759 gson (sh)
| | |--= 21395 gson (sh)
| | |--= 21444 gson (sh)
| | \--= 29829 gson (sh)
[...]
$ gdb /tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake
GNU gdb (GDB) 7.3.1
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64--netbsd".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake...(no debugging symbols found)...done.
(gdb) attach 2312
Attaching to program: /tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake, process 2312
Reading symbols from /usr/lib/libc.so.12...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libc.so.12
Reading symbols from /usr/libexec/ld.elf_so...(no debugging symbols found)...done.
Loaded symbols for /usr/libexec/ld.elf_so
0x00007f7ff783912a in read () from /usr/lib/libc.so.12
(gdb) where
#0 0x00007f7ff783912a in read () from /usr/lib/libc.so.12
#1 0x000000000040afad in Job_CatchOutput ()
#2 0x000000000040ff2c in Make_Run ()
#3 0x000000000040e729 in main ()
>How-To-Repeat:
Do lots of builds until one of them hangs.
Alternatively, since the processes are still running, I may be able to
perform additional diagnostics on request, if asked quickly enough.
>Fix:
>Release-Note:
>Audit-Trail:
From: David Laight <david@l8s.co.uk>
To: gnats-bugs@NetBSD.org
Cc: gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: bin/47524: make(1) loops in Job_CatchOutput()
Date: Sat, 2 Feb 2013 19:10:56 +0000
On Sat, Feb 02, 2013 at 02:15:01PM +0000, Andreas Gustafsson wrote:
> >Number: 47524
> >Category: bin
> >Synopsis: make(1) loops in Job_CatchOutput()
...
> @@ -2046,7 +2050,8 @@
> if (nready < 0 || readyfd(&childExitJob)) {
> char token = 0;
> nready -= 1;
> - (void)read(childExitJob.inPipe, &token, 1);
> + while (read(childExitJob.inPipe, &token, 1) == -1 && errno == EAGAIN)
> + continue;
> if (token == DO_JOB_RESUME[0])
> /* Complete relay requested from our SIGCONT handler */
> JobRestartJobs();
>
> I believe the "while" loop added by this change is where make is now
> looping.
I'd guess that fd in non-blocking so EAGAIN would just mean that the
pipe is empty.
The purpose of the read() is to empty the pipe, If it is already empty
it won't matter.
I think you'll find the sigchild handler writes into it, possibly
with some check of nready (not looked) to wake things up.
Most likely two children finishing at once only write 1 character,
or some path comes in with nready < 0 ...
David
--
David Laight: david@l8s.co.uk
From: Andreas Gustafsson <gson@gson.org>
To: dsl@NetBSD.org
Cc: christos@NetBSD.org,
gnats-bugs@NetBSD.org,
gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org,
gson@gson.org (Andreas Gustafsson)
Subject: Re: bin/47524: make(1) loops in Job_CatchOutput()
Date: Mon, 4 Feb 2013 21:23:02 +0200
Hi David,
You wrote:
> > I believe the "while" loop added by this change is where make is now
> > looping.
>
> I'd guess that fd in non-blocking
Yes - I believe you made it non-blocking as part of job.c revision
1.122, where the commit message says "Common up the code that creates
all the pipes - making them all non-block on the read side in the
process".
> so EAGAIN would just mean that the pipe is empty.
Yes.
> The purpose of the read() is to empty the pipe, If it is already empty
> it won't matter.
Agreed.
> I think you'll find the sigchild handler writes into it
Yes, I know - I wrote that code back in 2002.
> Most likely two children finishing at once only write 1 character,
I don't think that can trigger the bug. The read() happens right
after a poll(), and regardless of how many characters are in the pipe,
if the poll indicates the fd is readable, it should be readable.
> or some path comes in with nready < 0 ...
That seems more likely. There is actually only one "path in" - nready
is set by the poll on the previous line
nready = poll(fds + 1 - wantToken, nfds - 1 + wantToken, POLL_MSEC);
Can you please explain the reasoning behind the following change you
made in job.c 1.122 - why is a failing poll treated as if the
childExitJob fd is ready?
- nready = poll(fds + 1 - wantToken, nfds + 1 - wantToken, POLL_MSEC);
- if (nready <= 0)
- return;
+ nready = poll(fds + 1 - wantToken, nfds - 1 + wantToken, POLL_MSEC);
- if (readyfd(&childExitJob)) {
+ if (nready < 0 || readyfd(&childExitJob)) {
In any case, christos' "fix" of looping five times can't possibly be
correct. The code worked before the loop was added, and should be
fixed by removing the loop, not by looping some magic number of times.
--
Andreas Gustafsson, gson@gson.org
From: David Laight <david@l8s.co.uk>
To: Andreas Gustafsson <gson@gson.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: bin/47524: make(1) loops in Job_CatchOutput()
Date: Mon, 4 Feb 2013 21:37:07 +0000
On Mon, Feb 04, 2013 at 09:23:02PM +0200, Andreas Gustafsson wrote:
>
> Can you please explain the reasoning behind the following change you
> made in job.c 1.122 - why is a failing poll treated as if the
> childExitJob fd is ready?
>
> - nready = poll(fds + 1 - wantToken, nfds + 1 - wantToken, POLL_MSEC);
> - if (nready <= 0)
> - return;
> + nready = poll(fds + 1 - wantToken, nfds - 1 + wantToken, POLL_MSEC);
>
> - if (readyfd(&childExitJob)) {
> + if (nready < 0 || readyfd(&childExitJob)) {
I moved the Job_CatchChildren() from being called unconditionally
everytime around the loop, to be (usually) only called when a
token was read from the pipe.
But I didn't want to break the code if it was actually relying on the
poll timeout (5 seconds) to collect children.
> In any case, christos' "fix" of looping five times can't possibly be
> correct. The code worked before the loop was added, and should be
> fixed by removing the loop, not by looping some magic number of times.
I hadn't seen xtos's 'fix' - but it is broken!
The code could try to read more bytes - would save system calls.
David
--
David Laight: david@l8s.co.uk
From: christos@zoulas.com (Christos Zoulas)
To: Andreas Gustafsson <gson@gson.org>, dsl@NetBSD.org
Cc: gnats-bugs@NetBSD.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: bin/47524: make(1) loops in Job_CatchOutput()
Date: Mon, 4 Feb 2013 18:06:02 -0500
On Feb 4, 9:23pm, gson@gson.org (Andreas Gustafsson) wrote:
-- Subject: Re: bin/47524: make(1) loops in Job_CatchOutput()
| Hi David,
|
| You wrote:
| > > I believe the "while" loop added by this change is where make is now
| > > looping.
| >
| > I'd guess that fd in non-blocking
|
| Yes - I believe you made it non-blocking as part of job.c revision
| 1.122, where the commit message says "Common up the code that creates
| all the pipes - making them all non-block on the read side in the
| process".
|
| > so EAGAIN would just mean that the pipe is empty.
|
| Yes.
|
| > The purpose of the read() is to empty the pipe, If it is already empty
| > it won't matter.
|
| Agreed.
|
| > I think you'll find the sigchild handler writes into it
|
| Yes, I know - I wrote that code back in 2002.
|
| > Most likely two children finishing at once only write 1 character,
|
| I don't think that can trigger the bug. The read() happens right
| after a poll(), and regardless of how many characters are in the pipe,
| if the poll indicates the fd is readable, it should be readable.
|
| > or some path comes in with nready < 0 ...
|
| That seems more likely. There is actually only one "path in" - nready
| is set by the poll on the previous line
|
| nready = poll(fds + 1 - wantToken, nfds - 1 + wantToken, POLL_MSEC);
|
| Can you please explain the reasoning behind the following change you
| made in job.c 1.122 - why is a failing poll treated as if the
| childExitJob fd is ready?
|
| - nready = poll(fds + 1 - wantToken, nfds + 1 - wantToken, POLL_MSEC);
| - if (nready <= 0)
| - return;
| + nready = poll(fds + 1 - wantToken, nfds - 1 + wantToken, POLL_MSEC);
|
| - if (readyfd(&childExitJob)) {
| + if (nready < 0 || readyfd(&childExitJob)) {
|
| In any case, christos' "fix" of looping five times can't possibly be
| correct. The code worked before the loop was added, and should be
| fixed by removing the loop, not by looping some magic number of times.
I agree with that (magic loop count), but poll returning < 0 should not
attempt a read. the the sentinel can be removed from the loop.
christos
From: Andreas Gustafsson <gson@gson.org>
To: David Laight <david@l8s.co.uk>
Cc: christos@NetBSD.org,
Andreas Gustafsson <gson@gson.org>,
gnats-bugs@NetBSD.org
Subject: Re: bin/47524: make(1) loops in Job_CatchOutput()
Date: Tue, 5 Feb 2013 09:56:49 +0200
David Laight wrote:
> I moved the Job_CatchChildren() from being called unconditionally
> everytime around the loop, to be (usually) only called when a
> token was read from the pipe.
>
> But I didn't want to break the code if it was actually relying on the
> poll timeout (5 seconds) to collect children.
But in the timeout case, poll() will return 0, not < 0, therefore
Job_CatchChildren() will in fact *not* be called when there is a
timeout. It will, however, be called if the poll() system call
fails. Should the "nready < 0" have been "nready == 0"?
--
Andreas Gustafsson, gson@gson.org
From: David Laight <david@l8s.co.uk>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: bin/47524: make(1) loops in Job_CatchOutput()
Date: Tue, 5 Feb 2013 20:55:41 +0000
On Tue, Feb 05, 2013 at 08:00:13AM +0000, Andreas Gustafsson wrote:
> The following reply was made to PR bin/47524; it has been noted by GNATS.
>
> David Laight wrote:
> > I moved the Job_CatchChildren() from being called unconditionally
> > everytime around the loop, to be (usually) only called when a
> > token was read from the pipe.
> >
> > But I didn't want to break the code if it was actually relying on the
> > poll timeout (5 seconds) to collect children.
>
> But in the timeout case, poll() will return 0, not < 0, therefore
> Job_CatchChildren() will in fact *not* be called when there is a
> timeout. It will, however, be called if the poll() system call
> fails. Should the "nready < 0" have been "nready == 0"?
Quite possibly, I think I was just reducing the number of times
Job_CatchChildren() would be called unnecessarily.
The only time this all goes awry is when the number of jobs gets near
to the number of bytes that can be written into a pipe.
The 'job' pipe will just lose credit, you'd have to be very unluky to
lose anywhere else.
David
--
David Laight: david@l8s.co.uk
From: Andreas Gustafsson <gson@gson.org>
To: dsl@NetBSD.org
Cc: gnats-bugs@NetBSD.org,
christos@NetBSD.org
Subject: Re: bin/47524: make(1) loops in Job_CatchOutput()
Date: Tue, 5 Feb 2013 23:46:07 +0200
David Laight wrote:
> > But in the timeout case, poll() will return 0, not < 0, therefore
> > Job_CatchChildren() will in fact *not* be called when there is a
> > timeout. It will, however, be called if the poll() system call
> > fails. Should the "nready < 0" have been "nready == 0"?
>
> Quite possibly, I think I was just reducing the number of times
> Job_CatchChildren() would be called unnecessarily.
>
> The only time this all goes awry is when the number of jobs gets near
> to the number of bytes that can be written into a pipe.
> The 'job' pipe will just lose credit, you'd have to be very unluky to
> lose anywhere else.
I'm afraid I can't share your confidence in that being the only
remaining bug. I also think that bug is unrelated to the problem
at hand.
I think I now know how the infinite loop happened: poll() was
interrupted by a signal (probably a SIGCHILD) and returned nready= -1
with errno=EINTR. As a result of your addition of the "nready < 0" in
"if (nready < 0 || readyfd(&childExitJob))", a read() was then
attempted on the childExitJob pipe even though the select had not
indicated it was ready. Since the pipe was not ready, the read()
failed with EAGAIN, which used to be harmlessly ignored until Christos
turned it into an infinite loop.
I propose the following change, where the return values from both the
poll() and the read() are actually checked in the sense of reporting
unexpected errors instead of just looping on EAGAIN and silently
ignoring all other errors. I have successfully tested it with one
native -j4 build and one -j12 cross-build from a Linux host.
The other loops Christos added should probably also be replaced
by something similar.
Index: job.c
===================================================================
RCS file: /cvsroot/src/usr.bin/make/job.c,v
retrieving revision 1.168
diff -u -r1.168 job.c
--- job.c 2 Feb 2013 15:24:08 -0000 1.168
+++ job.c 5 Feb 2013 08:26:03 -0000
@@ -139,6 +139,7 @@
#include <sys/time.h>
#include <sys/wait.h>
+#include <assert.h>
#include <errno.h>
#include <fcntl.h>
#ifndef USE_SELECT
@@ -2040,22 +2041,34 @@
{
int nready;
Job *job;
- int i, n;
+ int i;
(void)fflush(stdout);
/* The first fd in the list is the job token pipe */
- nready = poll(fds + 1 - wantToken, nfds - 1 + wantToken, POLL_MSEC);
-
- if (nready < 0 || readyfd(&childExitJob)) {
- char token = 0;
- nready -= 1;
- for (n = 0; n < 5 &&
- read(childExitJob.inPipe, &token, 1) == -1 && errno == EAGAIN; n++)
- continue;
- if (token == DO_JOB_RESUME[0])
- /* Complete relay requested from our SIGCONT handler */
- JobRestartJobs();
+ do {
+ nready = poll(fds + 1 - wantToken, nfds - 1 + wantToken, POLL_MSEC);
+ } while (nready < 0 && errno == EINTR);
+ if (nready < 0)
+ Punt("poll: %s", strerror(errno));
+
+ if (nready == 0 || readyfd(&childExitJob)) {
+ if (nready > 0) {
+ char token = 0;
+ int count;
+ nready -= 1;
+ count = read(childExitJob.inPipe, &token, 1);
+ if (count < 0) {
+ Punt("token pipe read: %s", strerror(errno));
+ } else if (count == 0) {
+ Punt("unexpected eof on token pipe");
+ } else { /* count > 0 */
+ assert(count == 1);
+ if (token == DO_JOB_RESUME[0])
+ /* Complete relay requested from our SIGCONT handler */
+ JobRestartJobs();
+ }
+ }
Job_CatchChildren();
}
--
Andreas Gustafsson, gson@gson.org
From: christos@zoulas.com (Christos Zoulas)
To: Andreas Gustafsson <gson@gson.org>, dsl@NetBSD.org
Cc: gnats-bugs@NetBSD.org
Subject: Re: bin/47524: make(1) loops in Job_CatchOutput()
Date: Tue, 5 Feb 2013 17:05:18 -0500
On Feb 5, 11:46pm, gson@gson.org (Andreas Gustafsson) wrote:
-- Subject: Re: bin/47524: make(1) loops in Job_CatchOutput()
| I think I now know how the infinite loop happened: poll() was
| interrupted by a signal (probably a SIGCHILD) and returned nready= -1
| with errno=EINTR. As a result of your addition of the "nready < 0" in
| "if (nready < 0 || readyfd(&childExitJob))", a read() was then
| attempted on the childExitJob pipe even though the select had not
| indicated it was ready. Since the pipe was not ready, the read()
| failed with EAGAIN, which used to be harmlessly ignored until Christos
| turned it into an infinite loop.
|
| I propose the following change, where the return values from both the
| poll() and the read() are actually checked in the sense of reporting
| unexpected errors instead of just looping on EAGAIN and silently
| ignoring all other errors. I have successfully tested it with one
| native -j4 build and one -j12 cross-build from a Linux host.
|
| The other loops Christos added should probably also be replaced
| by something similar.
|
| Index: job.c
The whole logic is fishy. Why is nready decremented when it is not
used anymore? Why treat some descriptors specially and not deal
with them in the general way?
christos
http://www.netbsd.org/~christos/job.c.diff
From: David Laight <david@l8s.co.uk>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: bin/47524: make(1) loops in Job_CatchOutput()
Date: Tue, 5 Feb 2013 22:34:17 +0000
On Tue, Feb 05, 2013 at 11:46:07PM +0200, Andreas Gustafsson wrote:
> + do {
> + nready = poll(fds + 1 - wantToken, nfds - 1 + wantToken, POLL_MSEC);
> + } while (nready < 0 && errno == EINTR);
> + if (nready < 0)
> + Punt("poll: %s", strerror(errno));
> +
> + if (nready == 0 || readyfd(&childExitJob)) {
> + if (nready > 0) {
> + char token = 0;
> + int count;
> + nready -= 1;
> + count = read(childExitJob.inPipe, &token, 1);
> + if (count < 0) {
> + Punt("token pipe read: %s", strerror(errno));
> + } else if (count == 0) {
> + Punt("unexpected eof on token pipe");
> + } else { /* count > 0 */
> + assert(count == 1);
> + if (token == DO_JOB_RESUME[0])
> + /* Complete relay requested from our SIGCONT handler */
> + JobRestartJobs();
> + }
> + }
That is a riduculous number of tests, you can't see what the code
is doing any more.
Indeed, you now error out if the read(childExitJob.inPipe, &token, 1)
manages to fail to return some data. I wouldn't like to guarantee
that all systems don't have some point where such a read can get
interrupted (even though non-blocking) before committing to copy
a byte to userspace.
The 'token = 0' assignent is enough that it doesn't matter whether
the read fails of not.
Apart from issues with poll() actually returning a read error - so
it doesn't block next time around either, it really doesn't matter
why the code woke up.
I might even have decided that if the poll() call was interrupted
we might as well act as it the childExitJob pipe had a character in it.
David
--
David Laight: david@l8s.co.uk
From: Andreas Gustafsson <gson@gson.org>
To: christos@zoulas.com (Christos Zoulas)
Cc: dsl@NetBSD.org,
gnats-bugs@NetBSD.org
Subject: Re: bin/47524: make(1) loops in Job_CatchOutput()
Date: Wed, 6 Feb 2013 10:31:21 +0200
Christos Zoulas wrote:
> | Index: job.c
>
> The whole logic is fishy. Why is nready decremented when it is not
> used anymore?
I believe this is vestigial; nready was used after the decrement in
revision 1.121, for the purpose you allude to in your comment
"/* Why not use nready here to cut things short? */", but dsl
removed that shortcut in 1.122, without removing the decrement.
Since this is unrealated to the bug at hand, I felt it was best not to
clutter my patch with a fix for this but to leave it for a separate
commit.
> Why treat some descriptors specially and not deal with them in the
> general way?
If you mean why handle some in the "if (nready > 0) {...}" and others
after "if (nready <= 0) return;", I agree that it would be cleaner to
handle them all the same place. In 1.121, they still were. Again,
with my patch, I was not trying to overhaul the entire function, but
to address the bug at hand with a local change.
--
Andreas Gustafsson, gson@gson.org
From: Andreas Gustafsson <gson@gson.org>
To: dsl@NetBSD.org
Cc: gnats-bugs@NetBSD.org, christos@NetBSD.org
Subject: Re: bin/47524: make(1) loops in Job_CatchOutput()
Date: Wed, 6 Feb 2013 11:40:22 +0200
David Laight wrote:
> That is a riduculous number of tests, you can't see what the code
> is doing any more.
>
> Indeed, you now error out if the read(childExitJob.inPipe, &token, 1)
> manages to fail to return some data. I wouldn't like to guarantee
> that all systems don't have some point where such a read can get
> interrupted (even though non-blocking) before committing to copy
> a byte to userspace.
That is not supposed to happen since bmake_signal() sets SA_RESTART.
With the changes in my patch, at least we will get a diagnostic if
SA_RESTART turns out to be ineffective on some platform, and can then
add the necessary retry-on-EINTR loops around any *other* read/write
calls where this situation actually matters.
> The 'token = 0' assignent is enough that it doesn't matter whether
> the read fails of not.
>
> Apart from issues with poll() actually returning a read error - so
> it doesn't block next time around either, it really doesn't matter
> why the code woke up.
>
> I might even have decided that if the poll() call was interrupted
> we might as well act as it the childExitJob pipe had a character in it.
Or in other words, the code was working before revision 1.165; I concur.
Reverting job.c to version 1.164 would be acceptable, preferably with
some comments added to explain the above reasoning.
If the purpose of 1.165 was specifically to suppress a compiler
warning on Linux, surely there must be some way of doing that without
changing the generated object code. On the other hand, if what we
want is to change the behavior of the code to actually check the
return value instead of just suppressing a warning, then I think the
complexity of my "ridiculous number of tests" is pretty much
unavoidable.
--
Andreas Gustafsson, gson@gson.org
From: christos@zoulas.com (Christos Zoulas)
To: Andreas Gustafsson <gson@gson.org>, dsl@NetBSD.org
Cc: gnats-bugs@NetBSD.org
Subject: Re: bin/47524: make(1) loops in Job_CatchOutput()
Date: Wed, 6 Feb 2013 11:02:21 -0500
On Feb 6, 11:40am, gson@gson.org (Andreas Gustafsson) wrote:
-- Subject: Re: bin/47524: make(1) loops in Job_CatchOutput()
| If the purpose of 1.165 was specifically to suppress a compiler
| warning on Linux, surely there must be some way of doing that without
| changing the generated object code. On the other hand, if what we
| want is to change the behavior of the code to actually check the
| return value instead of just suppressing a warning, then I think the
| complexity of my "ridiculous number of tests" is pretty much
| unavoidable.
Irrespective for the reason of the change the current code is more complicated
than it needs to be and incorrect. It just happens to work.
christos
State-Changed-From-To: open->feedback
State-Changed-By: riastradh@NetBSD.org
State-Changed-When: Thu, 06 Jul 2023 02:32:19 +0000
State-Changed-Why:
Something resembling the requested change was committed in 1.169
approximately a decade ago. Has the problem persisted since then?
State-Changed-From-To: feedback->closed
State-Changed-By: gson@NetBSD.org
State-Changed-When: Thu, 06 Jul 2023 09:40:49 +0000
State-Changed-Why:
The changes Christos committed in job.c 1.169 (without referencing the
PR) were indeed quite similar to what I proposed, and in the decade
since, the TNF testbed has done more than 70,000 builds without the
problem happening again as far as I know.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2023
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.