NetBSD Problem Report #47524

From gson@gson.org  Sat Feb  2 14:10:45 2013
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	by www.NetBSD.org (Postfix) with ESMTP id DBBEE63D7B3
	for <gnats-bugs@gnats.NetBSD.org>; Sat,  2 Feb 2013 14:10:44 +0000 (UTC)
Message-Id: <20130202141040.27D8775EC5@guava.gson.org>
Date: Sat,  2 Feb 2013 16:10:39 +0200 (EET)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@gnats.NetBSD.org
Subject: make(1) loops in Job_CatchOutput()
X-Send-Pr-Version: 3.95

>Number:         47524
>Category:       bin
>Synopsis:       make(1) loops in Job_CatchOutput()
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    bin-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Feb 02 14:15:01 +0000 2013
>Closed-Date:    Thu Jul 06 09:40:49 +0000 2023
>Last-Modified:  Thu Jul 06 09:40:49 +0000 2023
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current source date 2013.01.29.01.05.57
>Organization:
>Environment:
System: NetBSD 6.0 amd64 (build host)
Architecture: amd64
Machine: amd64
>Description:

I have been running automated builds of NetBSD/i386 several times per
day for a number of years now.  The current build host is a quad-core
NetBSD/amd64 6.0 system, building with -j 4.

Recently, one of these builds hung with two make(1) processes stuck in
endless loops.  They have now used more than 24 hours of CPU each and
are still running as I write this.

Running ktruss against the looping processes shows them making
repeated calls to read() that always returned EAGAIN.  Attaching
with gdb shows them looping in Job_CatchOutput() in job.c.

Since this is the first time I have seen this particular failure in
many thousands of builds, I suspect it was caused by a recent change.
On Jan 26, usr.bin/make/job.c 1.65 was committed, containing the
following change to Job_CatchOutput() (among other changes):

@@ -2046,7 +2050,8 @@
     if (nready < 0 || readyfd(&childExitJob)) {
 	char token = 0;
 	nready -= 1;
-	(void)read(childExitJob.inPipe, &token, 1);
+	while (read(childExitJob.inPipe, &token, 1) == -1 && errno == EAGAIN)
+		continue;
 	if (token == DO_JOB_RESUME[0])
 	    /* Complete relay requested from our SIGCONT handler */
 	    JobRestartJobs();

I believe the "while" loop added by this change is where make is now
looping.

Some highlights from a transcript of the debugging session follow.

   $ ps -glaxw|grep make
   101 10825 22842     0  85  0 10804  1176 pipe_rd S+   ttyp0     0:00.00 grep make 
   101 26878 22842     0  43  0 30680  7528 -       T    ttyp0     0:06.34 gdb /tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake 
   101  2312  7193 34745  17 10  7704  2268 -       ON   pts/3  1648:33.01 /tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake _THISDIR_ DISTRIBUTION_DONE release 
   101 20821  9556     0  80 10 94744 78508 pipe_wr IN   pts/3     0:01.71 /tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake 
   101 26400 28328 34745  17 10  7704  1976 -       ON+  pts/3  1650:08.03 /tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake -j 4 release 

   $ top
   load averages:  1.99,  1.97,  1.92;               up 27+19:38:26      13:31:14
   77 processes: 66 sleeping, 3 stopped, 4 zombie, 4 on CPU
   CPU0 states:  0.0% user,  0.0% nice,  2.2% system, 28.9% interrupt, 68.8% idle
   CPU1 states:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
   CPU2 states:  0.0% user, 23.2% nice, 76.8% system,  0.0% interrupt,  0.0% idle
   CPU3 states:  0.0% user, 22.4% nice, 77.6% system,  0.0% interrupt,  0.0% idle
   Memory: 8702M Act, 4272M Inact, 48K Wired, 27M Exec, 8290M File, 81M Free
   Swap: 16G Total, 1917M Used, 14G Free

     PID USERNAME PRI NICE   SIZE   RES STATE      TIME   WCPU    CPU COMMAND
   26400 gson      15   10  7704K 1976K CPU/2     27.7H 99.02% 99.02% nbmake
    2312 gson      15   10  7704K 2268K CPU/3     27.7H 99.02% 99.02% nbmake
   [...]

   $ ktruss -p 26400
    26400      1 nbmake   read(0x9, 0x7f7fffffccbf, 0x1) Err#35 EAGAIN
    26400      1 nbmake   read(0x9, 0x7f7fffffccbf, 0x1) Err#35 EAGAIN
    26400      1 nbmake   read(0x9, 0x7f7fffffccbf, 0x1) Err#35 EAGAIN
   [...]

   $ proctree
   [...]
    | |           \-+- 19473 gson time ./build.sh -j 4 -D /tmp/bracket/build/2013.01.29.01.05.57-i386/destdir -R /bracket/i386/test/2013.01.29.01.05.57/release -T /tmp/bracket/build/2013.01.29.01.05.57-i386/tools -O /tmp/bracket/build/2013.01.29.01.05.57-i386/obj -m i386 -V TMPDIR=/tmp -U release iso-image 
    | |             \-+- 28328 gson sh ./build.sh -j 4 -D /tmp/bracket/build/2013.01.29.01.05.57-i386/destdir -R /bracket/i386/test/2013.01.29.01.05.57/release -T /tmp/bracket/build/2013.01.29.01.05.57-i386/tools -O /tmp/bracket/build/2013.01.29.01.05.57-i386/obj -m i386 -V TMPDIR=/tmp -U release iso-image 
    | |               \-+- 26400 gson /tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake -j 4 release 
    | |                 \-+= 07193 gson sh 
    | |                   \-+- 02312 gson /tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake _THISDIR_ DISTRIBUTION_DONE release 
    | |                     \-+= 09556 gson sh 
    | |                       \-+- 20821 gson /tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake 
    | |                         |--= 12759 gson (sh)
    | |                         |--= 21395 gson (sh)
    | |                         |--= 21444 gson (sh)
    | |                         \--= 29829 gson (sh)
   [...]

   $ gdb /tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake
   GNU gdb (GDB) 7.3.1
   Copyright (C) 2011 Free Software Foundation, Inc.
   License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
   This is free software: you are free to change and redistribute it.
   There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
   and "show warranty" for details.
   This GDB was configured as "x86_64--netbsd".
   For bug reporting instructions, please see:
   <http://www.gnu.org/software/gdb/bugs/>...
   Reading symbols from /tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake...(no debugging symbols found)...done.
   (gdb) attach 2312
   Attaching to program: /tmp/bracket/build/2013.01.29.01.05.57-i386/tools/bin/nbmake, process 2312
   Reading symbols from /usr/lib/libc.so.12...(no debugging symbols found)...done.
   Loaded symbols for /usr/lib/libc.so.12
   Reading symbols from /usr/libexec/ld.elf_so...(no debugging symbols found)...done.
   Loaded symbols for /usr/libexec/ld.elf_so
   0x00007f7ff783912a in read () from /usr/lib/libc.so.12
   (gdb) where
   #0  0x00007f7ff783912a in read () from /usr/lib/libc.so.12
   #1  0x000000000040afad in Job_CatchOutput ()
   #2  0x000000000040ff2c in Make_Run ()
   #3  0x000000000040e729 in main ()

>How-To-Repeat:

Do lots of builds until one of them hangs.

Alternatively, since the processes are still running, I may be able to
perform additional diagnostics on request, if asked quickly enough.

>Fix:

>Release-Note:

>Audit-Trail:
From: David Laight <david@l8s.co.uk>
To: gnats-bugs@NetBSD.org
Cc: gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: bin/47524: make(1) loops in Job_CatchOutput()
Date: Sat, 2 Feb 2013 19:10:56 +0000

 On Sat, Feb 02, 2013 at 02:15:01PM +0000, Andreas Gustafsson wrote:
 > >Number:         47524
 > >Category:       bin
 > >Synopsis:       make(1) loops in Job_CatchOutput()
 ...
 > @@ -2046,7 +2050,8 @@
 >      if (nready < 0 || readyfd(&childExitJob)) {
 >  	char token = 0;
 >  	nready -= 1;
 > -	(void)read(childExitJob.inPipe, &token, 1);
 > +	while (read(childExitJob.inPipe, &token, 1) == -1 && errno == EAGAIN)
 > +		continue;
 >  	if (token == DO_JOB_RESUME[0])
 >  	    /* Complete relay requested from our SIGCONT handler */
 >  	    JobRestartJobs();
 > 
 > I believe the "while" loop added by this change is where make is now
 > looping.

 I'd guess that fd in non-blocking so EAGAIN would just mean that the
 pipe is empty.
 The purpose of the read() is to empty the pipe, If it is already empty
 it won't matter.
 I think you'll find the sigchild handler writes into it, possibly
 with some check of nready (not looked) to wake things up.

 Most likely two children finishing at once only write 1 character,
 or some path comes in with nready < 0 ...

 	David

 -- 
 David Laight: david@l8s.co.uk

From: Andreas Gustafsson <gson@gson.org>
To: dsl@NetBSD.org
Cc: christos@NetBSD.org,
    gnats-bugs@NetBSD.org,
    gnats-admin@netbsd.org,
    netbsd-bugs@netbsd.org,
    gson@gson.org (Andreas Gustafsson)
Subject: Re: bin/47524: make(1) loops in Job_CatchOutput()
Date: Mon, 4 Feb 2013 21:23:02 +0200

 Hi David,

 You wrote:
 >  > I believe the "while" loop added by this change is where make is now
 >  > looping.
 >  
 >  I'd guess that fd in non-blocking

 Yes - I believe you made it non-blocking as part of job.c revision
 1.122, where the commit message says "Common up the code that creates
 all the pipes - making them all non-block on the read side in the
 process".

 > so EAGAIN would just mean that the pipe is empty.

 Yes.

 >  The purpose of the read() is to empty the pipe, If it is already empty
 >  it won't matter.

 Agreed.

 >  I think you'll find the sigchild handler writes into it

 Yes, I know - I wrote that code back in 2002.

 >  Most likely two children finishing at once only write 1 character,

 I don't think that can trigger the bug.  The read() happens right
 after a poll(), and regardless of how many characters are in the pipe,
 if the poll indicates the fd is readable, it should be readable.

 >  or some path comes in with nready < 0 ...

 That seems more likely.  There is actually only one "path in" - nready
 is set by the poll on the previous line

    nready = poll(fds + 1 - wantToken, nfds - 1 + wantToken, POLL_MSEC);

 Can you please explain the reasoning behind the following change you
 made in job.c 1.122 - why is a failing poll treated as if the
 childExitJob fd is ready?

 -    nready = poll(fds + 1 - wantToken, nfds + 1 - wantToken, POLL_MSEC);
 -    if (nready <= 0)
 -       return;
 +    nready = poll(fds + 1 - wantToken, nfds - 1 + wantToken, POLL_MSEC);

 -    if (readyfd(&childExitJob)) {
 +    if (nready < 0 || readyfd(&childExitJob)) {

 In any case, christos' "fix" of looping five times can't possibly be
 correct.  The code worked before the loop was added, and should be
 fixed by removing the loop, not by looping some magic number of times.
 -- 
 Andreas Gustafsson, gson@gson.org

From: David Laight <david@l8s.co.uk>
To: Andreas Gustafsson <gson@gson.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: bin/47524: make(1) loops in Job_CatchOutput()
Date: Mon, 4 Feb 2013 21:37:07 +0000

 On Mon, Feb 04, 2013 at 09:23:02PM +0200, Andreas Gustafsson wrote:
 > 
 > Can you please explain the reasoning behind the following change you
 > made in job.c 1.122 - why is a failing poll treated as if the
 > childExitJob fd is ready?
 > 
 > -    nready = poll(fds + 1 - wantToken, nfds + 1 - wantToken, POLL_MSEC);
 > -    if (nready <= 0)
 > -       return;
 > +    nready = poll(fds + 1 - wantToken, nfds - 1 + wantToken, POLL_MSEC);
 > 
 > -    if (readyfd(&childExitJob)) {
 > +    if (nready < 0 || readyfd(&childExitJob)) {

 I moved the Job_CatchChildren() from being called unconditionally
 everytime around the loop, to be (usually) only called when a
 token was read from the pipe.

 But I didn't want to break the code if it was actually relying on the
 poll timeout (5 seconds) to collect children.

 > In any case, christos' "fix" of looping five times can't possibly be
 > correct.  The code worked before the loop was added, and should be
 > fixed by removing the loop, not by looping some magic number of times.

 I hadn't seen xtos's 'fix' - but it is broken!
 The code could try to read more bytes - would save system calls.

 	David

 -- 
 David Laight: david@l8s.co.uk

From: christos@zoulas.com (Christos Zoulas)
To: Andreas Gustafsson <gson@gson.org>, dsl@NetBSD.org
Cc: gnats-bugs@NetBSD.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: bin/47524: make(1) loops in Job_CatchOutput()
Date: Mon, 4 Feb 2013 18:06:02 -0500

 On Feb 4,  9:23pm, gson@gson.org (Andreas Gustafsson) wrote:
 -- Subject: Re: bin/47524: make(1) loops in Job_CatchOutput()

 | Hi David,
 | 
 | You wrote:
 | >  > I believe the "while" loop added by this change is where make is now
 | >  > looping.
 | >  
 | >  I'd guess that fd in non-blocking
 | 
 | Yes - I believe you made it non-blocking as part of job.c revision
 | 1.122, where the commit message says "Common up the code that creates
 | all the pipes - making them all non-block on the read side in the
 | process".
 | 
 | > so EAGAIN would just mean that the pipe is empty.
 | 
 | Yes.
 | 
 | >  The purpose of the read() is to empty the pipe, If it is already empty
 | >  it won't matter.
 | 
 | Agreed.
 | 
 | >  I think you'll find the sigchild handler writes into it
 | 
 | Yes, I know - I wrote that code back in 2002.
 | 
 | >  Most likely two children finishing at once only write 1 character,
 | 
 | I don't think that can trigger the bug.  The read() happens right
 | after a poll(), and regardless of how many characters are in the pipe,
 | if the poll indicates the fd is readable, it should be readable.
 | 
 | >  or some path comes in with nready < 0 ...
 | 
 | That seems more likely.  There is actually only one "path in" - nready
 | is set by the poll on the previous line
 | 
 |    nready = poll(fds + 1 - wantToken, nfds - 1 + wantToken, POLL_MSEC);
 | 
 | Can you please explain the reasoning behind the following change you
 | made in job.c 1.122 - why is a failing poll treated as if the
 | childExitJob fd is ready?
 | 
 | -    nready = poll(fds + 1 - wantToken, nfds + 1 - wantToken, POLL_MSEC);
 | -    if (nready <= 0)
 | -       return;
 | +    nready = poll(fds + 1 - wantToken, nfds - 1 + wantToken, POLL_MSEC);
 | 
 | -    if (readyfd(&childExitJob)) {
 | +    if (nready < 0 || readyfd(&childExitJob)) {
 | 
 | In any case, christos' "fix" of looping five times can't possibly be
 | correct.  The code worked before the loop was added, and should be
 | fixed by removing the loop, not by looping some magic number of times.

 I agree with that (magic loop count), but poll returning < 0 should not
 attempt a read.  the the sentinel can be removed from the loop.

 christos

From: Andreas Gustafsson <gson@gson.org>
To: David Laight <david@l8s.co.uk>
Cc: christos@NetBSD.org,
    Andreas Gustafsson <gson@gson.org>,
    gnats-bugs@NetBSD.org
Subject: Re: bin/47524: make(1) loops in Job_CatchOutput()
Date: Tue, 5 Feb 2013 09:56:49 +0200

 David Laight wrote:
 > I moved the Job_CatchChildren() from being called unconditionally
 > everytime around the loop, to be (usually) only called when a
 > token was read from the pipe.
 > 
 > But I didn't want to break the code if it was actually relying on the
 > poll timeout (5 seconds) to collect children.

 But in the timeout case, poll() will return 0, not < 0, therefore
 Job_CatchChildren() will in fact *not* be called when there is a
 timeout.  It will, however, be called if the poll() system call
 fails.  Should the "nready < 0" have been "nready == 0"?
 -- 
 Andreas Gustafsson, gson@gson.org

From: David Laight <david@l8s.co.uk>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: bin/47524: make(1) loops in Job_CatchOutput()
Date: Tue, 5 Feb 2013 20:55:41 +0000

 On Tue, Feb 05, 2013 at 08:00:13AM +0000, Andreas Gustafsson wrote:
 > The following reply was made to PR bin/47524; it has been noted by GNATS.
 > 
 >  David Laight wrote:
 >  > I moved the Job_CatchChildren() from being called unconditionally
 >  > everytime around the loop, to be (usually) only called when a
 >  > token was read from the pipe.
 >  > 
 >  > But I didn't want to break the code if it was actually relying on the
 >  > poll timeout (5 seconds) to collect children.
 >  
 >  But in the timeout case, poll() will return 0, not < 0, therefore
 >  Job_CatchChildren() will in fact *not* be called when there is a
 >  timeout.  It will, however, be called if the poll() system call
 >  fails.  Should the "nready < 0" have been "nready == 0"?

 Quite possibly, I think I was just reducing the number of times
 Job_CatchChildren() would be called unnecessarily.

 The only time this all goes awry is when the number of jobs gets near
 to the number of bytes that can be written into a pipe.
 The 'job' pipe will just lose credit, you'd have to be very unluky to
 lose anywhere else.

 	David

 -- 
 David Laight: david@l8s.co.uk

From: Andreas Gustafsson <gson@gson.org>
To: dsl@NetBSD.org
Cc: gnats-bugs@NetBSD.org,
    christos@NetBSD.org
Subject: Re: bin/47524: make(1) loops in Job_CatchOutput()
Date: Tue, 5 Feb 2013 23:46:07 +0200

 David Laight wrote:
 >  >  But in the timeout case, poll() will return 0, not < 0, therefore
 >  >  Job_CatchChildren() will in fact *not* be called when there is a
 >  >  timeout.  It will, however, be called if the poll() system call
 >  >  fails.  Should the "nready < 0" have been "nready == 0"?
 >  
 >  Quite possibly, I think I was just reducing the number of times
 >  Job_CatchChildren() would be called unnecessarily.
 >  
 >  The only time this all goes awry is when the number of jobs gets near
 >  to the number of bytes that can be written into a pipe.
 >  The 'job' pipe will just lose credit, you'd have to be very unluky to
 >  lose anywhere else.

 I'm afraid I can't share your confidence in that being the only
 remaining bug.  I also think that bug is unrelated to the problem
 at hand.

 I think I now know how the infinite loop happened: poll() was
 interrupted by a signal (probably a SIGCHILD) and returned nready= -1
 with errno=EINTR.  As a result of your addition of the "nready < 0" in
 "if (nready < 0 || readyfd(&childExitJob))", a read() was then
 attempted on the childExitJob pipe even though the select had not
 indicated it was ready.  Since the pipe was not ready, the read()
 failed with EAGAIN, which used to be harmlessly ignored until Christos
 turned it into an infinite loop.

 I propose the following change, where the return values from both the
 poll() and the read() are actually checked in the sense of reporting
 unexpected errors instead of just looping on EAGAIN and silently
 ignoring all other errors.  I have successfully tested it with one
 native -j4 build and one -j12 cross-build from a Linux host.

 The other loops Christos added should probably also be replaced
 by something similar.

 Index: job.c
 ===================================================================
 RCS file: /cvsroot/src/usr.bin/make/job.c,v
 retrieving revision 1.168
 diff -u -r1.168 job.c
 --- job.c	2 Feb 2013 15:24:08 -0000	1.168
 +++ job.c	5 Feb 2013 08:26:03 -0000
 @@ -139,6 +139,7 @@
  #include <sys/time.h>
  #include <sys/wait.h>

 +#include <assert.h>
  #include <errno.h>
  #include <fcntl.h>
  #ifndef USE_SELECT
 @@ -2040,22 +2041,34 @@
  {
      int nready;
      Job *job;
 -    int i, n;
 +    int i;

      (void)fflush(stdout);

      /* The first fd in the list is the job token pipe */
 -    nready = poll(fds + 1 - wantToken, nfds - 1 + wantToken, POLL_MSEC);
 -
 -    if (nready < 0 || readyfd(&childExitJob)) {
 -	char token = 0;
 -	nready -= 1;
 -	for (n = 0; n < 5 &&
 -	    read(childExitJob.inPipe, &token, 1) == -1 && errno == EAGAIN; n++)
 -		continue;
 -	if (token == DO_JOB_RESUME[0])
 -	    /* Complete relay requested from our SIGCONT handler */
 -	    JobRestartJobs();
 +    do {
 +	nready = poll(fds + 1 - wantToken, nfds - 1 + wantToken, POLL_MSEC);
 +    } while (nready < 0 && errno == EINTR);
 +    if (nready < 0)
 +	Punt("poll: %s", strerror(errno));
 +
 +    if (nready == 0 || readyfd(&childExitJob)) {
 +	if (nready > 0) {
 +	    char token = 0;
 +	    int count;
 +	    nready -= 1;
 +	    count = read(childExitJob.inPipe, &token, 1);
 +	    if (count < 0) {
 +		Punt("token pipe read: %s", strerror(errno));
 +	    } else if (count == 0) {
 +		Punt("unexpected eof on token pipe");
 +	    } else { /* count > 0 */
 +		assert(count == 1);
 +		 if (token == DO_JOB_RESUME[0])
 +		     /* Complete relay requested from our SIGCONT handler */
 +		     JobRestartJobs();
 +	    }
 +	}
  	Job_CatchChildren();
      }

 -- 
 Andreas Gustafsson, gson@gson.org

From: christos@zoulas.com (Christos Zoulas)
To: Andreas Gustafsson <gson@gson.org>, dsl@NetBSD.org
Cc: gnats-bugs@NetBSD.org
Subject: Re: bin/47524: make(1) loops in Job_CatchOutput()
Date: Tue, 5 Feb 2013 17:05:18 -0500

 On Feb 5, 11:46pm, gson@gson.org (Andreas Gustafsson) wrote:
 -- Subject: Re: bin/47524: make(1) loops in Job_CatchOutput()

 | I think I now know how the infinite loop happened: poll() was
 | interrupted by a signal (probably a SIGCHILD) and returned nready= -1
 | with errno=EINTR.  As a result of your addition of the "nready < 0" in
 | "if (nready < 0 || readyfd(&childExitJob))", a read() was then
 | attempted on the childExitJob pipe even though the select had not
 | indicated it was ready.  Since the pipe was not ready, the read()
 | failed with EAGAIN, which used to be harmlessly ignored until Christos
 | turned it into an infinite loop.
 | 
 | I propose the following change, where the return values from both the
 | poll() and the read() are actually checked in the sense of reporting
 | unexpected errors instead of just looping on EAGAIN and silently
 | ignoring all other errors.  I have successfully tested it with one
 | native -j4 build and one -j12 cross-build from a Linux host.
 | 
 | The other loops Christos added should probably also be replaced
 | by something similar.
 | 
 | Index: job.c

 The whole logic is fishy. Why is nready decremented when it is not
 used anymore? Why treat some descriptors specially and not deal
 with them in the general way?

 christos

 http://www.netbsd.org/~christos/job.c.diff

From: David Laight <david@l8s.co.uk>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: bin/47524: make(1) loops in Job_CatchOutput()
Date: Tue, 5 Feb 2013 22:34:17 +0000

 On Tue, Feb 05, 2013 at 11:46:07PM +0200, Andreas Gustafsson wrote:
 > +    do {
 > +	nready = poll(fds + 1 - wantToken, nfds - 1 + wantToken, POLL_MSEC);
 > +    } while (nready < 0 && errno == EINTR);
 > +    if (nready < 0)
 > +	Punt("poll: %s", strerror(errno));
 > +
 > +    if (nready == 0 || readyfd(&childExitJob)) {
 > +	if (nready > 0) {
 > +	    char token = 0;
 > +	    int count;
 > +	    nready -= 1;
 > +	    count = read(childExitJob.inPipe, &token, 1);
 > +	    if (count < 0) {
 > +		Punt("token pipe read: %s", strerror(errno));
 > +	    } else if (count == 0) {
 > +		Punt("unexpected eof on token pipe");
 > +	    } else { /* count > 0 */
 > +		assert(count == 1);
 > +		 if (token == DO_JOB_RESUME[0])
 > +		     /* Complete relay requested from our SIGCONT handler */
 > +		     JobRestartJobs();
 > +	    }
 > +	}

 That is a riduculous number of tests, you can't see what the code
 is doing any more.

 Indeed, you now error out if the read(childExitJob.inPipe, &token, 1)
 manages to fail to return some data. I wouldn't like to guarantee
 that all systems don't have some point where such a read can get
 interrupted (even though non-blocking) before committing to copy
 a byte to userspace.

 The 'token = 0' assignent is enough that it doesn't matter whether
 the read fails of not.

 Apart from issues with poll() actually returning a read error - so
 it doesn't block next time around either, it really doesn't matter
 why the code woke up.

 I might even have decided that if the poll() call was interrupted
 we might as well act as it the childExitJob pipe had a character in it.

 	David

 -- 
 David Laight: david@l8s.co.uk

From: Andreas Gustafsson <gson@gson.org>
To: christos@zoulas.com (Christos Zoulas)
Cc: dsl@NetBSD.org,
    gnats-bugs@NetBSD.org
Subject: Re: bin/47524: make(1) loops in Job_CatchOutput()
Date: Wed, 6 Feb 2013 10:31:21 +0200

 Christos Zoulas wrote:
 > | Index: job.c
 > 
 > The whole logic is fishy. Why is nready decremented when it is not
 > used anymore?

 I believe this is vestigial; nready was used after the decrement in
 revision 1.121, for the purpose you allude to in your comment
 "/* Why not use nready here to cut things short? */", but dsl
 removed that shortcut in 1.122, without removing the decrement.

 Since this is unrealated to the bug at hand, I felt it was best not to
 clutter my patch with a fix for this but to leave it for a separate
 commit.

 > Why treat some descriptors specially and not deal with them in the
 > general way?

 If you mean why handle some in the "if (nready > 0) {...}" and others
 after "if (nready <= 0) return;", I agree that it would be cleaner to
 handle them all the same place.  In 1.121, they still were.  Again,
 with my patch, I was not trying to overhaul the entire function, but
 to address the bug at hand with a local change.
 -- 
 Andreas Gustafsson, gson@gson.org

From: Andreas Gustafsson <gson@gson.org>
To: dsl@NetBSD.org
Cc: gnats-bugs@NetBSD.org, christos@NetBSD.org
Subject: Re: bin/47524: make(1) loops in Job_CatchOutput()
Date: Wed, 6 Feb 2013 11:40:22 +0200

 David Laight wrote:
 >  That is a riduculous number of tests, you can't see what the code
 >  is doing any more.
 >  
 >  Indeed, you now error out if the read(childExitJob.inPipe, &token, 1)
 >  manages to fail to return some data. I wouldn't like to guarantee
 >  that all systems don't have some point where such a read can get
 >  interrupted (even though non-blocking) before committing to copy
 >  a byte to userspace.

 That is not supposed to happen since bmake_signal() sets SA_RESTART.
 With the changes in my patch, at least we will get a diagnostic if
 SA_RESTART turns out to be ineffective on some platform, and can then
 add the necessary retry-on-EINTR loops around any *other* read/write
 calls where this situation actually matters.

 >  The 'token = 0' assignent is enough that it doesn't matter whether
 >  the read fails of not.
 >
 >  Apart from issues with poll() actually returning a read error - so
 >  it doesn't block next time around either, it really doesn't matter
 >  why the code woke up.
 >  
 >  I might even have decided that if the poll() call was interrupted
 >  we might as well act as it the childExitJob pipe had a character in it.

 Or in other words, the code was working before revision 1.165; I concur.
 Reverting job.c to version 1.164 would be acceptable, preferably with
 some comments added to explain the above reasoning.

 If the purpose of 1.165 was specifically to suppress a compiler
 warning on Linux, surely there must be some way of doing that without
 changing the generated object code.  On the other hand, if what we
 want is to change the behavior of the code to actually check the
 return value instead of just suppressing a warning, then I think the
 complexity of my "ridiculous number of tests" is pretty much
 unavoidable.
 -- 
 Andreas Gustafsson, gson@gson.org

From: christos@zoulas.com (Christos Zoulas)
To: Andreas Gustafsson <gson@gson.org>, dsl@NetBSD.org
Cc: gnats-bugs@NetBSD.org
Subject: Re: bin/47524: make(1) loops in Job_CatchOutput()
Date: Wed, 6 Feb 2013 11:02:21 -0500

 On Feb 6, 11:40am, gson@gson.org (Andreas Gustafsson) wrote:
 -- Subject: Re: bin/47524: make(1) loops in Job_CatchOutput()

 | If the purpose of 1.165 was specifically to suppress a compiler
 | warning on Linux, surely there must be some way of doing that without
 | changing the generated object code.  On the other hand, if what we
 | want is to change the behavior of the code to actually check the
 | return value instead of just suppressing a warning, then I think the
 | complexity of my "ridiculous number of tests" is pretty much
 | unavoidable.

 Irrespective for the reason of the change the current code is more complicated
 than it needs to be and incorrect. It just happens to work.

 christos

State-Changed-From-To: open->feedback
State-Changed-By: riastradh@NetBSD.org
State-Changed-When: Thu, 06 Jul 2023 02:32:19 +0000
State-Changed-Why:
Something resembling the requested change was committed in 1.169
approximately a decade ago.  Has the problem persisted since then?


State-Changed-From-To: feedback->closed
State-Changed-By: gson@NetBSD.org
State-Changed-When: Thu, 06 Jul 2023 09:40:49 +0000
State-Changed-Why:
The changes Christos committed in job.c 1.169 (without referencing the
PR) were indeed quite similar to what I proposed, and in the decade
since, the TNF testbed has done more than 70,000 builds without the
problem happening again as far as I know.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2023 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.