NetBSD Problem Report #54514

From www@netbsd.org  Fri Aug 30 19:39:40 2019
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id E5BDF7A153
	for <gnats-bugs@gnats.NetBSD.org>; Fri, 30 Aug 2019 19:39:39 +0000 (UTC)
Message-Id: <20190830193938.CFF857A188@mollari.NetBSD.org>
Date: Fri, 30 Aug 2019 19:39:38 +0000 (UTC)
From: logix@foobar.franken.de
Reply-To: logix@foobar.franken.de
To: gnats-bugs@NetBSD.org
Subject: script(1) sometimes swallows last line(s) of output
X-Send-Pr-Version: www-1.0

>Number:         54514
>Category:       bin
>Synopsis:       script(1) sometimes swallows last line(s) of output
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    bin-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Aug 30 19:40:00 +0000 2019
>Closed-Date:    Tue May 09 20:58:28 +0000 2023
>Last-Modified:  Tue May 09 20:58:28 +0000 2023
>Originator:     Harold Gutch
>Release:        8.0
>Organization:
>Environment:
NetBSD localhost 8.0 NetBSD 8.0 (GENERIC) #0: Tue Jul 17 14:59:51 UTC 2018  mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/amd64/compile/GENERIC amd64
>Description:
script(1) can terminate before the last few lines of output are printed and dumped to the logfile.  The reason is that the script process handling output and writing to file might receive a SIGCHLD and terminate before finishing its job.
>How-To-Repeat:
This is non-deterministic, but in my tests (see "Fix") I got wrong output depending on the exact test script ~30%-70% of the time.


% cat > test.sh << EOF
? #!/bin/sh
? echo 1
? echo 2
? echo 3
? EOF
% chmod u+x test.sh
% ./test.sh
1
2
3
% script -c ./test.sh typescript
Script started, output file is typescript
1
Script done, output file is typescript
% cat typescript
Script started on Fri Aug 30 21:16:26 2019
1

Script done on Fri Aug 30 21:16:26 2019
%
>Fix:
Inspired by OpenBSD ( http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/usr.bin/script/script.c.diff?r1=1.26&r2=1.27 ):

-Index: script.c
===================================================================
RCS file: /usr/cvs/netbsd/src/usr.bin/script/script.c,v
retrieving revision 1.21
diff -u -p -r1.21 script.c
--- src/usr.bin/script/script.c	6 Sep 2011 18:29:56 -0000	1.21
+++ src/usr.bin/script/script.c	30 Aug 2019 21:22:48 -0000
@@ -166,7 +166,6 @@ main(int argc, char *argv[])
 	rtt.c_lflag &= ~ECHO;
 	(void)tcsetattr(STDIN_FILENO, TCSAFLUSH, &rtt);

-	(void)signal(SIGCHLD, finish);
 	child = fork();
 	if (child < 0) {
 		warn("fork");
@@ -183,6 +182,7 @@ main(int argc, char *argv[])
 		else
 			doshell(command);
 	}
+	(void)signal(SIGCHLD, finish);

 	if (!rawout)
 		(void)fclose(fscript);
@@ -226,6 +226,7 @@ dooutput(void)
 		(void)fprintf(fscript, "Script started on %s", ctime(&tvec));

 	(void)signal(SIGALRM, scriptflush);
+	(void)signal(SIGCHLD, SIG_IGN);
 	value.it_interval.tv_sec = SECSPERMIN / 2;
 	value.it_interval.tv_usec = 0;
 	value.it_value = value.it_interval;


Result:
1) without patch:
$ for i in `jot 100`; do script -c ./test.sh typescript.${i}; done
[...]
$ wc -l typescript.* | awk '/typescript/{print $1}' | sort | uniq -c
  71 4
  29 6

===> only in 29 out of 100 runs the output contained 6 lines (1 line header, 3 lines regular output, 1 new line, 1 line footer)

2) with patch:
$ for i in `jot 100`; do ./script -c ./test.sh typescript.${i}; done
[...]
$ wc -l typescript.* | awk '/typescript/{print $1}' | sort | uniq -c
 100 6

===> every run contained the expected 6 lines

>Release-Note:

>Audit-Trail:
From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: bin/54514: script(1) sometimes swallows last line(s) of output
Date: Sat, 31 Aug 2019 08:58:18 +0700

 I don't believe that is the correct fix.   Your first change
 (moving the signal(SIGCHLD, finish) later) opens a possible
 (if unlikely) race if the child exits very quickly (before the
 signal is established) - which would result in the parent
 hanging until it reads EOF from stdin (which for your tests would
 be essentially forever - until manually killed).

 The correct way to handle this is to leave the initial signal() call
 catching SIGCHLD where it is currently, and insert
 	(void)signal(SIGCHLD, SIG_DFL);
 in the
 	if (child == 0) {
 case, before the subsequent (second) fork().   (An alternative would be
 to save the result of the initial signal call, and use that instead of
 SIG_DFL in the one to be added - but that should make no real difference.)

 Your second change is wrong (setting SIGCHLD to SIG_IGN is almost
 never the right thing to do), and in any case unnecessary, both
 because after your first change (or after the version of it I
 suggest above) when dooutput() is called the state of SIGCHLD will
 be SIG_DFL (which results in what I suspect you wanted to achieve
 by setting it to SIG_IGN) and also because dooutput() is called in the
 grandchild process, which never forks, so never has any children
 and so will never receive a SIGCHLD in any case.   This change can
 simply be deleted.

 Please try again with the changes as suggested, and let us know
 if my suggested variation for your patch fixes the problem.

 kre

 ps: this has most likely never been a problem, as script is almost
 always used interactively, and only exits when the user enters ^D
 to the terminal, which they typically don't do until all the output
 that is expected has been seen.   For the kind of usage in your test.
 the input is already known (it is in the test script) so isn't needed,
 to capture output simple shell redirection is enough.   I am not sure
 why anyone would ever normally run script as it is being run here.

From: Harold Gutch <logix@foobar.franken.de>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: bin/54514: script(1) sometimes swallows last line(s) of output
Date: Sun, 1 Sep 2019 21:42:52 +0200

 Hi,

 On Sat, Aug 31, 2019 at 02:00:02AM +0000, Robert Elz wrote:
 > The following reply was made to PR bin/54514; it has been noted by GNATS.
 > 
 > From: Robert Elz <kre@munnari.OZ.AU>
 > To: gnats-bugs@netbsd.org
 > Cc: 
 > Subject: Re: bin/54514: script(1) sometimes swallows last line(s) of output
 > Date: Sat, 31 Aug 2019 08:58:18 +0700
 > 
 >  I don't believe that is the correct fix.   Your first change
 >  (moving the signal(SIGCHLD, finish) later) opens a possible
 >  (if unlikely) race if the child exits very quickly (before the
 >  signal is established) - which would result in the parent
 >  hanging until it reads EOF from stdin (which for your tests would
 >  be essentially forever - until manually killed).

 Good catch, you're right - thanks!


 >  The correct way to handle this is to leave the initial signal() call
 >  catching SIGCHLD where it is currently, and insert
 >  	(void)signal(SIGCHLD, SIG_DFL);
 >  in the
 >  	if (child == 0) {
 >  case, before the subsequent (second) fork().   (An alternative would be
 >  to save the result of the initial signal call, and use that instead of
 >  SIG_DFL in the one to be added - but that should make no real difference.)
 >  
 >  Your second change is wrong (setting SIGCHLD to SIG_IGN is almost
 >  never the right thing to do), and in any case unnecessary, both
 >  because after your first change (or after the version of it I
 >  suggest above) when dooutput() is called the state of SIGCHLD will
 >  be SIG_DFL (which results in what I suspect you wanted to achieve
 >  by setting it to SIG_IGN) and also because dooutput() is called in the
 >  grandchild process, which never forks, so never has any children
 >  and so will never receive a SIGCHLD in any case.   This change can
 >  simply be deleted.

 Thanks for pointing this out.  I wasn't aware of the difference
 between SIG_IGN and SIG_DFL for the SIGCHILD case.


 >  Please try again with the changes as suggested, and let us know
 >  if my suggested variation for your patch fixes the problem.

 Yes, your suggestion fixes the problem for me.

 --- src/usr.bin/script/script.c	2011-09-06 18:29:56.000000000 +0000
 +++ src/usr.bin/script/script.c	2019-08-31 12:41:26.368033644 +0000
 @@ -173,6 +173,7 @@
  		fail();
  	}
  	if (child == 0) {
 +		(void)signal(SIGCHLD, SIG_DFL);
  		subchild = child = fork();
  		if (child < 0) {
  			warn("fork");


 >  kre
 >  
 >  ps: this has most likely never been a problem, as script is almost
 >  always used interactively, and only exits when the user enters ^D
 >  to the terminal, which they typically don't do until all the output
 >  that is expected has been seen.   For the kind of usage in your test.
 >  the input is already known (it is in the test script) so isn't needed,
 >  to capture output simple shell redirection is enough.   I am not sure
 >  why anyone would ever normally run script as it is being run here.

 I actually use script(1) that way somewhat regularly if I want to log
 the output of a longer job.  The alternative would be to redirect stdout
 and stderr, open a second terminal and "tail -f" the log file (but
 very occasionally I also want timestamping via "script -t").  I find
 it simpler to run "script ./blah logfile".  And sure, I could do it in
 two steps, call script and then call whatever I want interactively.
 So yes, I really did stumble over this issue a while ago in a real
 situation.  And of course the test case is contrived and just
 illustrates the issue.


 Thanks for your input!
   Harold

From: Harold Gutch <logix@foobar.franken.de>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: bin/54514: script(1) sometimes swallows last line(s) of output
Date: Tue, 9 May 2023 17:37:25 +0200

 On Sun, Sep 01, 2019 at 09:42:52PM +0200, Harold Gutch wrote:
 > On Sat, Aug 31, 2019 at 02:00:02AM +0000, Robert Elz wrote:
 > >  Please try again with the changes as suggested, and let us know
 > >  if my suggested variation for your patch fixes the problem.
 > 
 > Yes, your suggestion fixes the problem for me.
 > 
 > --- src/usr.bin/script/script.c	2011-09-06 18:29:56.000000000 +0000
 > +++ src/usr.bin/script/script.c	2019-08-31 12:41:26.368033644 +0000
 > @@ -173,6 +173,7 @@
 >  		fail();
 >  	}
 >  	if (child == 0) {
 > +		(void)signal(SIGCHLD, SIG_DFL);
 >  		subchild = child = fork();
 >  		if (child < 0) {
 >  			warn("fork");

 script.c has received a number of changes since, and in the meanwhile
 the symptomps of this have changed a bit and no longer swallow lines
 but instead duplicate the "done" line:

   $ script -c ./test.sh typescript
   Script started, output file is typescript
   1
   2
   3
   Script done, output file is typescript
   $ cat typescript
   Script started on Tue May  9 17:34:31 2023
   1
   2
   3

   Script done on Tue May  9 17:34:31 2023

   Script done on Tue May  9 17:34:31 2023

 resetting the SIGCHLD signal handler to SIG_DFL for the child process
 still fixes this.


   Harold

State-Changed-From-To: open->closed
State-Changed-By: hgutch@NetBSD.org
State-Changed-When: Tue, 09 May 2023 20:58:28 +0000
State-Changed-Why:
Fixed with rev. 1.34 of src/usr.bin/script/script.c


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2023 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.