NetBSD Problem Report #50046

From www@NetBSD.org  Sun Jul 12 21:55:42 2015
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 50CEAA5B2E
	for <gnats-bugs@gnats.NetBSD.org>; Sun, 12 Jul 2015 21:55:42 +0000 (UTC)
Message-Id: <20150712215540.F3988A6555@mollari.NetBSD.org>
Date: Sun, 12 Jul 2015 21:55:40 +0000 (UTC)
From: frank@phoenix.owl.de
Reply-To: frank@phoenix.owl.de
To: gnats-bugs@NetBSD.org
Subject: Slow rc multiuser boot in 7.0 and current
X-Send-Pr-Version: www-1.0

>Number:         50046
>Category:       misc
>Synopsis:       Slow rc multiuser boot in 7.0 and current
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    phx
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Jul 12 22:00:00 +0000 2015
>Closed-Date:    Wed Nov 11 11:03:10 +0000 2015
>Last-Modified:  Wed Nov 11 11:03:10 +0000 2015
>Originator:     Frank Wille
>Release:        netbsd-7
>Organization:
NetBSD
>Environment:
NetBSD/amiga 7.0_RC1

>Description:
Since the workaround for PR 48714 the multiuser boot in 7.0 and current has become very slow, which especially affects smaller single-CPU platforms.

It has been reported on port-amiga (A3000 and A4000/68060) by myself:
  http://mail-index.netbsd.org/port-amiga/2015/01/25/msg007942.html
On x68k (XM6i/68030) by Tetsuya Isaki:
  http://mail-index.netbsd.org/port-amiga/2015/02/01/msg007951.html
And on Sparc (via qemu) by Arto Huusko:
  http://mail-index.netbsd.org/netbsd-users/2015/07/12/msg016476.html

The above mentioned workaround introduces a background process in rc, which sends a "nop" every three seconds. Removing this passage from rc:

--- rc.orig     2015-01-16 22:17:31.000000000 +0100
+++ rc  2015-05-31 14:47:06.000000000 +0200
@@ -120,24 +120,6 @@
        kill -0 $RC_PID >/dev/null 2>&1 || RC_PID=$$

        #
-       # As long as process $RC_PID is still running, send a "nop"
-       # metadata message to the postprocessor every few seconds.
-       # This should help flush partial lines that may appear when
-       # rc.d scripts that are NOT marked with "KEYWORD: interactive"
-       # nevertheless attempt to print prompts and wait for input.
-       #
-       (
-           # First detach from tty, to avoid intercepting SIGINFO.
-           eval "exec ${_rc_original_stdout_fd}<&-"
-           eval "exec ${_rc_original_stderr_fd}<&-"
-           exec </dev/null >/dev/null 2>&1
-           while kill -0 $RC_PID ; do
-               print_rc_metadata "nop"
-               sleep 3
-           done
-       ) &

... makes booting noticable faster. In this case from 5 minutes to 4 minutes.
From /var/run/rc.log:

Original:
[/etc/rc starting at Sun May 31 13:27:54 CEST 2015]
...
[/etc/rc finished at Sun May 31 13:32:55 CEST 2015]

Without "nop":
[/etc/rc starting at Sun May 31 14:48:48 CEST 2015]
...
[/etc/rc finished at Sun May 31 14:52:59 CEST 2015]


There are probably other factors involved to make booting much slower
than under netbsd-6, but this one is unnecessary.

>How-To-Repeat:
Boot a slow single-CPU system into multiuser mode and compare the time it needs with netbsd-6.

>Fix:
Reverting the workaround for PR 48714 partly fixes it.

>Release-Note:

>Audit-Trail:
From: Arto Huusko <armihu@gmail.com>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: misc/50046
Date: Mon, 13 Jul 2015 13:43:07 +0300

 Reverting revision 1.94 of etc/rc.subr fixes the slow boot.
 The change causes a considerable amount of forking.
 On a test system the number of forks right after boot
 with rc.subr 1.94: 1080
 with rc.subr 1.94 reverted: 280

 Arto Huusko

From: christos@zoulas.com (Christos Zoulas)
To: gnats-bugs@NetBSD.org, misc-bug-people@netbsd.org, 
	gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, frank@phoenix.owl.de
Cc: 
Subject: Re: misc/50046
Date: Wed, 15 Jul 2015 10:34:31 -0400

 On Jul 13, 10:45am, armihu@gmail.com (Arto Huusko) wrote:
 -- Subject: Re: misc/50046

 | From: Arto Huusko <armihu@gmail.com>
 | To: gnats-bugs@netbsd.org
 | Cc: 
 | Subject: Re: misc/50046
 | Date: Mon, 13 Jul 2015 13:43:07 +0300
 | 
 |  Reverting revision 1.94 of etc/rc.subr fixes the slow boot.
 |  The change causes a considerable amount of forking.
 |  On a test system the number of forks right after boot
 |  with rc.subr 1.94: 1080
 |  with rc.subr 1.94 reverted: 280

 Can you please test the stdbuf change I posted on tech-userlevel (compile
 and re-install libc), together with adding:

 export STDBUF0=L
 export STDBUF1=L
 export STDBUF2=L

 in the beginning of rc and reverting rc and rc.subr to pre-pinger changes?

 Thanks,

 christos

From: Frank Wille <frank@phoenix.owl.de>
To: Christos Zoulas <christos@zoulas.com>,
	gnats-bugs@NetBSD.org,
	misc-bug-people@netbsd.org,
	gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org
Cc: 
Subject: Re: misc/50046
Date: Tue, 21 Jul 2015 17:01:47 +0200

 Christos Zoulas wrote:

 > Can you please test the stdbuf change I posted on tech-userlevel
 > (compile and re-install libc), together with adding:
 >
 > export STDBUF0=L
 > export STDBUF1=L
 > export STDBUF2=L
 >
 > in the beginning of rc and reverting rc and rc.subr to pre-pinger
 > changes?

 I did some test runs today on an Amiga 1200, 68030/40MHz with 64MB. The
 multiuser boot process takes here 8 minutes and 44 seconds, which was less
 than half the time with NetBSD 6.


 In the first tests I reverted /etc/rc to 1.167, but left the _rc_pid
 definition in line 87 in. Otherwise booting will hang (it is needed by
 rc.subr). Results ("new clib" includes your stdbuf changes):

 old clib: 7:00 min.
 new clib: 6:57 min.
 new clib with STDBUFn=L: 6:58 min.
 new clib with STDBUFn=U: 7:01 min.

 The stdbuf change doesn't make a big difference here. But removing the "nop"
 pinger code is already noticable.


 In the next tests I reverted /etc/rc to 1.167 and /etc/rc.subr to 1.93, as
 suggested by Arto Huusko:

 old clib: 3:44 min.
 new clib: 3:38 min.

 So I can confirm that reverting /etc/rc.subr is the key. Would be great to
 find a solution here.

 -- 
 Frank Wille

From: Alan Barrett <apb@cequrux.com>
To: gnats-bugs@NetBSD.org
Cc: misc-bug-people@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org
Subject: Re: misc/50046: Slow rc multiuser boot in 7.0 and current
Date: Wed, 12 Aug 2015 10:24:27 +0400

 On Sun, 12 Jul 2015, frank@phoenix.owl.de wrote:
 >>Description:
 >Since the workaround for PR 48714 the multiuser boot in 7.0 and current has become very slow, which especially affects smaller single-CPU platforms.

 I suspect the tests in the _have_rc_postprocessor function in rc.subr.
 There are even comments saying that they are expensive.

 Please try the appended patch.

 --apb (Alan Barrett)

 Index: etc/rc.subr
 ===================================================================
 --- rc.subr	7 Oct 2014 19:09:45 -0000	1.96
 +++ rc.subr	12 Aug 2015 06:14:04 -0000
 @@ -814,19 +814,9 @@ $command $rc_flags $command_args"
  _have_rc_postprocessor()
  {
  	# Cheap tests that fd and pid are set, fd is writable.
 -	[ -n "${_rc_postprocessor_fd}" ] || return 1
 -	[ -n "${_rc_pid}" ] || return 1
 -	eval ": >&${_rc_postprocessor_fd}" 2>/dev/null || return 1
 -
 -	# More expensive test that pid is running.
 -	# Unset _rc_pid if this fails.
 -	kill -0 "${_rc_pid}" 2>/dev/null \
 -	|| { unset _rc_pid; return 1; }
 -
 -	# More expensive test that pid appears to be
 -	# a shell running an rc script.
 -	# Unset _rc_pid if this fails.
 -	expr "$(ps -p "${_rc_pid}" -o command=)" : ".*sh .*/rc.*" >/dev/null \
 +	[ -n "${_rc_pid}" ] || { unset _rc_pid; return 1; }
 +	[ -n "${_rc_postprocessor_fd}" ] || { unset _rc_pid; return 1; }
 +	eval ": >&${_rc_postprocessor_fd}" 2>/dev/null \
  	|| { unset _rc_pid; return 1; }

  	return 0

From: Frank Wille <frank@phoenix.owl.de>
To: gnats-bugs@NetBSD.org
Cc: Alan Barrett <apb@cequrux.com>, misc-bug-people@netbsd.org,
 gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: misc/50046: Slow rc multiuser boot in 7.0 and current
Date: Sat, 15 Aug 2015 13:00:08 +0200

 On Wed, 12 Aug 2015 08:00:01 +0000 (UTC)
 Alan Barrett <apb@cequrux.com> wrote:

 >  I suspect the tests in the _have_rc_postprocessor function in rc.subr.
 >  There are even comments saying that they are expensive.
 >  
 >  Please try the appended patch.

 I tested your patch with the same hardware as in my previous reply
 (Amiga 1200, 68030/40MHz, 64MB).

 Now (with the patch applied) the multiuser boot takes exactly 4:00 minutes.
 It was 8:44 minutes before.

 I think this is acceptable. The remaining 15 seconds are probably lost
 through the nop-pinger code in /etc/rc.

 -- 
 Frank Wille

From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: misc/50046: Slow rc multiuser boot in 7.0 and current
Date: Sat, 5 Sep 2015 22:57:57 +0000

 On Wed, Aug 12, 2015 at 08:00:01AM +0000, Alan Barrett wrote:
  >  I suspect the tests in the _have_rc_postprocessor function in rc.subr.
  >  There are even comments saying that they are expensive.
  >  
  >  Please try the appended patch.

 Did this get committed?

 -- 
 David A. Holland
 dholland@netbsd.org

From: Frank Wille <frank@phoenix.owl.de>
To: gnats-bugs@NetBSD.org,
	misc-bug-people@netbsd.org,
	gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org
Cc: 
Subject: Re: misc/50046: Slow rc multiuser boot in 7.0 and current
Date: Tue, 08 Sep 2015 15:08:58 +0200

 David Holland wrote:

 >  Did this get committed?

 No, unfortunately not.
 I think we should just commit the patch, when Alan is too busy...

 -- 
 Frank Wille

From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: misc/50046: Slow rc multiuser boot in 7.0 and current
Date: Wed, 9 Sep 2015 06:30:29 +0000

 On Tue, Sep 08, 2015 at 01:10:07PM +0000, Frank Wille wrote:
  >>  Did this get committed?
  >  
  >  No, unfortunately not.
  >  I think we should just commit the patch, when Alan is too busy...

 Yes, probably. I sort of wonder why it needs to check so extensively
 for a postprocessor when it ought to be able to know based on whether
 one got forked (and it shouldn't just die arbitrarily); but I'm
 probably missing something and I'm not curious enough to actually wade
 into the logic right now. :-/

 -- 
 David A. Holland
 dholland@netbsd.org

From: Frank Wille <frank@phoenix.owl.de>
To: gnats-bugs@NetBSD.org,
	misc-bug-people@netbsd.org,
	gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org
Cc: apb@cequrux.com
Subject: Re: misc/50046: Slow rc multiuser boot in 7.0 and current
Date: Wed, 30 Sep 2015 17:22:17 +0200

 Alan Barrett wrote:

 >  I suspect the tests in the _have_rc_postprocessor function in rc.subr.
 >  There are even comments saying that they are expensive.
 >  
 >  Please try the appended patch.
 > [...]

 Alan, can you commit that please? It has been tested and was a real
 improvement. Even when it is not the final solution we need it in 7.0 now. 

 Thanks!

 -- 
 Frank Wille

Responsible-Changed-From-To: misc-bug-people->phx
Responsible-Changed-By: phx@NetBSD.org
Responsible-Changed-When: Sat, 31 Oct 2015 12:21:01 +0000
Responsible-Changed-Why:
Applying the patch myself.


From: "Frank Wille" <phx@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/50046 CVS commit: src/etc
Date: Sat, 31 Oct 2015 12:31:37 +0000

 Module Name:	src
 Committed By:	phx
 Date:		Sat Oct 31 12:31:37 UTC 2015

 Modified Files:
 	src/etc: rc.subr

 Log Message:
 PR misc/50046
 Remove the expensive tests in _have_rc_postprocessor(), as proposed by apb@.
 It more than halves the multiuser boot time on slow machines and brings
 it back near to the previous level.


 To generate a diff of this commit:
 cvs rdiff -u -r1.96 -r1.97 src/etc/rc.subr

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->pending-pullups
State-Changed-By: phx@NetBSD.org
State-Changed-When: Sat, 31 Oct 2015 16:19:20 +0000
State-Changed-Why:
Pullup for NetBSD-7 requested.


From: "Jeff Rizzo" <riz@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/50046 CVS commit: [netbsd-7] src/etc
Date: Sun, 8 Nov 2015 00:45:23 +0000

 Module Name:	src
 Committed By:	riz
 Date:		Sun Nov  8 00:45:22 UTC 2015

 Modified Files:
 	src/etc [netbsd-7]: rc.subr

 Log Message:
 Pull up following revision(s) (requested by phx in ticket #1015):
 	etc/rc.subr: revision 1.97
 PR misc/50046
 Remove the expensive tests in _have_rc_postprocessor(), as proposed by apb@.
 It more than halves the multiuser boot time on slow machines and brings
 it back near to the previous level.


 To generate a diff of this commit:
 cvs rdiff -u -r1.94 -r1.94.2.1 src/etc/rc.subr

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: pending-pullups->closed
State-Changed-By: phx@NetBSD.org
State-Changed-When: Wed, 11 Nov 2015 11:03:10 +0000
State-Changed-Why:
Fixed, tested and pulled up to netbsd-7.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.