NetBSD Problem Report #32519

From dholland@eecs.harvard.edu  Sat Jan 14 07:14:47 2006
Return-Path: <dholland@eecs.harvard.edu>
Received: from tanaqui.eecs.harvard.edu (tanaqui.eecs.harvard.edu [140.247.60.239])
	by narn.netbsd.org (Postfix) with ESMTP id 6F7B263B87A
	for <gnats-bugs@gnats.NetBSD.org>; Sat, 14 Jan 2006 07:14:47 +0000 (UTC)
Message-Id: <20060114071218.3DB72F787@tanaqui.eecs.harvard.edu>
Date: Sat, 14 Jan 2006 02:12:18 -0500 (EST)
From: dholland@eecs.harvard.edu
Reply-To: dholland@eecs.harvard.edu
To: gnats-bugs@netbsd.org
Subject: ypbind spams syslog
X-Send-Pr-Version: 3.95

>Number:         32519
>Category:       bin
>Synopsis:       ypbind spams syslog if it loses connectivity
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    dholland
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Jan 14 07:15:00 +0000 2006
>Closed-Date:    
>Last-Modified:  Thu Sep 11 19:13:49 +0000 2014
>Originator:     David A. Holland <dholland@eecs.harvard.edu>
>Release:        NetBSD 3.99.10 (-20051026) (ypbind.c 1.53)
>Organization:
   Harvard EECS
>Environment:
System: NetBSD tanaqui 3.99.10 NetBSD 3.99.10 (TANAQUI) #3: Wed Oct 26 18:52:27 EDT 2005 root@tanaqui:/usr/src/sys/arch/i386/compile/TANAQUI i386
Architecture: i386
Machine: i386
>Description:
	A couple of days ago we had network maintenance in our building
	overnight, and so everything lost connectivity for a while. I
	came back the next morning to find the system log stuffed with
	messages like these:


Jan 11 07:30:26 tanaqui ypbind[307]: [server1]: Host name lookup failure
Jan 11 07:30:26 tanaqui ypbind[307]: [server2]: Host name lookup failure
Jan 11 07:30:26 tanaqui ypbind[307]: no contactable servers found in
					/var/yp/binding/[domain].ypservers
Jan 11 07:30:26 tanaqui ypbind[307]: broadcast: sendto: Host is down
Jan 11 07:30:26 tanaqui ypbind[307]: [server1]: Host name lookup failure
Jan 11 07:30:26 tanaqui ypbind[307]: [server2]: Host name lookup failure
Jan 11 07:30:26 tanaqui ypbind[307]: no contactable servers found in
					/var/yp/binding/[domain].ypservers

	(Yes, more than one of these groups per second...)

	The problem, or at least a problem, appears to be this code:

		if (ypdb->dom_check_t >= now) {
			/* don't flood it */
			ypdb->dom_check_t = 0;
			check++;
		}

	(lines 298-302 of ypbind.c 1.53)

	Setting dom_check_t to 0 forces a check of the server at the
	next call to checkwork(). This happens at least once a second,
	but also after servicing rpc calls. So when the network goes
	down and processes start contacting ypbind, it'll discover
	right away that it's lost all the servers. However, once
	things are dead every subsequent request it gets will trigger
	another try at contacting the servers, possibly many times per
	second. I think this is what I was seeing.

	(And of course every time it tries to contact servers and
	fails, it prints assorted messages to the system log.)

>How-To-Repeat:

	Find a system using ypbind (that isn't also a yp server),
	pull the network plug, and wait a while. If necessary, have
	some cron jobs or the like start up while the network is down.

>Fix:

	I think the right answer is to make the test

		(ypdb->dom_check_t >= now && ypdb->dom_alive == 1)

	as if everything's already been found to be unreachable
	dom_alive will apparently be 2. However, I could easily be
	missing something and I'd like someone else to think about it.
	So I'm not including a patch. :-)

	Even with this issue fixed ypbind will still probably spam the
	system log once every five seconds (the traditional behavior)
	and it would be nice to have some exponential backoff on that,
	and/or have it suppress the error messages after the first
	time. However, this is a separate problem and I'd be inclined
	right now to leave it for later.

	I do however suggest the following patch to fix a misleading
	log message:

Index: ypbind.c
===================================================================
RCS file: /cvsroot/src/usr.sbin/ypbind/ypbind.c,v
retrieving revision 1.53
diff -u -r1.53 ypbind.c
--- ypbind.c	30 Oct 2004 15:57:43 -0000	1.53
+++ ypbind.c	14 Jan 2006 06:52:28 -0000
@@ -774,7 +774,7 @@

 		if (sendto(rpcsock, buf, outlen, 0, (struct sockaddr *)&bindsin,
 			   sizeof bindsin) == -1)
-			yp_log(LOG_WARNING, "broadcast: sendto: %m");
+			yp_log(LOG_WARNING, "nag_servers: sendto: %m");
 	}

 	switch (ypbindmode) {


	(And I know, the real fix for these problems is to not run
	ypbind. Unfortunately, that's not always an option.)

>Release-Note:

>Audit-Trail:
From: Elad Efrat <elad@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: PR/32519 CVS commit: src/usr.sbin/ypbind
Date: Sun,  1 Oct 2006 19:43:15 +0000 (UTC)

 Module Name:	src
 Committed By:	elad
 Date:		Sun Oct  1 19:43:15 UTC 2006

 Modified Files:
 	src/usr.sbin/ypbind: ypbind.c

 Log Message:
 Fix misleading error message (from PR/32519).


 To generate a diff of this commit:
 cvs rdiff -r1.54 -r1.55 src/usr.sbin/ypbind/ypbind.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->closed
State-Changed-By: dholland@narn.netbsd.org
State-Changed-When: Sat, 19 Jan 2008 18:18:20 +0000
State-Changed-Why:
This particular pathological behavior no longer happens, though ypbind is
still noisier than one would like.


State-Changed-From-To: closed->open
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Wed, 13 Aug 2008 07:23:07 +0000
State-Changed-Why:
Spoke too soon; it still happens and hit me last bight.
Er, last *night*.


Responsible-Changed-From-To: bin-bug-people->dholland
Responsible-Changed-By: dholland@NetBSD.org
Responsible-Changed-When: Mon, 23 May 2011 06:29:30 +0000
Responsible-Changed-Why:
I am rototilling ypbind.


From: "David A. Holland" <dholland@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/32519 CVS commit: src/usr.sbin/ypbind
Date: Tue, 10 Jun 2014 17:19:22 +0000

 Module Name:	src
 Committed By:	dholland
 Date:		Tue Jun 10 17:19:22 UTC 2014

 Modified Files:
 	src/usr.sbin/ypbind: ypbind.c

 Log Message:
 Instead of using magic numbers in what looks like a boolean
 (dom_alive), create a state enumeration (domainstates) and use it
 instead.

 Instead of three states (new, alive, and, effectively, 'troubled') go
 to five: new, alive, pinging, lost, and dead.

 Domains start in the NEW state. When we get a reply from a server, the
 state goes to ALIVE. The state is set to PINGING when we ping the
 server (once a minute normally) and if the ping times out, it goes to
 LOST. If we stay lost for a minute, go to DEAD, and in DEAD, do
 exponential backoff of nag_servers calls.

 Getting rid of the broken logic attached to the 'troubled' state fixes
 PR 15355 (ypbind defeats disk idle spindown) -- it will now only
 rewrite the binding file when the binding changes.

 Also, fix the HEURISTIC code so it doesn't trigger except in ALIVE
 state. I think this was the source of a lot of the spamming behavior
 seen in PR 32519, which is now fixed.

 Might also fix PR 23135 (broadcast ypbind sometimes fails to find
 servers).


 To generate a diff of this commit:
 cvs rdiff -u -r1.95 -r1.96 src/usr.sbin/ypbind/ypbind.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->pending-pullups
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Tue, 10 Jun 2014 17:41:00 +0000
State-Changed-Why:
pullup-6 #1083
(I have filed it, but I've asked to stall until June 20th as a precaution.
Also, depending on what releng things we might just punt.)

It is not clear whether trying to fix -5 is worthwhile.


From: "SAITOH Masanobu" <msaitoh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/32519 CVS commit: [netbsd-6] src/usr.sbin/ypbind
Date: Tue, 9 Sep 2014 08:24:29 +0000

 Module Name:	src
 Committed By:	msaitoh
 Date:		Tue Sep  9 08:24:29 UTC 2014

 Modified Files:
 	src/usr.sbin/ypbind [netbsd-6]: ypbind.8 ypbind.c

 Log Message:
 Pull up following revision(s) (requested by dholland in ticket #1083):
 	usr.sbin/ypbind/ypbind.c: revision 1.91
 	usr.sbin/ypbind/ypbind.c: revision 1.92
 	usr.sbin/ypbind/ypbind.c: revision 1.93
 	usr.sbin/ypbind/ypbind.c: revision 1.94
 	usr.sbin/ypbind/ypbind.c: revision 1.95
 	usr.sbin/ypbind/ypbind.c: revision 1.96
 	usr.sbin/ypbind/ypbind.c: revision 1.97
 	usr.sbin/ypbind/ypbind.c: revision 1.98
 	usr.sbin/ypbind/ypbind.8: revision 1.20
 	usr.sbin/ypbind/ypbind.8: revision 1.19
 Don't store the default domain name in a global. While running we
 really don't care which domain is the system's default domain.
 Factor out some rpc validation code.
 While there are times it's appropriate to call a state variable
 "evil", this isn't one of them. Since the logic involved is to wait
 until the default domain binds before backgrounding, call the variable
 "started" instead.
 Don't rake up the default domain until after processing arguments.
 Processing arguments just sets flags -- may as well do it first, and
 this way detection of silly errors isn't contingent on having things
 fully configured and operating.
 Load up with comments.
 Instead of using magic numbers in what looks like a boolean
 (dom_alive), create a state enumeration (domainstates) and use it
 instead.
 Instead of three states (new, alive, and, effectively, 'troubled') go
 to five: new, alive, pinging, lost, and dead.
 Domains start in the NEW state. When we get a reply from a server, the
 state goes to ALIVE. The state is set to PINGING when we ping the
 server (once a minute normally) and if the ping times out, it goes to
 LOST. If we stay lost for a minute, go to DEAD, and in DEAD, do
 exponential backoff of nag_servers calls.
 Getting rid of the broken logic attached to the 'troubled' state fixes
 PR 15355 (ypbind defeats disk idle spindown) -- it will now only
 rewrite the binding file when the binding changes.
 Also, fix the HEURISTIC code so it doesn't trigger except in ALIVE
 state. I think this was the source of a lot of the spamming behavior
 seen in PR 32519, which is now fixed.
 Might also fix PR 23135 (broadcast ypbind sometimes fails to find
 servers).
 Add a SIGHUP handler; upon SIGHUP do an extra nag_servers on any
 domain that's in DEAD state. This lets you explicitly rescue ypbind
 from its exponential backoff when you know the world's back up.
 Log state transitions.
 Don't store the default domain name in a global. While running we
 really don't care which domain is the system's default domain.
 Factor out some rpc validation code.
 While there are times it's appropriate to call a state variable
 "evil", this isn't one of them. Since the logic involved is to wait
 until the default domain binds before backgrounding, call the variable
 "started" instead.
 Don't rake up the default domain until after processing arguments.
 Processing arguments just sets flags -- may as well do it first, and
 this way detection of silly errors isn't contingent on having things
 fully configured and operating.
 Load up with comments.
 Instead of using magic numbers in what looks like a boolean
 (dom_alive), create a state enumeration (domainstates) and use it
 instead.
 Instead of three states (new, alive, and, effectively, 'troubled') go
 to five: new, alive, pinging, lost, and dead.
 Domains start in the NEW state. When we get a reply from a server, the
 state goes to ALIVE. The state is set to PINGING when we ping the
 server (once a minute normally) and if the ping times out, it goes to
 LOST. If we stay lost for a minute, go to DEAD, and in DEAD, do
 exponential backoff of nag_servers calls.
 Getting rid of the broken logic attached to the 'troubled' state fixes
 PR 15355 (ypbind defeats disk idle spindown) -- it will now only
 rewrite the binding file when the binding changes.
 Also, fix the HEURISTIC code so it doesn't trigger except in ALIVE
 state. I think this was the source of a lot of the spamming behavior
 seen in PR 32519, which is now fixed.
 Might also fix PR 23135 (broadcast ypbind sometimes fails to find
 servers).
 Add a SIGHUP handler; upon SIGHUP do an extra nag_servers on any
 domain that's in DEAD state. This lets you explicitly rescue ypbind
 from its exponential backoff when you know the world's back up.
 Log state transitions.
 Document exponential backoff behavior and SIGHUP support, plus a couple
 other minor edits.
 Use more markup.


 To generate a diff of this commit:
 cvs rdiff -u -r1.18 -r1.18.22.1 src/usr.sbin/ypbind/ypbind.8
 cvs rdiff -u -r1.90 -r1.90.4.1 src/usr.sbin/ypbind/ypbind.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: pending-pullups->open
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Thu, 11 Sep 2014 19:13:49 +0000
State-Changed-Why:
pullup-6 done, but I need to do pullup-5 too.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.