NetBSD Problem Report #32519
From dholland@eecs.harvard.edu Sat Jan 14 07:14:47 2006
Return-Path: <dholland@eecs.harvard.edu>
Received: from tanaqui.eecs.harvard.edu (tanaqui.eecs.harvard.edu [140.247.60.239])
by narn.netbsd.org (Postfix) with ESMTP id 6F7B263B87A
for <gnats-bugs@gnats.NetBSD.org>; Sat, 14 Jan 2006 07:14:47 +0000 (UTC)
Message-Id: <20060114071218.3DB72F787@tanaqui.eecs.harvard.edu>
Date: Sat, 14 Jan 2006 02:12:18 -0500 (EST)
From: dholland@eecs.harvard.edu
Reply-To: dholland@eecs.harvard.edu
To: gnats-bugs@netbsd.org
Subject: ypbind spams syslog
X-Send-Pr-Version: 3.95
>Number: 32519
>Category: bin
>Synopsis: ypbind spams syslog if it loses connectivity
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: dholland
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sat Jan 14 07:15:00 +0000 2006
>Closed-Date: Wed Nov 05 09:41:44 +0000 2014
>Last-Modified: Wed Nov 05 09:41:44 +0000 2014
>Originator: David A. Holland <dholland@eecs.harvard.edu>
>Release: NetBSD 3.99.10 (-20051026) (ypbind.c 1.53)
>Organization:
Harvard EECS
>Environment:
System: NetBSD tanaqui 3.99.10 NetBSD 3.99.10 (TANAQUI) #3: Wed Oct 26 18:52:27 EDT 2005 root@tanaqui:/usr/src/sys/arch/i386/compile/TANAQUI i386
Architecture: i386
Machine: i386
>Description:
A couple of days ago we had network maintenance in our building
overnight, and so everything lost connectivity for a while. I
came back the next morning to find the system log stuffed with
messages like these:
Jan 11 07:30:26 tanaqui ypbind[307]: [server1]: Host name lookup failure
Jan 11 07:30:26 tanaqui ypbind[307]: [server2]: Host name lookup failure
Jan 11 07:30:26 tanaqui ypbind[307]: no contactable servers found in
/var/yp/binding/[domain].ypservers
Jan 11 07:30:26 tanaqui ypbind[307]: broadcast: sendto: Host is down
Jan 11 07:30:26 tanaqui ypbind[307]: [server1]: Host name lookup failure
Jan 11 07:30:26 tanaqui ypbind[307]: [server2]: Host name lookup failure
Jan 11 07:30:26 tanaqui ypbind[307]: no contactable servers found in
/var/yp/binding/[domain].ypservers
(Yes, more than one of these groups per second...)
The problem, or at least a problem, appears to be this code:
if (ypdb->dom_check_t >= now) {
/* don't flood it */
ypdb->dom_check_t = 0;
check++;
}
(lines 298-302 of ypbind.c 1.53)
Setting dom_check_t to 0 forces a check of the server at the
next call to checkwork(). This happens at least once a second,
but also after servicing rpc calls. So when the network goes
down and processes start contacting ypbind, it'll discover
right away that it's lost all the servers. However, once
things are dead every subsequent request it gets will trigger
another try at contacting the servers, possibly many times per
second. I think this is what I was seeing.
(And of course every time it tries to contact servers and
fails, it prints assorted messages to the system log.)
>How-To-Repeat:
Find a system using ypbind (that isn't also a yp server),
pull the network plug, and wait a while. If necessary, have
some cron jobs or the like start up while the network is down.
>Fix:
I think the right answer is to make the test
(ypdb->dom_check_t >= now && ypdb->dom_alive == 1)
as if everything's already been found to be unreachable
dom_alive will apparently be 2. However, I could easily be
missing something and I'd like someone else to think about it.
So I'm not including a patch. :-)
Even with this issue fixed ypbind will still probably spam the
system log once every five seconds (the traditional behavior)
and it would be nice to have some exponential backoff on that,
and/or have it suppress the error messages after the first
time. However, this is a separate problem and I'd be inclined
right now to leave it for later.
I do however suggest the following patch to fix a misleading
log message:
Index: ypbind.c
===================================================================
RCS file: /cvsroot/src/usr.sbin/ypbind/ypbind.c,v
retrieving revision 1.53
diff -u -r1.53 ypbind.c
--- ypbind.c 30 Oct 2004 15:57:43 -0000 1.53
+++ ypbind.c 14 Jan 2006 06:52:28 -0000
@@ -774,7 +774,7 @@
if (sendto(rpcsock, buf, outlen, 0, (struct sockaddr *)&bindsin,
sizeof bindsin) == -1)
- yp_log(LOG_WARNING, "broadcast: sendto: %m");
+ yp_log(LOG_WARNING, "nag_servers: sendto: %m");
}
switch (ypbindmode) {
(And I know, the real fix for these problems is to not run
ypbind. Unfortunately, that's not always an option.)
>Release-Note:
>Audit-Trail:
From: Elad Efrat <elad@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: PR/32519 CVS commit: src/usr.sbin/ypbind
Date: Sun, 1 Oct 2006 19:43:15 +0000 (UTC)
Module Name: src
Committed By: elad
Date: Sun Oct 1 19:43:15 UTC 2006
Modified Files:
src/usr.sbin/ypbind: ypbind.c
Log Message:
Fix misleading error message (from PR/32519).
To generate a diff of this commit:
cvs rdiff -r1.54 -r1.55 src/usr.sbin/ypbind/ypbind.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: open->closed
State-Changed-By: dholland@narn.netbsd.org
State-Changed-When: Sat, 19 Jan 2008 18:18:20 +0000
State-Changed-Why:
This particular pathological behavior no longer happens, though ypbind is
still noisier than one would like.
State-Changed-From-To: closed->open
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Wed, 13 Aug 2008 07:23:07 +0000
State-Changed-Why:
Spoke too soon; it still happens and hit me last bight.
Er, last *night*.
Responsible-Changed-From-To: bin-bug-people->dholland
Responsible-Changed-By: dholland@NetBSD.org
Responsible-Changed-When: Mon, 23 May 2011 06:29:30 +0000
Responsible-Changed-Why:
I am rototilling ypbind.
From: "David A. Holland" <dholland@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/32519 CVS commit: src/usr.sbin/ypbind
Date: Tue, 10 Jun 2014 17:19:22 +0000
Module Name: src
Committed By: dholland
Date: Tue Jun 10 17:19:22 UTC 2014
Modified Files:
src/usr.sbin/ypbind: ypbind.c
Log Message:
Instead of using magic numbers in what looks like a boolean
(dom_alive), create a state enumeration (domainstates) and use it
instead.
Instead of three states (new, alive, and, effectively, 'troubled') go
to five: new, alive, pinging, lost, and dead.
Domains start in the NEW state. When we get a reply from a server, the
state goes to ALIVE. The state is set to PINGING when we ping the
server (once a minute normally) and if the ping times out, it goes to
LOST. If we stay lost for a minute, go to DEAD, and in DEAD, do
exponential backoff of nag_servers calls.
Getting rid of the broken logic attached to the 'troubled' state fixes
PR 15355 (ypbind defeats disk idle spindown) -- it will now only
rewrite the binding file when the binding changes.
Also, fix the HEURISTIC code so it doesn't trigger except in ALIVE
state. I think this was the source of a lot of the spamming behavior
seen in PR 32519, which is now fixed.
Might also fix PR 23135 (broadcast ypbind sometimes fails to find
servers).
To generate a diff of this commit:
cvs rdiff -u -r1.95 -r1.96 src/usr.sbin/ypbind/ypbind.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: open->pending-pullups
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Tue, 10 Jun 2014 17:41:00 +0000
State-Changed-Why:
pullup-6 #1083
(I have filed it, but I've asked to stall until June 20th as a precaution.
Also, depending on what releng things we might just punt.)
It is not clear whether trying to fix -5 is worthwhile.
From: "SAITOH Masanobu" <msaitoh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/32519 CVS commit: [netbsd-6] src/usr.sbin/ypbind
Date: Tue, 9 Sep 2014 08:24:29 +0000
Module Name: src
Committed By: msaitoh
Date: Tue Sep 9 08:24:29 UTC 2014
Modified Files:
src/usr.sbin/ypbind [netbsd-6]: ypbind.8 ypbind.c
Log Message:
Pull up following revision(s) (requested by dholland in ticket #1083):
usr.sbin/ypbind/ypbind.c: revision 1.91
usr.sbin/ypbind/ypbind.c: revision 1.92
usr.sbin/ypbind/ypbind.c: revision 1.93
usr.sbin/ypbind/ypbind.c: revision 1.94
usr.sbin/ypbind/ypbind.c: revision 1.95
usr.sbin/ypbind/ypbind.c: revision 1.96
usr.sbin/ypbind/ypbind.c: revision 1.97
usr.sbin/ypbind/ypbind.c: revision 1.98
usr.sbin/ypbind/ypbind.8: revision 1.20
usr.sbin/ypbind/ypbind.8: revision 1.19
Don't store the default domain name in a global. While running we
really don't care which domain is the system's default domain.
Factor out some rpc validation code.
While there are times it's appropriate to call a state variable
"evil", this isn't one of them. Since the logic involved is to wait
until the default domain binds before backgrounding, call the variable
"started" instead.
Don't rake up the default domain until after processing arguments.
Processing arguments just sets flags -- may as well do it first, and
this way detection of silly errors isn't contingent on having things
fully configured and operating.
Load up with comments.
Instead of using magic numbers in what looks like a boolean
(dom_alive), create a state enumeration (domainstates) and use it
instead.
Instead of three states (new, alive, and, effectively, 'troubled') go
to five: new, alive, pinging, lost, and dead.
Domains start in the NEW state. When we get a reply from a server, the
state goes to ALIVE. The state is set to PINGING when we ping the
server (once a minute normally) and if the ping times out, it goes to
LOST. If we stay lost for a minute, go to DEAD, and in DEAD, do
exponential backoff of nag_servers calls.
Getting rid of the broken logic attached to the 'troubled' state fixes
PR 15355 (ypbind defeats disk idle spindown) -- it will now only
rewrite the binding file when the binding changes.
Also, fix the HEURISTIC code so it doesn't trigger except in ALIVE
state. I think this was the source of a lot of the spamming behavior
seen in PR 32519, which is now fixed.
Might also fix PR 23135 (broadcast ypbind sometimes fails to find
servers).
Add a SIGHUP handler; upon SIGHUP do an extra nag_servers on any
domain that's in DEAD state. This lets you explicitly rescue ypbind
from its exponential backoff when you know the world's back up.
Log state transitions.
Don't store the default domain name in a global. While running we
really don't care which domain is the system's default domain.
Factor out some rpc validation code.
While there are times it's appropriate to call a state variable
"evil", this isn't one of them. Since the logic involved is to wait
until the default domain binds before backgrounding, call the variable
"started" instead.
Don't rake up the default domain until after processing arguments.
Processing arguments just sets flags -- may as well do it first, and
this way detection of silly errors isn't contingent on having things
fully configured and operating.
Load up with comments.
Instead of using magic numbers in what looks like a boolean
(dom_alive), create a state enumeration (domainstates) and use it
instead.
Instead of three states (new, alive, and, effectively, 'troubled') go
to five: new, alive, pinging, lost, and dead.
Domains start in the NEW state. When we get a reply from a server, the
state goes to ALIVE. The state is set to PINGING when we ping the
server (once a minute normally) and if the ping times out, it goes to
LOST. If we stay lost for a minute, go to DEAD, and in DEAD, do
exponential backoff of nag_servers calls.
Getting rid of the broken logic attached to the 'troubled' state fixes
PR 15355 (ypbind defeats disk idle spindown) -- it will now only
rewrite the binding file when the binding changes.
Also, fix the HEURISTIC code so it doesn't trigger except in ALIVE
state. I think this was the source of a lot of the spamming behavior
seen in PR 32519, which is now fixed.
Might also fix PR 23135 (broadcast ypbind sometimes fails to find
servers).
Add a SIGHUP handler; upon SIGHUP do an extra nag_servers on any
domain that's in DEAD state. This lets you explicitly rescue ypbind
from its exponential backoff when you know the world's back up.
Log state transitions.
Document exponential backoff behavior and SIGHUP support, plus a couple
other minor edits.
Use more markup.
To generate a diff of this commit:
cvs rdiff -u -r1.18 -r1.18.22.1 src/usr.sbin/ypbind/ypbind.8
cvs rdiff -u -r1.90 -r1.90.4.1 src/usr.sbin/ypbind/ypbind.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: pending-pullups->open
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Thu, 11 Sep 2014 19:13:49 +0000
State-Changed-Why:
pullup-6 done, but I need to do pullup-5 too.
State-Changed-From-To: open->pending-pullups
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sat, 20 Sep 2014 21:16:37 +0000
State-Changed-Why:
pullup-5 #1925
From: "SAITOH Masanobu" <msaitoh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/32519 CVS commit: [netbsd-5] src/usr.sbin/ypbind
Date: Mon, 3 Nov 2014 18:48:42 +0000
Module Name: src
Committed By: msaitoh
Date: Mon Nov 3 18:48:42 UTC 2014
Modified Files:
src/usr.sbin/ypbind [netbsd-5]: ypbind.8 ypbind.c
Log Message:
Pull up following revision(s) (requested by jmcneill in dholland ticket #1925):
usr.sbin/ypbind/ypbind.8 1.19-1.20
usr.sbin/ypbind/ypbind.c 1.63-1.98
State handling fixes and cleanups in ypbind; it no longer churns the
disk, spams syslogd when the server goes down, or gets confused about
its binding state. Fixes PR#15355, PR#32519 and also PR#43900.
To generate a diff of this commit:
cvs rdiff -u -r1.18 -r1.18.6.1 src/usr.sbin/ypbind/ypbind.8
cvs rdiff -u -r1.57.18.1 -r1.57.18.2 src/usr.sbin/ypbind/ypbind.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: pending-pullups->closed
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Wed, 05 Nov 2014 09:41:44 +0000
State-Changed-Why:
Fixed.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.