NetBSD Problem Report #32519

From dholland@eecs.harvard.edu  Sat Jan 14 07:14:47 2006
Return-Path: <dholland@eecs.harvard.edu>
Received: from tanaqui.eecs.harvard.edu (tanaqui.eecs.harvard.edu [140.247.60.239])
	by narn.netbsd.org (Postfix) with ESMTP id 6F7B263B87A
	for <gnats-bugs@gnats.NetBSD.org>; Sat, 14 Jan 2006 07:14:47 +0000 (UTC)
Message-Id: <20060114071218.3DB72F787@tanaqui.eecs.harvard.edu>
Date: Sat, 14 Jan 2006 02:12:18 -0500 (EST)
From: dholland@eecs.harvard.edu
Reply-To: dholland@eecs.harvard.edu
To: gnats-bugs@netbsd.org
Subject: ypbind spams syslog
X-Send-Pr-Version: 3.95

>Number:         32519
>Category:       bin
>Synopsis:       ypbind spams syslog if it loses connectivity
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    dholland
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Jan 14 07:15:00 +0000 2006
>Closed-Date:    
>Last-Modified:  Mon May 23 06:29:30 +0000 2011
>Originator:     David A. Holland <dholland@eecs.harvard.edu>
>Release:        NetBSD 3.99.10 (-20051026) (ypbind.c 1.53)
>Organization:
   Harvard EECS
>Environment:
System: NetBSD tanaqui 3.99.10 NetBSD 3.99.10 (TANAQUI) #3: Wed Oct 26 18:52:27 EDT 2005 root@tanaqui:/usr/src/sys/arch/i386/compile/TANAQUI i386
Architecture: i386
Machine: i386
>Description:
	A couple of days ago we had network maintenance in our building
	overnight, and so everything lost connectivity for a while. I
	came back the next morning to find the system log stuffed with
	messages like these:


Jan 11 07:30:26 tanaqui ypbind[307]: [server1]: Host name lookup failure
Jan 11 07:30:26 tanaqui ypbind[307]: [server2]: Host name lookup failure
Jan 11 07:30:26 tanaqui ypbind[307]: no contactable servers found in
					/var/yp/binding/[domain].ypservers
Jan 11 07:30:26 tanaqui ypbind[307]: broadcast: sendto: Host is down
Jan 11 07:30:26 tanaqui ypbind[307]: [server1]: Host name lookup failure
Jan 11 07:30:26 tanaqui ypbind[307]: [server2]: Host name lookup failure
Jan 11 07:30:26 tanaqui ypbind[307]: no contactable servers found in
					/var/yp/binding/[domain].ypservers

	(Yes, more than one of these groups per second...)

	The problem, or at least a problem, appears to be this code:

		if (ypdb->dom_check_t >= now) {
			/* don't flood it */
			ypdb->dom_check_t = 0;
			check++;
		}

	(lines 298-302 of ypbind.c 1.53)

	Setting dom_check_t to 0 forces a check of the server at the
	next call to checkwork(). This happens at least once a second,
	but also after servicing rpc calls. So when the network goes
	down and processes start contacting ypbind, it'll discover
	right away that it's lost all the servers. However, once
	things are dead every subsequent request it gets will trigger
	another try at contacting the servers, possibly many times per
	second. I think this is what I was seeing.

	(And of course every time it tries to contact servers and
	fails, it prints assorted messages to the system log.)

>How-To-Repeat:

	Find a system using ypbind (that isn't also a yp server),
	pull the network plug, and wait a while. If necessary, have
	some cron jobs or the like start up while the network is down.

>Fix:

	I think the right answer is to make the test

		(ypdb->dom_check_t >= now && ypdb->dom_alive == 1)

	as if everything's already been found to be unreachable
	dom_alive will apparently be 2. However, I could easily be
	missing something and I'd like someone else to think about it.
	So I'm not including a patch. :-)

	Even with this issue fixed ypbind will still probably spam the
	system log once every five seconds (the traditional behavior)
	and it would be nice to have some exponential backoff on that,
	and/or have it suppress the error messages after the first
	time. However, this is a separate problem and I'd be inclined
	right now to leave it for later.

	I do however suggest the following patch to fix a misleading
	log message:

Index: ypbind.c
===================================================================
RCS file: /cvsroot/src/usr.sbin/ypbind/ypbind.c,v
retrieving revision 1.53
diff -u -r1.53 ypbind.c
--- ypbind.c	30 Oct 2004 15:57:43 -0000	1.53
+++ ypbind.c	14 Jan 2006 06:52:28 -0000
@@ -774,7 +774,7 @@

 		if (sendto(rpcsock, buf, outlen, 0, (struct sockaddr *)&bindsin,
 			   sizeof bindsin) == -1)
-			yp_log(LOG_WARNING, "broadcast: sendto: %m");
+			yp_log(LOG_WARNING, "nag_servers: sendto: %m");
 	}

 	switch (ypbindmode) {


	(And I know, the real fix for these problems is to not run
	ypbind. Unfortunately, that's not always an option.)

>Release-Note:

>Audit-Trail:
From: Elad Efrat <elad@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: PR/32519 CVS commit: src/usr.sbin/ypbind
Date: Sun,  1 Oct 2006 19:43:15 +0000 (UTC)

 Module Name:	src
 Committed By:	elad
 Date:		Sun Oct  1 19:43:15 UTC 2006

 Modified Files:
 	src/usr.sbin/ypbind: ypbind.c

 Log Message:
 Fix misleading error message (from PR/32519).


 To generate a diff of this commit:
 cvs rdiff -r1.54 -r1.55 src/usr.sbin/ypbind/ypbind.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->closed
State-Changed-By: dholland@narn.netbsd.org
State-Changed-When: Sat, 19 Jan 2008 18:18:20 +0000
State-Changed-Why:
This particular pathological behavior no longer happens, though ypbind is
still noisier than one would like.


State-Changed-From-To: closed->open
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Wed, 13 Aug 2008 07:23:07 +0000
State-Changed-Why:
Spoke too soon; it still happens and hit me last bight.
Er, last *night*.


Responsible-Changed-From-To: bin-bug-people->dholland
Responsible-Changed-By: dholland@NetBSD.org
Responsible-Changed-When: Mon, 23 May 2011 06:29:30 +0000
Responsible-Changed-Why:
I am rototilling ypbind.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.