NetBSD Problem Report #51313

From hf@spg.tu-darmstadt.de  Tue Jul  5 13:50:30 2016
Return-Path: <hf@spg.tu-darmstadt.de>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 344987A227
	for <gnats-bugs@gnats.NetBSD.org>; Tue,  5 Jul 2016 13:50:30 +0000 (UTC)
Message-Id: <201607051321.u65DLM0q023888@Gstoder.nt.e-technik.tu-darmstadt.de>
Date: Tue, 5 Jul 2016 15:21:22 +0200 (CEST)
From: Hauke Fath <hf@spg.tu-darmstadt.de>
Reply-To: Hauke Fath <hf@spg.tu-darmstadt.de>
To: gnats-bugs@NetBSD.org
Cc: Hauke Fath <hf@spg.tu-darmstadt.de>
Subject: named caching server timeouts
X-Send-Pr-Version: 3.95

>Number:         51313
>Category:       bin
>Synopsis:       named caching server timeouts
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    bin-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Jul 05 13:55:01 +0000 2016
>Last-Modified:  Sun Sep 16 02:05:00 +0000 2018
>Originator:     Hauke Fath
>Release:        NetBSD 7.0_STABLE
>Organization:
Technische Universitaet Darmstadt
>Environment:


System: NetBSD Gstoder 7.0_STABLE NetBSD 7.0_STABLE (MONOLITHIC) #1: Fri Apr 1 14:41:59 CEST 2016 hf@Hochstuhl:/var/obj/netbsd-builds/7/i386/sys/arch/i386/compile/MONOLITHIC i386
Architecture: i386
Machine: i386
>Description:

	We have a named instance running on a XEN DomU that caches the
	group's dns requests.

	After switching from netbsd-5 to netbsd-7, users report
	intermittent UI hangs that feel a lot like DNS failures, and
	the named log has plenty of lines like


Jul  5 12:43:40 Cellon named[631]: client 0x7f7ff07d1800 (e1948.a.akamaiedge.net): query_find: unexpected error after resuming: timed out
Jul  5 12:43:50 Cellon named[631]: client 0x7f7ff07d1800 (springer.com.edgekey.net): query_find: unexpected error after resuming: timed out
Jul  5 12:43:50 Cellon named[631]: client 0x7f7ff0def800 (video-stats.l.google.com): query_find: unexpected error after resuming: timed out
Jul  5 12:43:50 Cellon named[631]: client 0x7f7ff0df3000 (0.client-channel.google.com): query_find: unexpected error after resuming: timed out
Jul  5 12:43:50 Cellon named[631]: client 0x7f7ff0df1000 (0.client-channel.google.com): query_find: unexpected error after resuming: timed out
Jul  5 12:43:50 Cellon named[631]: client 0x7f7ff0df2800 (doodle.com): query_find: unexpected error after resuming: timed out
Jul  5 12:43:50 Cellon named[631]: client 0x7f7ff0df0000 (blocklist.addons.mozilla.org): query_find: unexpected error after resuming: timed out
Jul  5 12:43:50 Cellon named[631]: client 0x7f7ff07d2800 (services.addons.mozilla.org): query_find: unexpected error after resuming: timed out
Jul  5 12:43:50 Cellon named[631]: client 0x7f7ff0def000 (kicker.de.122.2o7.net): query_find: unexpected error after resuming: timed out
Jul  5 12:43:50 Cellon named[631]: client 0x7f7ff0dee000 (kicker.de.122.2o7.net): query_find: unexpected error after resuming: timed out
Jul  5 12:43:50 Cellon named[631]: client 0x7f7ff0dec800 (imap.gmail.com): query_find: unexpected error after resuming: timed out
Jul  5 12:43:50 Cellon named[631]: client 0x7f7ff5377000 (cs485.wac.gammacdn.net): query_find: unexpected error after resuming: timed out
Jul  5 12:43:50 Cellon named[631]: client 0x7f7ff43dd000 (dns.msftncsi.com): query_find: unexpected error after resuming: timed out

	See also
	<https://mail-index.netbsd.org/netbsd-users/2016/03/24/msg018231.html>.


>How-To-Repeat:

	Set up a caching nameserver on netbsd-7.


>Fix:

	For now, we are back to the netbsd-5 DomU which does not
	exhibit the problem.

>Audit-Trail:
From: Thor Lancelot Simon <tls@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: bin/51313: named caching server timeouts
Date: Thu, 5 Apr 2018 21:10:52 +0000

 I see this misbehavior with NetBSD-8 built on 2017-12-30.

 Here are some snippets from /var/log/messages as well as from packet captures
 on the client-serving (interior) and internet-facing (exterior) interfaces
 of my NetBSD firewall, showing a query for pool.ntp.org, which:

 1) Appears to have succeded, returning a response to the caching server, but
 2) Results in ServFail being returned to the interior client, after a delay.

 The client appears to repeatedly query beginning at 20:17:49
 (as seen in the packet capture immediately following) but gets no response
 until at 20:17:59 finally it elicits a ServFail (corresponding to the
 log message you'll see next).  After the syslog messages you'll see the
 packet capture of the outside-interface traffic, witha successful query
 recursing all the way up at 20:17:49, corresponding to the client's first
 query, but no further traffic, perhaps because the successful lookup's cached).

 Inside-interface traffic:

 20:17:49.051563 IP (tos 0x0, ttl 255, id 2935, offset 0, flags [none], proto UDP (17), length 58)
     192.168.100.112.64751 > 192.168.100.1.53: [udp sum ok] 53844+ A? pool.ntp.org. (30)
 20:17:49.054393 IP (tos 0x0, ttl 255, id 48697, offset 0, flags [none], proto UDP (17), length 58)
     192.168.100.112.62578 > 192.168.100.1.53: [udp sum ok] 62879+ AAAA? pool.ntp.org. (30)
 20:17:50.043729 IP (tos 0x0, ttl 255, id 55255, offset 0, flags [none], proto UDP (17), length 58)
     192.168.100.112.64751 > 192.168.100.1.53: [udp sum ok] 53844+ A? pool.ntp.org. (30)
 20:17:50.054569 IP (tos 0x0, ttl 255, id 9128, offset 0, flags [none], proto UDP (17), length 58) 
     192.168.100.112.62578 > 192.168.100.1.53: [udp sum ok] 62879+ AAAA? pool.ntp.org. (30)
 20:17:52.046166 IP (tos 0x0, ttl 255, id 26324, offset 0, flags [none], proto UDP (17), length 58)
     192.168.100.112.64751 > 192.168.100.1.53: [udp sum ok] 53844+ A? pool.ntp.org. (30)
 20:17:52.083412 IP (tos 0x0, ttl 255, id 42478, offset 0, flags [none], proto UDP (17), length 58)
     192.168.100.112.62578 > 192.168.100.1.53: [udp sum ok] 62879+ AAAA? pool.ntp.org. (30)
 20:17:56.084653 IP (tos 0x0, ttl 255, id 64050, offset 0, flags [none], proto UDP (17), length 58)
     192.168.100.112.64751 > 192.168.100.1.53: [udp sum ok] 53844+ A? pool.ntp.org. (30)
 20:17:56.084745 IP (tos 0x0, ttl 255, id 64500, offset 0, flags [none], proto UDP (17), length 58)
     192.168.100.112.62578 > 192.168.100.1.53: [udp sum ok] 62879+ AAAA? pool.ntp.org. (30)
 20:17:59.073639 IP (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto UDP (17), length 58)
     192.168.100.1.53 > 192.168.100.112.64751: [udp sum ok] 53844 ServFail q: A? pool.ntp.org. 0/0/0 (30)
 20:17:59.073758 IP (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto UDP (17), length 58)
     192.168.100.1.53 > 192.168.100.112.62578: [udp sum ok] 62879 ServFail q: AAAA? pool.ntp.org. 0/0/0 (30)

 Log messages from named:

 Apr  5 20:17:59 hotpoint named[451]: query client=0x72f4cd103800 thread=0x72f4d4105000 (pool.ntp.org/A): query_find: unexpected error after resuming: timed out 
 Apr  5 20:17:59 hotpoint named[451]: query client=0x72f4cd105800 thread=0x72f4d4107000 (pool.ntp.org/AAAA): query_find: unexpected error after resuming: timed out

 Outside-interface traffic (there's no further occurrence of the string "ntp" in this
 capture after what's quoted here):

 20:17:49.193754 IP (tos 0x0, ttl 51, id 15102, offset 0, flags [DF], proto UDP (17), length 170) 
     207.171.17.42.53 > 24.104.251.72.64596: [udp sum ok] 59751*- q: A? pool.ntp.org. 4/0/0 pool.ntp.org. [2m30s] A 129.250.35.250, pool.ntp.org. [2m30s] A 173.230.144.109, pool.ntp.org. [2m30s] A 96.126.100.203, pool.ntp.org. [2m30s] A 173.255.206.154 (142) 
 20:17:49.194893 IP (tos 0x0, ttl 64, id 48395, offset 0, flags [none], proto UDP (17), length 76, bad cksum 0 (->a9d0)!)
     24.104.251.72.54793 > 199.19.57.1.53: [udp sum ok] 21225 [1au] DS? ntp.org. ar: . OPT UDPsize=512 DO (48)
 20:17:49.199728 IP (tos 0x0, ttl 51, id 15104, offset 0, flags [DF], proto UDP (17), length 111)
     207.171.17.42.53 > 24.104.251.72.53033: [udp sum ok] 34246*- q: A? g.ntpns.org. 2/0/0 g.ntpns.org. [2h53m20s] A 199.188.48.59, g.ntpns.org. [2h53m20s] A 198.105.223.32 (83)
 20:17:49.200701 IP (tos 0x0, ttl 51, id 15105, offset 0, flags [DF], proto UDP (17), length 137)
     207.171.17.42.53 > 24.104.251.72.57221: [udp sum ok] 32487*- q: AAAA? g.ntpns.org. 0/1/0 ns: g.ntpns.org. [1h] SOA a.ntpns.org. hostmaster.g.ntpns.org. 1516976433 5400 5400 1209600 3600 (109)
 20:17:49.207982 IP (tos 0x0, ttl 51, id 15107, offset 0, flags [DF], proto UDP (17), length 140)
     207.171.17.42.53 > 24.104.251.72.56537: [udp sum ok] 59870*- q: AAAA? pool.ntp.org. 0/1/0 ns: pool.ntp.org. [25m] SOA h.ntpns.org. hostmaster.pool.ntp.org. 1522959062 5400 5400 1209600 3600 (112)
 20:17:49.217755 IP (tos 0x0, ttl 54, id 32683, offset 0, flags [none], proto UDP (17), length 64)
     199.19.57.1.53 > 24.104.251.72.54793: [udp sum ok] 21225*-| q: DS? ntp.org. 0/0/1 ar: . OPT UDPsize=4096 DO (36)
 20:17:49.218310 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60, bad cksum 0 (->26f7)!)
     24.104.251.72.50135 > 199.19.57.1.53: Flags [S], cksum 0xa627 (correct), seq 3888800318, win 32768, options [mss 1460,nop,wscale 3,sackOK,TS val 1 ecr 0], length 0
 20:17:49.240802 IP (tos 0x0, ttl 56, id 47144, offset 0, flags [DF], proto TCP (6), length 60)
     199.19.57.1.53 > 24.104.251.72.50135: Flags [S.], cksum 0xb370 (correct), seq 1121150816, ack 3888800319, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 1740464306 ecr 1], length 0
 20:17:49.240898 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 52, bad cksum 0 (->26ff)!)
     24.104.251.72.50135 > 199.19.57.1.53: Flags [.], cksum 0xd1d6 (correct), seq 1, ack 1, win 4197, options [nop,nop,TS val 1 ecr 1740464306], length 0
 20:17:49.241199 IP (tos 0x0, ttl 64, id 48422, offset 0, flags [DF], proto TCP (6), length 102, bad cksum 0 (->69a6)!)
     24.104.251.72.50135 > 199.19.57.1.53: Flags [P.], cksum 0xeed4 (correct), seq 1:51, ack 1, win 4197, options [nop,nop,TS val 1 ecr 1740464306], length 506747 [1au] DS? ntp.org. ar: . OPT UDPsize=4096 DO (48)
 20:17:49.263006 IP (tos 0x0, ttl 56, id 47173, offset 0, flags [DF], proto TCP (6), length 801)
     199.19.57.1.53 > 24.104.251.72.50135: Flags [P.], cksum 0x4dbc (correct), seq 1:750, ack 51, win 1040, options [nop,nop,TS val 1740464328 ecr 1], length 7496747*- q: DS? ntp.org. 0/6/1 ns: h9p7u7tr2u91d0v0ljs9l1gidnp90u3h.org. [1d] Type50, h9p7u7tr2u91d0v0ljs9l1gidnp
 90u3h.org. [1d] RRSIG, org. [15m] SOA a0.org.afilias-nst.info. noc.afilias-nst.info. 2012934004 1800 900 604800 86400, org. [15m] RRSIG, d7fdd278p5up3itk58hk4vor3le6df4s.org. [1d] Type50, d7fdd278p5up3itk58hk4vor3le6df4s.org. [1d] RRSIG ar: . OPT UDPsize=4096 DO (747)
 20:17:49.263397 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 52, bad cksum 0 (->26ff)!)
     24.104.251.72.50135 > 199.19.57.1.53: Flags [F.], cksum 0xce9f (correct), seq 51, ack 750, win 4197, options [nop,nop,TS val 2 ecr 1740464328], length 0
 20:17:49.290458 IP (tos 0x0, ttl 56, id 47197, offset 0, flags [DF], proto TCP (6), length 52)
     199.19.57.1.53 > 24.104.251.72.50135: Flags [.], cksum 0xdad9 (correct), seq 750, ack 52, win 1040, options [nop,nop,TS val 1740464355 ecr 2], length 0
 20:17:49.291434 IP (tos 0x0, ttl 56, id 47198, offset 0, flags [DF], proto TCP (6), length 52)
     199.19.57.1.53 > 24.104.251.72.50135: Flags [F.], cksum 0xdad8 (correct), seq 750, ack 52, win 1040, options [nop,nop,TS val 1740464355 ecr 2], length 0
 20:17:49.291472 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 52, bad cksum 0 (->26ff)!)
     24.104.251.72.50135 > 199.19.57.1.53: Flags [.], cksum 0xce83 (correct), seq 52, ack 751, win 4197, options [nop,nop,TS val 2 ecr 1740464355], length 0


From: Havard Eidnes <he@NetBSD.org>
To: tls@netbsd.org
Cc: gnats-bugs@NetBSD.org, netbsd-bugs@netbsd.org
Subject: Re: bin/51313: named caching server timeouts
Date: Fri, 06 Apr 2018 15:14:36 +0200 (CEST)

 > Here are some snippets from /var/log/messages as well as from packet =
 captures
 > on the client-serving (interior) and internet-facing (exterior) inter=
 faces
 > of my NetBSD firewall, showing a query for pool.ntp.org, which:
 > =

 > 1) Appears to have succeded, returning a response to the caching serv=
 er, but
 > 2) Results in ServFail being returned to the interior client, after a=
  delay.

 That would be identical to the sort of failure you would get if
 DNSSEC validation for the looked-up name fails.

 However, when looked at from my vantage point, neither ntp.org
 nor pool.ntp.org do not appear to be DNSSEC-signed.

 Keying pool.ntp.org into dnsviz.net reveals a few warnings,
 though, ref.

   http://dnsviz.net/d/pool.ntp.org/dnssec/

 but nothing catastrophic I think.

 Regards,

 - H=E5vard

From: "David H. Gutteridge" <david@gutteridge.ca>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: bin/51313: named caching server timeouts
Date: Sat, 15 Sep 2018 22:00:15 -0400

 I've also been encountering this regularly with a NetBSD 8.0 system.
 It's presently running a releng build of the netbsd-8 branch from
 August 28th, with BIND 9.10.5-P1. (Which rules out any relation to
 PR/53421.)

 Dave


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.