NetBSD Problem Report #57049
From kardel@Kardel.name Fri Oct 7 11:54:38 2022
Return-Path: <kardel@Kardel.name>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 1C57A1A923C
for <gnats-bugs@gnats.NetBSD.org>; Fri, 7 Oct 2022 11:54:38 +0000 (UTC)
Message-Id: <20221007114857.BE0F644B33@Andromeda.Kardel.name>
Date: Fri, 7 Oct 2022 13:48:57 +0200 (CEST)
From: kardel@netbsd.org
Reply-To: kardel@netbsd.org
To: gnats-bugs@NetBSD.org
Subject: ARP nd.c issue? large TCP transfers NetBSD-Xen-Guest -> NetBSD-Xen-DOM0 abort with EHOSTDOWN
X-Send-Pr-Version: 3.95
>Number: 57049
>Category: kern
>Synopsis: ARP nd.c issue? large TCP transfers NetBSD-Xen-Guest -> NetBSD-Xen-DOM0 abort with EHOSTDOWN
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri Oct 07 11:55:00 +0000 2022
>Last-Modified: Sat Nov 19 17:15:01 +0000 2022
>Originator: kardel@netbsd.org
>Release: NetBSD 9.99.100
>Organization:
>Environment:
System: NetBSD TEMPLATE-fix-hostname 9.99.100 NetBSD 9.99.100 (GENERIC) #6: Fri Oct 7 13:19:45 CEST 2022 kardel@gaiatest:/src/NetBSD/current/src/obj.amd64/sys/arch/amd64/compile/GENERIC amd64
Architecture: x86_64
Machine: amd64
>Description:
When copying large files (e.g. 20GB) via scp from a Xen guest to a Xen DOM0 the transfers often fail with EHOSTDOWN.
The errorcode comes from sys/net/nd.c:nd_resolve() (nd.c:384)
The error can be replicated with a simple ttcp test - see below:
Also, during the transfer, following routing messages can be observed on the guest:
got message of size 152 on Fri Oct 7 11:41:36 2022
RTM_MISS: Lookup failed on this address: len 152, pid 0, seq 0, errno 0, flags: 0x40<DONE>
locks: 0 inits: 0
sockaddrs: 0x3<DST,GATEWAY>
10.0.2.16 link#1
got message of size 152 on Fri Oct 7 11:41:39 2022
RTM_MISS: Lookup failed on this address: len 152, pid 0, seq 0, errno 0, flags: 0x40<DONE>
locks: 0 inits: 0
sockaddrs: 0x3<DST,GATEWAY>
10.0.2.16 link#1
got message of size 152 on Fri Oct 7 11:41:39 2022
RTM_ADD: Add Route: len 152, pid 0, seq 0, errno 0, flags: 0x2445<UP,HOST,DONE,LLINFO,CLONED>
locks: 0 inits: 0
sockaddrs: 0x3<DST,GATEWAY>
10.0.2.16 aa:bb:cc:dd:ee:ff
>How-To-Repeat:
on DOM0:
Zugspitze# ttcp -s -r
ttcp-r: buflen=8192, nbuf=2048, align=16384/0, port=5001 tcp
ttcp-r: socket
on guest:
Guest# ttcp -s -t -n 1000000 zugspitze
ttcp-t: socket
ttcp-t: connect
ttcp-t: IO: Host is down
errno=64
>Fix:
no workaround so far
>Release-Note:
>Audit-Trail:
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/57049: large TCP transfers NetBSD-Xen-Guest ->
NetBSD-Xen-DOM0 abort with EHOSTDOWN
Date: Fri, 7 Oct 2022 16:30:26 +0200
On Fri, Oct 07, 2022 at 11:55:00AM +0000, kardel@netbsd.org wrote:
> >Description:
> When copying large files (e.g. 20GB) via scp from a Xen guest to a Xen DOM0 the transfers often fail with EHOSTDOWN.
> The errorcode comes from sys/net/nd.c:nd_resolve() (nd.c:384)
> The error can be replicated with a simple ttcp test - see below:
> Also, during the transfer, following routing messages can be observed on the guest:
> got message of size 152 on Fri Oct 7 11:41:36 2022
> RTM_MISS: Lookup failed on this address: len 152, pid 0, seq 0, errno 0, flags: 0x40<DONE>
> locks: 0 inits: 0
> sockaddrs: 0x3<DST,GATEWAY>
> 10.0.2.16 link#1
> got message of size 152 on Fri Oct 7 11:41:39 2022
> RTM_MISS: Lookup failed on this address: len 152, pid 0, seq 0, errno 0, flags: 0x40<DONE>
> locks: 0 inits: 0
> sockaddrs: 0x3<DST,GATEWAY>
> 10.0.2.16 link#1
> got message of size 152 on Fri Oct 7 11:41:39 2022
> RTM_ADD: Add Route: len 152, pid 0, seq 0, errno 0, flags: 0x2445<UP,HOST,DONE,LLINFO,CLONED>
> locks: 0 inits: 0
> sockaddrs: 0x3<DST,GATEWAY>
> 10.0.2.16 aa:bb:cc:dd:ee:ff
>
> >How-To-Repeat:
> on DOM0:
> Zugspitze# ttcp -s -r
> ttcp-r: buflen=8192, nbuf=2048, align=16384/0, port=5001 tcp
> ttcp-r: socket
>
> on guest:
> Guest# ttcp -s -t -n 1000000 zugspitze
> ttcp-t: socket
> ttcp-t: connect
> ttcp-t: IO: Host is down
> errno=64
I can't reproduce this (-HEAD dom0, 9.3 domU):
proto:/home/bouyer>./ttcp -s -t -n 1000000 borneo
ttcp-t: buflen=8192, nbuf=1000000, align=16384/0, port=5001 tcp -> borneo
ttcp-t: socket
ttcp-t: connect
ttcp-t: 8192000000 bytes in 162.54 real seconds = 49218.69 KB/sec +++
ttcp-t: 1000000 I/O calls, msec/call = 0.17, calls/sec = 6152.34
ttcp-t: 3.0user 80.2sys 2:42real 51% 0i+0d 778maxrss 0+2pf 297382+270931csw
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Cc:
Subject: Re: kern/57049: large TCP transfers NetBSD-Xen-Guest ->
NetBSD-Xen-DOM0 abort with EHOSTDOWN
Date: Fri, 7 Oct 2022 16:43:52 +0200
Hi Manuel,
that is probably because the DOMU is 9.2 which still had the classic ARP
resolution code. In 9.99.x the ARP resolution
was replaced with a neighbour discovery derived code in nd.c. On Xen I
tripped over this issue with a 99.100 GENERIC guest quickly. It may be
that it
happens with other true network peers also, but I was not able to
trigger it with a true network peer right away.
Frank
On 10/07/22 16:35, Manuel Bouyer wrote:
> The following reply was made to PR kern/57049; it has been noted by GNATS.
>
> From: Manuel Bouyer <bouyer@antioche.eu.org>
> To: gnats-bugs@netbsd.org
> Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
> Subject: Re: kern/57049: large TCP transfers NetBSD-Xen-Guest ->
> NetBSD-Xen-DOM0 abort with EHOSTDOWN
> Date: Fri, 7 Oct 2022 16:30:26 +0200
>
> On Fri, Oct 07, 2022 at 11:55:00AM +0000, kardel@netbsd.org wrote:
> > >Description:
> > When copying large files (e.g. 20GB) via scp from a Xen guest to a Xen DOM0 the transfers often fail with EHOSTDOWN.
> > The errorcode comes from sys/net/nd.c:nd_resolve() (nd.c:384)
> > The error can be replicated with a simple ttcp test - see below:
> > Also, during the transfer, following routing messages can be observed on the guest:
> > got message of size 152 on Fri Oct 7 11:41:36 2022
> > RTM_MISS: Lookup failed on this address: len 152, pid 0, seq 0, errno 0, flags: 0x40<DONE>
> > locks: 0 inits: 0
> > sockaddrs: 0x3<DST,GATEWAY>
> > 10.0.2.16 link#1
> > got message of size 152 on Fri Oct 7 11:41:39 2022
> > RTM_MISS: Lookup failed on this address: len 152, pid 0, seq 0, errno 0, flags: 0x40<DONE>
> > locks: 0 inits: 0
> > sockaddrs: 0x3<DST,GATEWAY>
> > 10.0.2.16 link#1
> > got message of size 152 on Fri Oct 7 11:41:39 2022
> > RTM_ADD: Add Route: len 152, pid 0, seq 0, errno 0, flags: 0x2445<UP,HOST,DONE,LLINFO,CLONED>
> > locks: 0 inits: 0
> > sockaddrs: 0x3<DST,GATEWAY>
> > 10.0.2.16 aa:bb:cc:dd:ee:ff
> >
> > >How-To-Repeat:
> > on DOM0:
> > Zugspitze# ttcp -s -r
> > ttcp-r: buflen=8192, nbuf=2048, align=16384/0, port=5001 tcp
> > ttcp-r: socket
> >
> > on guest:
> > Guest# ttcp -s -t -n 1000000 zugspitze
> > ttcp-t: socket
> > ttcp-t: connect
> > ttcp-t: IO: Host is down
> > errno=64
>
> I can't reproduce this (-HEAD dom0, 9.3 domU):
> proto:/home/bouyer>./ttcp -s -t -n 1000000 borneo
> ttcp-t: buflen=8192, nbuf=1000000, align=16384/0, port=5001 tcp -> borneo
> ttcp-t: socket
> ttcp-t: connect
> ttcp-t: 8192000000 bytes in 162.54 real seconds = 49218.69 KB/sec +++
> ttcp-t: 1000000 I/O calls, msec/call = 0.17, calls/sec = 6152.34
> ttcp-t: 3.0user 80.2sys 2:42real 51% 0i+0d 778maxrss 0+2pf 297382+270931csw
>
> --
> Manuel Bouyer <bouyer@antioche.eu.org>
> NetBSD: 26 ans d'experience feront toujours la difference
> --
>
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Frank Kardel <kardel@netbsd.org>
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: Re: kern/57049: large TCP transfers NetBSD-Xen-Guest ->
NetBSD-Xen-DOM0 abort with EHOSTDOWN
Date: Fri, 7 Oct 2022 17:07:37 +0200
On Fri, Oct 07, 2022 at 04:43:52PM +0200, Frank Kardel wrote:
> Hi Manuel,
>
> that is probably because the DOMU is 9.2 which still had the classic ARP
> resolution code. In 9.99.x the ARP resolution
>
> was replaced with a neighbour discovery derived code in nd.c. On Xen I
> tripped over this issue with a 99.100 GENERIC guest quickly. It may be that
> it
>
> happens with other true network peers also, but I was not able to trigger it
> with a true network peer right away.
OK, with a HEAD domU I can reproduce this.
But I don't think this is Xen-specific. Maybe it's just some timing or
ressource issue that makes it more likely to happen on Xen.
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Cc:
Subject: Re: kern/57049: large TCP transfers NetBSD-Xen-Guest ->
NetBSD-Xen-DOM0 abort with EHOSTDOWN
Date: Sat, 19 Nov 2022 16:31:50 +0100
Yes I think it is somehow related to the new ARP code in nd.c.
New datapoint:
Reversing the transfer to send from DOM0 to guest survive longer. DOM0
is fine. on the guest the EHOSTDOWN return in nd.c:~390 is triggered often.
As failure of sending ACK does not terminate TCP connections that is why
the connection survives.
The timing pattern seems to be a mixture of 200ms (possible ACK
re-sends) and ~41 seconds (possibly the nd.c effect).
The EHOSTDOWN pattern looks like this:
[ 2174.429719] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN
[ 2174.639722] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN
[ 2217.636882] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN
[ 2217.841108] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN
[ 2218.051081] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN
[ 2218.261120] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN
[ 2259.258196] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN
[ 2259.462455] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN
[ 2259.672445] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN
[ 2300.669503] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN
[ 2300.873729] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN
[ 2301.083775] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN
[ 2301.293752] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN
[ 2342.290807] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN
[ 2342.495060] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN
[ 2342.705039] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN
[ 2342.915041] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN
[ 2383.912120] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN
[ 2384.116365] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN
[ 2384.326380] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN
[ 2384.536361] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN
[ 2427.533491] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN
[ 2427.737675] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN
[ 2427.947745] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN
[ 2428.157763] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN
[ 2469.154813] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN
[ 2469.358585] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN
[ 2469.568578] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN
Best regards,
Frank
On 10/07/22 17:10, Manuel Bouyer wrote:
> The following reply was made to PR kern/57049; it has been noted by GNATS.
>
> From: Manuel Bouyer <bouyer@antioche.eu.org>
> To: Frank Kardel <kardel@netbsd.org>
> Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
> netbsd-bugs@netbsd.org
> Subject: Re: kern/57049: large TCP transfers NetBSD-Xen-Guest ->
> NetBSD-Xen-DOM0 abort with EHOSTDOWN
> Date: Fri, 7 Oct 2022 17:07:37 +0200
>
> On Fri, Oct 07, 2022 at 04:43:52PM +0200, Frank Kardel wrote:
> > Hi Manuel,
> >
> > that is probably because the DOMU is 9.2 which still had the classic ARP
> > resolution code. In 9.99.x the ARP resolution
> >
> > was replaced with a neighbour discovery derived code in nd.c. On Xen I
> > tripped over this issue with a 99.100 GENERIC guest quickly. It may be that
> > it
> >
> > happens with other true network peers also, but I was not able to trigger it
> > with a true network peer right away.
>
> OK, with a HEAD domU I can reproduce this.
> But I don't think this is Xen-specific. Maybe it's just some timing or
> ressource issue that makes it more likely to happen on Xen.
>
> --
> Manuel Bouyer <bouyer@antioche.eu.org>
> NetBSD: 26 ans d'experience feront toujours la difference
> --
>
From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Cc:
Subject: Re: kern/57049: large TCP transfers NetBSD-Xen-Guest ->
NetBSD-Xen-DOM0 abort with EHOSTDOWN
Date: Sat, 19 Nov 2022 18:12:36 +0100
More data points:
The arp state loops through following state sequence:
host (ip-address) at MAC on xennet0 expired U
host (ip-address) at MAC on xennet0 expired U
host (ip-address) at MAC on xennet0 30s R
host (ip-address) at MAC on xennet0 28s R
host (ip-address) at MAC on xennet0 27s R
host (ip-address) at MAC on xennet0 26s R
host (ip-address) at MAC on xennet0 25s R
host (ip-address) at MAC on xennet0 23s R
host (ip-address) at MAC on xennet0 22s R
host (ip-address) at MAC on xennet0 21s R
host (ip-address) at MAC on xennet0 20s R
host (ip-address) at MAC on xennet0 18s R
host (ip-address) at MAC on xennet0 17s R
host (ip-address) at MAC on xennet0 16s R
host (ip-address) at MAC on xennet0 15s R
host (ip-address) at MAC on xennet0 14s R
host (ip-address) at MAC on xennet0 13s R
host (ip-address) at MAC on xennet0 11s R
host (ip-address) at MAC on xennet0 10s R
host (ip-address) at MAC on xennet0 9s R
host (ip-address) at MAC on xennet0 7s R
host (ip-address) at MAC on xennet0 5s R
host (ip-address) at MAC on xennet0 4s R
host (ip-address) at MAC on xennet0 3s R
host (ip-address) at MAC on xennet0 2s R
host (ip-address) at MAC on xennet0 1s R
host (ip-address) at MAC on xennet0 5s D
host (ip-address) at MAC on xennet0 4s D
host (ip-address) at MAC on xennet0 3s D
host (ip-address) at MAC on xennet0 1s D
host (ip-address) at MAC on xennet0 1s P
host (ip-address) at MAC on xennet0 1s P
host (ip-address) at MAC on xennet0 1s P
host (ip-address) at MAC on xennet0 expired U
host (ip-address) at MAC on xennet0 expired U
host (ip-address) at MAC on xennet0 expired U
host (ip-address) at MAC on xennet0 30s R
host (ip-address) at MAC on xennet0 29s R
host (ip-address) at MAC on xennet0 28s R
host (ip-address) at MAC on xennet0 27s R
tcpdump on xennet0:
16:13:26.259366 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has
host (MAC) tell guest-host, length 28
16:13:27.259380 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has
host (MAC) tell guest-host, length 28
16:13:28.259414 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has
host (MAC) tell guest-host, length 28
16:13:29.259411 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has
host (MAC) tell guest-host, length 28
16:13:30.259434 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has
host (MAC) tell guest-host, length 28
16:13:31.259465 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has
host (MAC) tell guest-host, length 28
16:13:32.259516 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has
host tell guest-host, length 28
16:13:32.259606 ARP, Ethernet (len 6), IPv4 (len 4), Reply host is-at
MAC, length 28
16:13:32.259608 ARP, Ethernet (len 6), IPv4 (len 4), Reply host is-at
MAC, length 28
16:13:32.259609 ARP, Ethernet (len 6), IPv4 (len 4), Reply host is-at
MAC, length 28
16:13:32.259609 ARP, Ethernet (len 6), IPv4 (len 4), Reply host is-at
MAC, length 28
16:13:32.259610 ARP, Ethernet (len 6), IPv4 (len 4), Reply host is-at
MAC, length 28
16:13:32.259610 ARP, Ethernet (len 6), IPv4 (len 4), Reply host is-at
MAC, length 28
tcpdump on guest interfac in DOM0 connected to bridge0:
16:26:03.745507 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has
host (MAC) tell guest-host, length 28
16:26:04.745537 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has
host (MAC) tell guest-host, length 28
16:26:05.745559 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has
host (MAC) tell guest-host, length 28
16:26:06.745599 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has
host (MAC) tell guest-host, length 28
16:26:07.745636 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has
host (MAC) tell guest-host, length 28
16:26:08.745725 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has
host tell guest-host, length 28
16:26:08.745793 ARP, Ethernet (len 6), IPv4 (len 4), Reply host is-at
MAC, length 28
16:26:08.745797 ARP, Ethernet (len 6), IPv4 (len 4), Reply host is-at
MAC, length 28
16:26:08.745800 ARP, Ethernet (len 6), IPv4 (len 4), Reply host is-at
MAC, length 28
16:26:08.745803 ARP, Ethernet (len 6), IPv4 (len 4), Reply host is-at
MAC, length 28
16:26:08.745812 ARP, Ethernet (len 6), IPv4 (len 4), Reply host is-at
MAC, length 28
16:26:08.745815 ARP, Ethernet (len 6), IPv4 (len 4), Reply host is-at
MAC, length 28
So we see that Replys are delayed (possibly queued) for 6 seconds. of
the 7 requests 6 are being replied to.
Why do we see this delay/queueing on a busy bridge interface?
arp.c has bridge conditional code.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2022
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.