NetBSD Problem Report #55800
From kim@tac.gw.fi Tue Nov 10 08:39:52 2020
Return-Path: <kim@tac.gw.fi>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id A655A1A9246
for <gnats-bugs@gnats.NetBSD.org>; Tue, 10 Nov 2020 08:39:52 +0000 (UTC)
Message-Id: <20201110083944.E8A1794179@rendez-vous.gw.fi>
Date: Tue, 10 Nov 2020 10:39:44 +0200 (EET)
From: kim@netbsd.org (Kimmo Suominen)
Reply-To:
To: gnats-bugs@NetBSD.org
Subject: Data transfers stall when SACK is enabled
X-Send-Pr-Version: 3.95
>Number: 55800
>Category: kern
>Synopsis: Data transfers stall when SACK is enabled
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Tue Nov 10 08:40:00 +0000 2020
>Last-Modified: Tue Nov 22 09:15:01 +0000 2022
>Originator: kim@netbsd.org (Kimmo Suominen)
>Release: NetBSD 9.99.75 (202011081900Z)
>Organization:
>Environment:
System: NetBSD rendez-vous.gw.fi 9.99.75 NetBSD 9.99.75 (GENERIC) #0: Sun Nov 8 18:27:14 UTC 2020 mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/amd64/compile/GENERIC amd64
Architecture: x86_64
Machine: amd64
>Description:
Transferring files using rsync over ssh stalls after about 1 GB
of data transferred. (Might not be connected with the amount of
data, though.) The connection is over IPv4.
When the transfer stalls there is always some unresolved SACK.
During the transfer I observed regular bouts of SACK throughout
so not all occurrences of SACK result in a stall.
In the stalled state it looks like ssh is not getting any data
through (and therefore rsync is not receiving anything). I have
tcpdump output available here:
https://www.netbsd.org/~kim/NB-RSYNC-PROBLEM.txt
The last transfer stalled at 1:27. Then there are some packets
exchanged at 2:27 and 3:27. At 4:27 the connection is closed.
This would appear to match the sshd_config settings I have:
TCPKeepAlive no
ClientAliveInterval 3600
ClientAliveCountMax 3
The output on the terminal running rsync is as follows:
Timeout, server equinoxe not responding.
rsync: connection unexpectedly closed (949438584 bytes received so far) [receiver]
rsync error: error in rsync protocol data stream (code 12) at io.c(228) [receiver=3.2.3]
rsync: connection unexpectedly closed (14688411 bytes received so far) [generator]
rsync error: unexplained error (code 255) at io.c(228) [generator=3.2.3]
rsync: [generator] write error: Broken pipe (32)
I'm guessing the first line is from ssh, although I have not
verified that.
The remote side is running the NetBSD 9.1 release:
NetBSD 9.1 (GENERIC) #0: Sun Oct 18 19:24:30 UTC 2020
The local side is running the most recent -current snapshot:
NetBSD 9.99.75 (GENERIC) #0: Sun Nov 8 18:27:14 UTC 2020
When I first noticed the issue I was running a slightly older
-current (build ID derived from CVS checkout timestamp):
NetBSD 9.99.74 (GENERIC.202010172211Z~GW) #1: Sun Oct 18 02:20:50 EEST 2020
>How-To-Repeat:
This is the command I ran:
rsync -aHSs --delete --exclude /branch/ --exclude /daily/ \
--exclude /git/ --exclude /hg/ --exclude /releases/ \
--exclude /work/ --exclude /www/ equinoxe:/p/netbsd/ \
/p/netbsd/
Possibly any data transfer with enough data will do.
>Fix:
A successful workaround was to disable SACK on the local side:
sysctl -w net.inet.tcp.sack.enable=0
This transfer was using IPv4, but I did also disable IPv6 SACK:
sysctl -w net.inet6.tcp6.sack.enable=0
>Audit-Trail:
From: Kimmo Suominen <kim@netbsd.org>
To: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Cc:
Subject: Re: kern/55800: Data transfers stall when SACK is enabled
Date: Tue, 10 Nov 2020 17:01:32 +0200
Sadly this also happens if both ends are running 9.1. The transfer
stalled after about 1 GB of data having been transferred. After
disabling SACK (on the local end) and restarting the transfer, it
completed without any stalling (transferring another 9 GB of data).
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/55800: Data transfers stall when SACK is enabled
Date: Thu, 12 Nov 2020 07:08:01 -0000 (UTC)
kim@netbsd.org (Kimmo Suominen) writes:
> Sadly this also happens if both ends are running 9.1. The transfer
> stalled after about 1 GB of data having been transferred. After
> disabling SACK (on the local end) and restarting the transfer, it
> completed without any stalling (transferring another 9 GB of data).
The tcpdump shows that some packet from the sender gets lost and
the receiver SACKs subsequent segments. The sender then stops.
After an hour, the receiver pushes an ACK (still only for the
received segments), but the sender continues sending packets of the
stream as if nothing has happened. No retry packet is received
in the sequence.
Is there anything between sender and receiver? firewall? packet filter
(also on either machine) that could drop the retry packets ?
--
--
Michael van Elst
Internet: mlelstv@serpens.de
"A potential Snark may lurk in every tree."
From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/55800: Data transfers stall when SACK is enabled
Date: Tue, 22 Nov 2022 10:13:31 +0100
Another data point:
The problem still exists in 9.99.100 (and likely thereafter).
After a missed sequence of 17k bytes an ACK for the last successful
segment before the ole is being sent by the receiver including SACK for
segments received past the missed segments.
These ACKs(+SACKS) reach the sender and the sender resumes sending for
about 4-5k and then stalls. (traces are available)
Disabling SACK helps as already observed.
Frank
On 11/12/20 08:10, Michael van Elst wrote:
> The following reply was made to PR kern/55800; it has been noted by GNATS.
>
> From: mlelstv@serpens.de (Michael van Elst)
> To: gnats-bugs@netbsd.org
> Cc:
> Subject: Re: kern/55800: Data transfers stall when SACK is enabled
> Date: Thu, 12 Nov 2020 07:08:01 -0000 (UTC)
>
> kim@netbsd.org (Kimmo Suominen) writes:
>
> > Sadly this also happens if both ends are running 9.1. The transfer
> > stalled after about 1 GB of data having been transferred. After
> > disabling SACK (on the local end) and restarting the transfer, it
> > completed without any stalling (transferring another 9 GB of data).
>
> The tcpdump shows that some packet from the sender gets lost and
> the receiver SACKs subsequent segments. The sender then stops.
>
> After an hour, the receiver pushes an ACK (still only for the
> received segments), but the sender continues sending packets of the
> stream as if nothing has happened. No retry packet is received
> in the sequence.
>
> Is there anything between sender and receiver? firewall? packet filter
> (also on either machine) that could drop the retry packets ?
>
> --
> --
> Michael van Elst
> Internet: mlelstv@serpens.de
> "A potential Snark may lurk in every tree."
>
(Contact us)
$NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2022
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.