NetBSD Problem Report #55800

From kim@tac.gw.fi  Tue Nov 10 08:39:52 2020
Return-Path: <kim@tac.gw.fi>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id A655A1A9246
	for <gnats-bugs@gnats.NetBSD.org>; Tue, 10 Nov 2020 08:39:52 +0000 (UTC)
Message-Id: <20201110083944.E8A1794179@rendez-vous.gw.fi>
Date: Tue, 10 Nov 2020 10:39:44 +0200 (EET)
From: kim@netbsd.org (Kimmo Suominen)
Reply-To:
To: gnats-bugs@NetBSD.org
Subject: Data transfers stall when SACK is enabled
X-Send-Pr-Version: 3.95

>Number:         55800
>Category:       kern
>Synopsis:       Data transfers stall when SACK is enabled
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Nov 10 08:40:00 +0000 2020
>Last-Modified:  Tue Nov 22 09:15:01 +0000 2022
>Originator:     kim@netbsd.org (Kimmo Suominen)
>Release:        NetBSD 9.99.75 (202011081900Z)
>Organization:
>Environment:
System: NetBSD rendez-vous.gw.fi 9.99.75 NetBSD 9.99.75 (GENERIC) #0: Sun Nov 8 18:27:14 UTC 2020 mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/amd64/compile/GENERIC amd64
Architecture: x86_64
Machine: amd64
>Description:

	Transferring files using rsync over ssh stalls after about 1 GB
	of data transferred.  (Might not be connected with the amount of
	data, though.)  The connection is over IPv4.

	When the transfer stalls there is always some unresolved SACK.
	During the transfer I observed regular bouts of SACK throughout
	so not all occurrences of SACK result in a stall.

	In the stalled state it looks like ssh is not getting any data
	through (and therefore rsync is not receiving anything).  I have
	tcpdump output available here:

	    https://www.netbsd.org/~kim/NB-RSYNC-PROBLEM.txt

	The last transfer stalled at 1:27.  Then there are some packets
	exchanged at 2:27 and 3:27.  At 4:27 the connection is closed.
	This would appear to match the sshd_config settings I have:

	    TCPKeepAlive no
	    ClientAliveInterval 3600
	    ClientAliveCountMax 3

	The output on the terminal running rsync is as follows:

	    Timeout, server equinoxe not responding.
	    rsync: connection unexpectedly closed (949438584 bytes received so far) [receiver]
	    rsync error: error in rsync protocol data stream (code 12) at io.c(228) [receiver=3.2.3]
	    rsync: connection unexpectedly closed (14688411 bytes received so far) [generator]
	    rsync error: unexplained error (code 255) at io.c(228) [generator=3.2.3]
	    rsync: [generator] write error: Broken pipe (32)

	I'm guessing the first line is from ssh, although I have not
	verified that.

	The remote side is running the NetBSD 9.1 release:

	    NetBSD 9.1 (GENERIC) #0: Sun Oct 18 19:24:30 UTC 2020

	The local side is running the most recent -current snapshot:

	    NetBSD 9.99.75 (GENERIC) #0: Sun Nov 8 18:27:14 UTC 2020

	When I first noticed the issue I was running a slightly older
	-current (build ID derived from CVS checkout timestamp):

	    NetBSD 9.99.74 (GENERIC.202010172211Z~GW) #1: Sun Oct 18 02:20:50 EEST 2020

>How-To-Repeat:

	This is the command I ran:

	    rsync -aHSs --delete --exclude /branch/ --exclude /daily/ \
		--exclude /git/ --exclude /hg/ --exclude /releases/ \
		--exclude /work/ --exclude /www/ equinoxe:/p/netbsd/ \
		/p/netbsd/

	Possibly any data transfer with enough data will do.

>Fix:

	A successful workaround was to disable SACK on the local side:

	    sysctl -w net.inet.tcp.sack.enable=0

	This transfer was using IPv4, but I did also disable IPv6 SACK:

	    sysctl -w net.inet6.tcp6.sack.enable=0

>Audit-Trail:
From: Kimmo Suominen <kim@netbsd.org>
To: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org
Cc: 
Subject: Re: kern/55800: Data transfers stall when SACK is enabled
Date: Tue, 10 Nov 2020 17:01:32 +0200

 Sadly this also happens if both ends are running 9.1.  The transfer
 stalled after about 1 GB of data having been transferred.  After
 disabling SACK (on the local end) and restarting the transfer, it
 completed without any stalling (transferring another 9 GB of data).

From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/55800: Data transfers stall when SACK is enabled
Date: Thu, 12 Nov 2020 07:08:01 -0000 (UTC)

 kim@netbsd.org (Kimmo Suominen) writes:

 > Sadly this also happens if both ends are running 9.1.  The transfer
 > stalled after about 1 GB of data having been transferred.  After
 > disabling SACK (on the local end) and restarting the transfer, it
 > completed without any stalling (transferring another 9 GB of data).

 The tcpdump shows that some packet from the sender gets lost and
 the receiver SACKs subsequent segments. The sender then stops.

 After an hour, the receiver pushes an ACK (still only for the
 received segments), but the sender continues sending packets of the
 stream as if nothing has happened. No retry packet is received
 in the sequence.

 Is there anything between sender and receiver? firewall? packet filter
 (also on either machine) that could drop the retry packets ?

 -- 
 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/55800: Data transfers stall when SACK is enabled
Date: Tue, 22 Nov 2022 10:13:31 +0100

 Another data point:

 The problem still exists in 9.99.100 (and likely thereafter).

 After a missed sequence of 17k bytes an ACK for the last successful 
 segment before the ole is being sent by the receiver including SACK for 
 segments received past the missed segments.

 These ACKs(+SACKS) reach the sender and the sender resumes sending for 
 about 4-5k and then stalls. (traces are available)

 Disabling SACK helps as already observed.

 Frank


 On 11/12/20 08:10, Michael van Elst wrote:
 > The following reply was made to PR kern/55800; it has been noted by GNATS.
 >
 > From: mlelstv@serpens.de (Michael van Elst)
 > To: gnats-bugs@netbsd.org
 > Cc:
 > Subject: Re: kern/55800: Data transfers stall when SACK is enabled
 > Date: Thu, 12 Nov 2020 07:08:01 -0000 (UTC)
 >
 >   kim@netbsd.org (Kimmo Suominen) writes:
 >   
 >   > Sadly this also happens if both ends are running 9.1.  The transfer
 >   > stalled after about 1 GB of data having been transferred.  After
 >   > disabling SACK (on the local end) and restarting the transfer, it
 >   > completed without any stalling (transferring another 9 GB of data).
 >   
 >   The tcpdump shows that some packet from the sender gets lost and
 >   the receiver SACKs subsequent segments. The sender then stops.
 >   
 >   After an hour, the receiver pushes an ACK (still only for the
 >   received segments), but the sender continues sending packets of the
 >   stream as if nothing has happened. No retry packet is received
 >   in the sequence.
 >   
 >   Is there anything between sender and receiver? firewall? packet filter
 >   (also on either machine) that could drop the retry packets ?
 >   
 >   --
 >   --
 >                                   Michael van Elst
 >   Internet: mlelstv@serpens.de
 >                                   "A potential Snark may lurk in every tree."
 >
Home
PR Database Search
(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2022 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.