NetBSD Problem Report #51753

From www@NetBSD.org  Fri Dec 30 09:48:58 2016
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 2F6947A318
	for <gnats-bugs@gnats.NetBSD.org>; Fri, 30 Dec 2016 09:48:58 +0000 (UTC)
Message-Id: <20161230094856.AE0F57A344@mollari.NetBSD.org>
Date: Fri, 30 Dec 2016 09:48:56 +0000 (UTC)
From: marcotte@panix.com
Reply-To: marcotte@panix.com
To: gnats-bugs@NetBSD.org
Subject: tcp SACK causes SSH disconnects
X-Send-Pr-Version: www-1.0

>Number:         51753
>Category:       kern
>Synopsis:       tcp SACK causes SSH disconnects
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Dec 30 09:50:00 +0000 2016
>Closed-Date:    Mon Jun 04 10:34:20 +0000 2018
>Last-Modified:  Mon Jun 04 10:34:20 +0000 2018
>Originator:     Brian Marcotte
>Release:        7.0
>Organization:
Public Access Networks, Corp.
>Environment:
NetBSD trinity.nyc.access.net 7.0.2 NetBSD 7.0.2 (PANIX-XEN-STD) #1: Mon Nov 21 12:57:01 EST 2016  root@juggler.panix.com:/misc/obj/misc/devel/netbsd/7.0.2/src/sys/arch/i386/compile/PANIX-XEN-STD i386
>Description:
Ever since we started upgrading to NetBSD-7, we've been getting weird
SSH disconnects:

  client: Corrupted MAC on input. Disconnecting: Packet corrupt
  server: panix5 sshd[23482]: error: Received disconnect from x.x.x.x:
          2: Packet corrupt

It turns out, just replacing the kernel only and keeping the NetBSD-6
userland will cause the problem to show up. SSH client/server versions
don't appear to matter.

I traced this down to a change in the kernel between 2013-Nov-12 and
2013-Nov-13. I suspect the problem is in one of these files commited
on that day:

  sys/netinet/tcp_congctl.c   1.18
  sys/netinet/tcp_congctl.h   1.7
  sys/netinet/tcp_input.c     1.330
  sys/netinet/tcp_sack.c      1.29
  sys/netinet/tcp_subr.c      1.251
  src/sys/netinet/tcp_var.h   1.171

The above commits added "cubic" congestion control but also moved SACK
code around.


>How-To-Repeat:
In our case, certain types of terminal output can cause the problem.

I can now get it to happen somewhat reliably by compiling a NetBSD
kernel.

It may be that there must be some other network problem for this to
happen as I've not seen anyone else report this problem.


>Fix:
I don't know how to fix it but turning off SACK seems to be a
workaround:

    sysctl -w net.inet.tcp.sack.enable=0


>Release-Note:

>Audit-Trail:
From: "John D. Baker" <jdbaker@mylinuxisp.com>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/51753: tcp SACK causes SSH disconnects
Date: Fri, 30 Dec 2016 09:20:52 -0600 (CST)

 This looks a lot like something I was seeing when working on some remote
 systems.  It was most likely to occur with high output traffic (such as
 when running an update build of the system).

 It also seemed to be affected by one or more intervening routers between
 the client and server.  A "direct" connection to the problem machine
 would be unstable but SSH-ing to another host first and then connecting
 to the problem system would be rock solid.  That "other host" could be
 another machine on the same LAN as the problem machine, also running
 NetBSD-7/amd64, or another remote machine (not on same LAN) running
 FreeBSD-10/amd64.

 The problem machine was later replaced and the behavior has not been
 observed on the replacement.  Said problem machine is now in my possession
 and did not exhibit the behavior when connected to my LAN.

 There was a thread about this in one of the other mailing lists some
 time back (either current-users@ or netbsd-users@), but I haven't been
 able to locate it so far.

 -- 
 |/"\ John D. Baker, KN5UKS               NetBSD     Darwin/MacOS X
 |\ / jdbaker[snail]mylinuxisp[flyspeck]com    OpenBSD            FreeBSD
 | X  No HTML/proprietary data in email.   BSD just sits there and works!
 |/ \ GPGkeyID:  D703 4A7E 479F 63F8 D3F4  BD99 9572 8F23 E4AD 1645

From: "John D. Baker" <jdbaker@mylinuxisp.com>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/51753: tcp SACK causes SSH disconnects
Date: Fri, 30 Dec 2016 09:48:51 -0600 (CST)

 On Fri, 30 Dec 2016, John D. Baker wrote:

 > There was a thread about this in one of the other mailing lists some
 > time back (either current-users@ or netbsd-users@), but I haven't been
 > able to locate it so far.

 I found the thread (started by the PR originator, actually), here:

   https://mail-index.netbsd.org/netbsd-users/2016/01/18/msg017647.html

 and my observation here:

   https://mail-index.netbsd.org/netbsd-users/2016/01/18/msg017654.html

 but it has only marginally more information than what I posted to this
 PR before.

 There was another, earlier thread I originated relating to the same
 issue, but I still can't find it.

 -- 
 |/"\ John D. Baker, KN5UKS               NetBSD     Darwin/MacOS X
 |\ / jdbaker[snail]mylinuxisp[flyspeck]com    OpenBSD            FreeBSD
 | X  No HTML/proprietary data in email.   BSD just sits there and works!
 |/ \ GPGkeyID:  D703 4A7E 479F 63F8 D3F4  BD99 9572 8F23 E4AD 1645

From: Nick Hudson <skrll@netbsd.org>
To: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
 gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Cc: 
Subject: Re: kern/51753: tcp SACK causes SSH disconnects
Date: Fri, 30 Dec 2016 17:11:18 +0000

 This is a multi-part message in MIME format.
 --------------030809060906030209040700
 Content-Type: text/plain; charset=windows-1252; format=flowed
 Content-Transfer-Encoding: 7bit

 On 12/30/16 09:50, marcotte@panix.com wrote:
 >> Number:         51753
 >> Category:       kern
 >> Synopsis:       tcp SACK causes SSH disconnects

 Does this help?

 Nick

 --------------030809060906030209040700
 Content-Type: text/x-patch;
  name="pr51753.diff"
 Content-Transfer-Encoding: 7bit
 Content-Disposition: attachment;
  filename="pr51753.diff"

 Index: sys/netinet/tcp_congctl.c
 ===================================================================
 RCS file: /cvsroot/src/sys/netinet/tcp_congctl.c,v
 retrieving revision 1.22
 diff -u -p -r1.22 tcp_congctl.c
 --- sys/netinet/tcp_congctl.c	13 Dec 2016 08:29:03 -0000	1.22
 +++ sys/netinet/tcp_congctl.c	30 Dec 2016 17:10:27 -0000
 @@ -707,7 +707,6 @@ tcp_newreno_fast_retransmit_newack(struc
  		tp->t_partialacks++;
  		TCP_TIMER_DISARM(tp, TCPT_REXMT);
  		tp->t_rtttime = 0;
 -		tp->snd_nxt = th->th_ack;

  		if (TCP_SACK_ENABLED(tp)) {
  			/*
 @@ -734,6 +733,7 @@ tcp_newreno_fast_retransmit_newack(struc
  			tp->t_flags |= TF_ACKNOW;
  			(void) tcp_output(tp);
  		} else {
 +			tp->snd_nxt = th->th_ack;
  			/*
  			 * Set snd_cwnd to one segment beyond ACK'd offset
  			 * snd_una is not yet updated when we're called


 --------------030809060906030209040700--

From: Brian Marcotte <marcotte@panix.com>
To: Nick Hudson <skrll@netbsd.org>
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org, gnats-bugs@NetBSD.org
Subject: Re: kern/51753: tcp SACK causes SSH disconnects
Date: Fri, 30 Dec 2016 13:18:02 -0500

 >  Does this help?

 It may be helping, but I'll need another day to be sure. I'll let you
 know over the weekend.

 Thanks.

 --
 - Brian

From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/51753: tcp SACK causes SSH disconnects
Date: Sat, 31 Dec 2016 10:15:59 +0000

 On Fri, Dec 30, 2016 at 03:25:01PM +0000, John D. Baker wrote:
  >  This looks a lot like something I was seeing when working on some remote
  >  systems.  It was most likely to occur with high output traffic (such as
  >  when running an update build of the system).

 I've seen this too; it's readily triggered by compiling, seems to be
 correlated with output volume. I'd thought it was flaky hardware :-/

 -- 
 David A. Holland
 dholland@netbsd.org

From: Brian Marcotte <marcotte@panix.com>
To: Nick Hudson <skrll@netbsd.org>
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org, gnats-bugs@NetBSD.org
Subject: Re: kern/51753: tcp SACK causes SSH disconnects
Date: Sun, 1 Jan 2017 10:18:49 -0500

 > Subject: Re: kern/51753: tcp SACK causes SSH disconnects
 >  Does this help?

 I've not seen any disconnects since I applied the patch. It appears
 to resolve the problem.

 Thank you.

 --
 - Brian

From: "Nick Hudson" <skrll@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/51753 CVS commit: src/sys/netinet
Date: Mon, 2 Jan 2017 09:29:38 +0000

 Module Name:	src
 Committed By:	skrll
 Date:		Mon Jan  2 09:29:38 UTC 2017

 Modified Files:
 	src/sys/netinet: tcp_congctl.c

 Log Message:
 Restore behaviour to pre- tcp_congctl.c:1.18 for SACK.  Further analysis
 of the change is required.

 OK kefren@

 PR/51753 tcp SACK causes SSH disconnect


 To generate a diff of this commit:
 cvs rdiff -u -r1.22 -r1.23 src/sys/netinet/tcp_congctl.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/51753 CVS commit: [netbsd-7] src/sys/netinet
Date: Thu, 5 Jan 2017 08:08:46 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Thu Jan  5 08:08:46 UTC 2017

 Modified Files:
 	src/sys/netinet [netbsd-7]: tcp_congctl.c

 Log Message:
 Pull up following revision(s) (requested by skrll in ticket #1347):
 	sys/netinet/tcp_congctl.c: revision 1.23
 Restore behaviour to pre- tcp_congctl.c:1.18 for SACK.  Further analysis
 of the change is required.
 OK kefren@
 PR/51753 tcp SACK causes SSH disconnect


 To generate a diff of this commit:
 cvs rdiff -u -r1.19 -r1.19.4.1 src/sys/netinet/tcp_congctl.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/51753 CVS commit: [netbsd-7-0] src/sys/netinet
Date: Thu, 5 Jan 2017 09:11:14 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Thu Jan  5 09:11:14 UTC 2017

 Modified Files:
 	src/sys/netinet [netbsd-7-0]: tcp_congctl.c

 Log Message:
 Pull up following revision(s) (requested by skrll in ticket #1347):
 	sys/netinet/tcp_congctl.c: revision 1.23
 Restore behaviour to pre- tcp_congctl.c:1.18 for SACK.  Further analysis
 of the change is required.
 OK kefren@
 PR/51753 tcp SACK causes SSH disconnect


 To generate a diff of this commit:
 cvs rdiff -u -r1.19 -r1.19.8.1 src/sys/netinet/tcp_congctl.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->feedback
State-Changed-By: maya@NetBSD.org
State-Changed-When: Mon, 04 Jun 2018 09:34:29 +0000
State-Changed-Why:
Has this problem showed up since?


From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/51753 (tcp SACK causes SSH disconnects)
Date: Mon, 4 Jun 2018 11:37:29 +0200

 I am not the original submitter, but could see this problem regularily
 and for me it has never happened again. Whatever that is worth.

 Martin

State-Changed-From-To: feedback->closed
State-Changed-By: maya@NetBSD.org
State-Changed-When: Mon, 04 Jun 2018 10:34:20 +0000
State-Changed-Why:
Feedback received by martin, who had the same bug in the past. Feel free to report another bug or repl if you are seeing the problem again. thanks for the report & skrll for the fix.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.