NetBSD Problem Report #51057
From www@NetBSD.org Sat Apr 9 09:47:49 2016
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
by mollari.NetBSD.org (Postfix) with ESMTPS id DDF6F7A497
for <gnats-bugs@gnats.NetBSD.org>; Sat, 9 Apr 2016 09:47:49 +0000 (UTC)
Message-Id: <20160409094748.C365F7AAA1@mollari.NetBSD.org>
Date: Sat, 9 Apr 2016 09:47:48 +0000 (UTC)
From: cryintothebluesky@googlemail.com
Reply-To: cryintothebluesky@googlemail.com
To: gnats-bugs@NetBSD.org
Subject: hme(4) device driver bug when tcp4csum and udp4csum are enabled
X-Send-Pr-Version: www-1.0
>Number: 51057
>Category: kern
>Synopsis: hme(4) device driver bug when tcp4csum and udp4csum are enabled
>Confidential: no
>Severity: non-critical
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sat Apr 09 09:50:00 +0000 2016
>Last-Modified: Mon Apr 11 16:00:01 +0000 2016
>Originator: Roman
>Release: 7.0_STABLE
>Organization:
>Environment:
NetBSD ultra10 7.0_STABLE NetBSD 7.0_STABLE (GENERIC) #0: Sun Apr 3 17:13:43 BST 2016 roman@atom510:/opt/obj.sparc64/sys/arch/sparc64/compile/GENERIC sparc64
>Description:
All machines are on the same LAN, connected to a single
100Mbps switch. I run scp from NetBSD-7 on amd64 to NetBSD-7 on
sparc64, sooner or later sshd on sparc64 exits with error.
The way I manage to reproduce it:
1. Enable tcp4csum and udp4csum for hme0 on sparc64
2. Simulate heavy I/O on sparc64 by unpacking src.tgz which contains
pkgsrc, src and xsrc
3. Start scp for a 3GB file from amd64 to sparc64
Sooner or later sshd on sparc64 exists with error
Disabling tcp4csum and udp4csu for hme0 and repeating the above steps
always succeeds to scp the entire 3GB file without any errors.
So I'm inclined to think it's either hme0 hardware issue, or NetBSD
kernel bug.
ktrace shows the following:
161 1 sshd CALL read(3,0xffffffffffff7310,0x4000)
161 1 sshd GIO fd 3 read 56 bytes
"QB\M-u\M-?J\M-V\^C\M-^\M-;i$\M^L\M-kWJ\^S\M-6\M^M\M-C\M-5;\M-u\M^OpV\^Z\M-o7\M-|\M-v8\M^G-dP\r\M^R\M-h7\M-15j\
\M-)\^V)\M-UI\M-s\^Q\^E\M-S\^C\M^E?\M-02"
161 1 sshd RET read 56/0x38
161 1 sshd CALL __gettimeofday50(0xffffffffffff8710,0)
161 1 sshd RET __gettimeofday50 0
161 1 sshd CALL __sysctl(0xffffffffffff8608,2,0xffffffffffff99e0,0xffffffffffff8600,0,0)
161 1 sshd RET __sysctl 0
161 1 sshd CALL getpid
161 1 sshd RET getpid 161/0xa1, 767/0x2ff
161 1 sshd CALL __socket30(1,0x10000002,0)
161 1 sshd RET __socket30 9
161 1 sshd CALL connect(9,0xfffffffffca65218,0x6a)
161 1 sshd MISC mbsoname: [/var/run/log]
161 1 sshd NAMI "/var/run/log"
161 1 sshd RET connect 0
161 1 sshd CALL sendto(9,0xffffffffffff9048,0x53,0,0,0)
161 1 sshd MISC msghdr: [name=0x0, namelen=0, iov=0x12d967b10, iovlen=1, control=0x0, controllen=0, flags=0]
161 1 sshd GIO fd 9 wrote 83 bytes
"<38>1 2016-04-07T15:28:58.973455+01:00 ultra10 sshd 161 - - Corrupted MAC on input."
161 1 sshd RET sendto 83/0x53
161 1 sshd CALL close(9)
161 1 sshd RET close 0
161 1 sshd CALL __gettimeofday50(0xffffffffffff89c0,0)
161 1 sshd RET __gettimeofday50 0
161 1 sshd CALL __sysctl(0xffffffffffff88b8,2,0xffffffffffff9c90,0xffffffffffff88b0,0,0)
161 1 sshd RET __sysctl 0
161 1 sshd CALL getpid
161 1 sshd RET getpid 161/0xa1, 767/0x2ff
161 1 sshd CALL __socket30(1,0x10000002,0)
161 1 sshd RET __socket30 9
161 1 sshd CALL connect(9,0xfffffffffca65218,0x6a)
161 1 sshd MISC mbsoname: [/var/run/log]
161 1 sshd NAMI "/var/run/log"
161 1 sshd RET connect 0
161 1 sshd CALL sendto(9,0xffffffffffff92f8,0x80,0,0,0)
161 1 sshd MISC msghdr: [name=0x0, namelen=0, iov=0x12d967b10, iovlen=1, control=0x0, controllen=9, flags=0]
161 1 sshd GIO fd 9 wrote 128 bytes
"<34>1 2016-04-07T15:28:59.045004+01:00 ultra10 sshd 161 - - fatal: ssh_dispatch_run_fatal: message authenticat\
ion code incorrect"
161 1 sshd RET sendto 128/0x80
161 1 sshd CALL close(9)
161 1 sshd RET close 0
I assume that "Corrupted MAC on input" may have something to do with network card TCP checksumming
ultra10# cat /etc/ifconfig.hme0
up
192.168.1.3 netmask 255.255.255.0 tcp4csum udp4csum
Others report similar issues:
https://tty1.net/blog/2014/ssh-corrupted-mac-on-input_en.html
>How-To-Repeat:
>Fix:
>Audit-Trail:
From: "John D. Baker" <jdbaker@mylinuxisp.com>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/51057: hme(4) device driver bug when tcp4csum and udp4csum
are enabled
Date: Sat, 9 Apr 2016 09:32:35 -0500 (CDT)
I see something similar on a machine with a qfe (4 x hme) sbus card that
I use as a NAT firewall/router. Both internal and exteral networks are
on the qfe card. If "tcp4csum-rx" is enabled on the internal interface,
FTP connections from the local LAN are not forwarded to 'ftp-proxy'.
There was one remote machine I connected to that would drop the SSH
connection with similar messages to those reported here, but it was
the only one that exhibited the problem.
Lately, I experienced one particular web site that starts to work then
fails and, once failed, fails for all hosts on my internal LAN for a
couple of days, including one running redmond-OS. The webside works
for hosts outside my LAN (via excruciatingly slow X11 forwarding over
SSH). I briefly disabled all tcp4csum for internal and external networks,
but that didn't help. I'll try again by rebooting the firewall so it
comes up with those options disabled.
--
|/"\ John D. Baker, KN5UKS NetBSD Darwin/MacOS X
|\ / jdbaker[snail]mylinuxisp[flyspeck]com OpenBSD FreeBSD
| X No HTML/proprietary data in email. BSD just sits there and works!
|/ \ GPGkeyID: D703 4A7E 479F 63F8 D3F4 BD99 9572 8F23 E4AD 1645
From: "John D. Baker" <jdbaker@mylinuxisp.com>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/51057: hme(4) device driver bug when tcp4csum and udp4csum
are enabled
Date: Sun, 10 Apr 2016 15:04:08 -0500 (CDT)
I arranged to reboot my NAT router/firewall with all qfe/hme hardware
assist options off. The problem website no-longer fails catastrophically,
but neither does it work correctly.
I have a PCI qfe card lying around somewhere, so I could potentially test
the behavior in a non-SPARC system (macppc, i386, or amd64 fairly readily,
prep if I can get my MTX604 to boot again). I have to find it first.
Back when I was using a qec/qe card (and the built-in le interface), I
don't recall having any problems.
--
|/"\ John D. Baker, KN5UKS NetBSD Darwin/MacOS X
|\ / jdbaker[snail]mylinuxisp[flyspeck]com OpenBSD FreeBSD
| X No HTML/proprietary data in email. BSD just sits there and works!
|/ \ GPGkeyID: D703 4A7E 479F 63F8 D3F4 BD99 9572 8F23 E4AD 1645
From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org, cryintothebluesky@googlemail.com
Subject: re: kern/51057: hme(4) device driver bug when tcp4csum and udp4csum are enabled
Date: Mon, 11 Apr 2016 06:09:29 +1000
FWIW, i am successfully using the PCI quad hme(4) in a sunblade 2500
successfully, althought i am only using one interface so far.
i have this memory of people saying quad hme(4) @ sbus was a problem
a *long* time ago.
.mrg.
From: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
To: matthew green <mrg@eterna.com.au>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org,
netbsd-bugs@NetBSD.org, cryintothebluesky@googlemail.com
Subject: re: kern/51057: hme(4) device driver bug when tcp4csum and udp4csum
are enabled
Date: Sun, 10 Apr 2016 23:06:38 +0200
On Mon, 11 Apr 2016 06:09:29 +1000, matthew green wrote:
> i have this memory of people saying quad hme(4) @ sbus was a problem
> a *long* time ago.
I ran NetBSD on an SS20 with a quad hme as a router 24/7 for a few=20
years. It was replaced because the SM71 cpus died, not because of the=20
hme.
FWIW,
hauke
--=20
Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
Ernst-Ludwig-Stra=DFe 15
64625 Bensheim
Germany
From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org, cryintothebluesky@googlemail.com
Subject: re: kern/51057: hme(4) device driver bug when tcp4csum and udp4csum are enabled
Date: Mon, 11 Apr 2016 06:14:03 +1000
hmm, i see there is an uncommited patch in PR 24310 for hme(4).
i don't see it being directly related, but it's the only thing i can
see that isn't commited or clearly unrelated.
.mrg.
From: "John D. Baker" <jdbaker@mylinuxisp.com>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/51057: hme(4) device driver bug when tcp4csum and udp4csum
are enabled
Date: Mon, 11 Apr 2016 10:57:45 -0500 (CDT)
As a (relatively) quick test, I replaced the SPARC/qfe-based router with
a Soekris net4501. and the problem website works fine now.
I have a few single hme@sbus cards I could replace up to three of the qfe
ports with and see how they behave.
--
|/"\ John D. Baker, KN5UKS NetBSD Darwin/MacOS X
|\ / jdbaker[snail]mylinuxisp[flyspeck]com OpenBSD FreeBSD
| X No HTML/proprietary data in email. BSD just sits there and works!
|/ \ GPGkeyID: D703 4A7E 479F 63F8 D3F4 BD99 9572 8F23 E4AD 1645
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.