NetBSD Problem Report #41140

From kardel@pip.acrys.com  Sat Apr  4 15:56:06 2009
Return-Path: <kardel@pip.acrys.com>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by www.NetBSD.org (Postfix) with ESMTP id 4C5AC63B8A5
	for <gnats-bugs@gnats.NetBSD.org>; Sat,  4 Apr 2009 15:56:06 +0000 (UTC)
Message-Id: <200904041453.n34ErrTJ002169@pip.acrys.com>
Date: Sat, 4 Apr 2009 16:53:53 +0200 (MEST)
From: kardel@netbsd.org
Reply-To: kardel@netbsd.org
To: gnats-bugs@gnats.NetBSD.org
Subject: 5-RC3 msk driver possibly broken
X-Send-Pr-Version: 3.95

>Number:         41140
>Category:       kern
>Synopsis:       ssh/bacula diconnect with errors when msk iface is used
>Confidential:   no
>Severity:       critical
>Priority:       medium
>Responsible:    kern-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Apr 04 16:00:01 +0000 2009
>Closed-Date:    Wed Mar 29 10:13:13 +0000 2017
>Last-Modified:  Wed Mar 29 10:13:13 +0000 2017
>Originator:     Frank Kardel
>Release:        NetBSD 5.0_RC3-090330
>Organization:
>Environment:
NetBSD gaia.acrys.com 5.0_RC3 NetBSD 5.0_RC3 (GAIA) #2: Mon Mar 30 10:40:31 CEST 2009  kardel@gaia.acrys.com:/usr/obj/sys/arch/i386/compile/GAIA i386
Architecture: i386
Machine: i386
>Description:
	High data volume ssh session break with (e. g. using rsync):
		- MAC corruption
		- bad packet length (with a ridiculous length value in the millions)

	Bacula backup fail with:
		03-Apr 02:05 Orcus-sd JobId 15806: Fatal error: bsock.c:415 Packet size too big from "client:x.y.z.u:36643. Terminating connection.	

	Symptoms seem similar to PR #31178.
	Also it seems to be more likely the less mbufs are available.
	Happens at 100Mb link rate

	Bacula seem fine when using the elinkxl (ex*) driver. With msk* no full backup finished. With ex* the full backup went through.

	dmesg sniplets:
	mainbus0 (root)
	cpu0 at mainbus0 apid 0: Intel 686-class, 2831MHz, id 0x10677
	cpu0: Enhanced SpeedStep (1244 mV) 800 MHz
	cpu0: Enhanced SpeedStep frequencies available (MHz): 7200 6400 5600 4800 4000 3100 2300 1500 700
	cpu1 at mainbus0 apid 3: Intel 686-class, 2831MHz, id 0x10677
	cpu2 at mainbus0 apid 1: Intel 686-class, 2831MHz, id 0x10677
	cpu3 at mainbus0 apid 2: Intel 686-class, 2831MHz, id 0x10677
	ioapic0 at mainbus0 apid 4: pa 0xfec00000, version 20, 24 pins
	acpi0 at mainbus0: Intel ACPICA 20080321
	acpi0: X/RSDT: OemId <IntelR,AWRDACPI,42302e31>, AslId <AWRD,00000000>
	acpi0: SCI interrupting at int 9
	acpi0: fixed-feature power button present
	...
	mskc0 at pci3 dev 0 function 0mskc0: interrupt moderation is 0 us
	, Yukon-2 EC rev. A3 (0x2): ioapic0 pin 19
	msk0 at mskc0 port A: Ethernet address 00:xx:xx:xx:xx:xx
	makphy0 at msk0 phy 0: Marvell 88E1111 Gigabit PHY, rev. 2
	makphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
	mskc1 at pci5 dev 0 function 0mskc1: interrupt moderation is 0 us
	, Yukon-2 EC rev. A3 (0x2): ioapic0 pin 17
	msk1 at mskc1 port A: Ethernet address 00:xx:xx:xx:xx:xx
	makphy1 at msk1 phy 0: Marvell 88E1111 Gigabit PHY, rev. 2
	makphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
	...


>How-To-Repeat:
	Run rsync via ssh or bacula for high volume data transfer with msk* driver on a 5.0_RC3 4-CPU (Q9550). Connections will break due to
	protocol sanity checks (MAC, length issues).
>Fix:
	ignore the two builtin msk interfaces - downgrade to e. g. ex*.

>Release-Note:

>Audit-Trail:
From: "Jean-Yves Migeon (NetBSD)" <jym@NetBSD.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/41140
Date: Thu, 09 Jul 2015 12:28:39 +0200

 FWIW the problem still happens on NetBSD 6.1.5, amd64.

 Got an Asus P6T recently with two LAN Marvell chips (Marvell88E8056), 
 and on occasion (after about 5-10min of connectivity) on high transfer 
 loads rsync() errors out with a "corrupt packet received, disconnected."

 No message on dmesg, vmstat -i seems pretty normal and no error count 
 increased in netstat -i. I originally thought it was RAM error, but 
 neither memtest nor any other program crashes. chip + msk seems to be 
 the culprit.

 Happens both with 1000baseT and 100baseTX media, although 100baseTX 
 seems to have a lower rate of failure.

 Will report later with a -current kernel once I can reboot this machine.

 -- 
 Jean-Yves Migeon
 jym@

State-Changed-From-To: open->closed
State-Changed-By: kardel@NetBSD.org
State-Changed-When: Wed, 29 Mar 2017 10:13:13 +0000
State-Changed-Why:
timeout - hw not available - closed by submitter (me)


>Unformatted:
Home
PR Database Search
(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.