NetBSD Problem Report #44376

From Wolfgang.Stukenbrock@nagler-company.com  Wed Jan 12 16:58:28 2011
Return-Path: <Wolfgang.Stukenbrock@nagler-company.com>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by www.NetBSD.org (Postfix) with ESMTP id C908663B89F
	for <gnats-bugs@gnats.NetBSD.org>; Wed, 12 Jan 2011 16:58:28 +0000 (UTC)
Message-Id: <20110112165818.A77441E80CE@test-s0.nagler-company.com>
Date: Wed, 12 Jan 2011 17:58:18 +0100 (CET)
From: Wolfgang.Stukenbrock@nagler-company.com
Reply-To: Wolfgang.Stukenbrock@nagler-company.com
To: gnats-bugs@gnats.NetBSD.org
Subject: wm interface 82574 on Supermicro X8SIL (with Xeon L3406) -> kernel deadlock
X-Send-Pr-Version: 3.95

>Number:         44376
>Category:       kern
>Synopsis:       wm interface 82574 on Supermicro X8SIL (with Xeon L3406) -> kernel deadlock
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Jan 12 17:00:00 +0000 2011
>Originator:     Dr. Wolfgang Stukenbrock
>Release:        NetBSD 5.1, NetBSD-Head (12.01.2011)
>Organization:
Dr. Nagler & Company GmbH
>Environment:


System: NetBSD s0g7 5.99.43 NetBSD 5.99.43 (GENERIC) #0: Tue Jan 11 15:20:29 UTC 2011  builds@b7.netbsd.org:/home/builds/ab/HEAD/amd64/201101112100Z-obj/home/builds/ab/HEAD/src/sys/arch/amd64/compile/GENERIC amd64
Architecture: x86_64
Machine: amd64
>Description:
	We've tried to setup 5.1 on a new system with a Supermicro X8SIL board and Xeon L3406 CPU.
	This board uses the 3400 chip-set with two 82574L (rev 0) on board. There are two makphy (88E1149 ver 1) present.

	I've tried the original 5.1, a 5.1 with an updated wm driver (and some other phy stuff required for this, including
	some other headerfile extentions) and last an "original" GENERIC-kernel from the NetBSD ftp server from Head. (System-Info
	above is from that kernel.)
	All theese setups leads to a kernel dead-lock without any further interrupt procession after a short time if the onboard
	82574 interfaces are used.
	I've added a dual-port PCIe-card in order to get further information about the cause (Intel PRO/1000 PT (82571EB)).
	That one runs much more stable in all of the setups above. I've got only one - not reproducable - "crash" - see below.

	The scenario:
	The system is setup as a gateway that is used to route between two other networks.
	Only "the normal" unix-stuff is running on it. (named, syslogd, rpcbind, ypserv, ypbind, amd, ntpd, sshd, postfix, cron)
	I don't beleave it is related to one ot them. If desired, I can stop some of them ...

	Most times the system simply freezes.
	In some cases, the system reports problems when accessing the phy. (This has happend once for the PCIe-card too.)
	If that happens the system has continued working with the PCIe-card, but freezes most cases a short time after that when
	using the onboard interfaces.
	I've got this already during system boot ... and there I haven't started any ftp-test-transfer throught the system.

	When WM_DEBUG with LINK_DEBUG is enabled, only the "normal" switch to HDX is reported - nothing else ..
	When RX and TX is eanbled too there is too much output and I've seen some crashes with bad kernel-access - there
	seems to a timing problem somewhere here ...

	I've put some debug output into the 5.1 kernel that tracks the kernel-lock, but I haven't found anything that
	points to a problem.
	Netherless the famous last words in this output are always a little bit strange. Some (but not all in every case)
	of the CPU's (and there are 4 (2 Cores with 2 Threads each)) are gooing to lock the kernel-lock and stay there.
	(remark: I've tried disabled hyperthreading of the second core - no change. I haven't tested single-processor till now)
	Not all the time the output in front of that reports that a CPU is inside of the lock and my output that prints the
	cpu-number that hold the lock when waiting says no CPU is in there. Even if the output prio system freeze reports
	that a CPU has entered the lock and is still in there.
	The output on enter/leave in wm_intr() say that no wm-interrupt is active at the time of freeze.
	Something about have a minute before the system freezes, the wm-interrupts suddendly does not come requlary any-more.
	(the trafic request is still the same)


	I've no additional ideas anymore how to go on with debugging anymore - but I will if I get some new hints.
	(We need that system in a productive setup soon, so I will lose the system for testing during this month ...)


	The output of my kernel-lock debugging looks like something with the memory access and cache-sync gets out of sync.
	But I cannot beleave this, because in that case the problem should also happen on heavy load on the system without
	network trafic and I haven't seen this up to now.

	I've no idea where to place some debugging stuff to the the point where the system stops processing of interrupts.
	I've failed to add short printouts to the interrupt-stub routines, but that is af cause my fault personal problem ...
	At least wm_intr() is not called at that time - as long as printf() on the serial console is working correctly.
	(remark: when running on grafic console there seems to be much more kernel-lock "activity" as on serial console.
		 And the famous last words on grafic are truncated in the middle of a printf of 3 chars - on serial console
		 I've always seen the last printf completely ...)
	Due to the fact that absolutly no interrupts are processed anymore, I cannot get into DDB ...

	Help ....
>How-To-Repeat:
	Boot a current GENERIC kernel and do some network trafic on the HW setup above.
	The system will deadlock soon - no interrupts are processed anymore.
>Fix:
	Accedently not known till now ...
	Not even an idea how to go on with debugging ...

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.