NetBSD Problem Report #49711

From tih@hamartun.priv.no  Mon Mar  2 17:05:18 2015
Return-Path: <tih@hamartun.priv.no>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id CF983A65B0
	for <gnats-bugs@gnats.NetBSD.org>; Mon,  2 Mar 2015 17:05:18 +0000 (UTC)
Message-Id: <20150302170513.F00F31C70EA@barsoom.hamartun.priv.no>
Date: Mon,  2 Mar 2015 18:05:13 +0100 (CET)
From: tih@hamartun.priv.no
Reply-To: tih@hamartun.priv.no
To: gnats-bugs@NetBSD.org
Subject: Spurious interrupts with Intel E7520 chip set can cause hangs
X-Send-Pr-Version: 3.95

>Number:         49711
>Category:       kern
>Synopsis:       Interrupts get routed to wrong handler under load
>Confidential:   no
>Severity:       serious
>Priority:       low
>Responsible:    kern-bug-people
>State:          closed
>Class:          support
>Submitter-Id:   net
>Arrival-Date:   Mon Mar 02 17:10:00 +0000 2015
>Closed-Date:    Sat Jul 11 15:12:30 +0000 2020
>Last-Modified:  Sat Jul 11 15:12:30 +0000 2020
>Originator:     Tom Ivar Helbekkmo
>Release:        NetBSD 7.99.5
>Organization:
>Environment:


System: NetBSD barsoom.hamartun.priv.no 7.99.5 NetBSD 7.99.5 (BARSOOM) #128: Wed Feb 25 16:32:01 CET 2015 root@barsoom.hamartun.priv.no:/usr/obj/sys/arch/amd64/compile.amd64/BARSOOM amd64
Architecture: x86_64
Machine: amd64
>Description:

What happens seems to be that the Intel E7520 chip set has a bug where
an interrupt is being handled, and the ioapic pin temporarily masked,
and the chip set decides to make the masked interrupt pop up somewhere
else.  This somewhere else depends on the physical configuration of
the motherboard, and on the BIOS: it follows the underlying sharing of
IRQs.

I'm observing the problem on a Dell PowerEdge 2850, but it is known to
happen also on other Dell systems from the same period, as well as on
machines built by other manufacturers using the same chip set.  In my
case, the primary network interface, wm0, shares an underlying IRQ
with one uhci device, while the Dell PERC RAID controller, amr0,
shares with another uhci device.  As the machine gets loaded down with
increasing disk and network I/O, the number of interrupts falsely
routed to the uhci devices increases exponentially.  With a light
load, this will cause hangs of a few seconds at a time, where the
machine does not respond to anything.  With a sufficiently high load,
the machine will hang completely, and need a power cycle.  The problem
is worse with more CPUs active; I've not seen hangs of over half a
minute in single CPU mode.  'vmstat -i' reveals that the machine is
racking up very large numbers of interrupts on the (idle) uhci devices
during the hangs.

Since interrupts can get misrouted while previous interrupts are still
being handled, increasing I/O load increases both the number of
interrupts that could get hit, and the amount of time spent handling
interrupts, and thus being susceptible to the accidents.  This would
explain the non-linear growth of the problem with increasing load.

Googling reveals that the problem is well known, and has been worked
around in Linux and FreeBSD by introducing interrupt storm
mitigation.

>How-To-Repeat:

Load down a system built around the E7520 chip set with disk and
network I/O, and observe.

>Fix:

A workaround has been found for the particular case seen with this
Dell hardware, where disk and network interrupts get misrouted to the
uhci driver.  Joerg Sonnenberger suggested a way of modifying the uhci
driver to a) not enable interrupts from the uhci devices, and b) run
in a quasi-polled mode, using callout_schedule() to make the driver
check each uhci device HZ times per second.  This permits the uhci
driver to have a minimal execution path for its interrupt handler,
which simply returns without doing anything.

I am running with this hack in place, and have found myself unable to
get the machine to hang.  I do see periods of high interrupt rates on
the uhci ioapic lines, and the effect can be seen in increased ICMP
ECHO round trip times between this system and neighbors on the local
network (increasing from sub-millisecond to 15 to 20 milliseconds
during very heavily loaded periods), but it is a good workaround, and
running serial communications over a USB dongle over an affected uchi
device works flawlessly at the same time.

Here is the patch (replace "BARSOOM" with whatever option string or
other conditional you would like it to depend on):

Index: sys/dev/usb/uhci.c
===================================================================
RCS file: /cvsroot/src/sys/dev/usb/uhci.c,v
retrieving revision 1.264
diff -u -r1.264 uhci.c
--- sys/dev/usb/uhci.c	5 Aug 2014 06:35:24 -0000	1.264
+++ sys/dev/usb/uhci.c	2 Mar 2015 12:47:25 -0000
@@ -385,6 +385,19 @@
 	UHCICMD(sc, 0);			/* do nothing */
 }

+#ifdef BARSOOM
+Static int uhci_intr1(uhci_softc_t *);
+
+static void uhci_intr1_wrap(void *cookie)
+{
+	uhci_softc_t *sc = cookie;
+
+	mutex_spin_enter(&sc->sc_intr_lock);
+	uhci_intr1(sc);
+	mutex_spin_exit(&sc->sc_intr_lock);
+}
+#endif
+
 usbd_status
 uhci_init(uhci_softc_t *sc)
 {
@@ -542,8 +555,16 @@
 	DPRINTFN(1,("uhci_init: enabling\n"));

 	err =  uhci_run(sc, 1, 0);		/* and here we go... */
+
+#ifdef BARSOOM
+	callout_init(&sc->sc_worker, CALLOUT_MPSAFE);
+	callout_setfunc(&sc->sc_worker, uhci_intr1_wrap, sc);
+	callout_schedule(&sc->sc_worker, 1);
+#else
 	UWRITE2(sc, UHCI_INTR, UHCI_INTR_TOCRCIE | UHCI_INTR_RIE |
 		UHCI_INTR_IOCE | UHCI_INTR_SPIE);	/* enable interrupts */
+#endif
+
 	return err;
 }

@@ -720,8 +741,10 @@
 	UHCICMD(sc, cmd | UHCI_CMD_FGR); /* force resume */
 	usb_delay_ms_locked(&sc->sc_bus, USB_RESUME_DELAY, &sc->sc_intr_lock);
 	UHCICMD(sc, cmd & ~UHCI_CMD_EGSM); /* back to normal */
+#ifndef BARSOOM
 	UWRITE2(sc, UHCI_INTR, UHCI_INTR_TOCRCIE |
 	    UHCI_INTR_RIE | UHCI_INTR_IOCE | UHCI_INTR_SPIE);
+#endif
 	UHCICMD(sc, UHCI_CMD_MAXP);
 	uhci_run(sc, 1, 1); /* and start traffic again */
 	usb_delay_ms_locked(&sc->sc_bus, USB_RESUME_RECOVERY, &sc->sc_intr_lock);
@@ -735,6 +758,11 @@
 #endif

 	sc->sc_suspend = PWR_RESUME;
+
+#ifdef BARSOOM
+	callout_schedule(&sc->sc_worker, 1);
+#endif
+
 	mutex_spin_exit(&sc->sc_intr_lock);

 	return true;
@@ -1270,14 +1298,21 @@
 		sc->sc_bulk_end = pqh;
 }

+#ifndef BARSOOM
 Static int uhci_intr1(uhci_softc_t *);
+#endif

 int
 uhci_intr(void *arg)
 {
-	uhci_softc_t *sc = arg;
 	int ret = 0;

+#ifndef BARSOOM
+	uhci_softc_t *sc = arg;
+
+	if (sc == NULL)
+		return 0;
+
 	mutex_spin_enter(&sc->sc_intr_lock);

 	if (sc->sc_dying || !device_has_power(sc->sc_dev))
@@ -1294,6 +1329,8 @@

  done:
 	mutex_spin_exit(&sc->sc_intr_lock);
+#endif /* BARSOOM */
+
 	return ret;
 }

@@ -1313,8 +1350,13 @@
 	KASSERT(mutex_owned(&sc->sc_intr_lock));

 	status = UREAD2(sc, UHCI_STS) & UHCI_STS_ALLINTRS;
-	if (status == 0)	/* The interrupt was not for us. */
+
+	if (status == 0) {	/* The interrupt was not for us. */
+#ifdef BARSOOM
+		callout_schedule(&sc->sc_worker, 1);
+#endif
 		return (0);
+	}

 	if (sc->sc_suspend != PWR_RESUME) {
 #ifdef DIAGNOSTIC
@@ -1322,6 +1364,9 @@
 		       device_xname(sc->sc_dev));
 #endif
 		UWRITE2(sc, UHCI_STS, status); /* acknowledge the ints */
+#ifdef BARSOOM
+		callout_schedule(&sc->sc_worker, 1);
+#endif
 		return (0);
 	}

@@ -1359,8 +1404,13 @@
 		sc->sc_dying = 1;
 	}

-	if (!ack)
+	if (!ack) {
+#ifdef BARSOOM
+		callout_schedule(&sc->sc_worker, 1);
+#endif
 		return (0);	/* nothing to acknowledge */
+	}
+
 	UWRITE2(sc, UHCI_STS, ack); /* acknowledge the ints */

 	sc->sc_bus.no_intrs++;
@@ -1368,6 +1418,9 @@

 	DPRINTFN(15, ("%s: uhci_intr: exit\n", device_xname(sc->sc_dev)));

+#ifdef BARSOOM
+	callout_schedule(&sc->sc_worker, 1);
+#endif
 	return (1);
 }

Index: sys/dev/usb/uhcivar.h
===================================================================
RCS file: /cvsroot/src/sys/dev/usb/uhcivar.h,v
retrieving revision 1.52
diff -u -r1.52 uhcivar.h
--- sys/dev/usb/uhcivar.h	29 Jan 2013 00:00:15 -0000	1.52
+++ sys/dev/usb/uhcivar.h	2 Mar 2015 12:47:25 -0000
@@ -137,6 +137,10 @@
 	bus_space_handle_t ioh;
 	bus_size_t sc_size;

+#ifdef BARSOOM
+	struct callout sc_worker;
+#endif
+
 	kmutex_t sc_lock;
 	kmutex_t sc_intr_lock;
 	kcondvar_t sc_softwake_cv;

>Release-Note:

>Audit-Trail:

State-Changed-From-To: open->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Mon, 15 Jun 2020 16:58:28 +0000
State-Changed-Why:
Can you please try with up-to-date -current? -current should be using MSI
for the devices, so there shouldn't be any interrupt sharing any more,
which triggered this behaviour.


From: Tom Ivar Helbekkmo <tih@hamartun.priv.no>
To: jdolecek@netbsd.org
Cc: kern-bug-people@netbsd.org,  netbsd-bugs@netbsd.org,  gnats-admin@netbsd.org,  gnats-bugs@netbsd.org
Subject: Re: kern/49711 (Interrupts get routed to wrong handler under load)
Date: Mon, 06 Jul 2020 09:45:17 +0200

 No improvement.  This is a hardware bug, and the only way to mitigate it
 is to make sure the affected interrupt vectors either are disabled, or
 have extremely short handling times.  Running in polled mode achieves
 that, as does the FreeBSD approach of detecting and suppressing storms.

 Should probably be closed "won't fix", as the affected Intel hardware
 gets older, and less likely to run NetBSD.

 -tih
 -- 
 Most people who graduate with CS degrees don't understand the significance
 of Lisp.  Lisp is the most important idea in computer science.  --Alan Kay

State-Changed-From-To: feedback->closed
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Sat, 11 Jul 2020 15:12:30 +0000
State-Changed-Why:
Won't fix, E7520 hardware bug. Thanks for report.


>Unformatted:
Home
PR Database Search
(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.