NetBSD Problem Report #49305
From www@NetBSD.org Wed Oct 22 09:47:13 2014
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
by mollari.NetBSD.org (Postfix) with ESMTPS id BC222A6687
for <gnats-bugs@gnats.NetBSD.org>; Wed, 22 Oct 2014 09:47:13 +0000 (UTC)
Message-Id: <20141022094712.88A2AA6699@mollari.NetBSD.org>
Date: Wed, 22 Oct 2014 09:47:12 +0000 (UTC)
From: macallan@netbsd.org
Reply-To: macallan@netbsd.org
To: gnats-bugs@NetBSD.org
Subject: deadlocks on sparc64 SMP
X-Send-Pr-Version: www-1.0
>Number: 49305
>Category: kern
>Synopsis: deadlocks on sparc64 SMP
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Wed Oct 22 09:50:00 +0000 2014
>Last-Modified: Sun Nov 09 16:35:00 +0000 2014
>Originator: Michael Lorenz
>Release: 6.99.45 and newer
>Organization:
>Environment:
NetBSD blackbush 7.99.1 NetBSD 7.99.1 (BLACKBUSH) #1: Sun Oct 19 11:11:01 EDT 2014 root@blackbush:/stuff/build/obj_sparc64/sys/arch/sparc64/compile/BLACKBUSH sparc64
/
>Description:
Under load my Sun Blade 2500 will sooner or later deadlock.
I tracked this down to one specific commit - net/pktqueue.c -r1.7. A -current kernel with just this reversed is stable.
I have not seen any deadlocks on non-SMP hardware or on slower machines, like my Ultra 60.
>How-To-Repeat:
build.sh -j4 distribution with sources over nfs. Usually deadlocks within an hour, may take significantly longer.
>Fix:
downgrade net/pktqueue.c to -r1.6
>Audit-Trail:
From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: re: kern/49305: deadlocks on sparc64 SMP
Date: Wed, 22 Oct 2014 23:12:04 +1100
> >Description:
> Under load my Sun Blade 2500 will sooner or later deadlock.
> I tracked this down to one specific commit - net/pktqueue.c -r1.7. A -current kernel with just this reversed is stable.
> I have not seen any deadlocks on non-SMP hardware or on slower machines, like my Ultra 60.
> >How-To-Repeat:
> build.sh -j4 distribution with sources over nfs. Usually deadlocks within an hour, may take significantly longer.
> >Fix:
> downgrade net/pktqueue.c to -r1.6
interesting.
bool
pktq_enqueue(pktqueue_t *pq, struct mbuf *m, const u_int hash __unused)
{
- const unsigned cpuid = curcpu()->ci_index /* hash % ncpu */;
+#ifdef _RUMPKERNEL
+ const unsigned cpuid = curcpu()->ci_index;
+#else
+ const unsigned cpuid = hash % ncpu;
+#endif
KASSERT(kpreempt_disabled());
i wonder if the #ifdef polarity is wrong, and we want to keep
the prior code for the real kernel.
.mrg.
From: Martin Husemann <martin@duskware.de>
To: matthew green <mrg@eterna.com.au>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/49305: deadlocks on sparc64 SMP
Date: Wed, 22 Oct 2014 14:28:15 +0200
On Wed, Oct 22, 2014 at 11:12:04PM +1100, matthew green wrote:
> interesting.
>
> bool
> pktq_enqueue(pktqueue_t *pq, struct mbuf *m, const u_int hash __unused)
> {
> - const unsigned cpuid = curcpu()->ci_index /* hash % ncpu */;
[..]
> + const unsigned cpuid = hash % ncpu;
> +#endif
>
> KASSERT(kpreempt_disabled());
Note that only ether_input() passes a real hash value here (all other
callers hard code it to 0), and also that the only way to get to this
function is typically in the RX interrupt of a network driver, which
will very likely always happen on cpu0 on sparc64 (currently).
Martin
From: Michael <macallan@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/49305: deadlocks on sparc64 SMP
Date: Wed, 22 Oct 2014 08:44:38 -0400
On Wed, 22 Oct 2014 12:15:01 +0000 (UTC)
matthew green <mrg@eterna.com.au> wrote:
> i wonder if the #ifdef polarity is wrong, and we want to keep
> the prior code for the real kernel.
That's what I thought ( and mlelstv@ as well )
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: macallan@netbsd.org
Subject: Re: kern/49305: deadlocks on sparc64 SMP
Date: Wed, 22 Oct 2014 14:49:56 +0200
On Wed, Oct 22, 2014 at 12:45:01PM +0000, Michael wrote:
> That's what I thought ( and mlelstv@ as well )
That would mean that we always handle the softint consuming the packet
just queued on cpu 0.
The whole idea is to load balance and give others a chance to deal with
the queued packets.
The question is: what exactly goes wrong when we dispatch the softint on
another cpu?
Could you try hardcoding cpuid = 1 and see if that works?
Martin
From: Michael <macallan@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/49305: deadlocks on sparc64 SMP
Date: Tue, 28 Oct 2014 05:17:22 -0400
> Could you try hardcoding cpuid = 1 and see if that works?
I did, and it seemed to work just fine, no crashes or deadlocks.
With cpuid alternating between CPUs I got spontaneous reboots, although
they took a while to reproduce ( as in, 3/4 of a build.sh -j4 -m
mips64eb )
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/49305: deadlocks on sparc64 SMP
Date: Tue, 28 Oct 2014 10:45:04 +0100
Some race or missing sync ops in pcq(9)?
Martin
From: Masao Uebayashi <uebayasi@gmail.com>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/49305: deadlocks on sparc64 SMP
Date: Thu, 30 Oct 2014 12:21:08 +0900
If that's true, a missing membar_consumer() before:
http://nxr.netbsd.org/xref/src/sys/kern/subr_pcq.c#165
otherwise pcq_items[] values may be cached?
From: Dennis Ferguson <dennis.c.ferguson@gmail.com>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org,
gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org,
macallan@netbsd.org
Subject: Re: kern/49305: deadlocks on sparc64 SMP
Date: Thu, 30 Oct 2014 08:18:48 -0700
On 29 Oct, 2014, at 20:25 , Masao Uebayashi <uebayasi@gmail.com> wrote:
> If that's true, a missing membar_consumer() before:
>
> http://nxr.netbsd.org/xref/src/sys/kern/subr_pcq.c#165
>
> otherwise pcq_items[] values may be cached?
I don't think so. A membar_consumer() would ensure the
read at line 165 was ordered after previous reads, with
the read at line 159 being the only relevant one. In this
case, though, the read at 165 is dependent on the read
at 159 (it uses the result from 159 to determine what to
read at 165) and this is normally sufficient to ensure
correct ordering without a barrier. I'm pretty sure the
only machine where a barrier would change anything here
is a multiprocessor DEC Alpha, which had an odd cache
arrangement; for everything else it is okay the way it
is.
From: Michael <macallan@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/49305: deadlocks on sparc64 SMP
Date: Fri, 31 Oct 2014 16:50:57 -0400
On Thu, 30 Oct 2014 03:25:01 +0000 (UTC)
Masao Uebayashi <uebayasi@gmail.com> wrote:
> If that's true, a missing membar_consumer() before:
>
> http://nxr.netbsd.org/xref/src/sys/kern/subr_pcq.c#165
>
> otherwise pcq_items[] values may be cached?
Just tried that, got a deadlock after a couple hours of build.sh -j4
From: Mindaugas Rasiukevicius <rmind@netbsd.org>
To: Martin Husemann <martin@duskware.de>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, macallan@netbsd.org
Subject: Re: kern/49305: deadlocks on sparc64 SMP
Date: Sun, 9 Nov 2014 15:13:21 +0000
Martin Husemann <martin@duskware.de> wrote:
> The following reply was made to PR kern/49305; it has been noted by GNATS.
>
> From: Martin Husemann <martin@duskware.de>
> To: gnats-bugs@NetBSD.org
> Cc:
> Subject: Re: kern/49305: deadlocks on sparc64 SMP
> Date: Tue, 28 Oct 2014 10:45:04 +0100
>
> Some race or missing sync ops in pcq(9)?
>
Unlikely to be pcq(9). Most likely this is related to software interrupts
or IPIs on sparc64, see softint_schedule_cpu(9). Unfortunately, I do not
have time to look into sparc64 MD code right now.
--
Mindaugas
From: Martin Husemann <martin@duskware.de>
To: Mindaugas Rasiukevicius <rmind@netbsd.org>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, macallan@netbsd.org
Subject: Re: kern/49305: deadlocks on sparc64 SMP
Date: Sun, 9 Nov 2014 17:16:01 +0100
On Sun, Nov 09, 2014 at 03:13:21PM +0000, Mindaugas Rasiukevicius wrote:
> Unlikely to be pcq(9). Most likely this is related to software interrupts
> or IPIs on sparc64, see softint_schedule_cpu(9). Unfortunately, I do not
> have time to look into sparc64 MD code right now.
Possible, but why does the "always use the other cpu for softint processing"
version not trigger it?
Martin
From: Michael <macallan@netbsd.org>
To: Martin Husemann <martin@duskware.de>
Cc: Mindaugas Rasiukevicius <rmind@netbsd.org>, gnats-bugs@NetBSD.org,
kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/49305: deadlocks on sparc64 SMP
Date: Sun, 9 Nov 2014 11:33:43 -0500
Hello,
On Sun, 9 Nov 2014 17:16:01 +0100
Martin Husemann <martin@duskware.de> wrote:
> On Sun, Nov 09, 2014 at 03:13:21PM +0000, Mindaugas Rasiukevicius wrote:
> > Unlikely to be pcq(9). Most likely this is related to software interrupts
> > or IPIs on sparc64, see softint_schedule_cpu(9). Unfortunately, I do not
> > have time to look into sparc64 MD code right now.
>
> Possible, but why does the "always use the other cpu for softint processing"
> version not trigger it?
Might be a fluke but to me it looks like the problem happens
( sometimes ) when we schedule two software interrupts on different
CPUs in quick succession.
have fun
Michael
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.