NetBSD Problem Report #49305

From www@NetBSD.org  Wed Oct 22 09:47:13 2014
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id BC222A6687
	for <gnats-bugs@gnats.NetBSD.org>; Wed, 22 Oct 2014 09:47:13 +0000 (UTC)
Message-Id: <20141022094712.88A2AA6699@mollari.NetBSD.org>
Date: Wed, 22 Oct 2014 09:47:12 +0000 (UTC)
From: macallan@netbsd.org
Reply-To: macallan@netbsd.org
To: gnats-bugs@NetBSD.org
Subject: deadlocks on sparc64 SMP
X-Send-Pr-Version: www-1.0

>Number:         49305
>Category:       kern
>Synopsis:       deadlocks on sparc64 SMP
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Oct 22 09:50:00 +0000 2014
>Last-Modified:  Sun Nov 09 16:35:00 +0000 2014
>Originator:     Michael Lorenz
>Release:        6.99.45 and newer
>Organization:
>Environment:
NetBSD blackbush 7.99.1 NetBSD 7.99.1 (BLACKBUSH) #1: Sun Oct 19 11:11:01 EDT 2014  root@blackbush:/stuff/build/obj_sparc64/sys/arch/sparc64/compile/BLACKBUSH sparc64
/
>Description:
Under load my Sun Blade 2500 will sooner or later deadlock.
I tracked this down to one specific commit - net/pktqueue.c -r1.7. A -current kernel with just this reversed is stable.
I have not seen any deadlocks on non-SMP hardware or on slower machines, like my Ultra 60.
>How-To-Repeat:
build.sh -j4 distribution with sources over nfs. Usually deadlocks within an hour, may take significantly longer.
>Fix:
downgrade net/pktqueue.c to -r1.6

>Audit-Trail:
From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
    netbsd-bugs@netbsd.org
Subject: re: kern/49305: deadlocks on sparc64 SMP
Date: Wed, 22 Oct 2014 23:12:04 +1100

 > >Description:
 > Under load my Sun Blade 2500 will sooner or later deadlock.
 > I tracked this down to one specific commit - net/pktqueue.c -r1.7. A -current kernel with just this reversed is stable.
 > I have not seen any deadlocks on non-SMP hardware or on slower machines, like my Ultra 60.
 > >How-To-Repeat:
 > build.sh -j4 distribution with sources over nfs. Usually deadlocks within an hour, may take significantly longer.
 > >Fix:
 > downgrade net/pktqueue.c to -r1.6

 interesting.

  bool
  pktq_enqueue(pktqueue_t *pq, struct mbuf *m, const u_int hash __unused)
  {
 -       const unsigned cpuid = curcpu()->ci_index /* hash % ncpu */;
 +#ifdef _RUMPKERNEL
 +       const unsigned cpuid = curcpu()->ci_index;
 +#else
 +       const unsigned cpuid = hash % ncpu;
 +#endif

         KASSERT(kpreempt_disabled());

 i wonder if the #ifdef polarity is wrong, and we want to keep
 the prior code for the real kernel.


 .mrg.

From: Martin Husemann <martin@duskware.de>
To: matthew green <mrg@eterna.com.au>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/49305: deadlocks on sparc64 SMP
Date: Wed, 22 Oct 2014 14:28:15 +0200

 On Wed, Oct 22, 2014 at 11:12:04PM +1100, matthew green wrote:
 > interesting.
 > 
 >  bool
 >  pktq_enqueue(pktqueue_t *pq, struct mbuf *m, const u_int hash __unused)
 >  {
 > -       const unsigned cpuid = curcpu()->ci_index /* hash % ncpu */;
 [..]
 > +       const unsigned cpuid = hash % ncpu;
 > +#endif
 >  
 >         KASSERT(kpreempt_disabled());

 Note that only ether_input() passes a real hash value here (all other
 callers hard code it to 0), and also that the only way to get to this
 function is typically in the RX interrupt of a network driver, which
 will very likely always happen on cpu0 on sparc64 (currently).

 Martin

From: Michael <macallan@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/49305: deadlocks on sparc64 SMP
Date: Wed, 22 Oct 2014 08:44:38 -0400

 On Wed, 22 Oct 2014 12:15:01 +0000 (UTC)
 matthew green <mrg@eterna.com.au> wrote:

 >  i wonder if the #ifdef polarity is wrong, and we want to keep
 >  the prior code for the real kernel.

 That's what I thought ( and mlelstv@ as well )

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: macallan@netbsd.org
Subject: Re: kern/49305: deadlocks on sparc64 SMP
Date: Wed, 22 Oct 2014 14:49:56 +0200

 On Wed, Oct 22, 2014 at 12:45:01PM +0000, Michael wrote:
 >  That's what I thought ( and mlelstv@ as well )

 That would mean that we always handle the softint consuming the packet
 just queued on cpu 0.

 The whole idea is to load balance and give others a chance to deal with
 the queued packets.

 The question is: what exactly goes wrong when we dispatch the softint on
 another cpu?

 Could you try hardcoding cpuid = 1 and see if that works?

 Martin

From: Michael <macallan@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/49305: deadlocks on sparc64 SMP
Date: Tue, 28 Oct 2014 05:17:22 -0400

 >  Could you try hardcoding cpuid = 1 and see if that works?

 I did, and it seemed to work just fine, no crashes or deadlocks.
 With cpuid alternating between CPUs I got spontaneous reboots, although
 they took a while to reproduce ( as in, 3/4 of a build.sh -j4 -m
 mips64eb )

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/49305: deadlocks on sparc64 SMP
Date: Tue, 28 Oct 2014 10:45:04 +0100

 Some race or missing sync ops in pcq(9)?

 Martin

From: Masao Uebayashi <uebayasi@gmail.com>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/49305: deadlocks on sparc64 SMP
Date: Thu, 30 Oct 2014 12:21:08 +0900

 If that's true, a missing membar_consumer() before:

 http://nxr.netbsd.org/xref/src/sys/kern/subr_pcq.c#165

 otherwise pcq_items[] values may be cached?

From: Dennis Ferguson <dennis.c.ferguson@gmail.com>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org,
 gnats-admin@netbsd.org,
 netbsd-bugs@netbsd.org,
 macallan@netbsd.org
Subject: Re: kern/49305: deadlocks on sparc64 SMP
Date: Thu, 30 Oct 2014 08:18:48 -0700

 On 29 Oct, 2014, at 20:25 , Masao Uebayashi <uebayasi@gmail.com> wrote:
 > If that's true, a missing membar_consumer() before:
 > 
 > http://nxr.netbsd.org/xref/src/sys/kern/subr_pcq.c#165
 > 
 > otherwise pcq_items[] values may be cached?

 I don't think so.  A membar_consumer() would ensure the
 read at line 165 was ordered after previous reads, with
 the read at line 159 being the only relevant one.  In this
 case, though, the read at 165 is dependent on the read
 at 159 (it uses the result from 159 to determine what to
 read at 165) and this is normally sufficient to ensure
 correct ordering without a barrier.  I'm pretty sure the
 only machine where a barrier would change anything here
 is a multiprocessor DEC Alpha, which had an odd cache
 arrangement; for everything else it is okay the way it
 is.

From: Michael <macallan@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/49305: deadlocks on sparc64 SMP
Date: Fri, 31 Oct 2014 16:50:57 -0400

 On Thu, 30 Oct 2014 03:25:01 +0000 (UTC)
 Masao Uebayashi <uebayasi@gmail.com> wrote:

 >  If that's true, a missing membar_consumer() before:
 >  
 >  http://nxr.netbsd.org/xref/src/sys/kern/subr_pcq.c#165
 >  
 >  otherwise pcq_items[] values may be cached?

 Just tried that, got a deadlock after a couple hours of build.sh -j4

From: Mindaugas Rasiukevicius <rmind@netbsd.org>
To: Martin Husemann <martin@duskware.de>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
 gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, macallan@netbsd.org
Subject: Re: kern/49305: deadlocks on sparc64 SMP
Date: Sun, 9 Nov 2014 15:13:21 +0000

 Martin Husemann <martin@duskware.de> wrote:
 > The following reply was made to PR kern/49305; it has been noted by GNATS.
 > 
 > From: Martin Husemann <martin@duskware.de>
 > To: gnats-bugs@NetBSD.org
 > Cc: 
 > Subject: Re: kern/49305: deadlocks on sparc64 SMP
 > Date: Tue, 28 Oct 2014 10:45:04 +0100
 > 
 >  Some race or missing sync ops in pcq(9)?
 >  

 Unlikely to be pcq(9).  Most likely this is related to software interrupts
 or IPIs on sparc64, see softint_schedule_cpu(9).  Unfortunately, I do not
 have time to look into sparc64 MD code right now.

 -- 
 Mindaugas

From: Martin Husemann <martin@duskware.de>
To: Mindaugas Rasiukevicius <rmind@netbsd.org>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
	gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, macallan@netbsd.org
Subject: Re: kern/49305: deadlocks on sparc64 SMP
Date: Sun, 9 Nov 2014 17:16:01 +0100

 On Sun, Nov 09, 2014 at 03:13:21PM +0000, Mindaugas Rasiukevicius wrote:
 > Unlikely to be pcq(9).  Most likely this is related to software interrupts
 > or IPIs on sparc64, see softint_schedule_cpu(9).  Unfortunately, I do not
 > have time to look into sparc64 MD code right now.

 Possible, but why does the "always use the other cpu for softint processing"
 version not trigger it?

 Martin

From: Michael <macallan@netbsd.org>
To: Martin Husemann <martin@duskware.de>
Cc: Mindaugas Rasiukevicius <rmind@netbsd.org>, gnats-bugs@NetBSD.org,
 kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/49305: deadlocks on sparc64 SMP
Date: Sun, 9 Nov 2014 11:33:43 -0500

 Hello,

 On Sun, 9 Nov 2014 17:16:01 +0100
 Martin Husemann <martin@duskware.de> wrote:

 > On Sun, Nov 09, 2014 at 03:13:21PM +0000, Mindaugas Rasiukevicius wrote:
 > > Unlikely to be pcq(9).  Most likely this is related to software interrupts
 > > or IPIs on sparc64, see softint_schedule_cpu(9).  Unfortunately, I do not
 > > have time to look into sparc64 MD code right now.
 > 
 > Possible, but why does the "always use the other cpu for softint processing"
 > version not trigger it?

 Might be a fluke but to me it looks like the problem happens
 ( sometimes ) when we schedule two software interrupts on different
 CPUs in quick succession.

 have fun
 Michael
Home
PR Database Search
(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.