NetBSD Problem Report #49462
From martin@duskware.de Wed Dec 10 17:31:25 2014
Return-Path: <martin@duskware.de>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
by mollari.NetBSD.org (Postfix) with ESMTPS id E8DEBA57FE
for <gnats-bugs@gnats.NetBSD.org>; Wed, 10 Dec 2014 17:31:25 +0000 (UTC)
From: martin@NetBSD.org
Reply-To: martin@NetBSD.org
To: gnats-bugs@NetBSD.org
Subject: if_slowtimo callout mangled?
X-Send-Pr-Version: 3.95
>Number: 49462
>Category: kern
>Synopsis: if_slowtimo callout mangled?
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Wed Dec 10 17:35:00 +0000 2014
>Last-Modified: Wed Dec 17 22:20:01 +0000 2014
>Originator: Martin Husemann
>Release: NetBSD 7.99.2
>Organization:
The NetBSD Foundation, Inc.
>Environment:
System: NetBSD thirdstage.duskware.de 7.99.2 NetBSD 7.99.2 (MODULAR) #237: Wed Dec 10 17:53:26 CET 2014 martin@thirdstage.duskware.de:/usr/src/sys/arch/sparc64/compile/MODULAR sparc64
Architecture: sparc64
Machine: sparc64
>Description:
I get a (not quite, but pretty often) reproducable KASSERT firing on reboot
on a SMP sparc64 machine with four bge interfaces:
panic: kernel diagnostic assertion "(c->c_flags & CALLOUT_PENDING) == 0" failed: file "../../../../kern/kern_timeout.c", line 314 callout 0x103b29aa8: c_func (0x119f100) c_flags (0x3) destroyed from 0x11a0e40
The function is if_slowtimo and the caller of callout_destroy() is
if_detach.
However, this is basically impossible:
if (ifp->if_slowtimo != NULL) {
callout_halt(ifp->if_slowtimo_ch, NULL);
callout_destroy(ifp->if_slowtimo_ch);
kmem_free(ifp->if_slowtimo_ch, sizeof(*ifp->if_slowtimo_ch));
}
and callout_halt() certainly kills bit 0 in flags (CALLOUT_BOUND), and should
also not return before CALLOUT_PENDING has cleared.
So something(tm) is wrong, but it is not obvious to me.
>How-To-Repeat:
Reboot an MP sparc64 machine with a -current DIAGNOSTIC kernel, best from a
ssh login.
>Fix:
n/a
>Audit-Trail:
From: Ryota Ozaki <ozaki-r@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/49462: if_slowtimo callout mangled?
Date: Thu, 11 Dec 2014 10:51:06 +0900
On Thu, Dec 11, 2014 at 2:35 AM, <martin@netbsd.org> wrote:
>>Number: 49462
>>Category: kern
>>Synopsis: if_slowtimo callout mangled?
>>Confidential: no
>>Severity: critical
>>Priority: high
>>Responsible: kern-bug-people
>>State: open
>>Class: sw-bug
>>Submitter-Id: net
>>Arrival-Date: Wed Dec 10 17:35:00 +0000 2014
>>Originator: Martin Husemann
>>Release: NetBSD 7.99.2
>>Organization:
> The NetBSD Foundation, Inc.
>>Environment:
> System: NetBSD thirdstage.duskware.de 7.99.2 NetBSD 7.99.2 (MODULAR) #237: Wed Dec 10 17:53:26 CET 2014 martin@thirdstage.duskware.de:/usr/src/sys/arch/sparc64/compile/MODULAR sparc64
> Architecture: sparc64
> Machine: sparc64
>>Description:
>
> I get a (not quite, but pretty often) reproducable KASSERT firing on reboot
> on a SMP sparc64 machine with four bge interfaces:
>
> panic: kernel diagnostic assertion "(c->c_flags & CALLOUT_PENDING) == 0" failed: file "../../../../kern/kern_timeout.c", line 314 callout 0x103b29aa8: c_func (0x119f100) c_flags (0x3) destroyed from 0x11a0e40
>
> The function is if_slowtimo and the caller of callout_destroy() is
> if_detach.
>
> However, this is basically impossible:
>
> if (ifp->if_slowtimo != NULL) {
> callout_halt(ifp->if_slowtimo_ch, NULL);
> callout_destroy(ifp->if_slowtimo_ch);
> kmem_free(ifp->if_slowtimo_ch, sizeof(*ifp->if_slowtimo_ch));
> }
>
> and callout_halt() certainly kills bit 0 in flags (CALLOUT_BOUND), and should
> also not return before CALLOUT_PENDING has cleared.
>
> So something(tm) is wrong, but it is not obvious to me.
It can happen. See PR 47881 and this thread:
http://mail-index.netbsd.org/netbsd-bugs/2014/11/12/msg039065.html .
The problem is that callout_halt waits for a callout handler
completion but doesn't prevent the handler from scheduling
a new callout.
I fixed the problem by using an existing flag of the user (in6m->in6m_timer)
as "don't callout_schedule anymore" flag for callout. I think the fix can
be applied to this case.
Nonetheless, I'm thinking that we maybe should do it in callout_halt itself.
For example, introduce CALLOUT_HALTING flag and set it before waiting a
callout handler finished, while callout_schedule first checks the flag and
do nothing if the flag is set. By doing so, we can prevent a new callout
from being scheduled during callout_halt.
Off topic: callout_schedule_locked takes a (held) mutex but it's just
released only just before returning. We can release the mutex outside
callout_schedule_locked so that we don't need to pass it at all.
ozaki-r
>
>>How-To-Repeat:
> Reboot an MP sparc64 machine with a -current DIAGNOSTIC kernel, best from a
> ssh login.
>
>>Fix:
> n/a
>
From: Martin Husemann <martin@duskware.de>
To: Ryota Ozaki <ozaki-r@netbsd.org>
Cc: gnats-bugs@netbsd.org
Subject: Re: kern/49462: if_slowtimo callout mangled?
Date: Thu, 11 Dec 2014 10:38:07 +0100
--Q68bSM7Ycu6FN28Q
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Ok, how about this change?
Seems to fix it for me.
Martin
--Q68bSM7Ycu6FN28Q
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="if_slowtimo.patch"
Index: if.c
===================================================================
RCS file: /cvsroot/src/sys/net/if.c,v
retrieving revision 1.304
diff -u -p -r1.304 if.c
--- if.c 8 Dec 2014 04:55:47 -0000 1.304
+++ if.c 11 Dec 2014 09:36:53 -0000
@@ -745,6 +745,7 @@ if_detach(struct ifnet *ifp)
s = splnet();
if (ifp->if_slowtimo != NULL) {
+ ifp->if_slowtimo = NULL;
callout_halt(ifp->if_slowtimo_ch, NULL);
callout_destroy(ifp->if_slowtimo_ch);
kmem_free(ifp->if_slowtimo_ch, sizeof(*ifp->if_slowtimo_ch));
@@ -1515,15 +1516,19 @@ static void
if_slowtimo(void *arg)
{
struct ifnet *ifp = arg;
- int s = splnet();
+ int s;
- KASSERT(ifp->if_slowtimo != NULL);
+ if (__predict_false(ifp->if_slowtimo == NULL))
+ return;
+ s = splnet();
if (ifp->if_timer != 0 && --ifp->if_timer == 0)
(*ifp->if_slowtimo)(ifp);
splx(s);
- callout_schedule(ifp->if_slowtimo_ch, hz / IFNET_SLOWHZ);
+
+ if (__predict_true(ifp->if_slowtimo != NULL))
+ callout_schedule(ifp->if_slowtimo_ch, hz / IFNET_SLOWHZ);
}
/*
--Q68bSM7Ycu6FN28Q--
From: Ryota Ozaki <ozaki-r@netbsd.org>
To: Martin Husemann <martin@duskware.de>
Cc: gnats-bugs@netbsd.org
Subject: Re: kern/49462: if_slowtimo callout mangled?
Date: Thu, 11 Dec 2014 18:57:13 +0900
On Thu, Dec 11, 2014 at 6:38 PM, Martin Husemann <martin@duskware.de> wrote:
> Ok, how about this change?
> Seems to fix it for me.
The fix is what I consider :) And it works for me too.
Thanks!
ozaki-r
From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/49462 CVS commit: src/sys/net
Date: Thu, 11 Dec 2014 14:33:22 +0000
Module Name: src
Committed By: martin
Date: Thu Dec 11 14:33:22 UTC 2014
Modified Files:
src/sys/net: if.c
Log Message:
Avoid scheduling more slow timeouts while we are in the process of detaching
the interface: set if_slowtimo to NULL before doing the callout_halt()
and test for that in the callout. Fixes PR kern/49462.
To generate a diff of this commit:
cvs rdiff -u -r1.304 -r1.305 src/sys/net/if.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Masao Uebayashi <uebayasi@gmail.com>
To: gnats-bugs@netbsd.org
Cc: Martin Husemann <martin@duskware.de>
Subject: Re: kern/49462: if_slowtimo callout mangled?
Date: Thu, 11 Dec 2014 23:54:27 +0900
I would do this, to close possible race window, by dereference ifp->if_slowtimo
only once:
void
if_slowtimo(void *arg)
{
void (*slowtimo)(void *);
int s;
slowtimo = ifp->if_slowtimo;
if (slowtimo == NULL)
return;
s = splnet();
if (ifp->if_timer != 0 && --ifp->if_timer == 0)
(*slowtimo)(ifp);
splx(s);
callout_schedule(ifp->if_slowtimo_ch, hz / IFNET_SLOWHZ);
}
From: Martin Husemann <martin@duskware.de>
To: Masao Uebayashi <uebayasi@gmail.com>
Cc: gnats-bugs@netbsd.org
Subject: Re: kern/49462: if_slowtimo callout mangled?
Date: Thu, 11 Dec 2014 19:57:01 +0100
On Thu, Dec 11, 2014 at 11:54:27PM +0900, Masao Uebayashi wrote:
> I would do this, to close possible race window, by dereference ifp->if_slowtimo
> only once:
I would at least re-check before the callout_schedule() (after splx),
but otherwise feel free to change it.
Martin
From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/49462 CVS commit: src/sys/net
Date: Sun, 14 Dec 2014 08:57:14 +0000
Module Name: src
Committed By: martin
Date: Sun Dec 14 08:57:14 UTC 2014
Modified Files:
src/sys/net: if.c
Log Message:
Avoid a race when the ifp->if_slowtimo pointer is changed while we are
running in if_slowtimo already. Suggested by Masao Uebayashi
in PR kern/49462.
To generate a diff of this commit:
cvs rdiff -u -r1.305 -r1.306 src/sys/net/if.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: David Laight <david@l8s.co.uk>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/49462: if_slowtimo callout mangled?
Date: Wed, 17 Dec 2014 22:16:36 +0000
On Thu, Dec 11, 2014 at 02:55:01PM +0000, Masao Uebayashi wrote:
> The following reply was made to PR kern/49462; it has been noted by GNATS.
>
> I would do this, to close possible race window, by dereference ifp->if_slowtimo
> only once:
>
> void
> if_slowtimo(void *arg)
> {
> void (*slowtimo)(void *);
> int s;
>
> slowtimo = ifp->if_slowtimo;
> if (slowtimo == NULL)
> return;
> s = splnet();
> if (ifp->if_timer != 0 && --ifp->if_timer == 0)
> (*slowtimo)(ifp);
> splx(s);
> callout_schedule(ifp->if_slowtimo_ch, hz / IFNET_SLOWHZ);
> }
That still looks racy to me.
I'd have expected that if_slowtimo would need to read inside splnet().
Otherwuse reading it twice wouldn't matter.
It does rather depend on what else might clear it.
David
--
David Laight: david@l8s.co.uk
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.