NetBSD Problem Report #49462

From martin@duskware.de  Wed Dec 10 17:31:25 2014
Return-Path: <martin@duskware.de>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id E8DEBA57FE
	for <gnats-bugs@gnats.NetBSD.org>; Wed, 10 Dec 2014 17:31:25 +0000 (UTC)
From: martin@NetBSD.org
Reply-To: martin@NetBSD.org
To: gnats-bugs@NetBSD.org
Subject: if_slowtimo callout mangled?
X-Send-Pr-Version: 3.95

>Number:         49462
>Category:       kern
>Synopsis:       if_slowtimo callout mangled?
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Dec 10 17:35:00 +0000 2014
>Last-Modified:  Wed Dec 17 22:20:01 +0000 2014
>Originator:     Martin Husemann
>Release:        NetBSD 7.99.2
>Organization:
The NetBSD Foundation, Inc.
>Environment:
System: NetBSD thirdstage.duskware.de 7.99.2 NetBSD 7.99.2 (MODULAR) #237: Wed Dec 10 17:53:26 CET 2014 martin@thirdstage.duskware.de:/usr/src/sys/arch/sparc64/compile/MODULAR sparc64
Architecture: sparc64
Machine: sparc64
>Description:

I get a (not quite, but pretty often) reproducable KASSERT firing on reboot
on a SMP sparc64 machine with four bge interfaces:

panic: kernel diagnostic assertion "(c->c_flags & CALLOUT_PENDING) == 0" failed: file "../../../../kern/kern_timeout.c", line 314 callout 0x103b29aa8: c_func (0x119f100) c_flags (0x3) destroyed from 0x11a0e40

The function is if_slowtimo and the caller of callout_destroy() is
if_detach.

However, this is basically impossible:

         if (ifp->if_slowtimo != NULL) {
                 callout_halt(ifp->if_slowtimo_ch, NULL);
                 callout_destroy(ifp->if_slowtimo_ch);
                 kmem_free(ifp->if_slowtimo_ch, sizeof(*ifp->if_slowtimo_ch));
         }

and callout_halt() certainly kills bit 0 in flags (CALLOUT_BOUND), and should
also not return before CALLOUT_PENDING has cleared.

So something(tm) is wrong, but it is not obvious to me.

>How-To-Repeat:
Reboot an MP sparc64 machine with a -current DIAGNOSTIC kernel, best from a
ssh login.

>Fix:
n/a

>Audit-Trail:
From: Ryota Ozaki <ozaki-r@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/49462: if_slowtimo callout mangled?
Date: Thu, 11 Dec 2014 10:51:06 +0900

 On Thu, Dec 11, 2014 at 2:35 AM,  <martin@netbsd.org> wrote:
 >>Number:         49462
 >>Category:       kern
 >>Synopsis:       if_slowtimo callout mangled?
 >>Confidential:   no
 >>Severity:       critical
 >>Priority:       high
 >>Responsible:    kern-bug-people
 >>State:          open
 >>Class:          sw-bug
 >>Submitter-Id:   net
 >>Arrival-Date:   Wed Dec 10 17:35:00 +0000 2014
 >>Originator:     Martin Husemann
 >>Release:        NetBSD 7.99.2
 >>Organization:
 > The NetBSD Foundation, Inc.
 >>Environment:
 > System: NetBSD thirdstage.duskware.de 7.99.2 NetBSD 7.99.2 (MODULAR) #237: Wed Dec 10 17:53:26 CET 2014 martin@thirdstage.duskware.de:/usr/src/sys/arch/sparc64/compile/MODULAR sparc64
 > Architecture: sparc64
 > Machine: sparc64
 >>Description:
 >
 > I get a (not quite, but pretty often) reproducable KASSERT firing on reboot
 > on a SMP sparc64 machine with four bge interfaces:
 >
 > panic: kernel diagnostic assertion "(c->c_flags & CALLOUT_PENDING) == 0" failed: file "../../../../kern/kern_timeout.c", line 314 callout 0x103b29aa8: c_func (0x119f100) c_flags (0x3) destroyed from 0x11a0e40
 >
 > The function is if_slowtimo and the caller of callout_destroy() is
 > if_detach.
 >
 > However, this is basically impossible:
 >
 >          if (ifp->if_slowtimo != NULL) {
 >                  callout_halt(ifp->if_slowtimo_ch, NULL);
 >                  callout_destroy(ifp->if_slowtimo_ch);
 >                  kmem_free(ifp->if_slowtimo_ch, sizeof(*ifp->if_slowtimo_ch));
 >          }
 >
 > and callout_halt() certainly kills bit 0 in flags (CALLOUT_BOUND), and should
 > also not return before CALLOUT_PENDING has cleared.
 >
 > So something(tm) is wrong, but it is not obvious to me.

 It can happen. See PR 47881 and this thread:
 http://mail-index.netbsd.org/netbsd-bugs/2014/11/12/msg039065.html .

 The problem is that callout_halt waits for a callout handler
 completion but doesn't prevent the handler from scheduling
 a new callout.

 I fixed the problem by using an existing flag of the user (in6m->in6m_timer)
 as "don't callout_schedule anymore" flag for callout. I think the fix can
 be applied to this case.

 Nonetheless, I'm thinking that we maybe should do it in callout_halt itself.
 For example, introduce CALLOUT_HALTING flag and set it before waiting a
 callout handler finished, while callout_schedule first checks the flag and
 do nothing if the flag is set. By doing so, we can prevent a new callout
 from being scheduled during callout_halt.


 Off topic: callout_schedule_locked takes a (held) mutex but it's just
 released only just before returning. We can release the mutex outside
 callout_schedule_locked so that we don't need to pass it at all.

   ozaki-r

 >
 >>How-To-Repeat:
 > Reboot an MP sparc64 machine with a -current DIAGNOSTIC kernel, best from a
 > ssh login.
 >
 >>Fix:
 > n/a
 >

From: Martin Husemann <martin@duskware.de>
To: Ryota Ozaki <ozaki-r@netbsd.org>
Cc: gnats-bugs@netbsd.org
Subject: Re: kern/49462: if_slowtimo callout mangled?
Date: Thu, 11 Dec 2014 10:38:07 +0100

 --Q68bSM7Ycu6FN28Q
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline

 Ok, how about this change?
 Seems to fix it for me.

 Martin

 --Q68bSM7Ycu6FN28Q
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: attachment; filename="if_slowtimo.patch"

 Index: if.c
 ===================================================================
 RCS file: /cvsroot/src/sys/net/if.c,v
 retrieving revision 1.304
 diff -u -p -r1.304 if.c
 --- if.c	8 Dec 2014 04:55:47 -0000	1.304
 +++ if.c	11 Dec 2014 09:36:53 -0000
 @@ -745,6 +745,7 @@ if_detach(struct ifnet *ifp)
  	s = splnet();

  	if (ifp->if_slowtimo != NULL) {
 +		ifp->if_slowtimo = NULL;
  		callout_halt(ifp->if_slowtimo_ch, NULL);
  		callout_destroy(ifp->if_slowtimo_ch);
  		kmem_free(ifp->if_slowtimo_ch, sizeof(*ifp->if_slowtimo_ch));
 @@ -1515,15 +1516,19 @@ static void
  if_slowtimo(void *arg)
  {
  	struct ifnet *ifp = arg;
 -	int s = splnet();
 +	int s;

 -	KASSERT(ifp->if_slowtimo != NULL);
 +	if (__predict_false(ifp->if_slowtimo == NULL))
 +		return;

 +	s = splnet();
  	if (ifp->if_timer != 0 && --ifp->if_timer == 0)
  		(*ifp->if_slowtimo)(ifp);

  	splx(s);
 -	callout_schedule(ifp->if_slowtimo_ch, hz / IFNET_SLOWHZ);
 +
 +	if (__predict_true(ifp->if_slowtimo != NULL))
 +		callout_schedule(ifp->if_slowtimo_ch, hz / IFNET_SLOWHZ);
  }

  /*

 --Q68bSM7Ycu6FN28Q--

From: Ryota Ozaki <ozaki-r@netbsd.org>
To: Martin Husemann <martin@duskware.de>
Cc: gnats-bugs@netbsd.org
Subject: Re: kern/49462: if_slowtimo callout mangled?
Date: Thu, 11 Dec 2014 18:57:13 +0900

 On Thu, Dec 11, 2014 at 6:38 PM, Martin Husemann <martin@duskware.de> wrote:
 > Ok, how about this change?
 > Seems to fix it for me.

 The fix is what I consider :) And it works for me too.

 Thanks!
   ozaki-r

From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/49462 CVS commit: src/sys/net
Date: Thu, 11 Dec 2014 14:33:22 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Thu Dec 11 14:33:22 UTC 2014

 Modified Files:
 	src/sys/net: if.c

 Log Message:
 Avoid scheduling more slow timeouts while we are in the process of detaching
 the interface: set if_slowtimo to NULL before doing the callout_halt()
 and test for that in the callout. Fixes PR kern/49462.


 To generate a diff of this commit:
 cvs rdiff -u -r1.304 -r1.305 src/sys/net/if.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Masao Uebayashi <uebayasi@gmail.com>
To: gnats-bugs@netbsd.org
Cc: Martin Husemann <martin@duskware.de>
Subject: Re: kern/49462: if_slowtimo callout mangled?
Date: Thu, 11 Dec 2014 23:54:27 +0900

 I would do this, to close possible race window, by dereference ifp->if_slowtimo
 only once:

 void
 if_slowtimo(void *arg)
 {
 	void (*slowtimo)(void *);
 	int s;

 	slowtimo = ifp->if_slowtimo;
 	if (slowtimo == NULL)
 		return;
 	s = splnet();
 	if (ifp->if_timer != 0 && --ifp->if_timer == 0)
 		(*slowtimo)(ifp);
 	splx(s);
 	callout_schedule(ifp->if_slowtimo_ch, hz / IFNET_SLOWHZ);
 }

From: Martin Husemann <martin@duskware.de>
To: Masao Uebayashi <uebayasi@gmail.com>
Cc: gnats-bugs@netbsd.org
Subject: Re: kern/49462: if_slowtimo callout mangled?
Date: Thu, 11 Dec 2014 19:57:01 +0100

 On Thu, Dec 11, 2014 at 11:54:27PM +0900, Masao Uebayashi wrote:
 > I would do this, to close possible race window, by dereference ifp->if_slowtimo
 > only once:

 I would at least re-check before the callout_schedule() (after splx),
 but otherwise feel free to change it.

 Martin

From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/49462 CVS commit: src/sys/net
Date: Sun, 14 Dec 2014 08:57:14 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Sun Dec 14 08:57:14 UTC 2014

 Modified Files:
 	src/sys/net: if.c

 Log Message:
 Avoid a race when the ifp->if_slowtimo pointer is changed while we are
 running in if_slowtimo already. Suggested by Masao Uebayashi
 in PR kern/49462.


 To generate a diff of this commit:
 cvs rdiff -u -r1.305 -r1.306 src/sys/net/if.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: David Laight <david@l8s.co.uk>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/49462: if_slowtimo callout mangled?
Date: Wed, 17 Dec 2014 22:16:36 +0000

 On Thu, Dec 11, 2014 at 02:55:01PM +0000, Masao Uebayashi wrote:
 > The following reply was made to PR kern/49462; it has been noted by GNATS.
 > 
 >  I would do this, to close possible race window, by dereference ifp->if_slowtimo
 >  only once:
 >  
 >  void
 >  if_slowtimo(void *arg)
 >  {
 >  	void (*slowtimo)(void *);
 >  	int s;
 >  
 >  	slowtimo = ifp->if_slowtimo;
 >  	if (slowtimo == NULL)
 >  		return;
 >  	s = splnet();
 >  	if (ifp->if_timer != 0 && --ifp->if_timer == 0)
 >  		(*slowtimo)(ifp);
 >  	splx(s);
 >  	callout_schedule(ifp->if_slowtimo_ch, hz / IFNET_SLOWHZ);
 >  }

 That still looks racy to me.
 I'd have expected that if_slowtimo would need to read inside splnet().
 Otherwuse reading it twice wouldn't matter.
 It does rather depend on what else might clear it.

 	David

 -- 
 David Laight: david@l8s.co.uk

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.