NetBSD Problem Report #40419

From www@NetBSD.org  Fri Jan 16 22:04:38 2009
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by narn.NetBSD.org (Postfix) with ESMTP id 6C82763BA90
	for <gnats-bugs@gnats.netbsd.org>; Fri, 16 Jan 2009 22:04:38 +0000 (UTC)
Message-Id: <20090116220437.85E7663B8BA@narn.NetBSD.org>
Date: Fri, 16 Jan 2009 22:04:37 +0000 (UTC)
From: ad@netbsd.org
Reply-To: ad@netbsd.org
To: gnats-bugs@NetBSD.org
Subject: processor sets broken on 5.99.6
X-Send-Pr-Version: www-1.0

>Number:         40419
>Category:       kern
>Synopsis:       processor sets broken on 5.99.6
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    rmind
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Jan 16 22:05:00 +0000 2009
>Closed-Date:    Wed Jan 21 00:04:29 +0000 2009
>Last-Modified:  Thu Jan 22 20:10:03 +0000 2009
>Originator:     Andrew Doran
>Release:        5.99.6
>Organization:
The NetBSD Project
>Environment:
NetBSD spanners.hairylemon.org 5.99.6 NetBSD 5.99.6 (GENERIC) #6: Tue Dec 30 13:19:29 GMT 2008  ad@spanners.hairylemon.org:/local/home/ad/src/sys/arch/i386/compile/GENERIC i386

>Description:
# cpuctl list     
Num  HwId Unbound LWPs Interrupts     Last change
---- ---- ------------ -------------- ----------------------------
0    0    online       intr           Fri Jan 16 17:30:28 2009
1    1    online       intr           Fri Jan 16 17:30:28 2009
# psrset
system processor set 0: processor(s) 0
user processor set 1: processor(s) 1
# psrset -e 1 top | grep top
  561 root      43    0   752K 1116K CPU/0      0:00  0.00%  0.00% top


CPU/0 ???

>How-To-Repeat:
See above.
>Fix:
Not investigated.

>Release-Note:

>Audit-Trail:

Responsible-Changed-From-To: kern-bug-people->rmind
Responsible-Changed-By: rmind@NetBSD.org
Responsible-Changed-When: Sun, 18 Jan 2009 01:44:43 +0000
Responsible-Changed-Why:
Take.


From: Mindaugas Rasiukevicius <rmind@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/40419 CVS commit: src/sys/kern
Date: Sun, 18 Jan 2009 05:07:51 +0000 (UTC)

 Module Name:	src
 Committed By:	rmind
 Date:		Sun Jan 18 05:07:51 UTC 2009

 Modified Files:
 	src/sys/kern: kern_runq.c

 Log Message:
 - Avoid calling sched_catchlwp() if CPUs have different processor-sets.
 - sched_takecpu: check for psid earlier (be more strict).

 PR/40419.


 To generate a diff of this commit:
 cvs rdiff -r1.23 -r1.24 src/sys/kern/kern_runq.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Mindaugas Rasiukevicius <rmind@netbsd.org>
To: ad@netbsd.org
Cc: gnats-bugs@NetBSD.org, netbsd-bugs@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/40419 (processor sets broken on 5.99.6)
Date: Sun, 18 Jan 2009 05:26:10 +0000

 > >Number:         40419
 > >Category:       kern
 > >Synopsis:       processor sets broken on 5.99.6
 >
 > ...
 > 
 > # cpuctl list     
 > Num  HwId Unbound LWPs Interrupts     Last change
 > ---- ---- ------------ -------------- ----------------------------
 > 0    0    online       intr           Fri Jan 16 17:30:28 2009
 > 1    1    online       intr           Fri Jan 16 17:30:28 2009
 > # psrset
 > system processor set 0: processor(s) 0
 > user processor set 1: processor(s) 1
 > # psrset -e 1 top | grep top
 >   561 root      43    0   752K 1116K CPU/0      0:00  0.00%  0.00% top

 What happens later?

 I have committed one fix (please review), sched_takecpu() would keep the
 thread on local CPU if it idles. Now function is strict. But even with
 previous behaviour, after some load process should have been migrated to
 the appropriate CPU (and then it should not be able to get out).

 However, in your test case, application might still be on local CPU,
 because curlwp (LSONRPOC) might not migrate that fast. We can do yield()
 in sys__pset_bind(). Do you think it is worth?

 Otherwise, it seems to be working for me.

 -- 
 Best regards,
 Mindaugas

From: Andrew Doran <ad@netbsd.org>
To: Mindaugas Rasiukevicius <rmind@netbsd.org>
Cc: gnats-bugs@NetBSD.org, netbsd-bugs@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/40419 (processor sets broken on 5.99.6)
Date: Mon, 19 Jan 2009 09:02:17 +0000

 > I have committed one fix (please review), sched_takecpu() would keep the
 > thread on local CPU if it idles.

 Thanks. I will look at it later, don't have the time right now.

 > However, in your test case, application might still be on local CPU,
 > because curlwp (LSONRPOC) might not migrate that fast.

 I think this is what's happending.

 > We can do yield() in sys__pset_bind(). Do you think it is worth?

 I was thinking of a function that scans all threads, with cpu_lock held, and
 checks to see if their l_cpu is allowed by their affinity mask, processor
 set or LP_BOUND flag. If not, change l_cpu (or migrate if online), then do a
 broadcast xcall to nullop() if there have been migrations.

 Hmm, shouldn't psets and affinity masks be mutually exclusive?

 Andrew

From: Mindaugas Rasiukevicius <rmind@netbsd.org>
To: Andrew Doran <ad@netbsd.org>
Cc: gnats-bugs@NetBSD.org, netbsd-bugs@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/40419 (processor sets broken on 5.99.6)
Date: Tue, 20 Jan 2009 02:08:53 +0000

 Andrew Doran <ad@netbsd.org> wrote:
 > > We can do yield() in sys__pset_bind(). Do you think it is worth?
 > 
 > I was thinking of a function that scans all threads, with cpu_lock held, and
 > checks to see if their l_cpu is allowed by their affinity mask, processor
 > set or LP_BOUND flag. If not, change l_cpu (or migrate if online), then do a
 > broadcast xcall to nullop() if there have been migrations.
 > 
 > Hmm, shouldn't psets and affinity masks be mutually exclusive?
 > 

 Yes, exactly what I thought while fixing this. While originally mixing was
 permitted, intervention to the "jailed" processor-set is really wrong!

 Fix committed (without xcall bit):

 http://mail-index.netbsd.org/source-changes/2009/01/20/msg215921.html

 By the way, does it make more sense to return EPERM or EBUSY, or other?

 > Andrew

 -- 
 Best regards,
 Mindaugas

State-Changed-From-To: open->closed
State-Changed-By: rmind@NetBSD.org
State-Changed-When: Wed, 21 Jan 2009 00:04:29 +0000
State-Changed-Why:
Fixed, thanks.


From: Mindaugas Rasiukevicius <rmind@netbsd.org>
To: Andrew Doran <ad@netbsd.org>
Cc: gnats-bugs@NetBSD.org, netbsd-bugs@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/40419 (processor sets broken on 5.99.6)
Date: Wed, 21 Jan 2009 00:01:49 +0000

 Andrew Doran <ad@netbsd.org> wrote:
 > > We can do yield() in sys__pset_bind(). Do you think it is worth?
 > 
 > I was thinking of a function that scans all threads, with cpu_lock held, and
 > checks to see if their l_cpu is allowed by their affinity mask, processor
 > set or LP_BOUND flag. If not, change l_cpu (or migrate if online), then do a
 > broadcast xcall to nullop() if there have been migrations.

 After some thinking, I do not think it is worth. Theoretically, xc_broadcast
 might still not ensure that all LWPs have migrated, eg. in a case when there
 are many migrating LWPs in the same run-queue.

 After last fixes, I think migration should be very fast (~immediate) anyway,
 just re-scheduling overhead.

 -- 
 Best regards,
 Mindaugas

From: Andrew Doran <ad@netbsd.org>
To: Mindaugas Rasiukevicius <rmind@netbsd.org>
Cc: gnats-bugs@NetBSD.org, netbsd-bugs@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/40419 (processor sets broken on 5.99.6)
Date: Wed, 21 Jan 2009 09:56:38 +0000

 On Wed, Jan 21, 2009 at 12:01:49AM +0000, Mindaugas Rasiukevicius wrote:

 > Andrew Doran <ad@netbsd.org> wrote:
 > > > We can do yield() in sys__pset_bind(). Do you think it is worth?
 > > 
 > > I was thinking of a function that scans all threads, with cpu_lock held, and
 > > checks to see if their l_cpu is allowed by their affinity mask, processor
 > > set or LP_BOUND flag. If not, change l_cpu (or migrate if online), then do a
 > > broadcast xcall to nullop() if there have been migrations.
 > 
 > After some thinking, I do not think it is worth. Theoretically, xc_broadcast
 > might still not ensure that all LWPs have migrated, eg. in a case when there
 > are many migrating LWPs in the same run-queue.

 Hmm. I can't look at the code right now. If we can have LWPs in the wrong
 runqueue after a pset/affinity change, we should move them to prevent them
 running on that CPU after there is a context switch. Maybe it would be
 useful to add a syncobj_t::sobj_changecpu()?

 Thanks,
 Andrew

From: Mindaugas Rasiukevicius <rmind@netbsd.org>
To: Andrew Doran <ad@netbsd.org>
Cc: gnats-bugs@NetBSD.org, netbsd-bugs@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/40419 (processor sets broken on 5.99.6)
Date: Wed, 21 Jan 2009 10:17:14 +0000

 Andrew Doran <ad@netbsd.org> wrote:
 > > > I was thinking of a function that scans all threads, with cpu_lock
 > > > held, and checks to see if their l_cpu is allowed by their affinity
 > > > mask, processor set or LP_BOUND flag. If not, change l_cpu (or migrate
 > > > if online), then do a broadcast xcall to nullop() if there have been
 > > > migrations.
 > > 
 > > After some thinking, I do not think it is worth. Theoretically,
 > > xc_broadcast might still not ensure that all LWPs have migrated, eg. in a
 > > case when there are many migrating LWPs in the same run-queue.
 > 
 > Hmm. I can't look at the code right now. If we can have LWPs in the wrong
 > runqueue after a pset/affinity change, we should move them to prevent them
 > running on that CPU after there is a context switch. Maybe it would be
 > useful to add a syncobj_t::sobj_changecpu()?

 It would not migrate immediately only if LWP is in LSRUN (and if it's in the
 runqueue) or LSONPROC state. For both cases LWP would migrate just after the
 context-switch via idle loop (l_target_cpu != NULL).

 If we really want to be synchronous about this, then a simple cv_wait() in
 lwp_migrate() and cv_broadcast() in sched_idle() would work. Do you see some
 good reason to do this?

 -- 
 Best regards,
 Mindaugas

From: Soren Jacobsen <snj@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/40419 CVS commit: [netbsd-5] src/sys/kern
Date: Thu, 22 Jan 2009 20:06:47 +0000 (UTC)

 Module Name:	src
 Committed By:	snj
 Date:		Thu Jan 22 20:06:47 UTC 2009

 Modified Files:
 	src/sys/kern [netbsd-5]: kern_runq.c

 Log Message:
 Pull up following revision(s) (requested by rmind in ticket #284):
 	sys/kern/kern_runq.c: revision 1.24
 - Avoid calling sched_catchlwp() if CPUs have different processor-sets.
 - sched_takecpu: check for psid earlier (be more strict).
 PR/40419.


 To generate a diff of this commit:
 cvs rdiff -r1.22 -r1.22.4.1 src/sys/kern/kern_runq.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.