NetBSD Problem Report #51615
From feyrer@vmnetbsd.promi.se Tue Nov 8 23:46:51 2016
Return-Path: <feyrer@vmnetbsd.promi.se>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
by mollari.NetBSD.org (Postfix) with ESMTPS id 2AC517A227
for <gnats-bugs@gnats.NetBSD.org>; Tue, 8 Nov 2016 23:46:51 +0000 (UTC)
Message-Id: <20161108234647.6C752DEF3F@vmnetbsd.promi.se>
Date: Wed, 9 Nov 2016 00:46:47 +0100 (CET)
From: feyrer@vmnetbsd.promi.se
Reply-To: hubert@feyrer.de
To: gnats-bugs@NetBSD.org
Subject: Userland processes not evenly distributed on all CPUs
X-Send-Pr-Version: 3.95
>Number: 51615
>Category: kern
>Synopsis: Userland processes not evenly distributed on all CPUs
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: kern-bug-people
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Tue Nov 08 23:50:00 +0000 2016
>Closed-Date: Thu Dec 22 19:51:22 +0000 2016
>Last-Modified: Thu Dec 22 19:51:22 +0000 2016
>Originator: hubert@feyrer.de
>Release: NetBSD 7.0_STABLE + -current, both as of 20161108
>Organization:
>Environment:
1) NetBSD vmnetbsd.promi.se 7.0_STABLE NetBSD 7.0_STABLE (GENERIC) #0: Tue Nov 8 22:19:10 CET 2016 feyrer@promise.local:/Users/feyrer/work/NetBSD/cvs/src-7/obj.amd64/sys/arch/amd64/compile/GENERIC amd64
2) NetBSD vmnetbsd.home.feyrer.net 7.99.42 NetBSD 7.99.42 (GENERIC) #4: Tue Nov 8 13:46:43 CET 2016 feyrer@promise.local:/Volumes/netbsd-src-objdestdir/obj.amd64-Darwin-XXX/sys/arch/amd64/compile/GENERIC amd64
System: NetBSD vmnetbsd.promi.se 7.0_STABLE NetBSD 7.0_STABLE (GENERIC) #0: Tue Nov 8 22:19:10 CET 2016 feyrer@promise.local:/Users/feyrer/work/NetBSD/cvs/src-7/obj.amd64/sys/arch/amd64/compile/GENERIC amd64
Architecture: x86_64
Machine: amd64
>Description:
On an amd64 system with two CPU cores, running two processes
that hog CPU time each, one would expect that each process
runs on one CPU. This is not the case and they both fight for
one CPU, and the other one is left idle.
This happens on both NetBSD-current as well as 7.0-STABLE
with sources as of 2016-11-08.
The bug was first observed in a Xen environment on Amazon AWS.
>How-To-Repeat:
1) run "top" and type '1' to see all CPUs
2) run two CPU hoggig processes at the same time:
loop & loop &
3) Notice two things in top:
a) CPU and WCPU is about 50% for both processes, i.e. none gets
the CPU on its own:
PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
1222 feyrer 27 0 13M 1344K CPU/1 7:15 54.00% 54.00% sh
147 feyrer 29 0 13M 1344K RUN/1 7:05 42.97% 42.97% sh
b) in the CPU stats on the top, one can see that one CPU is
utilized with 100% user time, the other one with 0%.
Expected is 100% on both:
CPU0 states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
CPU1 states: 100% user, 0.0% nice, 0.0% system, 0.0% interrupt, 0.0% idle
I've put a screenshot here that shows two VMware VMs
with 2 CPU cores each, the left one running -current and
the right one running 7.0_STABLE, as can be seen from the top:
http://www.feyrer.de/Misc/priv/bad-scheduling-7.0_STABLE+7.99.42.png
>Fix:
No idea.
A workaround exists using psrset(8), see
http://www.feyrer.de/NetBSD/blog.html/nb_20161105_1754.html
>Release-Note:
>Audit-Trail:
From: Hubert Feyrer <hubert@feyrer.de>
To: gnats-bugs@NetBSD.org
Cc: Hubert Feyrer <hubert@feyrer.de>
Subject: Re: kern/51615: Userland processes not evenly distributed on all
CPUs
Date: Wed, 9 Nov 2016 10:11:16 +0100 (CET)
Pondering on the "rounding error" mentioned by Michael, I've setup both
VMs with 4 CPUs, and the behaviour shown there is that load is distributed
to about 3 and a half CPU - three CPUs under full load, and one not
reaching 100%. There's definitely something fishy in there.
See screenshot:
http://www.feyrer.de/Misc/priv/bad-scheduling-7.0_STABLE+7.99.42-4CPUs.png
- Hubert
From: Hubert Feyrer <hubert@feyrer.de>
To: gnats-bugs@NetBSD.org
Cc: Hubert Feyrer <hubert@feyrer.de>
Subject: Re: kern/51615: Userland processes not evenly distributed on all
CPUs
Date: Wed, 9 Nov 2016 10:20:45 +0100 (CET)
On Wed, 9 Nov 2016, Hubert Feyrer wrote:
> Pondering on the "rounding error" mentioned by Michael, I've setup both VMs
> with 4 CPUs, and the behaviour shown there is that load is distributed to
> about 3 and a half CPU - three CPUs under full load, and one not reaching
> 100%. There's definitely something fishy in there.
>
> See screenshot:
> http://www.feyrer.de/Misc/priv/bad-scheduling-7.0_STABLE+7.99.42-4CPUs.png
Splitting up the four CPUs on different processor sets with one process
assigned to each set (using psrset(8)) leads to an even load distribution
here, too. This leads me to thinking that the NetBSD scheduling works well
between different processor sets, but is busted within one set.
- Hubert
From: Hubert Feyrer <hubert@feyrer.de>
To: gnats-bugs@NetBSD.org
Cc: Hubert Feyrer <hubert@feyrer.de>
Subject: Re: kern/51615: Userland processes not evenly distributed on all
CPUs
Date: Thu, 10 Nov 2016 22:48:37 +0100 (CET)
Update:
Michael van Elst suspects that there's a rounding error in the calculation
of r_avgcount, and he proposed a patch, see first URL below.
After discussing with him, we agreed that a bigger factor may help,
which is what the second URL is about. It works, and doesn't seem to make
things worse.
http://mail-index.netbsd.org/tech-kern/2016/11/09/msg021222.html
http://mail-index.netbsd.org/tech-kern/2016/11/10/msg021227.html
Review & comments wanted!
- Hubert
From: Hubert Feyrer <hubert@feyrer.de>
To: gnats-bugs@NetBSD.org
Cc: Hubert Feyrer <hubert@feyrer.de>
Subject: Re: kern/51615: Userland processes not evenly distributed on all
CPUs
Date: Sun, 13 Nov 2016 01:20:22 +0100 (CET)
Update:
I have learned that this PR is a duplicate of PR 43561,
funny enough the patch proposed there was the same as here,
though implemented lightly different.
Following an advice to test it, I did so:
I've started a "build.sh -j8" on a (VMware Fusion) VM with 4 CPUs on a
Macbook Pro, and it nearly brought the machine to a halt - What I saw was
lots of idle time on all CPUs though. I aborted the exercise to get some
CPU cycles for me back.
I restarted the exercise with 2 CPUs in the same VM, and there I saw load
distribution on both CPUs (not much wonder with -j8), but there was also
quite some idle times in the 'make clean / install' phases that I'm not
sure is normal. During the actual build phases I wasn't able to see idle
time, though the system spent quite some time in the kernel (system).
Example top(1) output is appended below.
All in all, I'd say the patch is a good step forward from the current
situation, which does not properly distribute pure CPU hogs, at all.
- Hubert
load averages: 9.01, 8.60, 7.15; up 0+01:24:11 01:19:33
67 processes: 7 runnable, 58 sleeping, 2 on CPU
CPU0 states: 0.0% user, 55.4% nice, 44.6% system, 0.0% interrupt, 0.0% idle
CPU1 states: 0.0% user, 69.3% nice, 30.7% system, 0.0% interrupt, 0.0% idle
Memory: 311M Act, 99M Inact, 6736K Wired, 23M Exec, 322M File, 395M Free
Swap: 1536M Total, 21M Used, 1516M Free
PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
27028 feyrer 20 5 62M 27M CPU/1 0:00 9.74% 0.93% cc1
728 feyrer 85 0 78M 3808K select/1 1:03 0.73% 0.73% sshd
23274 feyrer 21 5 36M 14M RUN/0 0:00 10.00% 0.49% cc1
21634 feyrer 20 5 44M 20M RUN/0 0:00 7.00% 0.34% cc1
24697 feyrer 77 5 7988K 2480K select/1 0:00 0.31% 0.15% nbmake
24964 feyrer 74 5 11M 5496K select/1 0:00 0.44% 0.15% nbmake
18221 feyrer 21 5 49M 15M RUN/0 0:00 2.00% 0.10% cc1
14513 feyrer 20 5 43M 16M RUN/0 0:00 2.00% 0.10% cc1
518 feyrer 43 0 15M 1764K CPU/0 0:02 0.00% 0.00% top
20842 feyrer 21 5 6992K 340K RUN/0 0:00 0.00% 0.00% x86_64--netb
16215 feyrer 21 5 28M 172K RUN/0 0:00 0.00% 0.00% cc1
8922 feyrer 20 5 51M 14M RUN/0 0:00 0.00% 0.00% cc1
State-Changed-From-To: open->closed
State-Changed-By: hubertf@NetBSD.org
State-Changed-When: Thu, 22 Dec 2016 19:51:22 +0000
State-Changed-Why:
Michael van Elst committed his code today:
http://mail-index.netbsd.org/source-changes/2016/12/22/msg080093.html
This addresses the issue of this PR.
- Hubert
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.