NetBSD Problem Report #52184

From gson@gson.org  Fri Apr 21 09:36:52 2017
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id B90617A1F5
	for <gnats-bugs@gnats.NetBSD.org>; Fri, 21 Apr 2017 09:36:52 +0000 (UTC)
Message-Id: <20170421093646.92D3F7446DB@guava.gson.org>
Date: Fri, 21 Apr 2017 12:36:46 +0300 (EEST)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: Recent qemu performs badly on NetBSD hosts under load
X-Send-Pr-Version: 3.95

>Number:         52184
>Category:       misc
>Synopsis:       Recent qemu performs badly on NetBSD hosts under load
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    misc-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Apr 21 09:40:00 +0000 2017
>Closed-Date:    Mon Apr 16 14:45:17 +0000 2018
>Last-Modified:  Mon Apr 16 14:45:17 +0000 2018
>Originator:     Andreas Gustafsson
>Release:        NetBSD 7.1
>Organization:

>Environment:
System: NetBSD 7.1
Architecture: x86_64
Machine: amd64
>Description:

The NetBSD testbed babylon5.netbsd.org runs automated full-system
tests using qemu.  After the qemu on babylon5 was upgraded from
version 0.15.1 to 2.8.0 in early March, the tests have been taking
much longer to run than they did with 0.15.1, at least twice as long
on average.

The exact reason for the slowdown is still unknown, and could be a
qemu issue, a NetBSD issue, or a combination of the two; hence the
"misc" category of this PR.

I have run a number of tests to try to track down the circumstances in
which the problem occurs.  As far as I can tell, they are as follows:

  1. The version of qemu must include commit
     05e514b1d4d5bd4209e2c8bbc76ff05c85a235f3
     (found by bisection).

  2. The host system that qemu is executed on must be NetBSD
     rather than Linux.

  3. Some minor slowdown may occur under just the two above
     conditions, but for the slowdown to be dramatic, the host system
     must be under some CPU load.  On babylon5, this load typically
     consists of a parallel build of NetBSD (started with "nice 10")
     and other qemu processes.  In my own tests, I have simulated it
     by running multiple infinite loop processes in the background
     (again with "nice 10").

A shell script to automate reproducing the problem in available at

   http://www.gson.org/netbsd/bugs/qemu-slow/test.sh

This script builds two versions of qemu from git, from before and
after the above commit, downloads a NetBSD 6.0.1 image pre-configured
to run the ATF tests, starts infinite loop processes in the background
to generate CPU load, and runs the test image using each qemu version
in turn.  It needs about 2.5 GB of disk and several hours to run, and
due to the high CPU load it generates, I would not recommend running
it on a production system.

When run under NetBSD 7.1/amd64 on a 12-core HP DL360 G7 server, the
script reported the following execution times for qemu git revision
21a03d17f2edb1e63f7137d97ba355cc6f19d79f:

   real      3871.12
   user      2934.44
   sys         17.25

And for qemu git revision 05e514b1d4d5bd4209e2c8bbc76ff05c85a235f3:

   real     16813.29
   user     17774.38
   sys       5884.99

Note that the difference is especially dramatic for the system time,
which increased by a factor of more than 300.  When the script was run
on a Linux system, there was no significant difference in execution
time between the two qemu versions.

The commit message for qemu commit 05e514b1d4d5bd4209e2c8bbc76ff05c85a235f3
reads:

    AioContext: optimize clearing the EventNotifier

    It is pretty rare for aio_notify to actually set the EventNotifier.  It
    can happen with worker threads such as thread-pool.c's, but otherwise it
    should never be set thanks to the ctx->notify_me optimization.  The
    previous patch, unfortunately, added an unconditional call to
    event_notifier_test_and_clear; now add a userspace fast path that
    avoids the call.

    Note that it is not possible to do the same with event_notifier_set;
    it would break, as proved (again) by the included formal model.

    This patch survived over 3000 reboots on aarch64 KVM.

    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    Reviewed-by: Fam Zheng <famz@redhat.com>
    Tested-by: Richard W.M. Jones <rjones@redhat.com>
    Message-id: 1437487673-23740-7-git-send-email-pbonzini@redhat.com
    Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

Reading the diff of the commit in case did not make it immediately
clear to me why this should cause a performance regression on NetBSD.
Any help in explaining this is appreciated.

>How-To-Repeat:

Run the test script.

>Fix:

>Release-Note:

>Audit-Trail:
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: misc/52184: Recent qemu performs badly on NetBSD hosts under load
Date: Fri, 21 Apr 2017 12:00:20 +0200

 On Fri, Apr 21, 2017 at 09:40:00AM +0000, Andreas Gustafsson wrote:
 > Reading the diff of the commit in case did not make it immediately
 > clear to me why this should cause a performance regression on NetBSD.
 > Any help in explaining this is appreciated.

 We should probably investigate what atomic_xchg() actually expands to:

 +void aio_notify_accept(AioContext *ctx)
 +{
 +    if (atomic_xchg(&ctx->notified, false)) {
 +        event_notifier_test_and_clear(&ctx->notifier);
      }

 ctx->notified is a boolean, so I guess __atomic_exchange_1().
 For testing, it might be interesting to align it on a cache line boundary
 or make it a 32bit value.

 It may also be a compiler issue, can you test a qemu compiled with a newer
 gcc, or even repeat the test in NetBSD -current?

 Martin

From: coypu@sdf.org
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: misc/52184: Recent qemu performs badly on NetBSD hosts under load
Date: Fri, 21 Apr 2017 10:27:10 +0000

 These came up while searching for the commit ID

 https://patchwork.kernel.org/patch/9497653/
 https://bugs.launchpad.net/qemu/+bug/1481272

From: Andreas Gustafsson <gson@gson.org>
To: copyu@sdf.org
Cc: gnats-bugs@NetBSD.org
Subject: Re: misc/52184: Recent qemu performs badly on NetBSD hosts under load
Date: Fri, 21 Apr 2017 15:17:29 +0300

 coypu@sdf.org wrote:
 >  These came up while searching for the commit ID
 >  
 >  https://patchwork.kernel.org/patch/9497653/
 >  https://bugs.launchpad.net/qemu/+bug/1481272

 For what it's worth, the message "main-loop: WARNING: I/O thread spun
 for 1000 iterations" reported in the latter link also appears in the
 output of the NetBSD test runs where qemu includes commit
 05e514b1d4d5bd4209e2c8bbc76ff05c85a235f3, whether hosted on NetBSD
 on Linux.
 -- 
 Andreas Gustafsson, gson@gson.org

From: Andreas Gustafsson <gson@gson.org>
To: Martin Husemann <martin@duskware.de>
Cc: gnats-bugs@NetBSD.org
Subject: Re: misc/52184: Recent qemu performs badly on NetBSD hosts under load
Date: Mon, 24 Apr 2017 15:17:39 +0300

 Martin Husemann wrote:
 >  We should probably investigate what atomic_xchg() actually expands to:

 Just a plain xchg of an 8-bit register and memory:

 $ gdb qemu-system-x86_64.05e514b1d4d5bd4209e2c8bbc76ff05c85a235f3
 (gdb) x/10i aio_notify_accept
    0x61afb8 <aio_notify_accept>:        sub    $0x18,%rsp
    0x61afbc <aio_notify_accept+4>:      mov    0x4eaddd(%rip),%rax        # 0xb05da0 <__stack_chk_guard>
    0x61afc3 <aio_notify_accept+11>:     mov    %rax,0x8(%rsp)
    0x61afc8 <aio_notify_accept+16>:     xor    %eax,%eax
    0x61afca <aio_notify_accept+18>:     xchg   %al,0x12c(%rdi)
    0x61afd0 <aio_notify_accept+24>:     test   %al,%al
    0x61afd2 <aio_notify_accept+26>:     jne    0x61afe7 <aio_notify_accept+47>

 >  For testing, it might be interesting to align it on a cache line boundary
 >  or make it a 32bit value.
 >  
 >  It may also be a compiler issue, can you test a qemu compiled with a newer
 >  gcc, or even repeat the test in NetBSD -current?

 I repeated the test in NetBSD-current (source date
 2017.04.17.09.54.59) using qemu prerequisite packages built from
 source using a recent pkgsrc.  Since PaX mprotect is enabled by
 default in -current, I had to disable it using "sysctl -w
 security.pax.mprotect.enabled=0" to get qemu to run.

 The results were similar to those from NetSBD 7.1:

 qemu git revision 21a03d17f2edb1e63f7137d97ba355cc6f19d79f:
 real      4468.86
 user      2870.44
 sys         25.24

 qemu git revision 05e514b1d4d5bd4209e2c8bbc76ff05c85a235f3:
 real     13180.34
 user     10890.35
 sys       6131.38

 -- 
 Andreas Gustafsson, gson@gson.org

From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: misc/52184: Recent qemu performs badly on NetBSD hosts under load
Date: Mon, 16 Apr 2018 17:24:28 +0300

 I believe the performance issues reported in this PR have now mostly
 been resolved.  There were multiple causes:

 1. The increase in system time was because of QEMU bug 1481272, which
 has now been fixed.  See https://bugs.launchpad.net/qemu/+bug/1481272/
 for details.

 2. Much of the increase in user time appears to have been due to QEMU
 2 running slowly when using a small translation buffer size.  By
 default, QEMU sizes the translation buffer based on the amount of RAM
 configured for the VM, and i386 in particular was configured with a
 small amount of RAM (32 MB), which caused a translation buffer of only
 8 MB to be used.  This was partly addressed some time ago by
 increasing the RAM configured for the i386 tests, and performance
 improved further when I recently added the qemu command line option
 "-tb-size 128" to the configuration on babylon5.netbsd.org to increase
 the size further.

 The size of the translation buffer appears to have a much smaller
 effect on performance in qemu 0.15, but I don't know why.

 3. There was also an issue with qemu failing to detect the presence
 of posix_memalign() on NetBSD, falling back to valloc() and wasting
 memory by unnecessarily rounding allocations up to full pages.  This
 does not appear to have been a major cause of the performance issues,
 but in any case, this too is now fixed.

 4. The NetBSD tests themselves have grown substantially, and now
 require several times more CPU time than they did a couple of years
 ago.

 Testing historic versions of NetBSD using a recent qemu with -m 128
 -tb-size 128 shows it running the tests faster than qemu 0.15 did
 (both running on the current babylon5.netbsd.org hardware):

 NetBSD source date 2016.01.01.19.14.45, qemu-0.15 -m 32
     18986.10 real     15865.06 user       604.28 sys                                           

 NetBSD source date 2016.01.01.19.46.48, qemu-2.11.1 -m 128 -tb-size 128 
     14637.12 real      9664.66 user       142.80 sys
 -- 
 Andreas Gustafsson, gson@gson.org

State-Changed-From-To: open->closed
State-Changed-By: gson@NetBSD.org
State-Changed-When: Mon, 16 Apr 2018 14:45:17 +0000
State-Changed-Why:
After bug fixes and configuration changes, recent qemu runs faster than 0.15.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.