NetBSD Problem Report #50791
From gson@gson.org Fri Feb 12 10:25:08 2016
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.NetBSD.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
by mollari.NetBSD.org (Postfix) with ESMTPS id 7C96A7A213
for <gnats-bugs@gnats.NetBSD.org>; Fri, 12 Feb 2016 10:25:08 +0000 (UTC)
Message-Id: <20160212102500.64CA974508D@guava.gson.org>
Date: Fri, 12 Feb 2016 12:25:00 +0200 (EET)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@gnats.NetBSD.org
Subject: NetBSD's malloc has performance issues
X-Send-Pr-Version: 3.95
>Number: 50791
>Category: lib
>Synopsis: NetBSD's malloc has performance issues
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: lib-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri Feb 12 10:30:00 +0000 2016
>Last-Modified: Thu May 02 14:10:01 +0000 2019
>Originator: Andreas Gustafsson
>Release: NetBSD 6.1.5 and 7.0
>Organization:
>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:
I noticed that an application I'm working on was performing
significantly worse on NetBSD than on other operating systems, and
tracked down the problem to malloc/free, which were using several
times more CPU time than on the other systems for this particular
workload.
>How-To-Repeat:
I have prepared a test program that performs a sequence of 100226
mallocs and frees, derived from a trace of the actual malloc/free
calls made by the application. To run it:
ftp http://www.gson.org/netbsd/bugs/malloc-perf/test.c.gz
gunzip test.c.gz
cc test.c
time ./a.out
On an Intel Core 2 Duo CPU E8400 @3.00GHz running NetBSD 6.1.5, this
took 0.666 seconds. On an identical machine running Debian 7, it
took 0.118 seconds, more than 5x faster.
For another data point, on an Intel Core i5 @3.3 GHz running NetBSD
7.0, it took 0.224 seconds. On a Intel Core i5 @2.5 GHz running
Mac OS X 10.11.3, it took 0.081 seconds, more than 2.5x faster
despite the lower clock speed.
>Fix:
>Audit-Trail:
From: Onno van der Linden <o.vd.linden@quicknet.nl>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: lib/50791: NetBSD's malloc has performance issues
Date: Fri, 12 Feb 2016 18:17:20 +0100
> I have prepared a test program that performs a sequence of 100226
> mallocs and frees, derived from a trace of the actual malloc/free
> calls made by the application. To run it:
On my machine:
NetBSD sheep 7.99.25 NetBSD 7.99.25 (SHEEP) #9: Sun Jan 3 09:04:37 MET 2016 onno@sheep:/usr/src/sys/arch/i386/compile/SHEEP i386
$ cc mtest.c
$ time ./a.out
1.35s real 1.28s user 0.02s system
$ time ./a.out
1.30s real 1.29s user 0.01s system
$ time ./a.out
1.30s real
$ cc mtest.c -lbsdmalloc
$ time ./a.out
0.09s real 0.02s user 0.06s system
$ time ./a.out
0.09s real 0.03s user 0.06s system
Then I grabbed and installed the latest jemalloc from github
(https://github.com/jemalloc/jemalloc/archive/dev.zip)
$ cc mtest.c -Wl,-R/usr/local/lib -L/usr/local/lib -ljemalloc
$ time ./a.out
0.14s real 0.08s user 0.05s system
$ time ./a.out
0.14s real 0.09s user 0.06s system
Then I grabbed and installed jemalloc 4.0.4
$ cc mtest.c -L/usr/local/lib -Wl,-R/usr/local/lib -ljemalloc
$ time ./a.out
0.14s real 0.10s user 0.03s system
$ time ./a.out
0.13s real 0.08s user 0.05s system
Looks like NetBSD's jemalloc could use an upgrade
to 4.0.4.
Regards,
Onno
From: Andreas Gustafsson <gson@gson.org>
To: Onno van der Linden <o.vd.linden@quicknet.nl>
Cc: gnats-bugs@NetBSD.org
Subject: Re: lib/50791: NetBSD's malloc has performance issues
Date: Thu, 18 Feb 2016 18:12:56 +0200
Onno van der Linden wrote:
> Looks like NetBSD's jemalloc could use an upgrade
> to 4.0.4.
I have now imported jemalloc 4.0.4 into pkgsrc (from wip), and I'm
linking my application with that until the libc malloc gets fixed.
With this work-around, my application is showing performance similar
to that on other operating systems.
I wonder how much faster release builds would be with a newer
jemalloc in libc...
--
Andreas Gustafsson, gson@gson.org
From: Joerg Sonnenberger <joerg@britannica.bec.de>
To: gnats-bugs@NetBSD.org
Cc: lib-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org, Andreas Gustafsson <gson@gson.org>
Subject: Re: lib/50791: NetBSD's malloc has performance issues
Date: Fri, 19 Feb 2016 04:23:59 +0100
--vtzGhvizbBRQ85DL
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
On Thu, Feb 18, 2016 at 04:15:00PM +0000, Andreas Gustafsson wrote:
> I wonder how much faster release builds would be with a newer
> jemalloc in libc...
I somehow doubt that it makes a big difference. The provided trace is
somewhat atypical as roughly half of the allocations is between 64KB and
256KB. For jemalloc, that's a large allocation, so chunks out of the
larger 1MB blocks are allocated for it. What happens is that we are
using a single arena in either the single threaded or always running on
the same CPU case. The search of for the next free run of enough pages
is done in chunk address order, so the most of the allocations
process have to process most of the chunks. Keeping the chunk tree
ordered by the estimated number size of the largest run avoids this
worst case at the cost of shuffling things around a bit more often. A
proof of concept is attached. I haven't done any exhaustive testing for
this...
Joerg
--vtzGhvizbBRQ85DL
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="jemalloc.c.diff"
Index: jemalloc.c
===================================================================
RCS file: /home/joerg/repo/netbsd/src/lib/libc/stdlib/jemalloc.c,v
retrieving revision 1.39
diff -u -p -r1.39 jemalloc.c
--- jemalloc.c 24 Jan 2016 21:56:43 -0000 1.39
+++ jemalloc.c 19 Feb 2016 03:11:25 -0000
@@ -1661,6 +1661,11 @@ arena_chunk_comp(arena_chunk_t *a, arena
assert(a != NULL);
assert(b != NULL);
+ if (a->max_frun_npages < a->max_frun_npages)
+ return -1;
+ if (a->max_frun_npages > a->max_frun_npages)
+ return 1;
+
if ((uintptr_t)a < (uintptr_t)b)
return (-1);
else if (a == b)
@@ -1912,9 +1917,6 @@ arena_chunk_alloc(arena_t *arena)
chunk->arena = arena;
- /* LINTED */
- RB_INSERT(arena_chunk_tree_s, &arena->chunks, chunk);
-
/*
* Claim that no pages are in use, since the header is merely
* overhead.
@@ -1925,6 +1927,10 @@ arena_chunk_alloc(arena_t *arena)
arena_chunk_header_npages;
chunk->min_frun_ind = arena_chunk_header_npages;
+ /* Insert now that max_frun_npages is up-to-date. */
+ /* LINTED */
+ RB_INSERT(arena_chunk_tree_s, &arena->chunks, chunk);
+
/*
* Initialize enough of the map to support one maximal free run.
*/
@@ -1970,7 +1976,7 @@ arena_chunk_dealloc(arena_t *arena, aren
static arena_run_t *
arena_run_alloc(arena_t *arena, size_t size)
{
- arena_chunk_t *chunk;
+ arena_chunk_t *chunk, *chunk_tmp;
arena_run_t *run;
unsigned need_npages, limit_pages, compl_need_npages;
@@ -1986,14 +1992,30 @@ arena_run_alloc(arena_t *arena, size_t s
limit_pages = chunk_npages - arena_chunk_header_npages;
compl_need_npages = limit_pages - need_npages;
/* LINTED */
- RB_FOREACH(chunk, arena_chunk_tree_s, &arena->chunks) {
+ for (;;) {
+ chunk_tmp = RB_ROOT(&arena->chunks);
+ chunk = NULL;
+ while (chunk_tmp) {
+ if (chunk_tmp->max_frun_npages == need_npages) {
+ chunk = chunk_tmp;
+ break;
+ }
+ if (chunk_tmp->max_frun_npages < need_npages) {
+ chunk_tmp = RB_RIGHT(chunk_tmp, link);
+ continue;
+ }
+ chunk = chunk_tmp;
+ chunk_tmp = RB_LEFT(chunk, link);
+ }
+ if (chunk == NULL)
+ break;
/*
* Avoid searching this chunk if there are not enough
* contiguous free pages for there to possibly be a large
* enough free run.
*/
- if (chunk->pages_used <= compl_need_npages &&
- need_npages <= chunk->max_frun_npages) {
+ assert(need_npages <= chunk->max_frun_npages);
+ {
arena_chunk_map_t *mapelm;
unsigned i;
unsigned max_frun_npages = 0;
@@ -2031,7 +2053,9 @@ arena_run_alloc(arena_t *arena, size_t s
* chunk->min_frun_ind was already reset above (if
* necessary).
*/
+ RB_REMOVE(arena_chunk_tree_s, &arena->chunks, chunk);
chunk->max_frun_npages = max_frun_npages;
+ RB_INSERT(arena_chunk_tree_s, &arena->chunks, chunk);
}
}
@@ -2114,8 +2138,11 @@ arena_run_dalloc(arena_t *arena, arena_r
assert(chunk->map[run_ind + run_pages - 1].pos == POS_FREE);
}
- if (chunk->map[run_ind].npages > chunk->max_frun_npages)
+ if (chunk->map[run_ind].npages > chunk->max_frun_npages) {
+ RB_REMOVE(arena_chunk_tree_s, &arena->chunks, chunk);
chunk->max_frun_npages = chunk->map[run_ind].npages;
+ RB_INSERT(arena_chunk_tree_s, &arena->chunks, chunk);
+ }
if (run_ind < chunk->min_frun_ind)
chunk->min_frun_ind = run_ind;
--vtzGhvizbBRQ85DL--
From: Onno van der Linden <o.vd.linden@quicknet.nl>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: lib/50791: NetBSD's malloc has performance issues
Date: Fri, 19 Feb 2016 16:50:06 +0100
> I somehow doubt that it makes a big difference. The provided trace is
> somewhat atypical as roughly half of the allocations is between 64KB and
> 256KB. For jemalloc, that's a large allocation, so chunks out of the
> larger 1MB blocks are allocated for it. What happens is that we are
> using a single arena in either the single threaded or always running on
> the same CPU case. The search of for the next free run of enough pages
> is done in chunk address order, so the most of the allocations
> process have to process most of the chunks. Keeping the chunk tree
> ordered by the estimated number size of the largest run avoids this
> worst case at the cost of shuffling things around a bit more often. A
> proof of concept is attached. I haven't done any exhaustive testing for
> this...
FWIW,
jemalloc.c with the provided patch and the original test program:
$ cc -I/usr/src/lib/libc/include mtest.c jemalloc.c
$ time ./a.out
0.19s real 0.17s user 0.01s system
$ time ./a.out
0.18s real 0.15s user 0.03s system
$ time ./a.out
0.19s real 0.16s user 0.03s system
$ time ./a.out
0.19s real 0.14s user 0.05s system
$ time ./a.out
0.19s real 0.15s user 0.04s system
$ time ./a.out
0.19s real 0.14s user 0.06s system
Onno
From: Andreas Gustafsson <gson@gson.org>
To: Joerg Sonnenberger <joerg@britannica.bec.de>
Cc: gnats-bugs@NetBSD.org
Subject: Re: lib/50791: NetBSD's malloc has performance issues
Date: Fri, 19 Feb 2016 19:02:26 +0200
Joerg Sonnenberger wrote:
> A proof of concept is attached. I haven't done any exhaustive
> testing for this...
[...]
> + if (a->max_frun_npages < a->max_frun_npages)
That can't be right. Presumably the second "a" should be "b"?
> + if (a->max_frun_npages > a->max_frun_npages)
Ditto.
--
Andreas Gustafsson, gson@gson.org
From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: lib/50791: NetBSD's malloc has performance issues
Date: Tue, 30 Apr 2019 17:02:13 +0300
To see if the recent jemalloc update had improved performance, I reran
my test case from this PR using versions of NetBSD-current from before
and after the update. The result was the opposite of what I expected:
my test case ran more than ten times slower after the update than
before it. Most of the difference seems to come from running
unexpectedly fast before the update, not running unexpectely slow
after it.
Here's the test case again, with an added -Wno-int-to-pointer-cast
option which makes compilation run more than 50 times faster (an
interesting performance anomaly in itself, but outside the scope of
this PR).
ftp http://www.gson.org/netbsd/bugs/malloc-perf/test.c.gz
gunzip test.c.gz
cc -Wno-int-to-pointer-cast test.c
time ./a.out
Before the jemalloc update (using 2019.01.01.10.09.26 sources):
# time ./a.out
0.08 real 0.06 user 0.02 sys
After the jemalloc update (using 2019.04.29.12.53.15 sources)
# time ./a.out
1.33 real 0.68 user 0.65 sys
This is NetBSD/amd64 on
cpu0: Intel(R) Xeon(R) CPU L5630 @ 2.13GHz, id 0x206c2
--
Andreas Gustafsson, gson@gson.org
From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org
Cc: joerg@NetBSD.org, christos@NetBSD.org
Subject: Re: lib/50791: NetBSD's malloc has performance issues
Date: Thu, 2 May 2019 17:05:41 +0300
Here's an updated test case that eliminates slow compiler warnings
and returns a meaningful exit status (the timeout value may have to
be adjusted depending on the speed of the machine):
ftp http://www.gson.org/netbsd/bugs/malloc-perf/test2.c.gz && gunzip test2.c.gz && cc test2.c && time ./a.out && timeout 0.15 ./a.out
I ran an automated bisection using this to find out when malloc
unexpectedly got fast. It was this commit:
2016.04.12.18.07.08 joerg src/lib/libc/stdlib/jemalloc.c 1.40
lib/50791: Instead of using sorting the arena chunks by address only,
sort by size of the longest run and address as tie break. Avoids long
linear searches for code heavy on medium sized allocations.
which unfortunately did not get recorded in the PR at the time.
In an email to netbsd-bugs which also did not get recorded in the PR,
christos said:
> This is expected because of -DJEMALLOC_DEBUG
I will recheck the performance once that is turned off, and close the
PR if it's back to where it was between joerg's commit and the jemalloc
update.
--
Andreas Gustafsson, gson@gson.org
(Contact us)
$NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.