NetBSD Problem Report #50791

From gson@gson.org  Fri Feb 12 10:25:08 2016
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.NetBSD.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 7C96A7A213
	for <gnats-bugs@gnats.NetBSD.org>; Fri, 12 Feb 2016 10:25:08 +0000 (UTC)
Message-Id: <20160212102500.64CA974508D@guava.gson.org>
Date: Fri, 12 Feb 2016 12:25:00 +0200 (EET)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@gnats.NetBSD.org
Subject: NetBSD's malloc has performance issues
X-Send-Pr-Version: 3.95

>Number:         50791
>Category:       lib
>Synopsis:       NetBSD's malloc has performance issues
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    lib-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Feb 12 10:30:00 +0000 2016
>Last-Modified:  Thu May 02 14:10:01 +0000 2019
>Originator:     Andreas Gustafsson
>Release:        NetBSD 6.1.5 and 7.0
>Organization:

>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:

I noticed that an application I'm working on was performing
significantly worse on NetBSD than on other operating systems, and
tracked down the problem to malloc/free, which were using several
times more CPU time than on the other systems for this particular
workload.

>How-To-Repeat:

I have prepared a test program that performs a sequence of 100226
mallocs and frees, derived from a trace of the actual malloc/free
calls made by the application.  To run it:

  ftp http://www.gson.org/netbsd/bugs/malloc-perf/test.c.gz
  gunzip test.c.gz
  cc test.c
  time ./a.out

On an Intel Core 2 Duo CPU E8400 @3.00GHz running NetBSD 6.1.5, this
took 0.666 seconds.  On an identical machine running Debian 7, it
took 0.118 seconds, more than 5x faster.

For another data point, on an Intel Core i5 @3.3 GHz running NetBSD
7.0, it took 0.224 seconds.  On a Intel Core i5 @2.5 GHz running
Mac OS X 10.11.3, it took 0.081 seconds, more than 2.5x faster
despite the lower clock speed.

>Fix:

>Audit-Trail:
From: Onno van der Linden <o.vd.linden@quicknet.nl>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: lib/50791: NetBSD's malloc has performance issues
Date: Fri, 12 Feb 2016 18:17:20 +0100

 > I have prepared a test program that performs a sequence of 100226
 > mallocs and frees, derived from a trace of the actual malloc/free
 > calls made by the application.  To run it:

 On my machine:
 NetBSD sheep 7.99.25 NetBSD 7.99.25 (SHEEP) #9: Sun Jan  3 09:04:37 MET 2016  onno@sheep:/usr/src/sys/arch/i386/compile/SHEEP i386

 $ cc mtest.c
 $ time ./a.out
     1.35s real     1.28s user     0.02s system
 $ time ./a.out 
     1.30s real     1.29s user     0.01s system
 $ time ./a.out 
     1.30s real
 $ cc mtest.c -lbsdmalloc
 $ time ./a.out           
     0.09s real     0.02s user     0.06s system
 $ time ./a.out 
     0.09s real     0.03s user     0.06s system

 Then I grabbed and installed the latest jemalloc from github
 (https://github.com/jemalloc/jemalloc/archive/dev.zip)

 $ cc mtest.c -Wl,-R/usr/local/lib -L/usr/local/lib -ljemalloc
 $ time ./a.out
     0.14s real     0.08s user     0.05s system
 $ time ./a.out 
     0.14s real     0.09s user     0.06s system

 Then I grabbed and installed jemalloc 4.0.4
 $ cc mtest.c -L/usr/local/lib -Wl,-R/usr/local/lib  -ljemalloc 
 $ time ./a.out                                                 
     0.14s real     0.10s user     0.03s system
 $ time ./a.out 
     0.13s real     0.08s user     0.05s system

 Looks like NetBSD's jemalloc could use an upgrade
 to 4.0.4.

 Regards,

 Onno

From: Andreas Gustafsson <gson@gson.org>
To: Onno van der Linden <o.vd.linden@quicknet.nl>
Cc: gnats-bugs@NetBSD.org
Subject: Re: lib/50791: NetBSD's malloc has performance issues
Date: Thu, 18 Feb 2016 18:12:56 +0200

 Onno van der Linden wrote:
 >  Looks like NetBSD's jemalloc could use an upgrade
 >  to 4.0.4.

 I have now imported jemalloc 4.0.4 into pkgsrc (from wip), and I'm
 linking my application with that until the libc malloc gets fixed.
 With this work-around, my application is showing performance similar
 to that on other operating systems.

 I wonder how much faster release builds would be with a newer
 jemalloc in libc...
 -- 
 Andreas Gustafsson, gson@gson.org

From: Joerg Sonnenberger <joerg@britannica.bec.de>
To: gnats-bugs@NetBSD.org
Cc: lib-bug-people@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org, Andreas Gustafsson <gson@gson.org>
Subject: Re: lib/50791: NetBSD's malloc has performance issues
Date: Fri, 19 Feb 2016 04:23:59 +0100

 --vtzGhvizbBRQ85DL
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline

 On Thu, Feb 18, 2016 at 04:15:00PM +0000, Andreas Gustafsson wrote:
 >  I wonder how much faster release builds would be with a newer
 >  jemalloc in libc...

 I somehow doubt that it makes a big difference. The provided trace is
 somewhat atypical as roughly half of the allocations is between 64KB and
 256KB. For jemalloc, that's a large allocation, so chunks out of the
 larger 1MB blocks are allocated for it. What happens is that we are
 using a single arena in either the single threaded or always running on
 the same CPU case. The search of for the next free run of enough pages
 is done in chunk address order, so the most of the allocations
 process have to process most of the chunks. Keeping the chunk tree
 ordered by the estimated number size of the largest run avoids this
 worst case at the cost of shuffling things around a bit more often. A
 proof of concept is attached. I haven't done any exhaustive testing for
 this...

 Joerg

 --vtzGhvizbBRQ85DL
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: attachment; filename="jemalloc.c.diff"

 Index: jemalloc.c
 ===================================================================
 RCS file: /home/joerg/repo/netbsd/src/lib/libc/stdlib/jemalloc.c,v
 retrieving revision 1.39
 diff -u -p -r1.39 jemalloc.c
 --- jemalloc.c	24 Jan 2016 21:56:43 -0000	1.39
 +++ jemalloc.c	19 Feb 2016 03:11:25 -0000
 @@ -1661,6 +1661,11 @@ arena_chunk_comp(arena_chunk_t *a, arena
  	assert(a != NULL);
  	assert(b != NULL);

 +	if (a->max_frun_npages < a->max_frun_npages)
 +		return -1;
 +	if (a->max_frun_npages > a->max_frun_npages)
 +		return 1;
 +
  	if ((uintptr_t)a < (uintptr_t)b)
  		return (-1);
  	else if (a == b)
 @@ -1912,9 +1917,6 @@ arena_chunk_alloc(arena_t *arena)

  		chunk->arena = arena;

 -		/* LINTED */
 -		RB_INSERT(arena_chunk_tree_s, &arena->chunks, chunk);
 -
  		/*
  		 * Claim that no pages are in use, since the header is merely
  		 * overhead.
 @@ -1925,6 +1927,10 @@ arena_chunk_alloc(arena_t *arena)
  		    arena_chunk_header_npages;
  		chunk->min_frun_ind = arena_chunk_header_npages;

 +		/* Insert now that max_frun_npages is up-to-date. */
 +		/* LINTED */
 +		RB_INSERT(arena_chunk_tree_s, &arena->chunks, chunk);
 +
  		/*
  		 * Initialize enough of the map to support one maximal free run.
  		 */
 @@ -1970,7 +1976,7 @@ arena_chunk_dealloc(arena_t *arena, aren
  static arena_run_t *
  arena_run_alloc(arena_t *arena, size_t size)
  {
 -	arena_chunk_t *chunk;
 +	arena_chunk_t *chunk, *chunk_tmp;
  	arena_run_t *run;
  	unsigned need_npages, limit_pages, compl_need_npages;

 @@ -1986,14 +1992,30 @@ arena_run_alloc(arena_t *arena, size_t s
  	limit_pages = chunk_npages - arena_chunk_header_npages;
  	compl_need_npages = limit_pages - need_npages;
  	/* LINTED */
 -	RB_FOREACH(chunk, arena_chunk_tree_s, &arena->chunks) {
 +	for (;;) {
 +		chunk_tmp = RB_ROOT(&arena->chunks);
 +		chunk = NULL;
 +		while (chunk_tmp) {
 +			if (chunk_tmp->max_frun_npages == need_npages) {
 +				chunk = chunk_tmp;
 +				break;
 +			}
 +			if (chunk_tmp->max_frun_npages < need_npages) {
 +				chunk_tmp = RB_RIGHT(chunk_tmp, link);
 +				continue;
 +			}
 +			chunk = chunk_tmp;
 +			chunk_tmp = RB_LEFT(chunk, link);
 +		}
 +		if (chunk == NULL)
 +			break;
  		/*
  		 * Avoid searching this chunk if there are not enough
  		 * contiguous free pages for there to possibly be a large
  		 * enough free run.
  		 */
 -		if (chunk->pages_used <= compl_need_npages &&
 -		    need_npages <= chunk->max_frun_npages) {
 +		assert(need_npages <= chunk->max_frun_npages);
 +		{
  			arena_chunk_map_t *mapelm;
  			unsigned i;
  			unsigned max_frun_npages = 0;
 @@ -2031,7 +2053,9 @@ arena_run_alloc(arena_t *arena, size_t s
  			 * chunk->min_frun_ind was already reset above (if
  			 * necessary).
  			 */
 +			RB_REMOVE(arena_chunk_tree_s, &arena->chunks, chunk);
  			chunk->max_frun_npages = max_frun_npages;
 +			RB_INSERT(arena_chunk_tree_s, &arena->chunks, chunk);
  		}
  	}

 @@ -2114,8 +2138,11 @@ arena_run_dalloc(arena_t *arena, arena_r
  		assert(chunk->map[run_ind + run_pages - 1].pos == POS_FREE);
  	}

 -	if (chunk->map[run_ind].npages > chunk->max_frun_npages)
 +	if (chunk->map[run_ind].npages > chunk->max_frun_npages) {
 +		RB_REMOVE(arena_chunk_tree_s, &arena->chunks, chunk);
  		chunk->max_frun_npages = chunk->map[run_ind].npages;
 +		RB_INSERT(arena_chunk_tree_s, &arena->chunks, chunk);
 +	}
  	if (run_ind < chunk->min_frun_ind)
  		chunk->min_frun_ind = run_ind;


 --vtzGhvizbBRQ85DL--

From: Onno van der Linden <o.vd.linden@quicknet.nl>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: lib/50791: NetBSD's malloc has performance issues
Date: Fri, 19 Feb 2016 16:50:06 +0100

 >  I somehow doubt that it makes a big difference. The provided trace is
 >  somewhat atypical as roughly half of the allocations is between 64KB and
 >  256KB. For jemalloc, that's a large allocation, so chunks out of the
 >  larger 1MB blocks are allocated for it. What happens is that we are
 >  using a single arena in either the single threaded or always running on
 >  the same CPU case. The search of for the next free run of enough pages
 >  is done in chunk address order, so the most of the allocations
 >  process have to process most of the chunks. Keeping the chunk tree
 >  ordered by the estimated number size of the largest run avoids this
 >  worst case at the cost of shuffling things around a bit more often. A
 >  proof of concept is attached. I haven't done any exhaustive testing for
 >  this...

 FWIW,

 jemalloc.c with the provided patch and the original test program:

 $ cc -I/usr/src/lib/libc/include  mtest.c jemalloc.c
 $ time ./a.out
     0.19s real     0.17s user     0.01s system
 $ time ./a.out 
     0.18s real     0.15s user     0.03s system
 $ time ./a.out 
     0.19s real     0.16s user     0.03s system
 $ time ./a.out 
     0.19s real     0.14s user     0.05s system
 $ time ./a.out 
     0.19s real     0.15s user     0.04s system
 $ time ./a.out 
     0.19s real     0.14s user     0.06s system


 Onno

From: Andreas Gustafsson <gson@gson.org>
To: Joerg Sonnenberger <joerg@britannica.bec.de>
Cc: gnats-bugs@NetBSD.org
Subject: Re: lib/50791: NetBSD's malloc has performance issues
Date: Fri, 19 Feb 2016 19:02:26 +0200

 Joerg Sonnenberger wrote:
 > A proof of concept is attached. I haven't done any exhaustive
 > testing for this...
 [...]
 > +	if (a->max_frun_npages < a->max_frun_npages)

 That can't be right.  Presumably the second "a" should be "b"?

 > +	if (a->max_frun_npages > a->max_frun_npages)

 Ditto.
 -- 
 Andreas Gustafsson, gson@gson.org

From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: lib/50791: NetBSD's malloc has performance issues
Date: Tue, 30 Apr 2019 17:02:13 +0300

 To see if the recent jemalloc update had improved performance, I reran
 my test case from this PR using versions of NetBSD-current from before
 and after the update.  The result was the opposite of what I expected:
 my test case ran more than ten times slower after the update than
 before it.  Most of the difference seems to come from running
 unexpectedly fast before the update, not running unexpectely slow
 after it.

 Here's the test case again, with an added -Wno-int-to-pointer-cast
 option which makes compilation run more than 50 times faster (an
 interesting performance anomaly in itself, but outside the scope of
 this PR).

   ftp http://www.gson.org/netbsd/bugs/malloc-perf/test.c.gz
   gunzip test.c.gz
   cc -Wno-int-to-pointer-cast test.c
   time ./a.out

 Before the jemalloc update (using 2019.01.01.10.09.26 sources):

 # time ./a.out
         0.08 real         0.06 user         0.02 sys

 After the jemalloc update (using 2019.04.29.12.53.15 sources)

 # time ./a.out
         1.33 real         0.68 user         0.65 sys

 This is NetBSD/amd64 on

   cpu0: Intel(R) Xeon(R) CPU           L5630  @ 2.13GHz, id 0x206c2

 -- 
 Andreas Gustafsson, gson@gson.org

From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org
Cc: joerg@NetBSD.org, christos@NetBSD.org
Subject: Re: lib/50791: NetBSD's malloc has performance issues
Date: Thu, 2 May 2019 17:05:41 +0300

 Here's an updated test case that eliminates slow compiler warnings
 and returns a meaningful exit status (the timeout value may have to
 be adjusted depending on the speed of the machine):

   ftp http://www.gson.org/netbsd/bugs/malloc-perf/test2.c.gz && gunzip test2.c.gz && cc test2.c && time ./a.out && timeout 0.15 ./a.out

 I ran an automated bisection using this to find out when malloc
 unexpectedly got fast.  It was this commit:

   2016.04.12.18.07.08 joerg src/lib/libc/stdlib/jemalloc.c 1.40

   lib/50791: Instead of using sorting the arena chunks by address only,
   sort by size of the longest run and address as tie break. Avoids long
   linear searches for code heavy on medium sized allocations.

 which unfortunately did not get recorded in the PR at the time.

 In an email to netbsd-bugs which also did not get recorded in the PR,
 christos said:

 > This is expected because of -DJEMALLOC_DEBUG

 I will recheck the performance once that is turned off, and close the
 PR if it's back to where it was between joerg's commit and the jemalloc
 update.
 -- 
 Andreas Gustafsson, gson@gson.org

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.