NetBSD Problem Report #50791

From gson@gson.org  Fri Feb 12 10:25:08 2016
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.NetBSD.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 7C96A7A213
	for <gnats-bugs@gnats.NetBSD.org>; Fri, 12 Feb 2016 10:25:08 +0000 (UTC)
Message-Id: <20160212102500.64CA974508D@guava.gson.org>
Date: Fri, 12 Feb 2016 12:25:00 +0200 (EET)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@gnats.NetBSD.org
Subject: NetBSD's malloc has performance issues
X-Send-Pr-Version: 3.95

>Number:         50791
>Category:       lib
>Synopsis:       NetBSD's malloc has performance issues
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    lib-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Feb 12 10:30:00 +0000 2016
>Last-Modified:  Fri Feb 19 17:05:01 +0000 2016
>Originator:     Andreas Gustafsson
>Release:        NetBSD 6.1.5 and 7.0
>Organization:

>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:

I noticed that an application I'm working on was performing
significantly worse on NetBSD than on other operating systems, and
tracked down the problem to malloc/free, which were using several
times more CPU time than on the other systems for this particular
workload.

>How-To-Repeat:

I have prepared a test program that performs a sequence of 100226
mallocs and frees, derived from a trace of the actual malloc/free
calls made by the application.  To run it:

  ftp http://www.gson.org/netbsd/bugs/malloc-perf/test.c.gz
  gunzip test.c.gz
  cc test.c
  time ./a.out

On an Intel Core 2 Duo CPU E8400 @3.00GHz running NetBSD 6.1.5, this
took 0.666 seconds.  On an identical machine running Debian 7, it
took 0.118 seconds, more than 5x faster.

For another data point, on an Intel Core i5 @3.3 GHz running NetBSD
7.0, it took 0.224 seconds.  On a Intel Core i5 @2.5 GHz running
Mac OS X 10.11.3, it took 0.081 seconds, more than 2.5x faster
despite the lower clock speed.

>Fix:

>Audit-Trail:
From: Onno van der Linden <o.vd.linden@quicknet.nl>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: lib/50791: NetBSD's malloc has performance issues
Date: Fri, 12 Feb 2016 18:17:20 +0100

 > I have prepared a test program that performs a sequence of 100226
 > mallocs and frees, derived from a trace of the actual malloc/free
 > calls made by the application.  To run it:

 On my machine:
 NetBSD sheep 7.99.25 NetBSD 7.99.25 (SHEEP) #9: Sun Jan  3 09:04:37 MET 2016  onno@sheep:/usr/src/sys/arch/i386/compile/SHEEP i386

 $ cc mtest.c
 $ time ./a.out
     1.35s real     1.28s user     0.02s system
 $ time ./a.out 
     1.30s real     1.29s user     0.01s system
 $ time ./a.out 
     1.30s real
 $ cc mtest.c -lbsdmalloc
 $ time ./a.out           
     0.09s real     0.02s user     0.06s system
 $ time ./a.out 
     0.09s real     0.03s user     0.06s system

 Then I grabbed and installed the latest jemalloc from github
 (https://github.com/jemalloc/jemalloc/archive/dev.zip)

 $ cc mtest.c -Wl,-R/usr/local/lib -L/usr/local/lib -ljemalloc
 $ time ./a.out
     0.14s real     0.08s user     0.05s system
 $ time ./a.out 
     0.14s real     0.09s user     0.06s system

 Then I grabbed and installed jemalloc 4.0.4
 $ cc mtest.c -L/usr/local/lib -Wl,-R/usr/local/lib  -ljemalloc 
 $ time ./a.out                                                 
     0.14s real     0.10s user     0.03s system
 $ time ./a.out 
     0.13s real     0.08s user     0.05s system

 Looks like NetBSD's jemalloc could use an upgrade
 to 4.0.4.

 Regards,

 Onno

From: Andreas Gustafsson <gson@gson.org>
To: Onno van der Linden <o.vd.linden@quicknet.nl>
Cc: gnats-bugs@NetBSD.org
Subject: Re: lib/50791: NetBSD's malloc has performance issues
Date: Thu, 18 Feb 2016 18:12:56 +0200

 Onno van der Linden wrote:
 >  Looks like NetBSD's jemalloc could use an upgrade
 >  to 4.0.4.

 I have now imported jemalloc 4.0.4 into pkgsrc (from wip), and I'm
 linking my application with that until the libc malloc gets fixed.
 With this work-around, my application is showing performance similar
 to that on other operating systems.

 I wonder how much faster release builds would be with a newer
 jemalloc in libc...
 -- 
 Andreas Gustafsson, gson@gson.org

From: Joerg Sonnenberger <joerg@britannica.bec.de>
To: gnats-bugs@NetBSD.org
Cc: lib-bug-people@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org, Andreas Gustafsson <gson@gson.org>
Subject: Re: lib/50791: NetBSD's malloc has performance issues
Date: Fri, 19 Feb 2016 04:23:59 +0100

 --vtzGhvizbBRQ85DL
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline

 On Thu, Feb 18, 2016 at 04:15:00PM +0000, Andreas Gustafsson wrote:
 >  I wonder how much faster release builds would be with a newer
 >  jemalloc in libc...

 I somehow doubt that it makes a big difference. The provided trace is
 somewhat atypical as roughly half of the allocations is between 64KB and
 256KB. For jemalloc, that's a large allocation, so chunks out of the
 larger 1MB blocks are allocated for it. What happens is that we are
 using a single arena in either the single threaded or always running on
 the same CPU case. The search of for the next free run of enough pages
 is done in chunk address order, so the most of the allocations
 process have to process most of the chunks. Keeping the chunk tree
 ordered by the estimated number size of the largest run avoids this
 worst case at the cost of shuffling things around a bit more often. A
 proof of concept is attached. I haven't done any exhaustive testing for
 this...

 Joerg

 --vtzGhvizbBRQ85DL
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: attachment; filename="jemalloc.c.diff"

 Index: jemalloc.c
 ===================================================================
 RCS file: /home/joerg/repo/netbsd/src/lib/libc/stdlib/jemalloc.c,v
 retrieving revision 1.39
 diff -u -p -r1.39 jemalloc.c
 --- jemalloc.c	24 Jan 2016 21:56:43 -0000	1.39
 +++ jemalloc.c	19 Feb 2016 03:11:25 -0000
 @@ -1661,6 +1661,11 @@ arena_chunk_comp(arena_chunk_t *a, arena
  	assert(a != NULL);
  	assert(b != NULL);

 +	if (a->max_frun_npages < a->max_frun_npages)
 +		return -1;
 +	if (a->max_frun_npages > a->max_frun_npages)
 +		return 1;
 +
  	if ((uintptr_t)a < (uintptr_t)b)
  		return (-1);
  	else if (a == b)
 @@ -1912,9 +1917,6 @@ arena_chunk_alloc(arena_t *arena)

  		chunk->arena = arena;

 -		/* LINTED */
 -		RB_INSERT(arena_chunk_tree_s, &arena->chunks, chunk);
 -
  		/*
  		 * Claim that no pages are in use, since the header is merely
  		 * overhead.
 @@ -1925,6 +1927,10 @@ arena_chunk_alloc(arena_t *arena)
  		    arena_chunk_header_npages;
  		chunk->min_frun_ind = arena_chunk_header_npages;

 +		/* Insert now that max_frun_npages is up-to-date. */
 +		/* LINTED */
 +		RB_INSERT(arena_chunk_tree_s, &arena->chunks, chunk);
 +
  		/*
  		 * Initialize enough of the map to support one maximal free run.
  		 */
 @@ -1970,7 +1976,7 @@ arena_chunk_dealloc(arena_t *arena, aren
  static arena_run_t *
  arena_run_alloc(arena_t *arena, size_t size)
  {
 -	arena_chunk_t *chunk;
 +	arena_chunk_t *chunk, *chunk_tmp;
  	arena_run_t *run;
  	unsigned need_npages, limit_pages, compl_need_npages;

 @@ -1986,14 +1992,30 @@ arena_run_alloc(arena_t *arena, size_t s
  	limit_pages = chunk_npages - arena_chunk_header_npages;
  	compl_need_npages = limit_pages - need_npages;
  	/* LINTED */
 -	RB_FOREACH(chunk, arena_chunk_tree_s, &arena->chunks) {
 +	for (;;) {
 +		chunk_tmp = RB_ROOT(&arena->chunks);
 +		chunk = NULL;
 +		while (chunk_tmp) {
 +			if (chunk_tmp->max_frun_npages == need_npages) {
 +				chunk = chunk_tmp;
 +				break;
 +			}
 +			if (chunk_tmp->max_frun_npages < need_npages) {
 +				chunk_tmp = RB_RIGHT(chunk_tmp, link);
 +				continue;
 +			}
 +			chunk = chunk_tmp;
 +			chunk_tmp = RB_LEFT(chunk, link);
 +		}
 +		if (chunk == NULL)
 +			break;
  		/*
  		 * Avoid searching this chunk if there are not enough
  		 * contiguous free pages for there to possibly be a large
  		 * enough free run.
  		 */
 -		if (chunk->pages_used <= compl_need_npages &&
 -		    need_npages <= chunk->max_frun_npages) {
 +		assert(need_npages <= chunk->max_frun_npages);
 +		{
  			arena_chunk_map_t *mapelm;
  			unsigned i;
  			unsigned max_frun_npages = 0;
 @@ -2031,7 +2053,9 @@ arena_run_alloc(arena_t *arena, size_t s
  			 * chunk->min_frun_ind was already reset above (if
  			 * necessary).
  			 */
 +			RB_REMOVE(arena_chunk_tree_s, &arena->chunks, chunk);
  			chunk->max_frun_npages = max_frun_npages;
 +			RB_INSERT(arena_chunk_tree_s, &arena->chunks, chunk);
  		}
  	}

 @@ -2114,8 +2138,11 @@ arena_run_dalloc(arena_t *arena, arena_r
  		assert(chunk->map[run_ind + run_pages - 1].pos == POS_FREE);
  	}

 -	if (chunk->map[run_ind].npages > chunk->max_frun_npages)
 +	if (chunk->map[run_ind].npages > chunk->max_frun_npages) {
 +		RB_REMOVE(arena_chunk_tree_s, &arena->chunks, chunk);
  		chunk->max_frun_npages = chunk->map[run_ind].npages;
 +		RB_INSERT(arena_chunk_tree_s, &arena->chunks, chunk);
 +	}
  	if (run_ind < chunk->min_frun_ind)
  		chunk->min_frun_ind = run_ind;


 --vtzGhvizbBRQ85DL--

From: Onno van der Linden <o.vd.linden@quicknet.nl>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: lib/50791: NetBSD's malloc has performance issues
Date: Fri, 19 Feb 2016 16:50:06 +0100

 >  I somehow doubt that it makes a big difference. The provided trace is
 >  somewhat atypical as roughly half of the allocations is between 64KB and
 >  256KB. For jemalloc, that's a large allocation, so chunks out of the
 >  larger 1MB blocks are allocated for it. What happens is that we are
 >  using a single arena in either the single threaded or always running on
 >  the same CPU case. The search of for the next free run of enough pages
 >  is done in chunk address order, so the most of the allocations
 >  process have to process most of the chunks. Keeping the chunk tree
 >  ordered by the estimated number size of the largest run avoids this
 >  worst case at the cost of shuffling things around a bit more often. A
 >  proof of concept is attached. I haven't done any exhaustive testing for
 >  this...

 FWIW,

 jemalloc.c with the provided patch and the original test program:

 $ cc -I/usr/src/lib/libc/include  mtest.c jemalloc.c
 $ time ./a.out
     0.19s real     0.17s user     0.01s system
 $ time ./a.out 
     0.18s real     0.15s user     0.03s system
 $ time ./a.out 
     0.19s real     0.16s user     0.03s system
 $ time ./a.out 
     0.19s real     0.14s user     0.05s system
 $ time ./a.out 
     0.19s real     0.15s user     0.04s system
 $ time ./a.out 
     0.19s real     0.14s user     0.06s system


 Onno

From: Andreas Gustafsson <gson@gson.org>
To: Joerg Sonnenberger <joerg@britannica.bec.de>
Cc: gnats-bugs@NetBSD.org
Subject: Re: lib/50791: NetBSD's malloc has performance issues
Date: Fri, 19 Feb 2016 19:02:26 +0200

 Joerg Sonnenberger wrote:
 > A proof of concept is attached. I haven't done any exhaustive
 > testing for this...
 [...]
 > +	if (a->max_frun_npages < a->max_frun_npages)

 That can't be right.  Presumably the second "a" should be "b"?

 > +	if (a->max_frun_npages > a->max_frun_npages)

 Ditto.
 -- 
 Andreas Gustafsson, gson@gson.org

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.