NetBSD Problem Report #49275

From www@NetBSD.org  Sat Oct 11 23:55:45 2014
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 41222A65D1
	for <gnats-bugs@gnats.NetBSD.org>; Sat, 11 Oct 2014 23:55:45 +0000 (UTC)
Message-Id: <20141011235543.B7DD3A6665@mollari.NetBSD.org>
Date: Sat, 11 Oct 2014 23:55:43 +0000 (UTC)
From: dennis.c.ferguson@gmail.com
Reply-To: dennis.c.ferguson@gmail.com
To: gnats-bugs@NetBSD.org
Subject: compiler_rt version of __clzsi2() seems to result in universally inferior code
X-Send-Pr-Version: www-1.0

>Number:         49275
>Category:       toolchain
>Synopsis:       compiler_rt version of __clzsi2() seems to result in universally inferior code
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    toolchain-manager
>State:          closed
>Class:          change-request
>Submitter-Id:   net
>Arrival-Date:   Sun Oct 12 00:00:00 +0000 2014
>Closed-Date:    Mon Oct 31 03:35:43 +0000 2016
>Last-Modified:  Mon Oct 31 03:35:43 +0000 2016
>Originator:     Dennis Ferguson
>Release:        NetBSD 7.99.1
>Organization:
none at all
>Environment:
>Description:
I don't think the compiler_rt version of __clzsi2() in

   sys/external/bsd/compiler_rt/dist/lib/builtins/clzsi2.c

is very useful.  Compiling it results in a longer code sequence
(often significantly longer) than a more straight forward (in my view)
C implementation of the same function with all the machine/compiler
combinations available to me.

Three implementations of __clzsi2() are provided below, all computing
the same result for the same input.  The first, _crt, version is
copied from the compiler_rt library, the _opa version keeps the
use-results-of-comparisons-as-values style of the library but
reorganizes it to eliminate some arithmetic in the C, while the
_opb version recodes _opa into a more conventional (in my opinion)
implementation with if() statements.  These were compiled with
the compilers I had (-O2) and the generated instructions were
counted.  When branches were present in the generated code I counted
the instructions in the longest path through the code and appended
a `-' to the number to indicate there are paths with fewer (but
maybe not faster) instructions.  Here are the results:

                crt     opa     opb
    i386        60      41      26-
    amd64       53      37      25-
    arm         38      25      19-
    ppc32       35-     29      23-
    ppc64       37-     35      27-
    coldfire    44-     42-     37-
    riscv32     39-     25      21-
    riscv64     37-     25      21-
    amd64cl     43      34      27-
    i386cl      46      37      28-

The compilers are all gcc except for the last two.  In a perfect
world the results in a row would all be the same (minimal) value
since the C implementations all compute the same thing with very
similar algorithms, but it is clear the compiler_rt version does
not agree well with the world as it is.

Of these I think this is of practical consequence only for RISC-V,
which has no clz instruction.  The 21 instructions generated from
_opb appears to me to be the minimal sequence for this instruction
set, making hand-crafted assembly for top speed unnecessary.  A
reason to be interested in making clz fast is that a clz operation
can be a significant fraction of the cost of doing a (well-cached)
IP route lookup on a machine with no instruction.
>How-To-Repeat:
Compile this and look at the generated assembly:

typedef int si_int;
typedef unsigned int su_int;

si_int
__clzsi2_crt(si_int a)
{
        su_int x = (su_int) a;
        si_int t = ((x & 0xffff0000) == 0) << 4;
        si_int r = t;

        x >>= 16 - t;
        t = ((x & 0xff00) == 0) << 3;
        x >>= 8 - t;
        r += t;
        t = ((x & 0xf0) == 0) << 2;
        x >>= 4 - t;
        r += t;
        t = ((x & 0xc) == 0) << 1;
        x >>= 2 - t;
        r += t;
        return (r + ((2 - x) & -((x & 2) == 0)));
}

si_int
__clzsi2_opa(si_int a)
{
        su_int x = (su_int) a;
        si_int r = (((x >> 16) & 0xffff) == 0) << 4;
        si_int t = r;

        x <<= t;
        t = (((x >> 24) & 0xff) == 0) << 3;
        x <<= t;
        r += t;
        t = (((x >> 28) & 0xf) == 0) << 2;
        x <<= t;
        r += t;
        t = (((x >> 30) & 0x3) == 0) << 1;
        x <<= t;
        r += t;
        r += ((x & (0x1u << 31)) == 0) + (x == 0);
        return (r);
}

si_int
__clzsi2_opb(si_int a)
{
        su_int x = (su_int) a;
        si_int r = 0;

        if (((x >> 16) & 0xffff) == 0) {
                r = 16;
                x <<= 16;
        }
        if (((x >> 24) & 0xff) == 0) {
                r += 8;
                x <<= 8;
        }
        if (((x >> 28) & 0xf) == 0) {
                r += 4;
                x <<= 4;
        }
        if (((x >> 30) & 0x3) == 0) {
                r += 2;
                x <<= 2;
        }
        if ((x & (0x1 << 31)) == 0) {
                r += 1;
                if (x == 0) {
                        r += 1;
                }
        }
        return (r);
}

>Fix:

>Release-Note:

>Audit-Trail:
From: David Laight <david@l8s.co.uk>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: toolchain/49275: compiler_rt version of __clzsi2() seems to result in universally inferior code
Date: Sun, 7 Dec 2014 22:44:52 +0000

 On Sun, Oct 12, 2014 at 12:00:01AM +0000, dennis.c.ferguson@gmail.com wrote:
 > >Number:         49275
 > >Synopsis:       compiler_rt version of __clzsi2() seems to result in universally inferior code
 > I don't think the compiler_rt version of __clzsi2() in
 > 
 >    sys/external/bsd/compiler_rt/dist/lib/builtins/clzsi2.c
 > 
 > is very useful.  Compiling it results in a longer code sequence
 > (often significantly longer) than a more straight forward (in my view)
 > C implementation of the same function with all the machine/compiler
 > combinations available to me.
 > 
 > Three implementations of __clzsi2() are provided below, all computing
 > the same result for the same input.  The first, _crt, version is
 > copied from the compiler_rt library, the _opa version keeps the
 > use-results-of-comparisons-as-values style of the library but
 > reorganizes it to eliminate some arithmetic in the C, while the
 > _opb version recodes _opa into a more conventional (in my opinion)
 > implementation with if() statements.  These were compiled with
 > the compilers I had (-O2) and the generated instructions were
 > counted.  When branches were present in the generated code I counted
 > the instructions in the longest path through the code and appended
 > a `-' to the number to indicate there are paths with fewer (but
 > maybe not faster) instructions.  Here are the results:
 > 
 >                 crt     opa     opb
 >     i386        60      41      26-
 >     amd64       53      37      25-
 >     arm         38      25      19-
 >     ppc32       35-     29      23-
 >     ppc64       37-     35      27-
 >     coldfire    44-     42-     37-
 >     riscv32     39-     25      21-
 >     riscv64     37-     25      21-
 >     amd64cl     43      34      27-
 >     i386cl      46      37      28-

 Instruction count isn't a very accurate way of determining the
 execution time on modern cpus.
 Mispredicted branches can get very expensive, and multiple execution
 units mean that long dependency chains matter.

 This probably means that 'opb' is slower than you expect.
 Don't benchmark by running with a fixed value in a loop.

 With the very old compiler I have to hand changing 'opa' to
 test '((x & 0xff000000) == 0)' etc (ie shift the constant, not x)
 generates better code.

 	David

 -- 
 David Laight: david@l8s.co.uk

From: Dennis Ferguson <dennis.c.ferguson@gmail.com>
To: gnats-bugs@NetBSD.org
Cc: toolchain-manager@netbsd.org,
 gnats-admin@netbsd.org,
 netbsd-bugs@netbsd.org
Subject: Re: toolchain/49275: compiler_rt version of __clzsi2() seems to result in universally inferior code
Date: Tue, 9 Dec 2014 11:33:13 -0800

 On 7 Dec, 2014, at 14:50 , David Laight <david@l8s.co.uk> wrote:
 >>                crt     opa     opb
 >>    i386        60      41      26-
 >>    amd64       53      37      25-
 >>    arm         38      25      19-
 >>    ppc32       35-     29      23-
 >>    ppc64       37-     35      27-
 >>    coldfire    44-     42-     37-
 >>    riscv32     39-     25      21-
 >>    riscv64     37-     25      21-
 >>    amd64cl     43      34      27-
 >>    i386cl      46      37      28-
 > 
 > Instruction count isn't a very accurate way of determining the
 > execution time on modern cpus.
 > Mispredicted branches can get very expensive, and multiple execution
 > units mean that long dependency chains matter.
 > 
 > This probably means that 'opb' is slower than you expect.
 > Don't benchmark by running with a fixed value in a loop.
 > 
 > With the very old compiler I have to hand changing 'opa' to
 > test '((x & 0xff000000) == 0)' etc (ie shift the constant, not x)
 > generates better code.

 You can shift the variable or shift the mask and get the same
 boolean result, but I think the better of these depends strongly
 on the CPU and its instruction set.  I just pick one way for the C
 code and assume the compiler will be smart enough to do it the
 other way if that is cheaper for the target.  If the compiler
 isn't that smart now I"m willing to wait for a version that is.

 Note, though, that while everything you say is true in general
 we're not really discussing the general case but rather 3 alternative
 implementations of one particular function.  All of these are one
 long dependency chain (the computation of `r' is a sideline, but
 `x' and `t' computations are pretty much constrained to be
 sequential), so multiple issue CPUs have less advantage here
 than in more "normal" code.  In cases like this the relative
 instruction count may be a bit more accurate predictor of
 execution cost (though maybe still not on an x86, which doesn't
 execute its own instruction set), but that's not exactly what I
 expected the table to convey.  Additionally, adds of and shifts
 by constants (a la _opb) can be speculatively executed in parallel
 with the if() computation and tossed if they aren't needed,
 but adds of and shifts by `t' are generally going to have to wait
 until it knows `t', so that plus limited parallelism elsewhere may
 mean that 'opb' is in fact often faster than you seem to expect.

 But even that is kind of beside the point.  The issue isn't something
 subtle about code generation or machine execution, it is that the
 the compiler_rt __clzsi2() implementation is crappy.  It does
 too much arithmetic to compute its result, it is too complex.  This
 is true on every machine. The reason that 'opa' and 'opb' need fewer
 instructions than the current 'crt' algorithm on all the machines we
 do know about is that 'opa' and 'opb' compute fewer things in the
 C code as written to find the same result, and doing less work requires
 fewer instructions.  That (plus the inability of the compiler to
 recognize the extra work done in 'crt' is useless) is what I expected
 the instruction count to convey.  Between "opa" and "opb" there is
 less to choose since the basic algorithm is the same, and I'd be happy
 with either, but "opb" is just simpler to my eye and that is usually
 better.  This is C code after all; the goal isn't to write code which
 is optimal for any machine, it is to write code which is clear enough
 about what is being computed that the compiler (or some future compiler)
 can understand the intent well enough to produce code which is optimal
 for a particular target.

 Note that execution times on particular machines are consistent with
 the conclusion that the compiler_rt code is crap.  Averaged over all
 values of the 32 bit argument, for RISC-V, which is particularly
 relevant since it has no clz instruction, the simulator says that 'opb'
 is not quite twice the speed of 'crt'.  On my arm machine 'opb' is
 twice the speed of 'crt', while the hand-written assembler __clzsi2()
 it uses now is 2.5 times faster than 'crt' (i.e. 25% faster than 'opb')
 while producing slightly different results.  For amd64 and i386 'opb'
 is 55% faster than 'crt' on my fancy Core i7, while on my crappy Intel
 Atom machine the same i386 'opb' binary runs 3 times faster than 'crt'
 (and more than twice as fast as 'opa') for reasons I can guess at.
 This is a fairly trivial function, the magnitude of those differences
 is indicative of just how poor the compiler_rt implementation is.  I'm
 pretty confident you won't find any machine where 'crt' is faster; it
 would take a miracle for a compiler to understand what 'crt' is doing
 well enough to eliminate the junk and produce the same machine code as
 'opb' (which is what would happen in a perfect world since they produce
 the same results for the same input).

 I have a library which, among other things, provides a 32 bit clz
 operation that can be made optimal for any machine which someone cares
 about enough to write the specific code for.  For most machines the
 optimal implementation is __builtin_clz() because most machines have
 an instruction to compute this and __builtin_clz() generates it.  I
 would like __builtin_clz() to become more-or-less optimal for all
 machines (rather than just most) so that this part of my own library
 could go away.  That means dealing with machines which don't have an
 instruction and instead call __clzsi2() in the runtime library.

 For RISC-V, which has no instruction and calls __clzsi2(), 'opb'
 produces code which is as good as I could write in assembly and
 performs significantly better than the compiler_rt implementation.
 I could add this as a RISC-V special, the way the arm assembly
 implementation is handled, but doing it this way would imply the
 compiler_rt implementation might have some value on some other
 machine and I don't believe that is true.  The compiler_rt code
 is crappy and replacing it would do everyone a favor.

 You seem reluctant to replace it.  If you don't like the instruction
 count metric (and I could add more machines to that table; 'crt'
 always loses) is there some other measureable that would more reliably
 indicate to you that the compiler_rt code is indeed just universally
 crappy?  I haven't found one that even hints that it isn't.

 Dennis Ferguson

From: Joerg Sonnenberger <joerg@britannica.bec.de>
To: gnats-bugs@NetBSD.org
Cc: toolchain-manager@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org, dennis.c.ferguson@gmail.com
Subject: Re: toolchain/49275: compiler_rt version of __clzsi2() seems to
 result in universally inferior code
Date: Tue, 9 Dec 2014 22:37:11 +0100

 On Tue, Dec 09, 2014 at 07:35:00PM +0000, Dennis Ferguson wrote:
 >  > Instruction count isn't a very accurate way of determining the
 >  > execution time on modern cpus.
 >  > Mispredicted branches can get very expensive, and multiple execution
 >  > units mean that long dependency chains matter.

 David is fully correct here.

 >  subtle about code generation or machine execution, it is that the
 >  the compiler_rt __clzsi2() implementation is crappy.

 It is written on the idea that (a) all sane platforms have assembler
 instructions or (b) have fine tuned assembler version or (c) have some
 decent form of branch-free test-and-set.

 Feel free to submit your implementation upstream, but personally I don't
 see a strong enough reason to change anything here. Just way too much
 text.

 Joerg

State-Changed-From-To: open->closed
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Mon, 31 Oct 2016 03:35:43 +0000
State-Changed-Why:
This is an issue for upstream - please take it up with them if you haven't
already.


>Unformatted:
Home
PR Database Search
(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.