NetBSD Problem Report #35535
From www@NetBSD.org Thu Feb 1 11:19:49 2007
Return-Path: <www@NetBSD.org>
Received: by narn.NetBSD.org (Postfix, from userid 31301)
id 6CB0063B938; Thu, 1 Feb 2007 11:19:49 +0000 (UTC)
Message-Id: <20070201111949.6CB0063B938@narn.NetBSD.org>
Date: Thu, 1 Feb 2007 11:19:49 +0000 (UTC)
From: fuyuki@hadaly.org
Reply-To: fuyuki@hadaly.org
To: gnats-bugs@NetBSD.org
Subject: memcpy() is very slow if not aligned
X-Send-Pr-Version: www-1.0
>Number: 35535
>Category: port-amd64
>Synopsis: memcpy() is very slow if not aligned
>Confidential: no
>Severity: non-critical
>Priority: low
>Responsible: port-amd64-maintainer
>State: closed
>Class: change-request
>Submitter-Id: net
>Arrival-Date: Thu Feb 01 11:20:00 +0000 2007
>Closed-Date: Sun Nov 22 17:29:35 +0000 2009
>Last-Modified: Sun Nov 22 17:30:03 +0000 2009
>Originator: Kimura Fuyuki
>Release: 4.99.9
>Organization:
>Environment:
NetBSD lapis.hadaly.org 4.99.9 NetBSD 4.99.9 (LAPIS) #59: Thu Feb 1 16:18:21 JST 2007 fuyuki@lapis.hadaly.org:/usr/obj/sys/arch/amd64/compile/LAPIS amd64
>Description:
On NetBSD/amd64 (perhaps in i386 too) memcpy() could be very slow because no alignment effort has been taken place.
Sometimes it is very difficult or impossible for applications to align the dest addr so that the minimal effort should be taken in the library code.
>How-To-Repeat:
Use the following benchmark program to see the the unaligned memcpy() is so slow.
http://www.hadaly.org/fuyuki/memcpy_bench.c
<plain libc>
$ time ./memcpy_bench 65536 1000000 0 0
dst:0x503000 src:0x513000 len:65536
./memcpy_bench 65536 1000000 0 0 7.30s user 0.00s system 99% cpu 7.349 total
$ time ./memcpy_bench 65536 1000000 1 1
dst:0x503001 src:0x514001 len:65536
./memcpy_bench 65536 1000000 1 1 48.46s user 0.00s system 99% cpu 48.713 total
<patch (below) applied>
$ time ./memcpy_bench 65536 1000000 0 0
dst:0x503000 src:0x513000 len:65536
./memcpy_bench 65536 1000000 0 0 7.36s user 0.00s system 99% cpu 7.406 total
$ time ./memcpy_bench 65536 1000000 1 1
dst:0x503001 src:0x514001 len:65536
./memcpy_bench 65536 1000000 1 1 11.40s user 0.00s system 99% cpu 11.468 total
>Fix:
The following patch decreases the amarok's cpu% to <1 on my prescott celeron.
http://www.hadaly.org/fuyuki/bcopy.S.patch
See also (especially 8.2):
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF
>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: lib-bug-people->port-amd64-maintainer
Responsible-Changed-By: tron@netbsd.org
Responsible-Changed-When: Thu, 01 Feb 2007 12:44:02 +0000
Responsible-Changed-Why:
This is a NetBSD-amd64 specific problem.
From: Kimura Fuyuki <fuyuki@hadaly.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: lib/35535: memcpy() is very slow if not aligned
Date: Sat, 3 Feb 2007 19:29:17 +0900
Sorry for misfiling...
If you want further improvement, here's also an SSEd version which scales to
megs and preserves cached data.
http://www.hadaly.org/fuyuki/bcopy-sse.patch
plain:
$ time ./memcpy_bench 1000000 10000 0 0
dst:0x503000 src:0x5f8000 len:1000000
./memcpy_bench 1000000 10000 0 0 9.50s user 0.00s system 99% cpu 9.566 total
patched:
$ time ./memcpy_bench 1000000 10000 0 0
dst:0x503000 src:0x5f8000 len:1000000
./memcpy_bench 1000000 10000 0 0 6.38s user 0.00s system 99% cpu 6.425 total
xine's score:
Benchmarking memcpy methods (smaller is better):
libc memcpy() : 218605133
linux kernel memcpy() : 308372235
MMX optimized memcpy() : 312928386
MMXEXT optimized memcpy() : 216660195
SSE optimized memcpy() : 214245218
From: David Laight <david@l8s.co.uk>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: lib/35535: memcpy() is very slow if not aligned
Date: Sat, 3 Feb 2007 11:15:56 +0000
On Sat, Feb 03, 2007 at 10:30:02AM +0000, Kimura Fuyuki wrote:
> The following reply was made to PR port-amd64/35535; it has been noted by GNATS.
>
> From: Kimura Fuyuki <fuyuki@hadaly.org>
> To: gnats-bugs@netbsd.org
> Cc:
> Subject: Re: lib/35535: memcpy() is very slow if not aligned
> Date: Sat, 3 Feb 2007 19:29:17 +0900
>
> Sorry for misfiling...
>
> If you want further improvement, here's also an SSEd version which scales to
> megs and preserves cached data.
>
> http://www.hadaly.org/fuyuki/bcopy-sse.patch
1) I'm not sure that optimisations for > 128k copies are necessarily
worthwhile. Code ought to be passing such data by reference!
In the kernel, the only common large copy is (ought to be) the
copy-on-write of shared pages.
2) You want to look at the costs for short copies. They are much more
common than you think.
I've not done any timings for 'rep movsx', but I did do some for
'rep stosx' a couple of years ago. The instruction setup costs on
modern cpus is significant, so they shouldn't be used for small loops.
A common non-optimisation is the use of a 'rep movsb' instruction to
move the remaining bytes - which is likely to be zero [1].
One option is to copy the last 4/8 bytes first!
I also discovered that the pentium IV needs the target address to be
8 byte aligned!
3) (2) may well apply to the use to movsb to align copies.
David
[1] Certain compilers convert:
while (a < b)
*a++ = ' ';
into the inlined version of memset(), including 2 'expensive to setup'
'rep stosx' instructions, when I explictily wrote the loop because the
loop count is short....
--
David Laight: david@l8s.co.uk
From: Kimura Fuyuki <fuyuki@hadaly.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: lib/35535: memcpy() is very slow if not aligned
Date: Sat, 3 Feb 2007 23:24:24 +0900
On Saturday 03 February 2007, David Laight wrote:
>
> 1) I'm not sure that optimisations for > 128k copies are necessarily
> worthwhile. Code ought to be passing such data by reference!
> In the kernel, the only common large copy is (ought to be) the
> copy-on-write of shared pages.
In kernel use, it's just true that code for >128k is not so useful. I put it
just because we have libs shared on both kernel and user land. If you think
optimization for larger buffer is not a good idea, it could be removed or
#ifdef-outed for kernel.
> 2) You want to look at the costs for short copies. They are much more
> common than you think.
> I've not done any timings for 'rep movsx', but I did do some for
> 'rep stosx' a couple of years ago. The instruction setup costs on
> modern cpus is significant, so they shouldn't be used for small loops.
> A common non-optimisation is the use of a 'rep movsb' instruction to
> move the remaining bytes - which is likely to be zero [1].
> One option is to copy the last 4/8 bytes first!
> I also discovered that the pentium IV needs the target address to be
> 8 byte aligned!
Fact 1: I misunderstood the gcc's optimization policy a bit; I've thought
memcpy()s are more aggressively inlined or unrolled to mov ops. So, short
copies are important, right. But, they *are* properly inlined in many cases.
from gcc(1):
-mmemcpy
-mno-memcpy
Force (do not force) the use of "memcpy()" for non-trivial block
moves. The default is -mno-memcpy, which allows GCC to inline most
constant-sized copies.
Fact 2: I think just one branch is not a burden for modern cpus. Real number
follows. (ya, could be a little burden..)
plain:
$ time ./memcpy_bench 64 100000000 0 0
dst:0x502080 src:0x5020c0 len:64
./memcpy_bench 64 100000000 0 0 3.36s user 0.00s system 99% cpu 3.390 total
patched:
$ time ./memcpy_bench 64 100000000 0 0
dst:0x502080 src:0x5020c0 len:64
./memcpy_bench 64 100000000 0 0 3.49s user 0.00s system 99% cpu 3.517 total
Fact 3: I didn't touch the rep part of the code. I made the patch small as far
as I can. I agree that rep prefix should be carefully used.
> 3) (2) may well apply to the use to movsb to align copies.
Actually, I tried three versions of alignment code including movsb-less one
and took simpler and faster. Anyway, there's no big difference in these
three. Note also that the memcpy's dest address is very likely to be already
aligned.
The real (what's real?) latency for rep instructions can be seen here (8.3):
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF
Thanks for your comment.
> [1] Certain compilers convert:
> while (a < b)
> *a++ = ' ';
> into the inlined version of memset(), including 2 'expensive to setup'
> 'rep stosx' instructions, when I explictily wrote the loop because the
> loop count is short....
gcc 4 is a little bit smarter than that, I think. :)
From: David Laight <david@l8s.co.uk>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: lib/35535: memcpy() is very slow if not aligned
Date: Sat, 3 Feb 2007 21:23:31 +0000
On Sat, Feb 03, 2007 at 02:25:02PM +0000, Kimura Fuyuki wrote:
>
> The real (what's real?) latency for rep instructions can be seen here (8.3):
> http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF
Hmmm... I'm not entirely certain some of the suggestions in that document are correct!
Some of the C code certainly isn't!
Page 17 suggests the use of:
#define FLOAT2INTCAST(f) (*((int *)(&f)))
for speeding up float comparisons agains constants.
Someone hasn't read up on the C aliasing rules.
Page 106 also suggests you need to be a lot more careful with your write-combining
code. Thinking further it probably can't be used without disabling interrupts (or
maybe making the write to each cache line a RAS sequence).
(But maybe I'm misunderstanding exactly what happens to the partially written line.)
eg stuff in appendix B :-)
Page 167 suggests never (ok hardly ever) using the rep string opcodes.
The algorithm on pages 181+ looks like a good way to kill the I-cache.
Oh, and for good measure, code has to run on intel cpus as well.
David
--
David Laight: david@l8s.co.uk
From: murray@river-styx.org
To: gnats-bugs@NetBSD.org
Cc: david@l8s.co.uk,
david@l8s.co.uk
Subject: Re: lib/35535: memcpy() is very slow if not aligned
Date: Mon, 30 Mar 2009 23:25:12 +1100 (EST)
Hi Folks,
Some time ago(2 years+?) I did a review of all the low level
implementations of memcpy and friends around the amd64 platform across
all operating systems and gcc that I could find. The best
implementation I found was in opensolaris which used multiple
optimisations depending on the size of the source. I ported all their
stuff to NetBSD amd64 at the time and mentioned it to Andrew Doran and
Christos IIRC, but there were concerns with the opensolaris licensing.
Worth having another look?
Take care,
Murray Armfield
From: christos@zoulas.com (Christos Zoulas)
To: gnats-bugs@NetBSD.org, port-amd64-maintainer@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, fuyuki@hadaly.org
Cc:
Subject: Re: lib/35535: memcpy() is very slow if not aligned
Date: Mon, 30 Mar 2009 08:44:56 -0400
On Mar 30, 12:30pm, murray@river-styx.org (murray@river-styx.org) wrote:
-- Subject: Re: lib/35535: memcpy() is very slow if not aligned
| The following reply was made to PR port-amd64/35535; it has been noted by GNATS.
|
| From: murray@river-styx.org
| To: gnats-bugs@NetBSD.org
| Cc: david@l8s.co.uk,
| david@l8s.co.uk
| Subject: Re: lib/35535: memcpy() is very slow if not aligned
| Date: Mon, 30 Mar 2009 23:25:12 +1100 (EST)
|
| Hi Folks,
| Some time ago(2 years+?) I did a review of all the low level
| implementations of memcpy and friends around the amd64 platform across
| all operating systems and gcc that I could find. The best
| implementation I found was in opensolaris which used multiple
| optimisations depending on the size of the source. I ported all their
| stuff to NetBSD amd64 at the time and mentioned it to Andrew Doran and
| Christos IIRC, but there were concerns with the opensolaris licensing.
| Worth having another look?
|
| Take care,
| Murray Armfield
Yes, definitely.
christos
State-Changed-From-To: open->closed
State-Changed-By: dsl@NetBSD.org
State-Changed-When: Sun, 22 Nov 2009 17:29:35 +0000
State-Changed-Why:
Code added to align destination.
From: David Laight <dsl@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/35535 CVS commit: src/common/lib/libc/arch/x86_64/string
Date: Sun, 22 Nov 2009 17:25:48 +0000
Module Name: src
Committed By: dsl
Date: Sun Nov 22 17:25:47 UTC 2009
Modified Files:
src/common/lib/libc/arch/x86_64/string: bcopy.S
Log Message:
Align to the destination buffer.
This probably costs 1 clock (on modern cpus) in the normal case.
But gives a big benefit when the destination is misaligned.
In particular when the source has the same misalignment - although
that may not be a gain on Nehalem!
Fixes PR/35535
To generate a diff of this commit:
cvs rdiff -u -r1.3 -r1.4 src/common/lib/libc/arch/x86_64/string/bcopy.S
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.