NetBSD Problem Report #35535

From www@NetBSD.org  Thu Feb  1 11:19:49 2007
Return-Path: <www@NetBSD.org>
Received: by narn.NetBSD.org (Postfix, from userid 31301)
	id 6CB0063B938; Thu,  1 Feb 2007 11:19:49 +0000 (UTC)
Message-Id: <20070201111949.6CB0063B938@narn.NetBSD.org>
Date: Thu,  1 Feb 2007 11:19:49 +0000 (UTC)
From: fuyuki@hadaly.org
Reply-To: fuyuki@hadaly.org
To: gnats-bugs@NetBSD.org
Subject: memcpy() is very slow if not aligned
X-Send-Pr-Version: www-1.0

>Number:         35535
>Category:       port-amd64
>Synopsis:       memcpy() is very slow if not aligned
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    port-amd64-maintainer
>State:          closed
>Class:          change-request
>Submitter-Id:   net
>Arrival-Date:   Thu Feb 01 11:20:00 +0000 2007
>Closed-Date:    Sun Nov 22 17:29:35 +0000 2009
>Last-Modified:  Sun Nov 22 17:30:03 +0000 2009
>Originator:     Kimura Fuyuki
>Release:        4.99.9
>Organization:
>Environment:
NetBSD lapis.hadaly.org 4.99.9 NetBSD 4.99.9 (LAPIS) #59: Thu Feb  1 16:18:21 JST 2007  fuyuki@lapis.hadaly.org:/usr/obj/sys/arch/amd64/compile/LAPIS amd64
>Description:
On NetBSD/amd64 (perhaps in i386 too) memcpy() could be very slow because no alignment effort has been taken place.
Sometimes it is very difficult or impossible for applications to align the dest addr so that the minimal effort should be taken in the library code.

>How-To-Repeat:
Use the following benchmark program to see the the unaligned memcpy() is so slow.

http://www.hadaly.org/fuyuki/memcpy_bench.c

<plain libc>

$ time ./memcpy_bench 65536 1000000 0 0
dst:0x503000 src:0x513000 len:65536
./memcpy_bench 65536 1000000 0 0  7.30s user 0.00s system 99% cpu 7.349 total

$ time ./memcpy_bench 65536 1000000 1 1
dst:0x503001 src:0x514001 len:65536
./memcpy_bench 65536 1000000 1 1  48.46s user 0.00s system 99% cpu 48.713 total

<patch (below) applied>

$ time ./memcpy_bench 65536 1000000 0 0
dst:0x503000 src:0x513000 len:65536
./memcpy_bench 65536 1000000 0 0  7.36s user 0.00s system 99% cpu 7.406 total

$ time ./memcpy_bench 65536 1000000 1 1
dst:0x503001 src:0x514001 len:65536
./memcpy_bench 65536 1000000 1 1  11.40s user 0.00s system 99% cpu 11.468 total

>Fix:
The following patch decreases the amarok's cpu% to <1 on my prescott celeron.

http://www.hadaly.org/fuyuki/bcopy.S.patch

See also (especially 8.2):
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF

>Release-Note:

>Audit-Trail:

Responsible-Changed-From-To: lib-bug-people->port-amd64-maintainer
Responsible-Changed-By: tron@netbsd.org
Responsible-Changed-When: Thu, 01 Feb 2007 12:44:02 +0000
Responsible-Changed-Why:
This is a NetBSD-amd64 specific problem.


From: Kimura Fuyuki <fuyuki@hadaly.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: lib/35535: memcpy() is very slow if not aligned
Date: Sat, 3 Feb 2007 19:29:17 +0900

 Sorry for misfiling...

 If you want further improvement, here's also an SSEd version which scales to 
 megs and preserves cached data.

 http://www.hadaly.org/fuyuki/bcopy-sse.patch

 plain:

 $ time ./memcpy_bench 1000000 10000 0 0
 dst:0x503000 src:0x5f8000 len:1000000
 ./memcpy_bench 1000000 10000 0 0  9.50s user 0.00s system 99% cpu 9.566 total

 patched:

 $ time ./memcpy_bench 1000000 10000 0 0
 dst:0x503000 src:0x5f8000 len:1000000
 ./memcpy_bench 1000000 10000 0 0  6.38s user 0.00s system 99% cpu 6.425 total

 xine's score:

 Benchmarking memcpy methods (smaller is better):
         libc memcpy() : 218605133
         linux kernel memcpy() : 308372235
         MMX optimized memcpy() : 312928386
         MMXEXT optimized memcpy() : 216660195
         SSE optimized memcpy() : 214245218

From: David Laight <david@l8s.co.uk>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: lib/35535: memcpy() is very slow if not aligned
Date: Sat, 3 Feb 2007 11:15:56 +0000

 On Sat, Feb 03, 2007 at 10:30:02AM +0000, Kimura Fuyuki wrote:
 > The following reply was made to PR port-amd64/35535; it has been noted by GNATS.
 > 
 > From: Kimura Fuyuki <fuyuki@hadaly.org>
 > To: gnats-bugs@netbsd.org
 > Cc: 
 > Subject: Re: lib/35535: memcpy() is very slow if not aligned
 > Date: Sat, 3 Feb 2007 19:29:17 +0900
 > 
 >  Sorry for misfiling...
 >  
 >  If you want further improvement, here's also an SSEd version which scales to 
 >  megs and preserves cached data.
 >  
 >  http://www.hadaly.org/fuyuki/bcopy-sse.patch

 1) I'm not sure that optimisations for > 128k copies are necessarily
    worthwhile.  Code ought to be passing such data by reference!
    In the kernel, the only common large copy is (ought to be) the
    copy-on-write of shared pages.

 2) You want to look at the costs for short copies.  They are much more
    common than you think.
    I've not done any timings for 'rep movsx', but I did do some for
    'rep stosx' a couple of years ago.  The instruction setup costs on
    modern cpus is significant, so they shouldn't be used for small loops.
    A common non-optimisation is the use of a 'rep movsb' instruction to
    move the remaining bytes - which is likely to be zero [1].
    One option is to copy the last 4/8 bytes first!
    I also discovered that the pentium IV needs the target address to be
    8 byte aligned!

 3) (2) may well apply to the use to movsb to align copies.

 	David

 [1] Certain compilers convert:
 	while (a < b)
 	    *a++ = ' ';
     into the inlined version of memset(), including 2 'expensive to setup'
     'rep stosx' instructions, when I explictily wrote the loop because the
     loop count is short....

 -- 
 David Laight: david@l8s.co.uk

From: Kimura Fuyuki <fuyuki@hadaly.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: lib/35535: memcpy() is very slow if not aligned
Date: Sat, 3 Feb 2007 23:24:24 +0900

 On Saturday 03 February 2007, David Laight wrote:
 >
 >  1) I'm not sure that optimisations for > 128k copies are necessarily
 >     worthwhile.  Code ought to be passing such data by reference!
 >     In the kernel, the only common large copy is (ought to be) the
 >     copy-on-write of shared pages.

 In kernel use, it's just true that code for >128k is not so useful. I put it 
 just because we have libs shared on both kernel and user land. If you think 
 optimization for larger buffer is not a good idea, it could be removed or 
 #ifdef-outed for kernel.

 >  2) You want to look at the costs for short copies.  They are much more
 >     common than you think.
 >     I've not done any timings for 'rep movsx', but I did do some for
 >     'rep stosx' a couple of years ago.  The instruction setup costs on
 >     modern cpus is significant, so they shouldn't be used for small loops.
 >     A common non-optimisation is the use of a 'rep movsb' instruction to
 >     move the remaining bytes - which is likely to be zero [1].
 >     One option is to copy the last 4/8 bytes first!
 >     I also discovered that the pentium IV needs the target address to be
 >     8 byte aligned!

 Fact 1: I misunderstood the gcc's optimization policy a bit; I've thought 
 memcpy()s are more aggressively inlined or unrolled to mov ops. So, short 
 copies are important, right. But, they *are* properly inlined in many cases.

 from gcc(1):
        -mmemcpy
        -mno-memcpy
            Force (do not force) the use of "memcpy()" for non-trivial block
            moves.  The default is -mno-memcpy, which allows GCC to inline most
            constant-sized copies.

 Fact 2: I think just one branch is not a burden for modern cpus. Real number 
 follows. (ya, could be a little burden..)

 plain:
 $ time ./memcpy_bench 64 100000000 0 0
 dst:0x502080 src:0x5020c0 len:64
 ./memcpy_bench 64 100000000 0 0  3.36s user 0.00s system 99% cpu 3.390 total

 patched:
 $ time ./memcpy_bench 64 100000000 0 0
 dst:0x502080 src:0x5020c0 len:64
 ./memcpy_bench 64 100000000 0 0  3.49s user 0.00s system 99% cpu 3.517 total

 Fact 3: I didn't touch the rep part of the code. I made the patch small as far 
 as I can. I agree that rep prefix should be carefully used.

 >  3) (2) may well apply to the use to movsb to align copies.

 Actually, I tried three versions of alignment code including movsb-less one 
 and took simpler and faster. Anyway, there's no big difference in these 
 three. Note also that the memcpy's dest address is very likely to be already 
 aligned.


 The real (what's real?) latency for rep instructions can be seen here  (8.3): 
 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF


 Thanks for your comment.


 >  [1] Certain compilers convert:
 >  	while (a < b)
 >  	    *a++ = ' ';
 >      into the inlined version of memset(), including 2 'expensive to setup'
 >      'rep stosx' instructions, when I explictily wrote the loop because the
 >      loop count is short....

 gcc 4 is a little bit smarter than that, I think. :)

From: David Laight <david@l8s.co.uk>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: lib/35535: memcpy() is very slow if not aligned
Date: Sat, 3 Feb 2007 21:23:31 +0000

 On Sat, Feb 03, 2007 at 02:25:02PM +0000, Kimura Fuyuki wrote:
 >  
 >  The real (what's real?) latency for rep instructions can be seen here  (8.3): 
 >  http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF

 Hmmm... I'm not entirely certain some of the suggestions in that document are correct!
 Some of the C code certainly isn't!
 Page 17 suggests the use of:
 	#define FLOAT2INTCAST(f)  (*((int *)(&f)))
 for speeding up float comparisons agains constants.
 Someone hasn't read up on the C aliasing rules.

 Page 106 also suggests you need to be a lot more careful with your write-combining
 code.  Thinking further it probably can't be used without disabling interrupts (or
 maybe making the write to each cache line a RAS sequence).
 (But maybe I'm misunderstanding exactly what happens to the partially written line.)
 eg stuff in appendix B :-)

 Page 167 suggests never (ok hardly ever) using the rep string opcodes.
 The algorithm on pages 181+ looks like a good way to kill the I-cache.

 Oh, and for good measure, code has to run on intel cpus as well.

 	David

 -- 
 David Laight: david@l8s.co.uk

From: murray@river-styx.org
To: gnats-bugs@NetBSD.org
Cc: david@l8s.co.uk,
 david@l8s.co.uk
Subject: Re: lib/35535: memcpy() is very slow if not aligned
Date: Mon, 30 Mar 2009 23:25:12 +1100 (EST)

 Hi Folks,
    Some time ago(2 years+?) I did a review of all the low level
 implementations of memcpy and friends around the amd64 platform across
 all operating systems and gcc that I could find. The best
 implementation I found was in opensolaris which used multiple
 optimisations depending on the size of the source. I ported all their
 stuff to NetBSD amd64 at the time and mentioned it to Andrew Doran and
 Christos IIRC, but there were concerns with the opensolaris licensing.
 Worth having another look?

 Take care,
     Murray Armfield

From: christos@zoulas.com (Christos Zoulas)
To: gnats-bugs@NetBSD.org, port-amd64-maintainer@netbsd.org, 
	gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, fuyuki@hadaly.org
Cc: 
Subject: Re: lib/35535: memcpy() is very slow if not aligned
Date: Mon, 30 Mar 2009 08:44:56 -0400

 On Mar 30, 12:30pm, murray@river-styx.org (murray@river-styx.org) wrote:
 -- Subject: Re: lib/35535: memcpy() is very slow if not aligned

 | The following reply was made to PR port-amd64/35535; it has been noted by GNATS.
 | 
 | From: murray@river-styx.org
 | To: gnats-bugs@NetBSD.org
 | Cc: david@l8s.co.uk,
 |  david@l8s.co.uk
 | Subject: Re: lib/35535: memcpy() is very slow if not aligned
 | Date: Mon, 30 Mar 2009 23:25:12 +1100 (EST)
 | 
 |  Hi Folks,
 |     Some time ago(2 years+?) I did a review of all the low level
 |  implementations of memcpy and friends around the amd64 platform across
 |  all operating systems and gcc that I could find. The best
 |  implementation I found was in opensolaris which used multiple
 |  optimisations depending on the size of the source. I ported all their
 |  stuff to NetBSD amd64 at the time and mentioned it to Andrew Doran and
 |  Christos IIRC, but there were concerns with the opensolaris licensing.
 |  Worth having another look?
 |  
 |  Take care,
 |      Murray Armfield

 Yes, definitely.

 christos

State-Changed-From-To: open->closed
State-Changed-By: dsl@NetBSD.org
State-Changed-When: Sun, 22 Nov 2009 17:29:35 +0000
State-Changed-Why:
Code added to align destination.


From: David Laight <dsl@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/35535 CVS commit: src/common/lib/libc/arch/x86_64/string
Date: Sun, 22 Nov 2009 17:25:48 +0000

 Module Name:	src
 Committed By:	dsl
 Date:		Sun Nov 22 17:25:47 UTC 2009

 Modified Files:
 	src/common/lib/libc/arch/x86_64/string: bcopy.S

 Log Message:
 Align to the destination buffer.
 This probably costs 1 clock (on modern cpus) in the normal case.
 But gives a big benefit when the destination is misaligned.
 In particular when the source has the same misalignment - although
 that may not be a gain on Nehalem!
 Fixes PR/35535


 To generate a diff of this commit:
 cvs rdiff -u -r1.3 -r1.4 src/common/lib/libc/arch/x86_64/string/bcopy.S

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.