NetBSD Problem Report #21255
Received: (qmail 15716 invoked by uid 605); 21 Apr 2003 23:41:08 -0000
Message-Id: <200304212341.h3LNf1mV027247@nool.mines.edu>
Date: Mon, 21 Apr 2003 17:41:01 -0600 (MDT)
From: jbernard@mines.edu
Sender: gnats-bugs-owner@netbsd.org
Reply-To: jbernard@mines.edu
To: gnats-bugs@gnats.netbsd.org
Subject: i386 random stack alignment causes wildly inconsistent double-precision performance
X-Send-Pr-Version: 3.95
>Number: 21255
>Category: port-i386
>Synopsis: i386 random stack alignment causes wildly inconsistent double-precision performance
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: port-i386-maintainer
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Mon Apr 21 23:42:01 +0000 2003
>Closed-Date:
>Last-Modified: Sun Apr 15 21:47:41 +0000 2012
>Originator: Jim Bernard
>Release: NetBSD 1.6Q
>Organization:
>Environment:
System: NetBSD nool 1.6Q NetBSD 1.6Q (NOOL-$Revision: 1.35 $) #0: Sat Mar 22 17:59:26 MST 2003 jim@roc:/wd1/var/tmp/compile/sys/arch/i386/compile/NOOL i386
Architecture: i386
Machine: i386
>Description:
The performance of programs that use double-precision floating-point
variables heavily can vary dramatically, depending on accidents of
alignment of stack-resident variables. I'll include a small
demonstration program below that illustrates the effect. With that
program, I observed a 55% increase in execution time when the stack
misaligns the double-precision array, relative to the time when it
is aligned on an 8-byte boundary. This is similar to the slowdown
I've found in a production code written in Fortran. (FWIW, this same
test code exhibits stable, 8-byte alignment of doubles on a Linux/i386
system. I haven't tried it on other BSDs, etc.)
There appear to be two places where the alignment problems occur.
The first is in the execve system call (sys_execve in
sys/kern/kern_exec.c). There the argv array and the environment
array are copied to the stack, and the stack size is rounded up
to the nearest 4-byte boundary, using the ALIGN and ALIGNBYTES
macros from sys/arch/i386/include/param.h. Because the size is
rounded only to a 4-byte boundary, the alignment of stack-resident
double-precision variables in the program can be inefficient,
depending on details of the argument list (e.g., the program name)
and the environment (e.g., PWD and OLDPWD). Variations in these
cause the execution time to vary, seemingly randomly, between a
fast time and a much slower time.
Rounding up the stack size in execve to an 8-byte boundary removes
the variability, but the result is guaranteed misalignment and
slow execution all of the time. Adding an additional 4-byte
increment to the stack size in execve yields correct alignment
and fast execution all the time. So, there is evidently another
place where the stack size is changed by an odd multiple of 4 bytes,
and that change seems to be constant.
I owe many thanks to Sverre Froyen for coming up with several
critical insights that nailed this down. In particular, he
pointed the finger at execve and figured out that the environment
was causing variable execution times, as well as generating a
fix that proved this was the key to the varying alignment.
>How-To-Repeat:
The code below demonstrates the effect. It's a slightly modified
version of a program found in:
http://compilers.iecc.com/comparch/article/00-11-168
originally designed to demonstrate the importance of 8-byte
alignment for double-precision calculations. Here the array is
declared, rather than being allocated with malloc, so we can
observe the effects of execve on alignment of stack-resident
double-precision variables. (Thanks to Sverre for the idea to
declare the array.)
#include <stdio.h>
#define N 10000
int main (int argc, char **argv) {
double *x;
int i, j;
/*
x=(double*)malloc((N+1)*sizeof(double));
*/
double y[N+1];
x = y;
/*
if(argc==2) x=(double*)((int)x+4);
*/
printf("0x%x\n", (int)x);
printf("%d\n", (int)x%8);
for(i=0;i<N;i++) x[i]=(double)i;
for(i=0;i<N;i++) for (j=0;j<N;j++) x[i]=0.5*(x[j]+x[i]);
printf("%f\n", x[N-1]);
exit(0);
}
To observe the effect:
copy the code to a file, say dbl.c
cc -O2 dbl.c
ln a.out a
ln a.out aa
ln a.out aaa
ln a.out aaaa
time ./a
time ./aa
time ./aaa
time ./aaaa
You will notice that two of the executions are fast, and two
are slow. For those that are fast, the code will print a 0
for x%8, indicating 8-byte alignment of x. For those that are
slow, it will print +-4, indicating misalignment. The variation
is due to the different lengths of argv[0]. The numerical results
of all of the calculations will be identical.
You can also try changing your environment (e.g., through a
couple of cd's) to see its effect on the execution times.
Changes in the lengths of PWD and OLDPWD (e.g.) can change
the stack size and thus the alignment of x.
>Fix:
Changing ALIGNBYTES to (sizeof(double) - 1) in param.h and
adding 4 to len (the stack size) in sys_execve in kern_exec.c
(just after len = ALIGN(len)) does the trick, at least if only
the kernel is rebuilt.
But both of these changes may be a bit dangerous, and they
don't seem ideal. Both ALIGN and ALIGNBYTES are used in a
number of places in the kernel, in libc, and elsewhere, and
it may not be desirable to change the alignment in all those
places. I note that the arm port defines STACKALIGN and
STACKALIGNBYTES macros in its param.h, which sound like good
names to use in execve, but they're not used there. Maybe
they should be.
Certainly it would not be friendly to other ports
to arbitrarily add 4 to the stack size in execve, but maybe
adding sizeof(int) or some such would be acceptable.
I don't know what is the source of the additional change in
the stack size by an odd multiple of 4 bytes. I've looked
a bit at crt0.c (lib/csu/i386_elf/crt0.c), which seems like a
good candidate, but I haven't managed to figure out exactly
what its effect on the stack is, so I'll leave it to someone
familiar with that code to sort out whether it is at fault
and whether the fix should go there or in execve.
Also, given that this is probably going to be easy to break
in the future, it might be a good idea to put something like
the test program above into a regression test.
>Release-Note:
>Audit-Trail:
From: "Charles M. Hannum" <mycroft@netbsd.org>
To: gnats-bugs@netbsd.org, jbernard@mines.edu
Cc:
Subject: Re: port-i386/21255
Date: 21 Apr 2003 23:57:51 +0000
This is an incorrect analysis of the problem. The real cause of this
problem is that the x86 ABI simply does not do 8-byte alignment of the
stack, ever. Function calls regularly push odd multiples of 4 bytes on
the stack.
"Fixing" this is really an ABI change, and requires significant compiler
modifications. This is a subject to bring up to the GCC maintainers.
From: Jason Thorpe <thorpej@wasabisystems.com>
To: jbernard@mines.edu
Cc: gnats-bugs@gnats.netbsd.org
Subject: Re: port-i386/21255: i386 random stack alignment causes wildly inconsistent double-precision performance
Date: Mon, 21 Apr 2003 18:04:27 -0700
On Monday, April 21, 2003, at 04:41 PM, jbernard@mines.edu wrote:
> Rounding up the stack size in execve to an 8-byte boundary removes
> the variability, but the result is guaranteed misalignment and
> slow execution all of the time. Adding an additional 4-byte
> increment to the stack size in execve yields correct alignment
> and fast execution all the time. So, there is evidently another
> place where the stack size is changed by an odd multiple of 4 bytes,
> and that change seems to be constant.
Actually, I would almost say that this is a bug the compiler, but I'm
not totally convinced.
E.g. on ARM using ATPCS, the compiler goes to great lengths to ensure
that the stack pointer is aligned to an 8 byte boundary in non-leaf
functions. Our library assembly code for ARM also does the same.
I'm not sure the compiler for i386 does this, and I'm positive our
assembly routines don't go to great lengths to preserve such alignment.
That said, however, it probably is worth coming up with a macro that MI
code can use to set up the stack pointer in such a way as to make it
more likely that userland code can do the right thing.
Would you mind spending some time tracking down why the +4 is needed?
It would be nice to place a comment in the appropriate place suggesting
that whatever code it happens to be not be changed unless the
performance implications are well-understood :-)
-- Jason R. Thorpe <thorpej@wasabisystems.com>
Responsible-Changed-From-To: port-i386-maintainer->thorpej
Responsible-Changed-By: thorpej
Responsible-Changed-When: Tue Apr 22 01:05:02 UTC 2003
Responsible-Changed-Why:
I will handle this bug.
From: Jim Bernard <jbernard@mines.edu>
To: "Charles M. Hannum" <mycroft@netbsd.org>
Cc: gnats-bugs@netbsd.org
Subject: Re: port-i386/21255
Date: Mon, 21 Apr 2003 19:50:29 -0600
On Mon, Apr 21, 2003 at 11:57:51PM +0000, Charles M. Hannum wrote:
> This is an incorrect analysis of the problem. The real cause of this
> problem is that the x86 ABI simply does not do 8-byte alignment of the
> stack, ever. Function calls regularly push odd multiples of 4 bytes on
> the stack.
>
> "Fixing" this is really an ABI change, and requires significant compiler
> modifications. This is a subject to bring up to the GCC maintainers.
OK, to test this, I tried putting the calculation from the test code
into a function rather than the main program. I tried defining/calling
the function with no arguments and with a single int argument. In neither
case, on my patched system, did I get slow execution or misalignment of the
double-precision array. Only if I intentionally misaligned the array pointer,
could I produce slow execution.
Could you construct an example that will demonstrate this ABI problem by
causing the array to be misaligned on a system that has been patched to
avoid it (according to the proof-of-concept fix in the PR or any improved
fix)? I'll be happy to test your suggestions on my already-patched system.
It would be very useful to demonstrate that there is a deeper problem.
From: Jim Bernard <jbernard@mines.edu>
To: Jason Thorpe <thorpej@wasabisystems.com>
Cc: gnats-bugs@gnats.netbsd.org
Subject: Re: port-i386/21255: i386 random stack alignment causes wildly inconsistent double-precision performance
Date: Tue, 22 Apr 2003 09:16:48 -0600
Jason,
On Mon, Apr 21, 2003 at 06:04:27PM -0700, Jason Thorpe wrote:
>
> On Monday, April 21, 2003, at 04:41 PM, jbernard@mines.edu wrote:
>
> > Rounding up the stack size in execve to an 8-byte boundary removes
> > the variability, but the result is guaranteed misalignment and
> > slow execution all of the time. Adding an additional 4-byte
> > increment to the stack size in execve yields correct alignment
> > and fast execution all the time. So, there is evidently another
> > place where the stack size is changed by an odd multiple of 4 bytes,
> > and that change seems to be constant.
>
> Actually, I would almost say that this is a bug the compiler, but I'm
> not totally convinced.
I can buy that the compiler could, and maybe should, fix up the alignment.
I can also buy that the OS should make sure the alignment be correct, if
that is possible. Charles seems to indicate that the latter is not possible,
but I haven't managed to construct a case that demonstrates failure of the
proof-of-concept fix. So I asked him to try to come up with one.
But ultimately, it seems like there should be some sort of formal agreement
on where the responsibility lies. Charles obviously thinks it's the
compiler's job. Is there a clear standard specified somewhere?
BTW: I tried compiling the test code on a Linux box (gcc 2.96). When it
is executed on the Linux box, the double alignment is correct regardless of
the length of the name of the program. Running the same binary on a NetBSD
system yields incorrect alignment in all cases on both patched and unpatched
systems. So, it would seem that somewhere in the Linux code or the Linux
emulation there is something that tries to get the alignment right (i.e.,
independent of the size of argv), but that something in NetBSD adds an extra
4-byte offset, and that lies outside the scope of the patch.
> E.g. on ARM using ATPCS, the compiler goes to great lengths to ensure
> that the stack pointer is aligned to an 8 byte boundary in non-leaf
> functions. Our library assembly code for ARM also does the same.
ATPCS is a commercial ARM compiler?
> That said, however, it probably is worth coming up with a macro that MI
> code can use to set up the stack pointer in such a way as to make it
> more likely that userland code can do the right thing.
At least until the compiler is made to do a better job of aligning doubles,
if that's where the responsibility really lies.
> Would you mind spending some time tracking down why the +4 is needed?
> It would be nice to place a comment in the appropriate place suggesting
> that whatever code it happens to be not be changed unless the
> performance implications are well-understood :-)
I'll give it another try.
--Jim
From: Jim Bernard <jbernard@mines.edu>
To: Jason Thorpe <thorpej@wasabisystems.com>
Cc: gnats-bugs@gnats.netbsd.org, sverre@viewmark.com
Subject: Re: port-i386/21255: i386 random stack alignment causes wildly inconsistent double-precision performance
Date: Wed, 23 Apr 2003 14:13:06 -0600
Jason,
On Mon, Apr 21, 2003 at 06:04:27PM -0700, Jason Thorpe wrote:
>
> Would you mind spending some time tracking down why the +4 is needed?
> It would be nice to place a comment in the appropriate place suggesting
> that whatever code it happens to be not be changed unless the
> performance implications are well-understood :-)
OK, I've narrowed it down at least---it occurs between the calls to _init()
and main() in crt0.o. It smells like a compiler bug, and Sverre Froyen
has now compiled crt0.c with gcc 3.2.2, finding that it _doesn't_ require
the extra 4-byte offset to get the stack aligned. This would seem to
confirm that it is caused by a bug in gcc 2.95.3. I'll include details below.
Regarding the appropriate fix, Sverre and I concluded that it probably
makes more sense to align the stack in crt0 rather than in execve. This
appears to be what is done on the arm32 port, as well, so there is precedent
for it. As well, it should get around Charles' objection to the ABI change
that would result from doing the alignment in execve (though it's hard to
imagine how changing from randomly varying alignment to any fixed alignment
could actually break anything). We've put together a patch for crt0.c,
included below.
Also, I've looked at some assembly listings and tried setting up traps
for the compiler to see if it would misalign the stack (nothing as challenging
as crt0, though). What I've seen is that the compiler always seems to go
out of its way to align the stack before making a function call. So, I
think it's safe to assume that in the absence of a compiler blunder,
alignment of the stack will be preserved through function calls. If you
would like to see listings that demonstrate this, let me know.
There is also a question of the alignment boundary. The ia32 architecture
includes instructions that _require_ 16-byte alignment of memory operands
(the SSE instructions and kin). In my examinations of listings, I've seen
that the compiler _usually_ aligns the stack to a 16-byte boundary prior to
function calls, but I can make it fail to do that (and only align to 8 bytes).
So, with this compiler, it may not be useful to attempt 16-byte alignment.
But maybe gcc 3.2.2 or later might successfully preserve 16-byte alignment.
The patch to crt0.c (below) includes only 8-byte alignment, but it's trivial
to switch to 16.
Here's the relevant section of the disassembled ___start function from
a program compiled with gcc 2.95.3:
Up to this point, it's easy to track stack growth and see that the stack
pointer is aligned on a 16-byte boundary. I haven't figured out how to
treat the calls to atexit and printf+16, since they don't look like normal
function calls (which are stack neutral). If they're assumed to be stack
neutral, the instruction at ___start+153 (starred) introduces a spurious
4-byte increase in the stack, which yields misalignment when main is
called. The parenthesized arithmetic on the far right is my stack-pointer
counting (the rhs is the stack pointer relative to that just prior to
the CALL instruction that began ___start). (The syntax is Intel.)
0x8048646 <___start+134>: call 0x8048568 <atexit> (?)
0x804864b <___start+139>: call 0x8048548 <printf+16> (?)
0x8048650 <___start+144>: mov eax,DWORD PTR [ebx+52]
0x8048656 <___start+150>: add esp,0xfffffff4 (s - 12 = -60)
0x8048659 <___start+153>: * add esp,0xfffffffc (s - 4 = -64)
0x804865c <___start+156>: mov eax,DWORD PTR [eax]
0x804865e <___start+158>: push eax (s - 4 = -68)
0x804865f <___start+159>: push edi (s - 4 = -72)
0x8048660 <___start+160>: mov eax,DWORD PTR [ebp+8]
0x8048663 <___start+163>: push eax (s - 4 = -76)
0x8048664 <___start+164>: call 0x80489c4 <main> (misaligned!)
And here's the patch for crt0.c. There are also some minor rearrangements
of registers to premit reduction of the instruction count and to make it
easier to follow. The subl can be deleted when the compiler is fixed.
While we've both tested it and found it to work, we didn't test it extensively
(e.g., checking the args to main), so some additional testing would be wise.
--- crt0.c-dist 2001-12-31 04:04:20.000000000 -0700
+++ crt0.c 2003-04-23 12:48:41.000000000 -0600
@@ -44,19 +44,22 @@
" .align 4 \n"
" .globl __start \n"
" .globl _start \n"
"_start: \n"
"__start: \n"
+" movl %esp,%eax # save stack pointer \n"
+" and $0xfffffff8,%esp # align to 8-byte boundary \n"
+" subl $4,%esp # compensate for later offset \n"
" pushl %ebx # ps_strings \n"
" pushl %ecx # obj \n"
" pushl %edx # cleanup \n"
-" movl 12(%esp),%eax \n"
-" leal 20(%esp,%eax,4),%ecx \n"
-" leal 16(%esp),%edx \n"
-" pushl %ecx \n"
+" movl (%eax),%ebx # argc \n"
+" leal 4(%eax),%ecx # *argv \n"
+" leal 8(%eax,%ebx,4),%edx # *envp \n"
" pushl %edx \n"
-" pushl %eax \n"
+" pushl %ecx \n"
+" pushl %ebx \n"
" call ___start");
void
___start(argc, argv, envp, cleanup, obj, ps_strings)
int argc;
From: Jim Bernard <jbernard@mines.edu>
To: mrg@netbsd.org
Cc: gnats-bugs@netbsd.org, thorpej@netbsd.org,
gnats-admin@netbsd.org, sverre@viewmark.com
Subject: Re: PR/21255 CVS commit: src/gnu/dist/gcc4
Date: Sat, 20 May 2006 16:58:36 -0600
On Thu, Apr 20, 2006 at 10:25:56AM +0000, matthew green wrote:
> The following reply was made to PR port-i386/21255; it has been noted by GNATS.
>
> From: matthew green <mrg@netbsd.org>
> To: gnats-bugs@netbsd.org
> Cc:
> Subject: PR/21255 CVS commit: src/gnu/dist/gcc4
> Date: Thu, 20 Apr 2006 10:20:40 +0000 (UTC)
>
> Module Name: src
> Committed By: mrg
> Date: Thu Apr 20 10:19:48 UTC 2006
>
> Update of /cvsroot/src/gnu/dist/gcc4
> In directory ivanova.netbsd.org:/tmp/cvs-serv12371
>
> Log Message:
> initial import of GCC 4.1 branch, from 2006-04-20. list of changes way
> too large to list, but see:
> http://gcc.gnu.org/gcc-3.4/changes.html
> http://gcc.gnu.org/gcc-4.0/changes.html
> http://gcc.gnu.org/gcc-4.1/changes.html
> for the details.
>
> Status:
>
> Vendor Tag: FSF
> Release Tags: gcc-4-1-20060420
Well, I finally found a bit of time to try out gcc 4.1.1---built a 3.99.20
system from sources updated May 18, 2006 and _without_ the patch to crt0.c,
which I have been using happily ever since the spring of 2003. And it appears
that the 8-byte stack-alignment problem has been fixed. Here's a slightly
modified version of the test code:
#include <stdio.h>
#define N 10000
int main (int argc, char **argv) {
double *x;
int i, j;
/*
x=(double*)malloc((N+1)*sizeof(double));
*/
double y[N+1];
x = y;
if(argc==2) x=(double*)((int)x+4);
printf("0x%x\n", (int)x);
printf("%d\n", (int)x%8);
printf("%d\n", (int)x%16);
for(i=0;i<N;i++) x[i]=(double)i;
for(i=0;i<N;i++) for (j=0;j<N;j++) x[i]=0.5*(x[j]+x[i]);
printf("%f\n", x[N-1]);
exit(0);
}
If you compile it (e.g., gcc -O3) and link it to 4 different names:
ln a.out l
ln a.out ll
ln a.out lll
ln a.out llll
you'll find that execution times are all essentially the same, and the
address of the double-precision array x changes. Curiously, it changes
by 16 bytes, always remaining on an odd 8-byte boundary. Also, if the
array is allocated with malloc, instead of being declared, the alignment
is (still) optimal.
I haven't been able to test with fortran code, since the fortran compiler
seems to have disappeared, to my considerable disappointment. Once I figure
out how to get it back, I'll do some more testing with fortran code, but at
this point it looks like this alignment problem is solved.
Thanks!
BTW: Just for the record, here's the latest version of the patch I was
using for crt0.o. The only difference from the one previously in the PR is
the conditional inclusion of the 4-byte extra offset of the stack pointer
when GCC < 3 is used.
--- crt0.c-dist 2001-12-31 04:04:20.000000000 -0700
+++ crt0.c 2003-04-26 17:45:54.000000000 -0600
@@ -44,19 +44,24 @@
" .align 4 \n"
" .globl __start \n"
" .globl _start \n"
"_start: \n"
"__start: \n"
+" movl %esp,%eax # save stack pointer \n"
+" and $0xfffffff8,%esp # align to 8-byte boundary \n"
+#if defined(__GNUC__) && (__GNUC__ < 3)
+" subl $4,%esp # compensate for later offset \n"
+#endif
" pushl %ebx # ps_strings \n"
" pushl %ecx # obj \n"
" pushl %edx # cleanup \n"
-" movl 12(%esp),%eax \n"
-" leal 20(%esp,%eax,4),%ecx \n"
-" leal 16(%esp),%edx \n"
-" pushl %ecx \n"
+" movl (%eax),%ebx # argc \n"
+" leal 4(%eax),%ecx # *argv \n"
+" leal 8(%eax,%ebx,4),%edx # *envp \n"
" pushl %edx \n"
-" pushl %eax \n"
+" pushl %ecx \n"
+" pushl %ebx \n"
" call ___start");
void
___start(argc, argv, envp, cleanup, obj, ps_strings)
int argc;
--Jim
Responsible-Changed-From-To: thorpej->port-i386-maintainer
Responsible-Changed-By: wiz@NetBSD.org
Responsible-Changed-When: Sun, 15 Apr 2012 21:47:41 +0000
Responsible-Changed-Why:
Back to role account, thorpej left
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.