NetBSD Problem Report #21255

Received: (qmail 15716 invoked by uid 605); 21 Apr 2003 23:41:08 -0000
Message-Id: <200304212341.h3LNf1mV027247@nool.mines.edu>
Date: Mon, 21 Apr 2003 17:41:01 -0600 (MDT)
From: jbernard@mines.edu
Sender: gnats-bugs-owner@netbsd.org
Reply-To: jbernard@mines.edu
To: gnats-bugs@gnats.netbsd.org
Subject: i386 random stack alignment causes wildly inconsistent double-precision performance
X-Send-Pr-Version: 3.95

>Number:         21255
>Category:       port-i386
>Synopsis:       i386 random stack alignment causes wildly inconsistent double-precision performance
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    port-i386-maintainer
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Apr 21 23:42:01 +0000 2003
>Closed-Date:    
>Last-Modified:  Sun Apr 15 21:47:41 +0000 2012
>Originator:     Jim Bernard
>Release:        NetBSD 1.6Q
>Organization:
>Environment:
System: NetBSD nool 1.6Q NetBSD 1.6Q (NOOL-$Revision: 1.35 $) #0: Sat Mar 22 17:59:26 MST 2003 jim@roc:/wd1/var/tmp/compile/sys/arch/i386/compile/NOOL i386
Architecture: i386
Machine: i386
>Description:
	The performance of programs that use double-precision floating-point
	variables heavily can vary dramatically, depending on accidents of
	alignment of stack-resident variables.  I'll include a small
	demonstration program below that illustrates the effect.  With that
	program, I observed a 55% increase in execution time when the stack
	misaligns the double-precision array, relative to the time when it
	is aligned on an 8-byte boundary.  This is similar to the slowdown
	I've found in a production code written in Fortran.  (FWIW, this same
	test code exhibits stable, 8-byte alignment of doubles on a Linux/i386
	system.  I haven't tried it on other BSDs, etc.)

	There appear to be two places where the alignment problems occur.
	The first is in the execve system call (sys_execve in
	sys/kern/kern_exec.c).  There the argv array and the environment
	array are copied to the stack, and the stack size is rounded up
	to the nearest 4-byte boundary, using the ALIGN and ALIGNBYTES
	macros from sys/arch/i386/include/param.h.  Because the size is
	rounded only to a 4-byte boundary, the alignment of stack-resident
	double-precision variables in the program can be inefficient,
	depending on details of the argument list (e.g., the program name)
	and the environment (e.g., PWD and OLDPWD).  Variations in these
	cause the execution time to vary, seemingly randomly, between a
	fast time and a much slower time.

	Rounding up the stack size in execve to an 8-byte boundary removes
	the variability, but the result is guaranteed misalignment and
	slow execution all of the time.  Adding an additional 4-byte
	increment to the stack size in execve yields correct alignment
	and fast execution all the time.  So, there is evidently another
	place where the stack size is changed by an odd multiple of 4 bytes,
	and that change seems to be constant.

	I owe many thanks to Sverre Froyen for coming up with several
	critical insights that nailed this down.  In particular, he
	pointed the finger at execve and figured out that the environment
	was causing variable execution times, as well as generating a
	fix that proved this was the key to the varying alignment.

>How-To-Repeat:
	The code below demonstrates the effect.  It's a slightly modified
	version of a program found in:

	  http://compilers.iecc.com/comparch/article/00-11-168

	originally designed to demonstrate the importance of 8-byte
	alignment for double-precision calculations.  Here the array is
	declared, rather than being allocated with malloc, so we can
	observe the effects of execve on alignment of stack-resident
	double-precision variables.  (Thanks to Sverre for the idea to
	declare the array.)

#include <stdio.h>
#define N 10000
int main (int argc, char **argv) {
double *x;
int i, j;

/*
x=(double*)malloc((N+1)*sizeof(double));
 */
double y[N+1];
x = y;
/*
if(argc==2) x=(double*)((int)x+4);
 */

printf("0x%x\n", (int)x);
printf("%d\n", (int)x%8);
for(i=0;i<N;i++) x[i]=(double)i;
for(i=0;i<N;i++) for (j=0;j<N;j++) x[i]=0.5*(x[j]+x[i]);

printf("%f\n", x[N-1]);
exit(0);
}

	To observe the effect:

	  copy the code to a file, say dbl.c
	  cc -O2 dbl.c
	  ln a.out a
	  ln a.out aa
	  ln a.out aaa
	  ln a.out aaaa
	  time ./a
	  time ./aa
	  time ./aaa
	  time ./aaaa

	You will notice that two of the executions are fast, and two
	are slow.  For those that are fast, the code will print a 0
	for x%8, indicating 8-byte alignment of x.  For those that are
	slow, it will print +-4, indicating misalignment.  The variation
	is due to the different lengths of argv[0].  The numerical results
	of all of the calculations will be identical.

	You can also try changing your environment (e.g., through a
	couple of cd's) to see its effect on the execution times.
	Changes in the lengths of PWD and OLDPWD (e.g.) can change
	the stack size and thus the alignment of x.

>Fix:
	Changing ALIGNBYTES to (sizeof(double) - 1) in param.h and
	adding 4 to len (the stack size) in sys_execve in kern_exec.c
	(just after len = ALIGN(len)) does the trick, at least if only
	the kernel is rebuilt.

	But both of these changes may be a bit dangerous, and they
	don't seem ideal.  Both ALIGN and ALIGNBYTES are used in a
	number of places in the kernel, in libc, and elsewhere, and
	it may not be desirable to change the alignment in all those
	places.  I note that the arm port defines STACKALIGN and
	STACKALIGNBYTES macros in its param.h, which sound like good
	names to use in execve, but they're not used there.  Maybe
	they should be.

	Certainly it would not be friendly to other ports
	to arbitrarily add 4 to the stack size in execve, but maybe
	adding sizeof(int) or some such would be acceptable.
	I don't know what is the source of the additional change in
	the stack size by an odd multiple of 4 bytes.  I've looked
	a bit at crt0.c (lib/csu/i386_elf/crt0.c), which seems like a
	good candidate, but I haven't managed to figure out exactly
	what its effect on the stack is, so I'll leave it to someone
	familiar with that code to sort out whether it is at fault
	and whether the fix should go there or in execve.

	Also, given that this is probably going to be easy to break
	in the future, it might be a good idea to put something like
	the test program above into a regression test.
>Release-Note:
>Audit-Trail:

From: "Charles M. Hannum" <mycroft@netbsd.org>
To: gnats-bugs@netbsd.org, jbernard@mines.edu
Cc:  
Subject: Re: port-i386/21255
Date: 21 Apr 2003 23:57:51 +0000

 This is an incorrect analysis of the problem.  The real cause of this
 problem is that the x86 ABI simply does not do 8-byte alignment of the
 stack, ever.  Function calls regularly push odd multiples of 4 bytes on
 the stack.

 "Fixing" this is really an ABI change, and requires significant compiler
 modifications.  This is a subject to bring up to the GCC maintainers.


From: Jason Thorpe <thorpej@wasabisystems.com>
To: jbernard@mines.edu
Cc: gnats-bugs@gnats.netbsd.org
Subject: Re: port-i386/21255: i386 random stack alignment causes wildly inconsistent double-precision performance
Date: Mon, 21 Apr 2003 18:04:27 -0700

 On Monday, April 21, 2003, at 04:41  PM, jbernard@mines.edu wrote:

 > 	Rounding up the stack size in execve to an 8-byte boundary removes
 > 	the variability, but the result is guaranteed misalignment and
 > 	slow execution all of the time.  Adding an additional 4-byte
 > 	increment to the stack size in execve yields correct alignment
 > 	and fast execution all the time.  So, there is evidently another
 > 	place where the stack size is changed by an odd multiple of 4 bytes,
 > 	and that change seems to be constant.

 Actually, I would almost say that this is a bug the compiler, but I'm 
 not totally convinced.

 E.g. on ARM using ATPCS, the compiler goes to great lengths to ensure 
 that the stack pointer is aligned to an 8 byte boundary in non-leaf 
 functions.  Our library assembly code for ARM also does the same.

 I'm not sure the compiler for i386 does this, and I'm positive our 
 assembly routines don't go to great lengths to preserve such alignment.

 That said, however, it probably is worth coming up with a macro that MI 
 code can use to set up the stack pointer in such a way as to make it 
 more likely that userland code can do the right thing.

 Would you mind spending some time tracking down why the +4 is needed?  
 It would be nice to place a comment in the appropriate place suggesting 
 that whatever code it happens to be not be changed unless the 
 performance implications are well-understood :-)

          -- Jason R. Thorpe <thorpej@wasabisystems.com>

Responsible-Changed-From-To: port-i386-maintainer->thorpej 
Responsible-Changed-By: thorpej 
Responsible-Changed-When: Tue Apr 22 01:05:02 UTC 2003 
Responsible-Changed-Why:  
I will handle this bug. 

From: Jim Bernard <jbernard@mines.edu>
To: "Charles M. Hannum" <mycroft@netbsd.org>
Cc: gnats-bugs@netbsd.org
Subject: Re: port-i386/21255
Date: Mon, 21 Apr 2003 19:50:29 -0600

 On Mon, Apr 21, 2003 at 11:57:51PM +0000, Charles M. Hannum wrote:
 > This is an incorrect analysis of the problem.  The real cause of this
 > problem is that the x86 ABI simply does not do 8-byte alignment of the
 > stack, ever.  Function calls regularly push odd multiples of 4 bytes on
 > the stack.
 > 
 > "Fixing" this is really an ABI change, and requires significant compiler
 > modifications.  This is a subject to bring up to the GCC maintainers.

   OK, to test this, I tried putting the calculation from the test code
 into a function rather than the main program.  I tried defining/calling
 the function with no arguments and with a single int argument.  In neither
 case, on my patched system, did I get slow execution or misalignment of the
 double-precision array.  Only if I intentionally misaligned the array pointer,
 could I produce slow execution.

   Could you construct an example that will demonstrate this ABI problem by
 causing the array to be misaligned on a system that has been patched to
 avoid it (according to the proof-of-concept fix in the PR or any improved
 fix)?  I'll be happy to test your suggestions on my already-patched system.
 It would be very useful to demonstrate that there is a deeper problem.

From: Jim Bernard <jbernard@mines.edu>
To: Jason Thorpe <thorpej@wasabisystems.com>
Cc: gnats-bugs@gnats.netbsd.org
Subject: Re: port-i386/21255: i386 random stack alignment causes wildly inconsistent double-precision performance
Date: Tue, 22 Apr 2003 09:16:48 -0600

 Jason,

 On Mon, Apr 21, 2003 at 06:04:27PM -0700, Jason Thorpe wrote:
 > 
 > On Monday, April 21, 2003, at 04:41  PM, jbernard@mines.edu wrote:
 > 
 > >	Rounding up the stack size in execve to an 8-byte boundary removes
 > >	the variability, but the result is guaranteed misalignment and
 > >	slow execution all of the time.  Adding an additional 4-byte
 > >	increment to the stack size in execve yields correct alignment
 > >	and fast execution all the time.  So, there is evidently another
 > >	place where the stack size is changed by an odd multiple of 4 bytes,
 > >	and that change seems to be constant.
 > 
 > Actually, I would almost say that this is a bug the compiler, but I'm 
 > not totally convinced.

   I can buy that the compiler could, and maybe should, fix up the alignment.
 I can also buy that the OS should make sure the alignment be correct, if
 that is possible.  Charles seems to indicate that the latter is not possible,
 but I haven't managed to construct a case that demonstrates failure of the
 proof-of-concept fix.  So I asked him to try to come up with one.

   But ultimately, it seems like there should be some sort of formal agreement
 on where the responsibility lies.  Charles obviously thinks it's the
 compiler's job.  Is there a clear standard specified somewhere?

   BTW: I tried compiling the test code on a Linux box (gcc 2.96).  When it
 is executed on the Linux box, the double alignment is correct regardless of
 the length of the name of the program.  Running the same binary on a NetBSD
 system yields incorrect alignment in all cases on both patched and unpatched
 systems.  So, it would seem that somewhere in the Linux code or the Linux
 emulation there is something that tries to get the alignment right (i.e.,
 independent of the size of argv), but that something in NetBSD adds an extra
 4-byte offset, and that lies outside the scope of the patch.

 > E.g. on ARM using ATPCS, the compiler goes to great lengths to ensure 
 > that the stack pointer is aligned to an 8 byte boundary in non-leaf 
 > functions.  Our library assembly code for ARM also does the same.

   ATPCS is a commercial ARM compiler?

 > That said, however, it probably is worth coming up with a macro that MI 
 > code can use to set up the stack pointer in such a way as to make it 
 > more likely that userland code can do the right thing.

   At least until the compiler is made to do a better job of aligning doubles,
 if that's where the responsibility really lies.

 > Would you mind spending some time tracking down why the +4 is needed?  
 > It would be nice to place a comment in the appropriate place suggesting 
 > that whatever code it happens to be not be changed unless the 
 > performance implications are well-understood :-)

   I'll give it another try.

 --Jim

From: Jim Bernard <jbernard@mines.edu>
To: Jason Thorpe <thorpej@wasabisystems.com>
Cc: gnats-bugs@gnats.netbsd.org, sverre@viewmark.com
Subject: Re: port-i386/21255: i386 random stack alignment causes wildly inconsistent double-precision performance
Date: Wed, 23 Apr 2003 14:13:06 -0600

 Jason,

 On Mon, Apr 21, 2003 at 06:04:27PM -0700, Jason Thorpe wrote:
 > 
 > Would you mind spending some time tracking down why the +4 is needed?  
 > It would be nice to place a comment in the appropriate place suggesting 
 > that whatever code it happens to be not be changed unless the 
 > performance implications are well-understood :-)

   OK, I've narrowed it down at least---it occurs between the calls to _init()
 and main() in crt0.o.  It smells like a compiler bug, and Sverre Froyen
 has now compiled crt0.c with gcc 3.2.2, finding that it _doesn't_ require
 the extra 4-byte offset to get the stack aligned.  This would seem to
 confirm that it is caused by a bug in gcc 2.95.3.  I'll include details below.

   Regarding the appropriate fix, Sverre and I concluded that it probably
 makes more sense to align the stack in crt0 rather than in execve.  This
 appears to be what is done on the arm32 port, as well, so there is precedent
 for it.  As well, it should get around Charles' objection to the ABI change
 that would result from doing the alignment in execve (though it's hard to
 imagine how changing from randomly varying alignment to any fixed alignment
 could actually break anything).  We've put together a patch for crt0.c,
 included below.

   Also, I've looked at some assembly listings and tried setting up traps
 for the compiler to see if it would misalign the stack (nothing as challenging
 as crt0, though).  What I've seen is that the compiler always seems to go
 out of its way to align the stack before making a function call.  So, I
 think it's safe to assume that in the absence of a compiler blunder,
 alignment of the stack will be preserved through function calls.  If you
 would like to see listings that demonstrate this, let me know.

   There is also a question of the alignment boundary.  The ia32 architecture
 includes instructions that _require_ 16-byte alignment of memory operands
 (the SSE instructions and kin).  In my examinations of listings, I've seen
 that the compiler _usually_ aligns the stack to a 16-byte boundary prior to
 function calls, but I can make it fail to do that (and only align to 8 bytes).
 So, with this compiler, it may not be useful to attempt 16-byte alignment.
 But maybe gcc 3.2.2 or later might successfully preserve 16-byte alignment.
 The patch to crt0.c (below) includes only 8-byte alignment, but it's trivial
 to switch to 16.


   Here's the relevant section of the disassembled ___start function from
 a program compiled with gcc 2.95.3:

   Up to this point, it's easy to track stack growth and see that the stack
 pointer is aligned on a 16-byte boundary.  I haven't figured out how to
 treat the calls to atexit and printf+16, since they don't look like normal
 function calls (which are stack neutral).  If they're assumed to be stack
 neutral, the instruction at ___start+153 (starred) introduces a spurious
 4-byte increase in the stack, which yields misalignment when main is
 called.  The parenthesized arithmetic on the far right is my stack-pointer
 counting (the rhs is the stack pointer relative to that just prior to
 the CALL instruction that began ___start).  (The syntax is Intel.)

 0x8048646 <___start+134>:       call   0x8048568 <atexit>       (?)
 0x804864b <___start+139>:       call   0x8048548 <printf+16>    (?)
 0x8048650 <___start+144>:       mov    eax,DWORD PTR [ebx+52]
 0x8048656 <___start+150>:       add    esp,0xfffffff4           (s - 12 = -60)
 0x8048659 <___start+153>:   *   add    esp,0xfffffffc           (s - 4 = -64)
 0x804865c <___start+156>:       mov    eax,DWORD PTR [eax]
 0x804865e <___start+158>:       push   eax                      (s - 4 = -68)
 0x804865f <___start+159>:       push   edi                      (s - 4 = -72)
 0x8048660 <___start+160>:       mov    eax,DWORD PTR [ebp+8]
 0x8048663 <___start+163>:       push   eax                      (s - 4 = -76)
 0x8048664 <___start+164>:       call   0x80489c4 <main>         (misaligned!)


   And here's the patch for crt0.c.  There are also some minor rearrangements
 of registers to premit reduction of the instruction count and to make it
 easier to follow.  The subl can be deleted when the compiler is fixed.
 While we've both tested it and found it to work, we didn't test it extensively
 (e.g., checking the args to main), so some additional testing would be wise.

 --- crt0.c-dist	2001-12-31 04:04:20.000000000 -0700
 +++ crt0.c	2003-04-23 12:48:41.000000000 -0600
 @@ -44,19 +44,22 @@
  "	.align	4			\n"
  "	.globl	__start			\n"
  "	.globl	_start			\n"
  "_start:				\n"
  "__start:				\n"
 +"	movl	%esp,%eax		# save stack pointer		\n"
 +"	and	$0xfffffff8,%esp	# align to 8-byte boundary	\n"
 +"	subl	$4,%esp			# compensate for later offset	\n"
  "	pushl	%ebx			# ps_strings	\n"
  "	pushl	%ecx			# obj		\n"
  "	pushl	%edx			# cleanup	\n"
 -"	movl	12(%esp),%eax		\n"
 -"	leal	20(%esp,%eax,4),%ecx	\n"
 -"	leal	16(%esp),%edx		\n"
 -"	pushl	%ecx			\n"
 +"	movl	(%eax),%ebx		# argc		\n"
 +"	leal	4(%eax),%ecx		# *argv		\n"
 +"	leal	8(%eax,%ebx,4),%edx	# *envp		\n"
  "	pushl	%edx			\n"
 -"	pushl	%eax			\n"
 +"	pushl	%ecx			\n"
 +"	pushl	%ebx			\n"
  "	call	___start");

  void
  ___start(argc, argv, envp, cleanup, obj, ps_strings)
  	int argc;
From: Jim Bernard <jbernard@mines.edu>
To: mrg@netbsd.org
Cc: gnats-bugs@netbsd.org, thorpej@netbsd.org,
	gnats-admin@netbsd.org, sverre@viewmark.com
Subject: Re: PR/21255 CVS commit: src/gnu/dist/gcc4
Date: Sat, 20 May 2006 16:58:36 -0600

 On Thu, Apr 20, 2006 at 10:25:56AM +0000, matthew green wrote:
 > The following reply was made to PR port-i386/21255; it has been noted by GNATS.
 > 
 > From: matthew green <mrg@netbsd.org>
 > To: gnats-bugs@netbsd.org
 > Cc: 
 > Subject: PR/21255 CVS commit: src/gnu/dist/gcc4
 > Date: Thu, 20 Apr 2006 10:20:40 +0000 (UTC)
 > 
 >  Module Name:	src
 >  Committed By:	mrg
 >  Date:		Thu Apr 20 10:19:48 UTC 2006
 >  
 >  Update of /cvsroot/src/gnu/dist/gcc4
 >  In directory ivanova.netbsd.org:/tmp/cvs-serv12371
 >  
 >  Log Message:
 >  initial import of GCC 4.1 branch, from 2006-04-20.  list of changes way
 >  too large to list, but see:
 >  	http://gcc.gnu.org/gcc-3.4/changes.html
 >  	http://gcc.gnu.org/gcc-4.0/changes.html
 >  	http://gcc.gnu.org/gcc-4.1/changes.html
 >  for the details.
 >  
 >  Status:
 >  
 >  Vendor Tag:	FSF
 >  Release Tags:	gcc-4-1-20060420

   Well, I finally found a bit of time to try out gcc 4.1.1---built a 3.99.20
 system from sources updated May 18, 2006 and _without_ the patch to crt0.c,
 which I have been using happily ever since the spring of 2003.  And it appears
 that the 8-byte stack-alignment problem has been fixed.  Here's a slightly
 modified version of the test code:

 #include <stdio.h>
 #define N 10000
 int main (int argc, char **argv) {
 double *x;
 int i, j;

 /*
 x=(double*)malloc((N+1)*sizeof(double));
  */
 double y[N+1];
 x = y;
 if(argc==2) x=(double*)((int)x+4);

 printf("0x%x\n", (int)x);
 printf("%d\n", (int)x%8);
 printf("%d\n", (int)x%16);
 for(i=0;i<N;i++) x[i]=(double)i;
 for(i=0;i<N;i++) for (j=0;j<N;j++) x[i]=0.5*(x[j]+x[i]);

 printf("%f\n", x[N-1]);
 exit(0);
 }

  If you compile it (e.g., gcc -O3) and link it to 4 different names:

    ln a.out l
    ln a.out ll
    ln a.out lll
    ln a.out llll

 you'll find that execution times are all essentially the same, and the
 address of the double-precision array x changes.  Curiously, it changes
 by 16 bytes, always remaining on an odd 8-byte boundary.  Also, if the
 array is allocated with malloc, instead of being declared, the alignment
 is (still) optimal.

   I haven't been able to test with fortran code, since the fortran compiler
 seems to have disappeared, to my considerable disappointment.  Once I figure
 out how to get it back, I'll do some more testing with fortran code, but at
 this point it looks like this alignment problem is solved.

   Thanks!

   BTW: Just for the record, here's the latest version of the patch I was
 using for crt0.o.  The only difference from the one previously in the PR is
 the conditional inclusion of the 4-byte extra offset of the stack pointer
 when GCC < 3 is used.

 --- crt0.c-dist	2001-12-31 04:04:20.000000000 -0700
 +++ crt0.c	2003-04-26 17:45:54.000000000 -0600
 @@ -44,19 +44,24 @@
  "	.align	4			\n"
  "	.globl	__start			\n"
  "	.globl	_start			\n"
  "_start:				\n"
  "__start:				\n"
 +"	movl	%esp,%eax		# save stack pointer		\n"
 +"	and	$0xfffffff8,%esp	# align to 8-byte boundary	\n"
 +#if defined(__GNUC__) && (__GNUC__ < 3)
 +"	subl	$4,%esp			# compensate for later offset	\n"
 +#endif
  "	pushl	%ebx			# ps_strings	\n"
  "	pushl	%ecx			# obj		\n"
  "	pushl	%edx			# cleanup	\n"
 -"	movl	12(%esp),%eax		\n"
 -"	leal	20(%esp,%eax,4),%ecx	\n"
 -"	leal	16(%esp),%edx		\n"
 -"	pushl	%ecx			\n"
 +"	movl	(%eax),%ebx		# argc		\n"
 +"	leal	4(%eax),%ecx		# *argv		\n"
 +"	leal	8(%eax,%ebx,4),%edx	# *envp		\n"
  "	pushl	%edx			\n"
 -"	pushl	%eax			\n"
 +"	pushl	%ecx			\n"
 +"	pushl	%ebx			\n"
  "	call	___start");

  void
  ___start(argc, argv, envp, cleanup, obj, ps_strings)
  	int argc;

 --Jim

Responsible-Changed-From-To: thorpej->port-i386-maintainer
Responsible-Changed-By: wiz@NetBSD.org
Responsible-Changed-When: Sun, 15 Apr 2012 21:47:41 +0000
Responsible-Changed-Why:
Back to role account, thorpej left


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.