NetBSD Problem Report #51654

From dholland@netbsd.org  Sat Nov 26 03:58:50 2016
Return-Path: <dholland@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 3AE7B7A2A4
	for <gnats-bugs@gnats.NetBSD.org>; Sat, 26 Nov 2016 03:58:50 +0000 (UTC)
Message-Id: <20161126035849.BAC35855F6@mail.netbsd.org>
Date: Sat, 26 Nov 2016 03:58:49 +0000 (UTC)
From: dholland@NetBSD.org
Reply-To: dholland@NetBSD.org
To: gnats-bugs@NetBSD.org
Subject: wrong ps_strings breaks emacs20
X-Send-Pr-Version: 3.95

>Number:         51654
>Category:       lib
>Synopsis:       wrong ps_strings breaks emacs20
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    lib-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Nov 26 04:00:00 +0000 2016
>Closed-Date:    Wed Feb 28 16:54:10 +0000 2018
>Last-Modified:  Wed Feb 28 16:54:10 +0000 2018
>Originator:     David A. Holland
>Release:        NetBSD 7.99.42 (20161125)
>Organization:
>Environment:
System: NetBSD valkyrie 7.99.42 NetBSD 7.99.42 (VALKYRIE) #20: Fri Nov 25 18:33:19 EST 2016  dholland@valkyrie:/usr/src/sys/arch/amd64/compile/VALKYRIE amd64
Architecture: x86_64
Machine: amd64
>Description:

After updating emacs dumps core. After spending a long time barking up
the wrong PaX tree, it seems that the problem is that an invalid
pointer is being provided in __ps_strings and this causes _libc_init
to segv.

I stuck some debugging code into a copy of _libc_init and inserted it
with LD_PRELOAD, and found that __ps_strings is 0x7f7fffffffe0 while
the highest valid range in the address space in /proc/pid maps is

   00007f7fffff0000-00007f7ffffff000 rw-p 0000000000000000 00:00 0

..................... 0x7f7fffffffe0 is off the end.

It looks to me like this is because the kernel is providing the wrong
ps_strings address, not because ld.elf_so is mucking about and
garbaging the value, but it's hard to tell because even at its most
verbose ktrace doesn't record this information. (That is itself a bug
and should get fixed.)

I haven't the slightest idea why this happens only with emacs but I
imagine it's a consequence of the emacs dump/undump mechanism somehow.

>How-To-Repeat:

You can build a new emacs binary that exhibits the same behavior as my
old one by building editors/emacs20-20.7nb21 (which I just committed,
with fixes for some other problems) on current amd64. Or at least, I
can. YMMV, but hopefully it's not just me.

>Fix:

Not being able to run my editor is making it rather difficult to do
stuff, so I'd appreciate help...

>Release-Note:

>Audit-Trail:
From: Joerg Sonnenberger <joerg@bec.de>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org
Subject: Re: kern/51654: wrong ps_strings breaks emacs20
Date: Sat, 26 Nov 2016 06:49:24 +0100

 On Sat, Nov 26, 2016 at 04:00:00AM +0000, dholland@NetBSD.org wrote:
 > After updating emacs dumps core. After spending a long time barking up
 > the wrong PaX tree, it seems that the problem is that an invalid
 > pointer is being provided in __ps_strings and this causes _libc_init
 > to segv.

 The address space layout changed. ps_strings is placed at the very top,
 without PAX it is at a fixed address.

 > I haven't the slightest idea why this happens only with emacs but I
 > imagine it's a consequence of the emacs dump/undump mechanism somehow.

 Combine this with the changed MAXUSER address and now certain cached
 locations no longer match.

 Joerg

From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/51654: wrong ps_strings breaks emacs20
Date: Sat, 26 Nov 2016 06:14:39 +0000

 On Sat, Nov 26, 2016 at 05:50:01AM +0000, Joerg Sonnenberger wrote:
  >  On Sat, Nov 26, 2016 at 04:00:00AM +0000, dholland@NetBSD.org wrote:
  >  > After updating emacs dumps core. After spending a long time barking up
  >  > the wrong PaX tree, it seems that the problem is that an invalid
  >  > pointer is being provided in __ps_strings and this causes _libc_init
  >  > to segv.
  >  
  >  The address space layout changed. ps_strings is placed at the very top,
  >  without PAX it is at a fixed address.

 Turning off aslr with paxctl has no effect.

  >  > I haven't the slightest idea why this happens only with emacs but I
  >  > imagine it's a consequence of the emacs dump/undump mechanism somehow.
  >  
  >  Combine this with the changed MAXUSER address and now certain cached
  >  locations no longer match.

 Cached in what? The value for __ps_strings is provided by the kernel
 to crt0 and propagates to _libc_init (and causes it to croak) before
 anything in emacs gets control.

 -- 
 David A. Holland
 dholland@netbsd.org

From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/51654: wrong ps_strings breaks emacs20
Date: Sat, 26 Nov 2016 06:20:43 +0000

 On Sat, Nov 26, 2016 at 06:15:00AM +0000, David Holland wrote:
  >   >  The address space layout changed. ps_strings is placed at the very top,
  >   >  without PAX it is at a fixed address.
  >  
  >  Turning off aslr with paxctl has no effect.

 I take it back: making sure that emacs is dumped from a temacs with
 aslr disabled fixes the problem.

 But I don't follow why. Also, this doesn't explain why my preexisting
 emacs binary (which was built long before we started mucking with aslr
 this year) doesn't work and fails in the same way.

 Or did the default non-randomized layout also change, and by the same
 unclear mechanism thereby cause things to stop working?

  > >  > I haven't the slightest idea why this happens only with emacs but I
  > >  > imagine it's a consequence of the emacs dump/undump mechanism somehow.
  > >  
  > >  Combine this with the changed MAXUSER address and now certain cached
  > >  locations no longer match.
  >  
  >  Cached in what? The value for __ps_strings is provided by the kernel
  >  to crt0 and propagates to _libc_init (and causes it to croak) before
  >  anything in emacs gets control.

 Even if the prior value is saved when undumped, that should be
 overwritten by a new value from the current execution, no?

 -- 
 David A. Holland
 dholland@netbsd.org

From: "David A. Holland" <dholland@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/51654 CVS commit: pkgsrc/editors/emacs20
Date: Sat, 26 Nov 2016 06:38:06 +0000

 Module Name:	pkgsrc
 Committed By:	dholland
 Date:		Sat Nov 26 06:38:06 UTC 2016

 Modified Files:
 	pkgsrc/editors/emacs20: Makefile distinfo
 	pkgsrc/editors/emacs20/patches: patch-af

 Log Message:
 Use paxctl +a like in emacs21 to make the build work on -current. It seems
 that if you dump with a non-ASLR temacs you get a working emacs binary, and
 if you don't you don't, although I don't really see why -- perhaps it's
 something broken in crtstuff. Closes PR 51654.

 Note that pre-ASLR emacs20 binaries not dumped by an ASLR temacs also
 blow up in the same way, which doesn't make much sense either, but
 undoubtedly it's all connected.

 It's not particularly good that we apparently don't have backwards
 compatibility for old Emacs binaries because of this, but for the time
 being I'm more worried about it working at all.

 PKGREVISION++ again, to 22.


 To generate a diff of this commit:
 cvs rdiff -u -r1.55 -r1.56 pkgsrc/editors/emacs20/Makefile
 cvs rdiff -u -r1.33 -r1.34 pkgsrc/editors/emacs20/distinfo
 cvs rdiff -u -r1.2 -r1.3 pkgsrc/editors/emacs20/patches/patch-af

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->closed
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sat, 26 Nov 2016 06:47:31 +0000
State-Changed-Why:
fixed


State-Changed-From-To: closed->open
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sat, 26 Nov 2016 06:49:46 +0000
State-Changed-Why:
on reflection, let's keep this open until we understand the mechanism.

Also, it seems that the reason old binaries broke is that a few days ago
maxv unilaterally changed the stack top address:
http://mail-index.netbsd.org/source-changes-d/2016/11/24/msg008895.html

presumably via the same mechanism that somehow causes dumping to bake the
ps_strings value into crt0 or libc when it should be getting passed through
from the kernel.


From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/51654 (wrong ps_strings breaks emacs20)
Date: Sat, 26 Nov 2016 10:10:53 +0100

 The only way I see we could provide backward compat is to mark binaries
 like emacs with a special "fixed VA layout" note including a version
 number (similar to the compiler memory model notes used on sparc64)
 and then exec them with special compat VA layout.

 Sane binaries, of course, don't care about this details.

 Lots of compat code and rarely tested complexity for stupid binaries,
 so probably better just ignore it.

 Martin

From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/51654 (wrong ps_strings breaks emacs20)
Date: Sun, 27 Nov 2016 00:26:17 +0000

 On Sat, Nov 26, 2016 at 09:15:01AM +0000, Martin Husemann wrote:
  >  The only way I see we could provide backward compat is to mark binaries
  >  like emacs with a special "fixed VA layout" note including a version
  >  number (similar to the compiler memory model notes used on sparc64)
  >  and then exec them with special compat VA layout.

 That or back out the stack change that broke them all.

 Anyway, we figured out what's going on:

  (1) _libc_init is called twice, but the comment on it (since updated)
      was wrong. It is called from crt0 explicitly and from global
      constructor handling, but the order of these is inverted for
      static binaries vs. dynamically linked binaries. In a dynamically
      linked binary the global constructors for libc are called from
      the dynamic linker before crt0 gets control at all. (This is why
      a broken emacs cores before __start.)

  (2) Therefore, in a dynamic executable, _libc_init is reached before
      __ps_strings is initialized by crt0. Ordinarily, then it will be
      null and the logic that sets __libc_dlauxinfo will be skipped.
      This is fine because __libc_dlauxinfo is only used in static
      executables; it's there to help substitute for not having the
      dynamic linker image.

  (3) However, when emacs dumps it saves the static data of crt0, which
      is part of the linked program, but not the static data of libc,
      which isn't.

  (4) Therefore when starting the dumped emacs, libc is not initialized
      yet, but the value from the previous startup is still sitting in
      __ps_strings. Then _libc_init sees that it's not null and tries
      to set __libc_dlauxinfo from it. If the stack location is not the
      same as it was for the previous execution, the old __ps_strings
      can point off the end of the stack and this results in SIGSEGV.

 Thanks to Joerg for providing necessary explanations.

 Disabling ASLR for both the original temacs binary and the dumped
 emacs binary works around the problem (this fix has been put in place
 for emacs20/21 and I think later ones do it on their own) but it isn't
 particularly desirable. (While emacs will never run as a pie, it
 should still be possible to randomize the stack and library mappings.
 temacs itself runs fine with ASLR enabled.)

 There are several possible real fixes, but they all have problems:

  (a) Have emacs explicitly zero __ps_strings before dumping. This
      will work, but exposes the dirty laundry and makes it that much
      harder to tidy the internals up later. However, it works
      transparently for all or nearly all NetBSD versions going a long
      way back.

  (b) Give emacs some otherwise-private function to call before
      dumping, something like __crt_reset(). On the plus side, this
      doesn't expose the details, but on the minus side it would
      require configure tests and other fussing to use properly.

  (c) Add another variable to crt0 for libc to use in a way that allows
      it to tell that the __ps_strings value is bogus. E.g. add "int
      __initcount" to crt0 and have _libc_init increment it; then if
      on entry to _libc_init it's > 0 we know we're running after
      dumping and shouldn't trust __ps_strings. This has the advantage
      of not being exposed, but it's messy and adds more crud to crt0
      which we don't want.

  (d) Misuse #ifdef __PIC__ or similar to detect at compile time in
      _libc_init whether the current execution is statically linked.
      (This is abusive, and also, a binary statically linked against
      libc_pic.a wouldn't work properly.)

  (e) Come up with some other way to distinguish being dynamically or
      statically linked in _libc_init, either at compile time or
      runtime. The problem is: what/how? I don't have any bright ideas,
      but if someone else does, speak up: the advantage of this is that
      (unlike the other alternatives) it can fix the crash for already-
      existing Emacs binaries.

  (f) Kick __ps_strings (and also __environ and __progname) out of
      crt0.o entirely, and instead provide private entry points in libc
      for crt0 to assign them. This has a number of tidiness advantages
      going forward, but Joerg is worried that it might generate compat
      issues.

 -- 
 David A. Holland
 dholland@netbsd.org

From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/51654 (wrong ps_strings breaks emacs20)
Date: Sun, 27 Nov 2016 08:40:49 +0700

     Date:        Sun, 27 Nov 2016 00:30:00 +0000 (UTC)
     From:        David Holland <dholland-bugs@netbsd.org>
     Message-ID:  <20161127003000.E514B7A33D@mollari.NetBSD.org>

   |  There are several possible real fixes, but they all have problems:

  (g) Get rid of the undump procedure completely, and just use temacs
      (if you really have to use emacs of any form) - undump is an
      abomination and always has been.

 kre

From: "Joerg Sonnenberger" <joerg@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/51654 CVS commit: src/lib/libc/misc
Date: Sat, 17 Jun 2017 15:26:44 +0000

 Module Name:	src
 Committed By:	joerg
 Date:		Sat Jun 17 15:26:44 UTC 2017

 Modified Files:
 	src/lib/libc/misc: initfini.c

 Log Message:
 PR 51654: Don't initialize _dlauxinfo for static binaries.

 Emacs likes save all memory of the main binary and the first run of
 _libc_init via .init will get the wrong (old) value of __ps_strings.
 By avoiding the initialization of _dlauxinfo for shared applications,
 it will be touched only by the _libc_init call from crt0.o itself,
 at which point __ps_strings is correct.


 To generate a diff of this commit:
 cvs rdiff -u -r1.13 -r1.14 src/lib/libc/misc/initfini.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

Responsible-Changed-From-To: kern-bug-people->lib-bug-people
Responsible-Changed-By: dholland@NetBSD.org
Responsible-Changed-When: Sun, 18 Jun 2017 02:45:54 +0000
Responsible-Changed-Why:
misfiled; was originally thought to be the kernel providing wrong
__ps_strings but actually it was a crtstuff issue.


State-Changed-From-To: open->needs-pullups
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sun, 18 Jun 2017 02:45:54 +0000
State-Changed-Why:
fixed for real, but the fix needs to get into -8


From: "Soren Jacobsen" <snj@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/51654 CVS commit: [netbsd-8] src/lib/libc/misc
Date: Wed, 21 Jun 2017 17:46:03 +0000

 Module Name:	src
 Committed By:	snj
 Date:		Wed Jun 21 17:46:02 UTC 2017

 Modified Files:
 	src/lib/libc/misc [netbsd-8]: initfini.c

 Log Message:
 Pull up following revision(s) (requested by joerg in ticket #44):
 	lib/libc/misc/initfini.c: revision 1.14
 PR 51654: Don't initialize _dlauxinfo for static binaries.
 Emacs likes save all memory of the main binary and the first run of
 _libc_init via .init will get the wrong (old) value of __ps_strings.
 By avoiding the initialization of _dlauxinfo for shared applications,
 it will be touched only by the _libc_init call from crt0.o itself,
 at which point __ps_strings is correct.


 To generate a diff of this commit:
 cvs rdiff -u -r1.13 -r1.13.6.1 src/lib/libc/misc/initfini.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: needs-pullups->closed
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Wed, 28 Feb 2018 16:54:10 +0000
State-Changed-Why:
pulled up back in june


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.