NetBSD Problem Report #55719

From www@netbsd.org  Sun Oct 11 23:26:45 2020
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 304B31A921F
	for <gnats-bugs@gnats.NetBSD.org>; Sun, 11 Oct 2020 23:26:45 +0000 (UTC)
Message-Id: <20201011232643.E7E971A9241@mollari.NetBSD.org>
Date: Sun, 11 Oct 2020 23:26:43 +0000 (UTC)
From: nikhil.benesch@gmail.com
Reply-To: nikhil.benesch@gmail.com
To: gnats-bugs@NetBSD.org
Subject: Unwind tables for signal trampoline on amd64 are incorrect
X-Send-Pr-Version: www-1.0

>Number:         55719
>Category:       lib
>Synopsis:       Unwind tables for signal trampoline on amd64 are incorrect
>Confidential:   no
>Severity:       non-critical
>Priority:       medium
>Responsible:    kamil
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Oct 11 23:30:00 +0000 2020
>Closed-Date:    Mon Oct 12 17:58:00 +0000 2020
>Last-Modified:  Mon Nov 02 04:55:01 +0000 2020
>Originator:     Nikhil Benesch
>Release:        current
>Organization:
>Environment:
NetBSD  9.99.73 NetBSD 9.99.73 (GENERIC) #0: Sat Oct 10 07:28:04 UTC 2020  mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/amd64/compile/GENERIC amd64
>Description:
The current unwind tables for the amd64 version of __sigtramp_siginfo_2 look incorrect to me.

Per my reading of the DWARF specification and the GCC source code, the unwind tables for a signal trampoline need to specify the CFA as the value of RSP *in the saved context*, rather than the standard location of RSP + 8. They also need to specify where to find the values of each register in the saved context.

The CFI directives added to the amd64 in http://cvsweb.netbsd.org/bsdweb.cgi/src/lib/libc/arch/x86_64/sys/__sigtramp2.S?rev=1.7&content-type=text/x-cvsweb-markup&only_with_tag=MAIN instead seem to specify a standard function frame, rather than accounting for any of the special behavior of the signal trampoline. I believe the reason that GDB is able to backtrace through signal handlers on NetBSD is due to custom code in GDB (see "nbsd_pc_in_sigtramp" here: https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=gdb/nbsd-tdep.c;h=2db994af21811f82183cb9d2f342a8f3806deb51;hb=HEAD), not due to the unwind tables as presently generated.

For context, I am in the process of porting gccgo to NetBSD. The Go language has the unusual property that it promises that segfaults are recoverable panics (which in gccgo work using the same infrastructure as C++ exceptions). This requires that the unwinder be able to unwind through signal handlers. The WIP port can be seen here: https://go-review.googlesource.com/c/gofrontend/+/261137 . The resulting compiler generally works, but fails several test cases in the test suite that test whether segfaults are properly caught.

The patch I've included below fixes these tests in the test suite. It should also make unwinding "just work" when using a DWARF-based unwinder that are not specially-aware of NetBSD's signal trampolines.

In theory, an alternative would be to hardcode a NetBSD sigtramp sniffer in libgcc, like exists for FreeBSD (https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libgcc/config/i386/freebsd-unwind.h;h=0ea4e522712a0fb9ad0eb4681bc1cff9697c6505;hb=HEAD), but this approach seems a bit like madness to me. (Generating unwind tables by hand is also a bit crazy, but at least it only needs to be done *once*, rather than for every unwinder.)
>How-To-Repeat:
Attempt to unwind through a signal handler on NetBSD using an unwinder that is not NetBSD-aware (like the unwinder that ships in libgcc).
>Fix:
The following patch to libc fixes the issue on amd64 for me:

RCS file: /cvsroot/src/lib/libc/arch/x86_64/sys/__sigtramp2.S,v
retrieving revision 1.7
diff -u -r1.7 __sigtramp2.S
--- arch/x86_64/sys/__sigtramp2.S       2 Dec 2019 01:38:54 -0000       1.7
+++ arch/x86_64/sys/__sigtramp2.S       11 Oct 2020 23:10:32 -0000
@@ -36,19 +36,103 @@
  */

 #include "SYS.h"
+#include "assym.h"

 /*
  * The x86-64 signal trampoline is invoked only to return from
  * the signal; the kernel calls the signal handler directly.
  */
+
+/*
+ * Unwinders find the FDE for the caller by looking up (return PC - 1),
+ * assuming that the return PC points to the instruction after a call
+ * instruction and so (return PC - 1) points into the middle of the call
+ * instruction. Since the trampoline doesn't actually call the signal
+ * handler, we need to manually pad its FDE with a nop to account for
+ * the subtraction.
+ */
+.Ltramp_start:
+       nop
 NENTRY(__sigtramp_siginfo_2)
-       .cfi_startproc
-       .cfi_def_cfa rsp, 8
        movq    %r15,%rdi
        movq    $SYS_setcontext, %rax
        syscall
        movq    $-1,%rdi /* if we return here, something is wrong */
        movq    $SYS_exit, %rax
        syscall
-       .cfi_endproc
+.Ltramp_end:
 END(__sigtramp_siginfo_2)
+
+/*
+ * Manually build an unwind table for the trampoline.
+ * The unwind table consists of a CIE and an FDE within that CIE.
+ * The .cfi_* directives don't work here because they don't support the
+ * complex expressions required to restore registers relative to the
+ * address in R15 rather than relative to the CFA.
+ */
+
+/* CIE */
+       .section .eh_frame, "a", @progbits
+.Lcie:
+       .int .Lcie_end - .Lcie_start /* CIE length */
+.Lcie_start:
+       .int      0x00  /* CIE ID */
+       .byte     0x01  /* CIE version */
+       .string   "zRS" /* Augmentation string */
+       .uleb128  0x01  /* Code alignment factor */
+       .sleb128  0x01  /* Data alignment factor */
+       .byte     0x10  /* Return address column */
+       .uleb128  0x01  /* Augmentation length */
+       .byte     0x1b  /* Augmentation data: FDE encoding (DW_EH_PE_pcrel | DW_EH_PE_sdata4) */
+       .align    0x08
+.Lcie_end:
+
+/* FDE */
+       .int .Lfde_end - .Lfde_start            /* FDE length */
+.Lfde_start:
+       .int .Lfde_start - .Lcie                /* Relative CIE offset */
+       .int .Ltramp_start - .                  /* PC-relative start address */
+       .int .Ltramp_end - .Ltramp_start        /* PC-relative end address */
+       .uleb128  0                             /* Augmentation length */
+
+       /* Frame table instructions. */
+
+       /* Set the CFA to the value of RSP in the saved context. */
+       .byte     0x0f                          /* DW_CFA_def_cfa_expression */
+       .uleb128  1f - 0f                       /* Length of expression */
+0:     .byte     0x7f                          /* DW_OP_breg15 */
+       .sleb128  _OFFSETOF_UC_GREGS_RSP        /* Offset of RSP */
+       .byte     0x06                          /* DW_OP_deref */
+1:
+
+       /* Restore each register from its value in the saved context. */
+
+#define RESTORE_REG(regno, reg) \
+       .byte     0x10;                         /* DW_CFA_expression */         \
+       .uleb128  regno;                        /* Register number */           \
+       .uleb128  1f - 0f;                      /* Length of expression */      \
+0:     .byte     0x7f;                         /* DW_OP_breg15 */              \
+       .sleb128  _OFFSETOF_UC_GREGS_##reg;     /* Offset of <reg> */           \
+1:
+
+       RESTORE_REG(0, RAX)
+       RESTORE_REG(1, RDX)
+       RESTORE_REG(2, RCX)
+       RESTORE_REG(3, RBX)
+       RESTORE_REG(4, RSI)
+       RESTORE_REG(5, RDI)
+       RESTORE_REG(6, RBP)
+       /* The unwinder will use the CFA to restore RSP. */
+       RESTORE_REG(8, R8)
+       RESTORE_REG(9, R9)
+       RESTORE_REG(10, R10)
+       RESTORE_REG(11, R11)
+       RESTORE_REG(12, R12)
+       RESTORE_REG(13, R13)
+       RESTORE_REG(14, R14)
+       RESTORE_REG(15, R15)
+       RESTORE_REG(16, RIP) 
+
+       .align    0x08
+.Lfde_end:
+
Index: arch/x86_64/genassym.cf
===================================================================
RCS file: arch/x86_64/genassym.cf
diff -N arch/x86_64/genassym.cf
--- /dev/null   1 Jan 1970 00:00:00 -0000
+++ arch/x86_64/genassym.cf     11 Oct 2020 23:10:32 -0000
@@ -0,0 +1,19 @@
+include <ucontext.h>
+
+define _OFFSETOF_UC_GREGS_RAX offsetof(ucontext_t, uc_mcontext.__gregs[_REG_RAX])
+define _OFFSETOF_UC_GREGS_RDX offsetof(ucontext_t, uc_mcontext.__gregs[_REG_RDX])
+define _OFFSETOF_UC_GREGS_RCX offsetof(ucontext_t, uc_mcontext.__gregs[_REG_RCX])
+define _OFFSETOF_UC_GREGS_RBX offsetof(ucontext_t, uc_mcontext.__gregs[_REG_RBX])
+define _OFFSETOF_UC_GREGS_RSI offsetof(ucontext_t, uc_mcontext.__gregs[_REG_RSI])
+define _OFFSETOF_UC_GREGS_RDI offsetof(ucontext_t, uc_mcontext.__gregs[_REG_RDI])
+define _OFFSETOF_UC_GREGS_RBP offsetof(ucontext_t, uc_mcontext.__gregs[_REG_RBP])
+define _OFFSETOF_UC_GREGS_RSP offsetof(ucontext_t, uc_mcontext.__gregs[_REG_RSP])
+define _OFFSETOF_UC_GREGS_R8  offsetof(ucontext_t, uc_mcontext.__gregs[_REG_R8])
+define _OFFSETOF_UC_GREGS_R9  offsetof(ucontext_t, uc_mcontext.__gregs[_REG_R9])
+define _OFFSETOF_UC_GREGS_R10 offsetof(ucontext_t, uc_mcontext.__gregs[_REG_R10])
+define _OFFSETOF_UC_GREGS_R11 offsetof(ucontext_t, uc_mcontext.__gregs[_REG_R11])
+define _OFFSETOF_UC_GREGS_R12 offsetof(ucontext_t, uc_mcontext.__gregs[_REG_R12])
+define _OFFSETOF_UC_GREGS_R13 offsetof(ucontext_t, uc_mcontext.__gregs[_REG_R13])
+define _OFFSETOF_UC_GREGS_R14 offsetof(ucontext_t, uc_mcontext.__gregs[_REG_R14])
+define _OFFSETOF_UC_GREGS_R15 offsetof(ucontext_t, uc_mcontext.__gregs[_REG_R15])
+define _OFFSETOF_UC_GREGS_RIP offsetof(ucontext_t, uc_mcontext.__gregs[_REG_RIP])

>Release-Note:

>Audit-Trail:
From: Nikhil Benesch <nikhil.benesch@gmail.com>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: lib/55719
Date: Sun, 11 Oct 2020 19:34:22 -0400

 I should mention that the NetBSD blog post entitled "LLDB work
 completed" has a section about incomplete support for backtracing
 through a signal trampoline. I believe the patch enclosed in the
 initial report would complete the task as described there.

 [0]: https://blog.netbsd.org/tnf/entry/lldb_work_concluded

Responsible-Changed-From-To: lib-bug-people->kamil
Responsible-Changed-By: kamil@NetBSD.org
Responsible-Changed-When: Mon, 12 Oct 2020 01:54:22 +0200
Responsible-Changed-Why:
Take.


From: Kamil Rytarowski <kamil@netbsd.org>
To: gnats-bugs@netbsd.org, kamil@netbsd.org, lib-bug-people@netbsd.org,
 netbsd-bugs@netbsd.org, gnats-admin@netbsd.org, nikhil.benesch@gmail.com,
 Andrew Cagney <andrew.cagney@gmail.com>, =?UTF-8?B?TWljaGHFgiBHw7Nybnk=?=
 <mgorny@netbsd.org>
Cc: 
Subject: Re: lib/55719 (Unwind tables for signal trampoline on amd64 are
 incorrect)
Date: Mon, 12 Oct 2020 02:42:45 +0200

 This is an OpenPGP/MIME signed message (RFC 4880 and 3156)
 --MoMSqsNyukP3IwqCwBW4DBquHGCFsf2Nf
 Content-Type: multipart/mixed; boundary="RbaqZ3yQFUR2fI5rpTEZ3J26qa13dZV6h";
  protected-headers="v1"
 From: Kamil Rytarowski <kamil@netbsd.org>
 To: gnats-bugs@netbsd.org, kamil@netbsd.org, lib-bug-people@netbsd.org,
  netbsd-bugs@netbsd.org, gnats-admin@netbsd.org, nikhil.benesch@gmail.com,
  Andrew Cagney <andrew.cagney@gmail.com>, =?UTF-8?B?TWljaGHFgiBHw7Nybnk=?=
  <mgorny@netbsd.org>
 Message-ID: <11d8b6d4-b46d-4155-558f-a445b3ac5f31@netbsd.org>
 Subject: Re: lib/55719 (Unwind tables for signal trampoline on amd64 are
  incorrect)
 References: <pr-lib-55719@gnats.netbsd.org>
  <20201011232643.E7E971A9241@mollari.NetBSD.org>
  <20201011235422.E6EFA1A9241@mollari.NetBSD.org>
 In-Reply-To: <20201011235422.E6EFA1A9241@mollari.NetBSD.org>

 --RbaqZ3yQFUR2fI5rpTEZ3J26qa13dZV6h
 Content-Type: text/plain; charset=utf-8
 Content-Language: en-US
 Content-Transfer-Encoding: quoted-printable

 On 12.10.2020 01:54, kamil@NetBSD.org wrote:
 > Synopsis: Unwind tables for signal trampoline on amd64 are incorrect
 >=20
 > Responsible-Changed-From-To: lib-bug-people->kamil
 > Responsible-Changed-By: kamil@NetBSD.org
 > Responsible-Changed-When: Mon, 12 Oct 2020 01:54:22 +0200
 > Responsible-Changed-Why:
 > Take.
 >=20
 >=20
 >=20

 We ended up with something like this, but ran out of time:

 http://netbsd.org/~kamil/patch-00283-amd64-__sigtramp2-cfi-unwinding.txt

 Could you please check this code?

 This approach was proposed/inspired by Andrew Cagney and written by mysel=
 f.

 We could combine this with the genassym.cf approach.


 --RbaqZ3yQFUR2fI5rpTEZ3J26qa13dZV6h--

 --MoMSqsNyukP3IwqCwBW4DBquHGCFsf2Nf
 Content-Type: application/pgp-signature; name="signature.asc"
 Content-Description: OpenPGP digital signature
 Content-Disposition: attachment; filename="signature.asc"

 -----BEGIN PGP SIGNATURE-----

 iQIzBAEBCAAdFiEELaxVpweEzw+lMDwuS7MI6bAudmwFAl+DppYACgkQS7MI6bAu
 dmxMAhAAqQT3yZMBFfVsYHW8OtnHLsX2j1PDFT8hrwxZSuQBD1X1VQgre1e8rfWO
 dyBKiOT3Qo7Qf0rUMF0YZA2dFs+YwwqLwwF62y1xMSR0XOcyWvXugErSxZhbVAlQ
 utq3sREA0NGqI9awR6/69ujG6zMZppUc1EJl4ZSv4AuHZrPF1YAadoYsfr9RHedz
 p7vU6jLjL2pJlkGXFeaZj3RRh992KhHQjumN7OBJq6MxPaVhrbXBJ0SILpa0dUEg
 gWQAxvi8UzvxifZ/72hwXxG57ab63/KWsmjWz7vhFRv4jGUQLjs6tbk0okJBYiSa
 Eo83g651ZQf9F3/cxurEEbE7pwN2wYjoyQEOFqNSUjZYUVE9MXT1tZXZrcxGw+ut
 30Vlxw0+iGfz7EYwXP4pAdSBs1TOoHJ70rq2fyNo+p0SHwAyPaSm+v4U4gLZRoV+
 6kBLd12vadCBQ7gEyLR1WTkKslfP3lv/eMmW8UzmDXYr2mGRXbH6P+zphIvaGOVU
 ThaXanNSmW78ZAJE+wSFSJTjWdcTJxHEB6RZMTkg0e5OTU9U9ji/hfz3P+kK98P/
 6oJFxOADAyo+O4eq36Zu2bKj0al0mgnRWrX1HSYJcqKQs2Ttbo1CaPpOV1pLfxj1
 cdjRM75Fd0pW30tXPuyQgLBHZpv3kyFr8WoowdArz/Vh6y1ly94=
 =n+A/
 -----END PGP SIGNATURE-----

 --MoMSqsNyukP3IwqCwBW4DBquHGCFsf2Nf--

From: Nikhil Benesch <nikhil.benesch@gmail.com>
To: Kamil Rytarowski <kamil@netbsd.org>
Cc: gnats-bugs@netbsd.org, lib-bug-people@netbsd.org, netbsd-bugs@netbsd.org, 
	gnats-admin@netbsd.org, Andrew Cagney <andrew.cagney@gmail.com>, 
	=?UTF-8?B?TWljaGHFgiBHw7Nybnk=?= <mgorny@netbsd.org>
Subject: Re: lib/55719 (Unwind tables for signal trampoline on amd64 are incorrect)
Date: Sun, 11 Oct 2020 21:18:04 -0400

 --0000000000009eda3505b16f1257
 Content-Type: text/plain; charset="UTF-8"

 Sure thing! I can see if that works later tonight.

 One thing jumps out at me, though: it doesn't seem valid to redefine the
 CFA like that? As I understand it, the CFA on amd64 is defined to be the
 stack pointer in the caller's frame at the time of the call, and you're not
 allowed to define it otherwise. The reason to use multiple .cfi_def_cfa
 directives in one function would be to update the CFA *computation* to
 account for changes to the variables used in the computation (e.g. RSP),
 but not to actually change the CFA itself.

 That said, it's possible that changing the CFA as you propose works just
 fine in practice. Someone more knowledgeable than me would be better
 equipped to say for sure.

 I should note that I am far from an expert on this topic. Everything I know
 about unwind tables has been pieced together in the last two days while
 trying to fix this gccgo test.

 On Sun, Oct 11, 2020 at 8:46 PM Kamil Rytarowski <kamil@netbsd.org> wrote:

 > On 12.10.2020 01:54, kamil@NetBSD.org wrote:
 > > Synopsis: Unwind tables for signal trampoline on amd64 are incorrect
 > >
 > > Responsible-Changed-From-To: lib-bug-people->kamil
 > > Responsible-Changed-By: kamil@NetBSD.org
 > > Responsible-Changed-When: Mon, 12 Oct 2020 01:54:22 +0200
 > > Responsible-Changed-Why:
 > > Take.
 > >
 > >
 > >
 >
 > We ended up with something like this, but ran out of time:
 >
 > http://netbsd.org/~kamil/patch-00283-amd64-__sigtramp2-cfi-unwinding.txt
 >
 > Could you please check this code?
 >
 > This approach was proposed/inspired by Andrew Cagney and written by myself.
 >
 > We could combine this with the genassym.cf approach.
 >
 >

 --0000000000009eda3505b16f1257
 Content-Type: text/html; charset="UTF-8"
 Content-Transfer-Encoding: quoted-printable

 <div dir=3D"ltr"><div dir=3D"ltr">Sure thing! I can see if that works later=
  tonight.<div><br></div><div>One thing jumps out at me, though: it doesn&#3=
 9;t seem valid to redefine the CFA like that? As I understand it, the CFA o=
 n amd64 is defined to be the stack pointer in the caller&#39;s frame at the=
  time of the call, and you&#39;re not allowed to define it otherwise. The r=
 eason to use multiple .cfi_def_cfa directives in one function would be to u=
 pdate the CFA *computation* to account for changes to the variables used in=
  the computation (e.g. RSP), but not to actually change the CFA itself.</di=
 v><div><br></div><div>That said, it&#39;s possible that changing the CFA as=
  you propose=C2=A0works just fine in practice. Someone more knowledgeable t=
 han me would be better equipped to say for sure.</div><div><br></div><div>I=
  should note that I am far from an expert on this topic. Everything I know =
 about unwind tables has been pieced together in the last two days while try=
 ing to fix this gccgo test.</div></div></div><br><div class=3D"gmail_quote"=
 ><div dir=3D"ltr" class=3D"gmail_attr">On Sun, Oct 11, 2020 at 8:46 PM Kami=
 l Rytarowski &lt;<a href=3D"mailto:kamil@netbsd.org" target=3D"_blank">kami=
 l@netbsd.org</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" styl=
 e=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid=
 ;border-left-color:rgb(204,204,204);padding-left:1ex">On 12.10.2020 01:54, =
 kamil@NetBSD.org wrote:<br>
 &gt; Synopsis: Unwind tables for signal trampoline on amd64 are incorrect<b=
 r>
 &gt; <br>
 &gt; Responsible-Changed-From-To: lib-bug-people-&gt;kamil<br>
 &gt; Responsible-Changed-By: kamil@NetBSD.org<br>
 &gt; Responsible-Changed-When: Mon, 12 Oct 2020 01:54:22 +0200<br>
 &gt; Responsible-Changed-Why:<br>
 &gt; Take.<br>
 &gt; <br>
 &gt; <br>
 &gt; <br>
 <br>
 We ended up with something like this, but ran out of time:<br>
 <br>
 <a href=3D"http://netbsd.org/~kamil/patch-00283-amd64-__sigtramp2-cfi-unwin=
 ding.txt" rel=3D"noreferrer" target=3D"_blank">http://netbsd.org/~kamil/pat=
 ch-00283-amd64-__sigtramp2-cfi-unwinding.txt</a><br>
 <br>
 Could you please check this code?<br>
 <br>
 This approach was proposed/inspired by Andrew Cagney and written by myself.=
 <br>
 <br>
 We could combine this with the <a href=3D"http://genassym.cf" rel=3D"norefe=
 rrer" target=3D"_blank">genassym.cf</a> approach.<br>
 <br>
 </blockquote></div>

 --0000000000009eda3505b16f1257--

From: Andrew Cagney <andrew.cagney@gmail.com>
To: Nikhil Benesch <nikhil.benesch@gmail.com>
Cc: Kamil Rytarowski <kamil@netbsd.org>, gnats-bugs@netbsd.org, lib-bug-people@netbsd.org, 
	netbsd-bugs@netbsd.org, gnats-admin@netbsd.org, 
	=?UTF-8?B?TWljaGHFgiBHw7Nybnk=?= <mgorny@netbsd.org>
Subject: Re: lib/55719 (Unwind tables for signal trampoline on amd64 are incorrect)
Date: Sun, 11 Oct 2020 22:35:35 -0400

 On Sun, 11 Oct 2020 at 21:18, Nikhil Benesch <nikhil.benesch@gmail.com> wro=
 te:
 >
 > Sure thing! I can see if that works later tonight.
 >
 > One thing jumps out at me, though: it doesn't seem valid to redefine the =
 CFA like that? As I understand it, the CFA on amd64 is defined to be the st=
 ack pointer in the caller's frame at the time of the call, and you're not a=
 llowed to define it otherwise. The reason to use multiple .cfi_def_cfa dire=
 ctives in one function would be to update the CFA *computation* to account =
 for changes to the variables used in the computation (e.g. RSP), but not to=
  actually change the CFA itself.

 Now you've gone and made me read the manual :-( :-)

 6.4 Call Frame Information
 =E2=80=A2 An area of memory that is allocated on a stack called a =E2=80=9C=
 call
 frame.=E2=80=9D The call frame is identified by an address on the stack. We
 refer to this address as the Canonical Frame Address or CFA.
 Typically, the CFA is defined to be the value of the stack pointer at
 the call site in the previous frame (which may be different from its
 value on entry to the current frame).

 So any code wanting the caller's stack pointer will need to parse the
 CFI and then unwind the stack-pointer (CFA is a convenient fiction
 invented by DWARF).

 The only real expectation is that the CFA is constant through out the
 lifetime of the frame (one of the things identifying  a single-stepped
 call/return instruction is the changed CFA).

 > That said, it's possible that changing the CFA as you propose works just =
 fine in practice. Someone more knowledgeable than me would be better equipp=
 ed to say for sure.
 >
 > I should note that I am far from an expert on this topic. Everything I kn=
 ow about unwind tables has been pieced together in the last two days while =
 trying to fix this gccgo test.
 >
 > On Sun, Oct 11, 2020 at 8:46 PM Kamil Rytarowski <kamil@netbsd.org> wrote=
 :
 >>
 >> On 12.10.2020 01:54, kamil@NetBSD.org wrote:
 >> > Synopsis: Unwind tables for signal trampoline on amd64 are incorrect
 >> >
 >> > Responsible-Changed-From-To: lib-bug-people->kamil
 >> > Responsible-Changed-By: kamil@NetBSD.org
 >> > Responsible-Changed-When: Mon, 12 Oct 2020 01:54:22 +0200
 >> > Responsible-Changed-Why:
 >> > Take.
 >> >
 >> >
 >> >
 >>
 >> We ended up with something like this, but ran out of time:
 >>
 >> http://netbsd.org/~kamil/patch-00283-amd64-__sigtramp2-cfi-unwinding.txt
 >>
 >> Could you please check this code?
 >>
 >> This approach was proposed/inspired by Andrew Cagney and written by myse=
 lf.
 >>
 >> We could combine this with the genassym.cf approach.
 >>

From: Nikhil Benesch <nikhil.benesch@gmail.com>
To: Andrew Cagney <andrew.cagney@gmail.com>
Cc: Kamil Rytarowski <kamil@netbsd.org>, gnats-bugs@netbsd.org, lib-bug-people@netbsd.org, 
	netbsd-bugs@netbsd.org, gnats-admin@netbsd.org, 
	=?UTF-8?B?TWljaGHFgiBHw7Nybnk=?= <mgorny@netbsd.org>
Subject: Re: lib/55719 (Unwind tables for signal trampoline on amd64 are incorrect)
Date: Sun, 11 Oct 2020 23:30:00 -0400

 On Sun, Oct 11, 2020 at 10:35 PM Andrew Cagney <andrew.cagney@gmail.com> wrote:
 > So any code wanting the caller's stack pointer will need to parse the
 > CFI and then unwind the stack-pointer (CFA is a convenient fiction
 > invented by DWARF).
 >
 > The only real expectation is that the CFA is constant through out the
 > lifetime of the frame (one of the things identifying  a single-stepped
 > call/return instruction is the changed CFA).

 Hmm, ok, so perhaps just picking a different CFA for this frame is
 acceptable then. My assertion that the CFA must be equal to RSP on
 amd64 is based only on this blog post [0], which is hardly
 authoritative.

 Kamil's patch as written sets the CFA to both R15 and later (RSP + 8),
 but that is easy enough to fix to drop the second .cfi_def_cfa
 directive. There is still the issue of generating the FDE one PC
 earlier than the beginning of the sigtramp symbol, but maybe I can
 work something out.

 Also apologies that my last mail was sent in HTML, not plain text.
 That should be fixed now.

 [0]: https://www.corsix.org/content/cfa-rsp-x86-64

From: Nikhil Benesch <nikhil.benesch@gmail.com>
To: Andrew Cagney <andrew.cagney@gmail.com>
Cc: Kamil Rytarowski <kamil@netbsd.org>, gnats-bugs@netbsd.org, lib-bug-people@netbsd.org, 
	netbsd-bugs@netbsd.org, gnats-admin@netbsd.org, 
	=?UTF-8?B?TWljaGHFgiBHw7Nybnk=?= <mgorny@netbsd.org>
Subject: Re: lib/55719 (Unwind tables for signal trampoline on amd64 are incorrect)
Date: Mon, 12 Oct 2020 00:24:38 -0400

 Ok, I've combined the two approaches like so:

 https://gist.githubusercontent.com/benesch/7052d40ed9de9a3a5d3886a060ec3a3d/raw/93abaac2ae83e6b4e63951bd0c44f1100e5e7f0e/netbsd-sigtramp-unwind.patch

 The new patch continues to work for gccgo and does so with far less
 complexity. I'm quite hopeful that this is now in a mergeable state.

 On Sun, Oct 11, 2020 at 11:30 PM Nikhil Benesch
 <nikhil.benesch@gmail.com> wrote:
 >
 > On Sun, Oct 11, 2020 at 10:35 PM Andrew Cagney <andrew.cagney@gmail.com> wrote:
 > > So any code wanting the caller's stack pointer will need to parse the
 > > CFI and then unwind the stack-pointer (CFA is a convenient fiction
 > > invented by DWARF).
 > >
 > > The only real expectation is that the CFA is constant through out the
 > > lifetime of the frame (one of the things identifying  a single-stepped
 > > call/return instruction is the changed CFA).
 >
 > Hmm, ok, so perhaps just picking a different CFA for this frame is
 > acceptable then. My assertion that the CFA must be equal to RSP on
 > amd64 is based only on this blog post [0], which is hardly
 > authoritative.
 >
 > Kamil's patch as written sets the CFA to both R15 and later (RSP + 8),
 > but that is easy enough to fix to drop the second .cfi_def_cfa
 > directive. There is still the issue of generating the FDE one PC
 > earlier than the beginning of the sigtramp symbol, but maybe I can
 > work something out.
 >
 > Also apologies that my last mail was sent in HTML, not plain text.
 > That should be fixed now.
 >
 > [0]: https://www.corsix.org/content/cfa-rsp-x86-64

From: Andrew Cagney <andrew.cagney@gmail.com>
To: Nikhil Benesch <nikhil.benesch@gmail.com>
Cc: Kamil Rytarowski <kamil@netbsd.org>, gnats-bugs@netbsd.org, lib-bug-people@netbsd.org, 
	netbsd-bugs@netbsd.org, gnats-admin@netbsd.org, 
	=?UTF-8?B?TWljaGHFgiBHw7Nybnk=?= <mgorny@netbsd.org>
Subject: Re: lib/55719 (Unwind tables for signal trampoline on amd64 are incorrect)
Date: Mon, 12 Oct 2020 01:02:29 -0400

 On Sun, 11 Oct 2020 at 23:30, Nikhil Benesch <nikhil.benesch@gmail.com> wrote:
 >
 > On Sun, Oct 11, 2020 at 10:35 PM Andrew Cagney <andrew.cagney@gmail.com> wrote:
 > > So any code wanting the caller's stack pointer will need to parse the
 > > CFI and then unwind the stack-pointer (CFA is a convenient fiction
 > > invented by DWARF).
 > >
 > > The only real expectation is that the CFA is constant through out the
 > > lifetime of the frame (one of the things identifying  a single-stepped
 > > call/return instruction is the changed CFA).
 >
 > Hmm, ok, so perhaps just picking a different CFA for this frame is
 > acceptable then. My assertion that the CFA must be equal to RSP on
 > amd64 is based only on this blog post [0], which is hardly
 > authoritative.

 The other thing to keep in mind is that this is for signal trampolines
 which are outside the scope of DWARF.

 > Kamil's patch as written sets the CFA to both R15 and later (RSP + 8),
 > but that is easy enough to fix to drop the second .cfi_def_cfa
 > directive.

 Code uses the tuple {CFA,FUNCTION_START} to identify a frame.  So if
 the CFA changes, code will think the frame changed.

 Testing is "easy".  Code up a SIGILL, say, and then breakpoint on the
 signal handler.  Then instruction-step back to the trapping
 instruction.  Every single instruction needs to breaktrace correctly.
 GDB should still have tests (I wrote) that do this.

 > There is still the issue of generating the FDE one PC
 > earlier than the beginning of the sigtramp symbol, but maybe I can
 > work something out.

 You mean the non-normative text in: Call Frame Calling Address where
 it discusses the hack of subtracting 1 from the return address?

 + * The unwind entry includes the one byte prior to the trampoline
 + * because the unwinder will look up (return PC - 1) while unwinding.
 + * Normally (return PC - 1) computes an address inside the call
 + * instruction that created the child frame, but here there is no call
 + * instruction so we have to manually add padding.

 yep.

 Is the PC-1 hack also used when looking up the sigtramp's [elf] symbol
 (I don't remember)?  Just look at the backtrace out of the signal
 handler.

 >
 > Also apologies that my last mail was sent in HTML, not plain text.
 > That should be fixed now.

 It's 2020.  Personally I don't care.

 > [0]: https://www.corsix.org/content/cfa-rsp-x86-64

From: Nikhil Benesch <nikhil.benesch@gmail.com>
To: gnats-bugs@netbsd.org
Cc: kamil@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: lib/55719 (Unwind tables for signal trampoline on amd64 are incorrect)
Date: Mon, 12 Oct 2020 01:57:32 -0400

 On Mon, Oct 12, 2020 at 1:05 AM Andrew Cagney <andrew.cagney@gmail.com> wrote:
 >  Code uses the tuple {CFA,FUNCTION_START} to identify a frame.  So if
 >  the CFA changes, code will think the frame changed.
 >
 >  Testing is "easy".  Code up a SIGILL, say, and then breakpoint on the
 >  signal handler.  Then instruction-step back to the trapping
 >  instruction.  Every single instruction needs to breaktrace correctly.
 >  GDB should still have tests (I wrote) that do this.

 Via reverse-debugging, yeah? I couldn't get this to work in the GDB
 that ships with NetBSD.

 (gdb) target record-full
 Process record: the current architecture doesn't support record function.

 Possibly due to the fact that I'm using a VM? Not sure.

 >  You mean the non-normative text in: Call Frame Calling Address where
 >  it discusses the hack of subtracting 1 from the return address?
 >
 >  + * The unwind entry includes the one byte prior to the trampoline
 >  + * because the unwinder will look up (return PC - 1) while unwinding.
 >  + * Normally (return PC - 1) computes an address inside the call
 >  + * instruction that created the child frame, but here there is no call
 >  + * instruction so we have to manually add padding.
 >
 >  yep.
 >
 >  Is the PC-1 hack also used when looking up the sigtramp's [elf] symbol
 >  (I don't remember)?  Just look at the backtrace out of the signal
 >  handler.

 Looks the same to me in GDB with and without this patch. The signal
 handler frame is labeled as "<signal handler called>".

 #0  sig_handler (no=11) at main.c:7
 #1  <signal handler called>
 #2  0x0000000000400bd8 in f1 (x=42, y=0x0) at main.c:12
 #3  0x0000000000400c35 in main () at main.c:23

 >  > Also apologies that my last mail was sent in HTML, not plain text.
 >  > That should be fixed now.
 >
 >  It's 2020.  Personally I don't care.

 GNATS seems to be stuck in 2004 and renders multipart emails terribly.

From: Kamil Rytarowski <kamil@netbsd.org>
To: gnats-bugs@netbsd.org, kamil@netbsd.org, gnats-admin@netbsd.org,
 netbsd-bugs@netbsd.org, nikhil.benesch@gmail.com,
 Andrew Cagney <andrew.cagney@gmail.com>
Cc: 
Subject: Re: lib/55719 (Unwind tables for signal trampoline on amd64 are
 incorrect)
Date: Mon, 12 Oct 2020 16:01:01 +0200

 On 12.10.2020 08:00, Nikhil Benesch wrote:
 > The following reply was made to PR lib/55719; it has been noted by GNATS.
 > 
 > From: Nikhil Benesch <nikhil.benesch@gmail.com>
 > To: gnats-bugs@netbsd.org
 > Cc: kamil@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
 > Subject: Re: lib/55719 (Unwind tables for signal trampoline on amd64 are incorrect)
 > Date: Mon, 12 Oct 2020 01:57:32 -0400
 > 
 >  On Mon, Oct 12, 2020 at 1:05 AM Andrew Cagney <andrew.cagney@gmail.com> wrote:
 >  >  Code uses the tuple {CFA,FUNCTION_START} to identify a frame.  So if
 >  >  the CFA changes, code will think the frame changed.
 >  >
 >  >  Testing is "easy".  Code up a SIGILL, say, and then breakpoint on the
 >  >  signal handler.  Then instruction-step back to the trapping
 >  >  instruction.  Every single instruction needs to breaktrace correctly.
 >  >  GDB should still have tests (I wrote) that do this.
 >  
 >  Via reverse-debugging, yeah? I couldn't get this to work in the GDB
 >  that ships with NetBSD.
 >  
 >  (gdb) target record-full
 >  Process record: the current architecture doesn't support record function.
 >  
 >  Possibly due to the fact that I'm using a VM? Not sure.
 >  
 >  >  You mean the non-normative text in: Call Frame Calling Address where
 >  >  it discusses the hack of subtracting 1 from the return address?
 >  >
 >  >  + * The unwind entry includes the one byte prior to the trampoline
 >  >  + * because the unwinder will look up (return PC - 1) while unwinding.
 >  >  + * Normally (return PC - 1) computes an address inside the call
 >  >  + * instruction that created the child frame, but here there is no call
 >  >  + * instruction so we have to manually add padding.
 >  >
 >  >  yep.
 >  >
 >  >  Is the PC-1 hack also used when looking up the sigtramp's [elf] symbol
 >  >  (I don't remember)?  Just look at the backtrace out of the signal
 >  >  handler.
 >  
 >  Looks the same to me in GDB with and without this patch. The signal
 >  handler frame is labeled as "<signal handler called>".
 >  
 >  #0  sig_handler (no=11) at main.c:7
 >  #1  <signal handler called>
 >  #2  0x0000000000400bd8 in f1 (x=42, y=0x0) at main.c:12
 >  #3  0x0000000000400c35 in main () at main.c:23
 >  
 >  >  > Also apologies that my last mail was sent in HTML, not plain text.
 >  >  > That should be fixed now.
 >  >
 >  >  It's 2020.  Personally I don't care.
 >  
 >  GNATS seems to be stuck in 2004 and renders multipart emails terribly.
 >  
 > 

 Nikhil, great work! This code works!

 http://netbsd.org/~kamil/backtrace/libc-x86_64-PR55719.txt

 Andrew, is this ready to commit?

 http://netbsd.org/~kamil/patch-00283-amd64-__sigtramp2-cfi-unwinding.2.txt

From: Kamil Rytarowski <kamil@netbsd.org>
To: Kamil Rytarowski <kamil@netbsd.org>, gnats-bugs@netbsd.org,
 gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, nikhil.benesch@gmail.com,
 Andrew Cagney <andrew.cagney@gmail.com>
Cc: 
Subject: Re: lib/55719 (Unwind tables for signal trampoline on amd64 are
 incorrect)
Date: Mon, 12 Oct 2020 16:24:33 +0200

 On 12.10.2020 16:01, Kamil Rytarowski wrote:
 > On 12.10.2020 08:00, Nikhil Benesch wrote:
 >> The following reply was made to PR lib/55719; it has been noted by GNATS.
 >>
 >> From: Nikhil Benesch <nikhil.benesch@gmail.com>
 >> To: gnats-bugs@netbsd.org
 >> Cc: kamil@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
 >> Subject: Re: lib/55719 (Unwind tables for signal trampoline on amd64 are incorrect)
 >> Date: Mon, 12 Oct 2020 01:57:32 -0400
 >>
 >>  On Mon, Oct 12, 2020 at 1:05 AM Andrew Cagney <andrew.cagney@gmail.com> wrote:
 >>  >  Code uses the tuple {CFA,FUNCTION_START} to identify a frame.  So if
 >>  >  the CFA changes, code will think the frame changed.
 >>  >
 >>  >  Testing is "easy".  Code up a SIGILL, say, and then breakpoint on the
 >>  >  signal handler.  Then instruction-step back to the trapping
 >>  >  instruction.  Every single instruction needs to breaktrace correctly.
 >>  >  GDB should still have tests (I wrote) that do this.
 >>  
 >>  Via reverse-debugging, yeah? I couldn't get this to work in the GDB
 >>  that ships with NetBSD.
 >>  
 >>  (gdb) target record-full
 >>  Process record: the current architecture doesn't support record function.
 >>  
 >>  Possibly due to the fact that I'm using a VM? Not sure.

 Unfortunately recording and reverse-debugging is still unsupported in
 GDB on NetBSD. I would need to research how much of work is needed to
 get support and maybe Andrew can shed some light on it too.

From: Andrew Cagney <andrew.cagney@gmail.com>
To: Kamil Rytarowski <kamil@netbsd.org>
Cc: gnats-bugs@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, 
	nikhil.benesch@gmail.com
Subject: Re: lib/55719 (Unwind tables for signal trampoline on amd64 are incorrect)
Date: Mon, 12 Oct 2020 10:55:37 -0400

 On Mon, 12 Oct 2020 at 10:27, Kamil Rytarowski <kamil@netbsd.org> wrote:

 > Unfortunately recording and reverse-debugging is still unsupported in
 > GDB on NetBSD. I would need to research how much of work is needed to
 > get support and maybe Andrew can shed some light on it too.

 The native reverse debugging code was pushed after I stopped hacking on GDB.
 Is it working on any non Linux backend?  I can see two ways of making this work:
 1. sucking out the contents of the process
 2. some sort of "spawn of linux" to capture the thread state

From: Nick Hudson <nick.hudson@gmx.co.uk>
To: Kamil Rytarowski <kamil@netbsd.org>, gnats-bugs@netbsd.org,
 gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, nikhil.benesch@gmail.com,
 Andrew Cagney <andrew.cagney@gmail.com>
Cc: 
Subject: Re: lib/55719 (Unwind tables for signal trampoline on amd64 are
 incorrect)
Date: Mon, 12 Oct 2020 15:56:59 +0100

 On 12/10/2020 15:01, Kamil Rytarowski wrote:
 [snip]

 Good job.

 > Andrew, is this ready to commit?
 >
 > http://netbsd.org/~kamil/patch-00283-amd64-__sigtramp2-cfi-unwinding.2.t=
 xt
 >

 Is '_OFFSETOF_' really needed as a prefix?


 nick@thinkbook:~/netbsd/nbcvs/src$ find lib/libc -name "genassym.cf"
 -exec grep -H offsetof {} \; | grep -v _OFFSETOF | wc -l
 71
 nick@thinkbook:~/netbsd/nbcvs/src$ find lib/libc -name "genassym.cf"
 -exec grep -H offsetof {} \; | grep  _OFFSETOF | wc -l
 41
 nick@thinkbook:~/netbsd/nbcvs/src$


 Thanks,
 Nick

From: Nikhil Benesch <nikhil.benesch@gmail.com>
To: Nick Hudson <nick.hudson@gmx.co.uk>
Cc: Kamil Rytarowski <kamil@netbsd.org>, gnats-bugs@netbsd.org, gnats-admin@netbsd.org, 
	netbsd-bugs@netbsd.org, Andrew Cagney <andrew.cagney@gmail.com>
Subject: Re: lib/55719 (Unwind tables for signal trampoline on amd64 are incorrect)
Date: Mon, 12 Oct 2020 11:13:29 -0400

 Ah, indeed, I copied the naming from the MIPS port, which does seem to
 be the outlier in using the "OFFSETOF_" prefix for these macros.

 On Mon, Oct 12, 2020 at 10:57 AM Nick Hudson <nick.hudson@gmx.co.uk> wrote:
 >
 > On 12/10/2020 15:01, Kamil Rytarowski wrote:
 > [snip]
 >
 > Good job.
 >
 > > Andrew, is this ready to commit?
 > >
 > > http://netbsd.org/~kamil/patch-00283-amd64-__sigtramp2-cfi-unwinding.2.txt
 > >
 >
 > Is '_OFFSETOF_' really needed as a prefix?
 >
 >
 > nick@thinkbook:~/netbsd/nbcvs/src$ find lib/libc -name "genassym.cf"
 > -exec grep -H offsetof {} \; | grep -v _OFFSETOF | wc -l
 > 71
 > nick@thinkbook:~/netbsd/nbcvs/src$ find lib/libc -name "genassym.cf"
 > -exec grep -H offsetof {} \; | grep  _OFFSETOF | wc -l
 > 41
 > nick@thinkbook:~/netbsd/nbcvs/src$
 >
 >
 > Thanks,
 > Nick

State-Changed-From-To: open->closed
State-Changed-By: kamil@NetBSD.org
State-Changed-When: Mon, 12 Oct 2020 19:58:00 +0200
State-Changed-Why:
Fix committe.


From: Nikhil Benesch <nikhil.benesch@gmail.com>
To: gnats-bugs@netbsd.org
Cc: Kamil Rytarowski <kamil@netbsd.org>, netbsd-bugs@netbsd.org, gnats-admin@netbsd.org
Subject: Re: lib/55719 (Unwind tables for signal trampoline on amd64 are incorrect)
Date: Sun, 18 Oct 2020 12:01:08 -0400

 Ah, unfortunately I introduced a buglet with the revisions. Apologies
 for not noticing sooner. The problem only presents when backtracing
 through *multiple* frames before a signal handler and when those
 frames don't use a frame pointer. I've included a modified version of
 Kamil's test program that surfaces the bug here:

 https://gist.github.com/benesch/2b2b4ffb76019d7a2cda6c6002bb2aa0

 With -O0, the backtrace works perfectly on NetBSD-current:

 $ ./test
 Backtrace 4 stack frames.
 0x400bb0 <handler+0x30> at ./test
 0x7a64384a7870 <__sigtramp_siginfo_2> at /usr/lib/libc.so.12
 0x400be9 <run+0x4> at ./test
 0x400c0e <main+0x23> at ./test

 With -O2 the backtrace gets stuck at the run function:

 $ ./test
 Backtrace 3 stack frames.
 0x400ba0 <handler+0x20> at ./test
 0x73959f0a7870 <__sigtramp_siginfo_2> at /usr/lib/libc.so.12
 0x400bc8 <run> at ./test

 Looking at the generated CFI for the test program makes the issue
 clear. Without optimizations, the CFA is computed via the base
 pointer. With optimizations, it is computed via the stack pointer. So
 the CFI we added to NetBSD's sigtramp mishandles restoration of RSP.
 The fix is quite straightforward:

 ===================================================================
 RCS file: /cvsroot/src/lib/libc/arch/x86_64/sys/__sigtramp2.S,v
 retrieving revision 1.8
 diff -u -r1.8 __sigtramp2.S
 --- arch/x86_64/sys/__sigtramp2.S       12 Oct 2020 17:55:54 -0000      1.8
 +++ arch/x86_64/sys/__sigtramp2.S       18 Oct 2020 15:51:51 -0000
 @@ -58,7 +58,7 @@
         .cfi_offset rsi, UC_GREGS_RSI
         .cfi_offset rdi, UC_GREGS_RDI
         .cfi_offset rbp, UC_GREGS_RBP
 -       /* The unwinder will use the CFA to restore RSP. */
 +       .cfi_offset rsp, UC_GREGS_RSP
         .cfi_offset r8,  UC_GREGS_R8
         .cfi_offset r9,  UC_GREGS_R9
         .cfi_offset r10, UC_GREGS_R10

 (My email client has a tendency to misformat patches, so I've included
 another copy of the patch in the GitHub Gist I linked to above.)

 Basically, because we changed the CFI for the sigtramp to *not* use
 RSP as the CFA, we need to explicitly restore RSP to the value stored
 in the context. Otherwise the unwinder will implicitly set RSP to the
 CFA, which is not correct in this case. (Of course, without
 optimizations things work out ok, since the CFI doesn't refer to the
 mishandled RSP.)

 On Mon, Oct 12, 2020 at 1:58 PM <kamil@netbsd.org> wrote:
 >
 > Synopsis: Unwind tables for signal trampoline on amd64 are incorrect
 >
 > State-Changed-From-To: open->closed
 > State-Changed-By: kamil@NetBSD.org
 > State-Changed-When: Mon, 12 Oct 2020 19:58:00 +0200
 > State-Changed-Why:
 > Fix committe.
 >
 >
 >

From: "Kamil Rytarowski" <kamil@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/55719 CVS commit: src/lib/libc/arch/x86_64/sys
Date: Mon, 19 Oct 2020 11:29:26 +0000

 Module Name:	src
 Committed By:	kamil
 Date:		Mon Oct 19 11:29:26 UTC 2020

 Modified Files:
 	src/lib/libc/arch/x86_64/sys: __sigtramp2.S

 Log Message:
 Restore RSP from mcontext

 Fixes unwinding of multiple frames without base pointer.

 Patch by: Nikhil Benesch via PR lib/55719


 To generate a diff of this commit:
 cvs rdiff -u -r1.8 -r1.9 src/lib/libc/arch/x86_64/sys/__sigtramp2.S

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Kamil Rytarowski <n54@gmx.com>
To: gnats-bugs@netbsd.org, kamil@netbsd.org, gnats-admin@netbsd.org,
 netbsd-bugs@netbsd.org, nikhil.benesch@gmail.com,
 Andrew Cagney <andrew.cagney@gmail.com>
Cc: 
Subject: Re: lib/55719 (Unwind tables for signal trampoline on amd64 are
 incorrect)
Date: Mon, 19 Oct 2020 13:38:29 +0200

 Thanks for the patch. I've applied it and repeated it in i386.

 I was wondering whether we need to restore the stack pointer too.

 While there, I'm getting the following backtrace for NetBSD/i386:

 Backtrace 6 stack frames.
 0x8048972 <handler+0x2d> at ./a.out
 0xfbd93010 <__sigtramp_siginfo_2> at /usr/lib/i386/libc.so.12
 0x804899e <run1> at ./a.out
 0x80489a5 <run3> at ./a.out
 0x80489aa <run3+0x5> at ./a.out
 0x80489de <__x86.get_pc_thunk.ax> at ./a.out

 for "gcc -fomit-frame-pointer  -O3 test.c -lexecinfo -m32". Is this
 correct? I don't see <main> over there.

 http://netbsd.org/~kamil/backtrace/libc-i386-PR55719-multiple-frames.txt

 On 18.10.2020 18:05, Nikhil Benesch wrote:
 > The following reply was made to PR lib/55719; it has been noted by GNATS=
 .
 >
 > From: Nikhil Benesch <nikhil.benesch@gmail.com>
 > To: gnats-bugs@netbsd.org
 > Cc: Kamil Rytarowski <kamil@netbsd.org>, netbsd-bugs@netbsd.org, gnats-a=
 dmin@netbsd.org
 > Subject: Re: lib/55719 (Unwind tables for signal trampoline on amd64 are=
  incorrect)
 > Date: Sun, 18 Oct 2020 12:01:08 -0400
 >
 >  Ah, unfortunately I introduced a buglet with the revisions. Apologies
 >  for not noticing sooner. The problem only presents when backtracing
 >  through *multiple* frames before a signal handler and when those
 >  frames don't use a frame pointer. I've included a modified version of
 >  Kamil's test program that surfaces the bug here:
 >
 >  https://gist.github.com/benesch/2b2b4ffb76019d7a2cda6c6002bb2aa0
 >
 >  With -O0, the backtrace works perfectly on NetBSD-current:
 >
 >  $ ./test
 >  Backtrace 4 stack frames.
 >  0x400bb0 <handler+0x30> at ./test
 >  0x7a64384a7870 <__sigtramp_siginfo_2> at /usr/lib/libc.so.12
 >  0x400be9 <run+0x4> at ./test
 >  0x400c0e <main+0x23> at ./test
 >
 >  With -O2 the backtrace gets stuck at the run function:
 >
 >  $ ./test
 >  Backtrace 3 stack frames.
 >  0x400ba0 <handler+0x20> at ./test
 >  0x73959f0a7870 <__sigtramp_siginfo_2> at /usr/lib/libc.so.12
 >  0x400bc8 <run> at ./test
 >
 >  Looking at the generated CFI for the test program makes the issue
 >  clear. Without optimizations, the CFA is computed via the base
 >  pointer. With optimizations, it is computed via the stack pointer. So
 >  the CFI we added to NetBSD's sigtramp mishandles restoration of RSP.
 >  The fix is quite straightforward:
 >
 >  =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
 >  RCS file: /cvsroot/src/lib/libc/arch/x86_64/sys/__sigtramp2.S,v
 >  retrieving revision 1.8
 >  diff -u -r1.8 __sigtramp2.S
 >  --- arch/x86_64/sys/__sigtramp2.S       12 Oct 2020 17:55:54 -0000     =
  1.8
 >  +++ arch/x86_64/sys/__sigtramp2.S       18 Oct 2020 15:51:51 -0000
 >  @@ -58,7 +58,7 @@
 >          .cfi_offset rsi, UC_GREGS_RSI
 >          .cfi_offset rdi, UC_GREGS_RDI
 >          .cfi_offset rbp, UC_GREGS_RBP
 >  -       /* The unwinder will use the CFA to restore RSP. */
 >  +       .cfi_offset rsp, UC_GREGS_RSP
 >          .cfi_offset r8,  UC_GREGS_R8
 >          .cfi_offset r9,  UC_GREGS_R9
 >          .cfi_offset r10, UC_GREGS_R10
 >
 >  (My email client has a tendency to misformat patches, so I've included
 >  another copy of the patch in the GitHub Gist I linked to above.)
 >
 >  Basically, because we changed the CFI for the sigtramp to *not* use
 >  RSP as the CFA, we need to explicitly restore RSP to the value stored
 >  in the context. Otherwise the unwinder will implicitly set RSP to the
 >  CFA, which is not correct in this case. (Of course, without
 >  optimizations things work out ok, since the CFI doesn't refer to the
 >  mishandled RSP.)
 >
 >  On Mon, Oct 12, 2020 at 1:58 PM <kamil@netbsd.org> wrote:
 >  >
 >  > Synopsis: Unwind tables for signal trampoline on amd64 are incorrect
 >  >
 >  > State-Changed-From-To: open->closed
 >  > State-Changed-By: kamil@NetBSD.org
 >  > State-Changed-When: Mon, 12 Oct 2020 19:58:00 +0200
 >  > State-Changed-Why:
 >  > Fix committe.
 >  >
 >  >
 >  >
 >
 >

From: Nikhil Benesch <nikhil.benesch@gmail.com>
To: gnats-bugs@netbsd.org, kamil@netbsd.org, gnats-admin@netbsd.org,
 netbsd-bugs@netbsd.org
Cc: 
Subject: Re: lib/55719 (Unwind tables for signal trampoline on amd64 are
 incorrect)
Date: Sun, 1 Nov 2020 23:54:34 -0500

 On Mon, Oct 19, 2020 at 7:45 AM Kamil Rytarowski <n54@gmx.com> wrote:
 >  While there, I'm getting the following backtrace for NetBSD/i386:
 >
 >  Backtrace 6 stack frames.
 >  0x8048972 <handler+0x2d> at ./a.out
 >  0xfbd93010 <__sigtramp_siginfo_2> at /usr/lib/i386/libc.so.12
 >  0x804899e <run1> at ./a.out
 >  0x80489a5 <run3> at ./a.out
 >  0x80489aa <run3+0x5> at ./a.out
 >  0x80489de <__x86.get_pc_thunk.ax> at ./a.out
 >
 >  for "gcc -fomit-frame-pointer  -O3 test.c -lexecinfo -m32". Is this
 >  correct? I don't see <main> over there.
 >
 >  http://netbsd.org/~kamil/backtrace/libc-i386-PR55719-multiple-frames.txt

 Sorry for the slow reply. I finally got the chance to look into this and I think
 figured out what is going on. It is a classic case of GCC being too clever. Here
 is the disassembly of your test program:

 0804899e <run1>:
   804899e:       eb fe                   jmp    804899e <run1>

 080489a0 <run2>:
   80489a0:       e8 f9 ff ff ff          call   804899e <run1>

 080489a5 <run3>:
   80489a5:       e8 f6 ff ff ff          call   80489a0 <run2>
   80489aa:       66 90                   xchg   %ax,%ax
   80489ac:       66 90                   xchg   %ax,%ax
   80489ae:       66 90                   xchg   %ax,%ax
   80489b0:       e8 eb fa ff ff          call   80484a0 <abort@plt>
   80489b5:       89 fb                   mov    %edi,%ebx
   80489b7:       e8 e4 fa ff ff          call   80484a0 <abort@plt>

 080489bc <main>:
   80489bc:       55                      push   %ebp
   80489bd:       89 e5                   mov    %esp,%ebp
   80489bf:       83 e4 f0                and    $0xfffffff0,%esp
   80489c2:       83 ec 10                sub    $0x10,%esp
   80489c5:       c7 44 24 04 45 89 04    movl   $0x8048945,0x4(%esp)
   80489cc:       08
   80489cd:       c7 04 24 02 00 00 00    movl   $0x2,(%esp)
   80489d4:       e8 e7 fa ff ff          call   80484c0 <signal@plt>
   80489d9:       e8 c7 ff ff ff          call   80489a5 <run3>

 080489de <__x86.get_pc_thunk.ax>:
   80489de:       8b 04 24                mov    (%esp),%eax
   80489e1:       c3                      ret

 Notice how none of the functions have ret instructions. I guess GCC has realized
 that all of them terminate in the infinite loop in run1. So when the unwinder
 tries to unwind from run3, it ends up falling off the end of main and into
 the __x86.get_pc_thunk.ax function instead, due to that (ra + 1) hack we were
 discussing previously.

 A very simple change to inhibit this optimatization...

 diff -u test.c test2.c
 --- test.c      2020-11-02 04:49:00.693887594 +0000
 +++ test2.c     2020-11-02 04:48:33.693998384 +0000
 @@ -14,10 +14,12 @@
           backtrace_symbols_fd (array, size, 2);
   }

 +volatile int v;
 +
   __attribute__ ((noinline))
   int
   run1(void) {
 -       for (;;)
 +       for (v = 1; v;)
                   continue;
   }

 ...results in the correct backtraces:

 $ ./a.out
 ^Cx 2
 Backtrace 4 stack frames.
 0x8048972 <handler+0x2d> at ./a.out
 0xf4c3c0c0 <__sigtramp_siginfo_2> at /usr/lib/i386/libc.so.12
 0x80489a8 <run1+0xa> at ./a.out
 0x80489ee <main+0x22> at ./a.out

 And the assembly, as you would expect, includes the ret instructions:

 080489cc <main>:
   80489cc:       55                      push   %ebp
   80489cd:       89 e5                   mov    %esp,%ebp
   80489cf:       83 e4 f0                and    $0xfffffff0,%esp
   80489d2:       83 ec 10                sub    $0x10,%esp
   80489d5:       c7 44 24 04 45 89 04    movl   $0x8048945,0x4(%esp)
   80489dc:       08
   80489dd:       c7 04 24 02 00 00 00    movl   $0x2,(%esp)
   80489e4:       e8 d7 fa ff ff          call   80484c0 <signal@plt>
   80489e9:       e8 c6 ff ff ff          call   80489b4 <run3>
   80489ee:       31 c0                   xor    %eax,%eax
   80489f0:       c9                      leave
   80489f1:       c3                      ret

 I'm not sure there is anything that can be done about this. This seems like a flaw
 inherent to the design of the DWARF-based unwinders. At the very least, analyzing
 and fixing this properly exceeds my expertise.

 Thanks again for getting the earlier unwinding patches committed so quickly,
 Kamil. gccgo is working great on NetBSD now as a result.

 Cheers,
 Nikhil

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.