NetBSD Problem Report #55352

From gson@gson.org  Sat Jun  6 11:17:04 2020
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id A466F1A921E
	for <gnats-bugs@gnats.NetBSD.org>; Sat,  6 Jun 2020 11:17:04 +0000 (UTC)
Message-Id: <20200606111659.37B51253D4C@guava.gson.org>
Date: Sat,  6 Jun 2020 14:16:59 +0300 (EEST)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: access_regs_set_unaligned_pc_0x7 test cases sometimes fail
X-Send-Pr-Version: 3.95

>Number:         55352
>Category:       kern
>Synopsis:       access_regs_set_unaligned_pc_0x7 test cases sometimes fail
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kamil
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Jun 06 11:20:00 +0000 2020
>Last-Modified:  Tue Nov 02 09:10:01 +0000 2021
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current
>Organization:
>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:

The access_regs_set_unaligned_pc_0x7 test case in multiple
lib/libc/sys/t_ptrace_wait* tests is now randomly failing on my amd64
bare metal testbed which runs the tests on a physical HP DL360 G7
system.  Each test case fails in maybe 3/4 of the runs, but there's
six of them, so at least one of them is almost certain to fail in
each run.

Here is a tabular representation of the outcomes of the last 30 test
runs, with passing test cases marked by "-" and failing ones by "X":

  -------------------------XX-X-X-X-X--XXX   lib/libc/sys/t_ptrace_wait3:access_regs_set_unaligned_pc_0x7
  -------------------------X--XXXXX-X-XXXX   lib/libc/sys/t_ptrace_wait4:access_regs_set_unaligned_pc_0x7
  -------------------------X-XX--XXX-X-X-X   lib/libc/sys/t_ptrace_wait6:access_regs_set_unaligned_pc_0x7
  --------------------------XXX-X--XXX-XX-   lib/libc/sys/t_ptrace_wait:access_regs_set_unaligned_pc_0x7
  -------------------------XX-XXXXXX-XXXX-   lib/libc/sys/t_ptrace_waitid:access_regs_set_unaligned_pc_0x7
  -------------------------XXX--XXXXXX---X   lib/libc/sys/t_ptrace_waitpid:access_regs_set_unaligned_pc_0x7

Although the failures are easily reproduced on the system in case,
I'm having a hard time reproducing it on other systems.  None of the
following are failing:

 - Tetsts run in qemu
 - Tests run on a Dell R630 rather than a HP DL360
 - Tests run on the same hardware under NetBSD/i386 rather than NetBSD/amd64
 - The access_regs_set_unaligned_pc_0xN test case for any N other than 7

The first failures appeared after this commit:

  2020.05.31.01.39.33 ad src/sys/dev/acpi/acpi_cpu_cstate.c 1.61

Logs:

  http://www.gson.org/netbsd/bugs/build/amd64-baremetal/commits-2020.05.html#2020.05.31.01.39.33

>How-To-Repeat:

With difficulty.

>Fix:

>Release-Note:

>Audit-Trail:

Responsible-Changed-From-To: kern-bug-people->kamil
Responsible-Changed-By: kamil@NetBSD.org
Responsible-Changed-When: Sat, 06 Jun 2020 13:22:20 +0200
Responsible-Changed-Why:
Take.


From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/55352: access_regs_set_unaligned_pc_0x7 test cases sometimes fail
Date: Fri, 15 Oct 2021 19:34:13 +0300

 My NetBSD/amd64 real hardware testbed is still reporting several
 access_regs_set_unaligned_pc_0x7 test case failures per run.
 Here's log output from a recent run with five of them:

   https://www.gson.org/netbsd/bugs/build/amd64-baremetal/2021/2021.10.08.21.32.28/test.html#failed-tcs-summary

 I tried to debug this by setting the variable "debug" in
 t_ptrace_wait.c, but that didn't work.  Setting it from the debugger
 doesn't work because the variable has been optimized away, and if you
 change the initializer in the source and rebuild t_ptrace_wait, you do
 get debug output but the test no longer fails.

 When the test fails, the t_ptrace_wait process hangs until the ATF
 5-minute timeout, and ps shows it has forked a child.  If I attach to
 the parent with gdb, it's hung in a wait() syscall:

   (gdb) bt
   #0  0x000076c7db846a8a in _sys___wait450 () from /usr/lib/libc.so.12
   #1  0x000076c7dc008821 in __wait450 (wpid=wpid@entry=-1, status=status@entry=0x7f7fff88ec5c, options=options@entry=0, rusage=rusage@entry=0x0)
       at /usr/src/lib/libpthread/pthread_cancelstub.c:661
   #2  0x000076c7db872f28 in _wait (istat=istat@entry=0x7f7fff88ec5c) at /usr/src/lib/libc/gen/wait.c:55
   #3  0x00000001b2e270e4 in access_regs (regset=<optimized out>, aux=<optimized out>) at /usr/src/tests/lib/libc/sys/t_ptrace_register_wait.h:147
   #4  0x000076c7dc40a434 in atf_tc_run (tc=0x1b30541d0 <atfu_access_regs_set_unaligned_pc_0x7_tc>, resfile=<optimized out>) at /usr/src/external/bsd/atf/dist/atf-c/tc.c:1024
   #5  0x000076c7dc406dac in run_tc (exitcode=<synthetic pointer>, p=0x7f7fff88f040, tp=0x7f7fff88f020) at /usr/src/external/bsd/atf/dist/atf-c/detail/tp_main.c:510
   #6  controlled_main (exitcode=<synthetic pointer>, add_tcs_hook=0x1b2e0a51c <atfu_tp_add_tcs>, argv=<optimized out>, argc=<optimized out>)
       at /usr/src/external/bsd/atf/dist/atf-c/detail/tp_main.c:580
   #7  atf_tp_main (argc=<optimized out>, argv=<optimized out>, add_tcs_hook=0x1b2e0a51c <atfu_tp_add_tcs>) at /usr/src/external/bsd/atf/dist/atf-c/detail/tp_main.c:610
   #8  0x00000001b2e0994d in ___start ()

 Specifically, it hangs at the second one of two wait calls performed
 by the test, expected to fail because the first one succeeded:

   #3  0x00000001b2e270e4 in access_regs (regset=<optimized out>, aux=<optimized out>) at /usr/src/tests/lib/libc/sys/t_ptrace_register_wait.h:147
   147                     TWAIT_REQUIRE_FAILURE(ECHILD,

 I can't attach to the child process with gdb because it's already
 being ptrace'd by the parent process.

 If I understand the code correctly, the test sets the program counter
 of a traced process to a somewhat arbitrary value that may well point
 into the middle of a multi-byte instruction, tells the process to
 continue execution from that point, then immediately kills it, and
 finally waits for it to exit, twice, expecting the first wait to
 succeed and the second one to fail.

 Did I get that right?  If so, it's weird test in that it invokes
 undefined behavior in the child process, but arguably it should still
 be possible to trace, kill, and wait for such a process without a
 second wait hanging.

 Can anyone else reproduce this?  This hangs with about 50% probability
 on my machine:

   cd /usr/tests/lib/libc/sys
   ./t_ptrace_wait access_regs_set_unaligned_pc_0x7

 --
 Andreas Gustafsson, gson@gson.org

From: Kamil Rytarowski <n54@gmx.com>
To: gnats-bugs@netbsd.org, kamil@netbsd.org, gnats-admin@netbsd.org,
 netbsd-bugs@netbsd.org, Andreas Gustafsson <gson@gson.org>
Cc: 
Subject: Re: kern/55352: access_regs_set_unaligned_pc_0x7 test cases sometimes
 fail
Date: Tue, 2 Nov 2021 10:06:24 +0100

 The test is supposed to set an invalid Program Counter and check whether
 the kernel does not crash. On strict architectures (notably SPARC) there
 might be (and it was reported) a kernel panic.

 Double wait is supposed to first wait for the child to exit cleanly,
 then assert that it is already gone.

 If the problem is still there, I plan to have a look over the weekend.

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.