NetBSD Problem Report #54893
From gson@gson.org Sat Jan 25 08:12:42 2020
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id BE7CF8B486
for <gnats-bugs@gnats.NetBSD.org>; Sat, 25 Jan 2020 08:12:42 +0000 (UTC)
Message-Id: <20200125081239.A0046253F50@guava.gson.org>
Date: Sat, 25 Jan 2020 10:12:39 +0200 (EET)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: New t_ptrace_wait*/resume1 failures since Jan 12
X-Send-Pr-Version: 3.95
>Number: 54893
>Category: bin
>Synopsis: New t_ptrace_wait*/resume1 failures since Jan 12
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: kamil
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sat Jan 25 08:15:00 +0000 2020
>Closed-Date: Tue Feb 11 00:45:55 +0000 2020
>Last-Modified: Tue Feb 11 00:50:01 +0000 2020
>Originator: Andreas Gustafsson
>Release: NetBSD-current, source date >= 2020.01.12.22.03.23
>Organization:
>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:
Several lib/libc/sys/t_ptrace_wait*:resume1 test cases started failing
with the following commits:
commit 2020.01.12.22.03.22 ad src/sys/kern/kern_exec.c 1.488
commit 2020.01.12.22.03.22 ad src/sys/kern/kern_runq.c 1.58
commit 2020.01.12.22.03.23 ad src/sys/sys/lwp.h 1.196
commit 2020.01.12.22.03.23 ad src/sys/sys/sched.h 1.87
The lib/libc/sys/t_ptrace_wait3:resume1 and
lib/libc/sys/t_ptrace_wait4:resume1 test cases have failed in every
run, and some other resume1 tests have been failing randomly.
These failures are still occurring as of source date 2020.01.25.00.12.42.
For logs, see:
http://www.gson.org/netbsd/bugs/build/amd64-baremetal/commits-2020.01.html#2020.01.12.22.03.23
See PR 50350 for another issue that (re)appeared with the same commits.
>How-To-Repeat:
>Fix:
>Release-Note:
>Audit-Trail:
From: Andrew Doran <ad@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/54893: New t_ptrace_wait*/resume1 failures since Jan 12
Date: Tue, 28 Jan 2020 19:02:37 +0000
It looks to me like there is a race in this test case, which the scheduler
in -current has exposed, because now if any CPU is idle a new LWP will be
running within microseconds of _lwp_create().
===> Working:
child (LID 1)
DPRINTF("Before creating new in child\n");
FORKEE_ASSERT(_lwp_create(&uc, 0, &lid) == 0);
22139 1 t_ptrace_wait CALL _lwp_create(0x7f7fff1659d0,0,0x7f7fff1659b8)
22139 1 t_ptrace_wait RET _lwp_create 0
child (LID 1):
CHILD_TO_PARENT("Message", fds, msg);
child (LID 2)
the_lwp_id = _lwp_self();
raise(SIGTRAP);
_lwp_exit();
22139 1 t_ptrace_wait CALL close(5)
22139 1 t_ptrace_wait RET close 0
22139 1 t_ptrace_wait CALL close(6)
22139 2 t_ptrace_wait CALL _lwp_self
22139 1 t_ptrace_wait RET close 0
22139 2 t_ptrace_wait RET _lwp_self 2
22139 2 t_ptrace_wait CALL _lwp_self
22139 1 t_ptrace_wait CALL write(7,0x7f7fff1659b3,1)
22139 2 t_ptrace_wait RET _lwp_self 2
22139 1 t_ptrace_wait GIO fd 7 wrote 1 bytes
"\M-^"
22139 2 t_ptrace_wait CALL _lwp_kill(2,5)
22139 1 t_ptrace_wait RET write 1
22139 2 t_ptrace_wait RET _lwp_kill 0
22139 1 t_ptrace_wait CALL read(4,0x7f7fff1659b3,1)
Look at that, amazin'. Both LWPs stop here. Later on in the parent:
parent (LID 1)
5671 1 t_ptrace_wait CALL read(6,0x7f7fff1659b3,1)
5671 1 t_ptrace_wait GIO fd 6 read 1 bytes
"\M-^"
===> Failing:
20223 1 t_ptrace_wait CALL _lwp_create(0x7f7fffb27280,0,0x7f7fffb27268)
20223 1 t_ptrace_wait RET _lwp_create 0
20223 1 t_ptrace_wait CALL close(5)
20223 1 t_ptrace_wait RET close 0
20223 2 t_ptrace_wait CALL _lwp_self
20223 2 t_ptrace_wait RET _lwp_self 2
20223 2 t_ptrace_wait CALL _lwp_self
20223 2 t_ptrace_wait RET _lwp_self 2
20223 2 t_ptrace_wait CALL _lwp_kill(2,5)
20223 2 t_ptrace_wait RET _lwp_kill 0
Oops!
...
later in the parent:
8999 1 t_ptrace_wait CALL read(6,0x7f7fffb27263,1)
^ hangs here
Responsible-Changed-From-To: kern-bug-people->bin-bug-people
Responsible-Changed-By: ad@NetBSD.org
Responsible-Changed-When: Tue, 28 Jan 2020 19:13:20 +0000
Responsible-Changed-Why:
Not a kernel problem from what I can see
Responsible-Changed-From-To: bin-bug-people->kamil
Responsible-Changed-By: kamil@NetBSD.org
Responsible-Changed-When: Wed, 29 Jan 2020 05:07:56 +0100
Responsible-Changed-Why:
Take.
State-Changed-From-To: open->closed
State-Changed-By: kamil@NetBSD.org
State-Changed-When: Tue, 11 Feb 2020 01:45:55 +0100
State-Changed-Why:
Fixed in src/tests/lib/libc/sys/t_ptrace_wait.c r1.155, src/tests/lib/libc/sys/t_ptrace_wait.h r1.19.
From: "Kamil Rytarowski" <kamil@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/54893 CVS commit: src/tests/lib/libc/sys
Date: Tue, 11 Feb 2020 00:41:37 +0000
Module Name: src
Committed By: kamil
Date: Tue Feb 11 00:41:37 UTC 2020
Modified Files:
src/tests/lib/libc/sys: t_ptrace_wait.c t_ptrace_wait.h
Log Message:
Rewrite the t_ptrace_wait resume1 test
Switch to pthread functions.
Rename to 'resume'.
Synchronize the threads with pthread barriers.
Avoid race in the test.
Test passes correctly.
Fixes PR bin/54893 by Andreas Gustafsson
To generate a diff of this commit:
cvs rdiff -u -r1.154 -r1.155 src/tests/lib/libc/sys/t_ptrace_wait.c
cvs rdiff -u -r1.18 -r1.19 src/tests/lib/libc/sys/t_ptrace_wait.h
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.