NetBSD Problem Report #54618
From www@netbsd.org Mon Oct 14 21:21:54 2019
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 89B097A25D
for <gnats-bugs@gnats.NetBSD.org>; Mon, 14 Oct 2019 21:21:54 +0000 (UTC)
Message-Id: <20191014212153.826827A26E@mollari.NetBSD.org>
Date: Mon, 14 Oct 2019 21:21:53 +0000 (UTC)
From: david@gutteridge.ca
Reply-To: david@gutteridge.ca
To: gnats-bugs@NetBSD.org
Subject: Recurring kernel panics during shutdown (9.99.15/amd64)
X-Send-Pr-Version: www-1.0
>Number: 54618
>Category: kern
>Synopsis: Recurring kernel panics during shutdown (9.99.15/amd64)
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kamil
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Mon Oct 14 21:25:00 +0000 2019
>Closed-Date: Mon Oct 21 22:57:38 +0000 2019
>Last-Modified: Wed Oct 23 19:30:05 +0000 2019
>Originator: David H. Gutteridge
>Release: 9.99.15
>Organization:
>Environment:
NetBSD arcusxvi.nonus-porta.net 9.99.15 NetBSD 9.99.15 (GENERIC) #5: Tue Oct 8 22:53:28 EDT 2019 disciple@arcusxvi.nonus-porta.net:/home/disciple/netbsd-current/src/sys/arch/amd64/compile/obj/GENERIC amd64
>Description:
I've encountered this panic during the shutdown process a few times
recently, with HEAD kernels of an October 1st and October 8th vintage.
I can't consistently duplicate it; it only seems to happen after there
has been a appreciable amount of disk activity (e.g. a bunch of pkgsrc
builds). This is on an oldish Ivy Bridge laptop I use for testing and
such.
I haven't had time to look into it any further, but suspect I can get
this to happen again, to get more details. Each time the backtrace is
essentially the same, the only difference is the CPU it's running on.
Anyway, I thought I'd mention it. Maybe someone else knows more
already.
[ 9688.139991] syncing disks... [ 9688.1399915] Skipping crash dump on recursive panic
[ 9688.139991] panic: lock error: Reader / writer lock: rw_vector_enter,350: locking against myself: lock 0xffff9986d4b623a0 cpu 0 lwp 0xffff9986cd64ab40
[ 9688.139991] cpu0: Begin traceback...
[ 9688.139991] vpanic() at netbsd:vpanic+0x160
[ 9688.139991] snprintf() at netbsd:snprintf
[ 9688.139991] lockdebug_abort() at netbsd:lockdebug_abort+0xee
[ 9688.149995] rw_vector_enter() at netbsd:rw_vector_enter+0x3b0
[ 9688.149995] exit1() at netbsd:exit1+0xf8
[ 9688.159998] lwp_exit() at netbsd:lwp_exit+0x4aa
[ 9688.159998] sigswitch() at netbsd:sigswitch+0x329
[ 9688.159998] issignal() at netbsd:issignal+0x216
[ 9688.159998] sleepq_block() at netbsd:sleepq_block+0x157
[ 9688.170002] cv_timedwait_sig() at netbsd:cv_timedwait_sig+0x107
[ 9688.170002] ttysleep() at netbsd:ttysleep+0x7b
[ 9688.170002] ttywait_timo() at netbsd:ttywait_timo+0x8a
[ 9688.180005] exit1() at netbsd:exit1+0x7f4
[ 9688.180005] sigexit() at netbsd:sigexit+0x20a
[ 9688.190009] sendsig() at netbsd:sendsig
[ 9688.190009] lwp_userret() at netbsd:lwp_userret+0x19b
[ 9688.190009] syscall() at netbsd:syscall+0x1ed
[ 9688.200013] --- syscall (number 54) ---
[ 9688.200013] 78e23df818d8:
[ 9688.200013] cpu0: End traceback...
[ 9688.200013] fatal breakpoint trap in supervisor mode
[ 9688.200013] trap type 1 code 0 rip 0xffffffff8021dd9d cs 0x8 rflags 0x202 cr2 0x7dbfb0bd56f0 ilevel 0 rsp 0xffff9b0068c5e910
[ 9688.200013] curlwp 0xffff9986cd64ab40 pid 603.1 lowest kstack 0xffff9b0068c5b2c0
>How-To-Repeat:
(As above, it seems a certain amount of intensive disk activity is
required, but I'm not sure about that.)
>Fix:
>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: kern-bug-people->kamil
Responsible-Changed-By: kamil@NetBSD.org
Responsible-Changed-When: Tue, 15 Oct 2019 11:49:15 +0200
Responsible-Changed-Why:
Take.
From: "Kamil Rytarowski" <kamil@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/54618 CVS commit: src/sys/kern
Date: Tue, 15 Oct 2019 13:59:57 +0000
Module Name: src
Committed By: kamil
Date: Tue Oct 15 13:59:57 UTC 2019
Modified Files:
src/sys/kern: kern_sig.c
Log Message:
Remove the short-circuit lwp_exit() path from sigswitch()
sigswitch() can be called from exit1() through:
ttywait()->ttysleep()-> cv_timedwait_sig()->sleepq_block()->issignal()->sigswitch()
lwp_exit() called for the last LWP triggers exit1() and this causes a panic.
The debugger related signals have short-circuit demise paths in
eventswitch() and other functions, before calling sigswitch().
This change restores the original behavior, but there is an open question
whether the kernel crash is a red herring of misbehavior of ttywait().
This should fix PR kern/54618 by David H. Gutteridge
To generate a diff of this commit:
cvs rdiff -u -r1.372 -r1.373 src/sys/kern/kern_sig.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: open->feedback
State-Changed-By: kamil@NetBSD.org
State-Changed-When: Tue, 15 Oct 2019 16:03:28 +0200
State-Changed-Why:
Potential fix committed src/sys/kern/kern_sig.c r1.373, please test.
State-Changed-From-To: feedback->closed
State-Changed-By: gutteridge@NetBSD.org
State-Changed-When: Mon, 21 Oct 2019 22:57:38 +0000
State-Changed-Why:
No recurrances since the fix was applied, assuming fixed. Thanks for the quick response!
From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/54618 CVS commit: [netbsd-9] src
Date: Wed, 23 Oct 2019 19:25:39 +0000
Module Name: src
Committed By: martin
Date: Wed Oct 23 19:25:39 UTC 2019
Modified Files:
src/sys/kern [netbsd-9]: kern_sig.c sys_ptrace_common.c
src/tests/lib/libc/sys [netbsd-9]: t_ptrace_wait.c
Log Message:
Pull up following revision(s) (requested by kamil in ticket #366):
tests/lib/libc/sys/t_ptrace_wait.c: revision 1.136
sys/kern/kern_sig.c: revision 1.373
tests/lib/libc/sys/t_ptrace_wait.c: revision 1.138
tests/lib/libc/sys/t_ptrace_wait.c: revision 1.139
sys/kern/kern_sig.c: revision 1.376
tests/lib/libc/sys/t_ptrace_wait.c: revision 1.140
sys/kern/sys_ptrace_common.c: revision 1.64
Fix typo in a comment
Enable TEST_LWP_ENABLED in t_ptrace_wait*
The LWP events (created, exited) are now reliable in my local tests.
PR kern/51420
PR kern/51995
Remove the short-circuit lwp_exit() path from sigswitch()
sigswitch() can be called from exit1() through:
ttywait()->ttysleep()-> cv_timedwait_sig()->sleepq_block()->issignal()->sigswitch()
lwp_exit() called for the last LWP triggers exit1() and this causes a panic.
The debugger related signals have short-circuit demise paths in
eventswitch() and other functions, before calling sigswitch().
This change restores the original behavior, but there is an open question
whether the kernel crash is a red herring of misbehavior of ttywait().
This should fix PR kern/54618 by David H. Gutteridge
Fix a race condition when handling concurrent LWP signals and add a test
Fix a race condition that caused PT_GET_SIGINFO to return incorrect
information when multiple signals were delivered concurrently
to different LWPs. Add a regression test that verifies that when 50
threads concurrently use pthread_kill() on themselves, the debugger
receives all signals with correct information.
The kernel uses separate signal queues for each LWP. However,
the signal context used to implement PT_GET_SIGINFO is stored in 'struct
proc' and therefore common to all LWPs in the process. Previously,
this member was filled in kpsignal2(), i.e. when the signal was sent.
This meant that if another LWP managed to send another signal
concurrently, the data was overwritten before the process was stopped.
As a result, PT_GET_SIGINFO did not report the correct LWP and signal
(it could even report a different signal than wait()). This can be
quite reliably reproduced with the number of 20 LWPs, however it can
also occur with 10.
This patch moves setting of signal context to issignal(), just before
the process is actually stopped. The data is taken from per-LWP
or per-process signal queue. The added test confirms that the debugger
correctly receives all signals, and PT_GET_SIGINFO reports both correct
LWP and signal number.
Reviewed by kamil.
Remove preprocessor switch TEST_VFORK_ENABLED in t_ptrace_wait*
vfork(2) tests are now enabled always and confirmed to be stable.
Remove preprocessor switch TEST_LWP_ENABLED in t_ptrace_wait*
LWP tests are now enabled always and confirmed to be stable.
To generate a diff of this commit:
cvs rdiff -u -r1.364.2.7 -r1.364.2.8 src/sys/kern/kern_sig.c
cvs rdiff -u -r1.58.2.8 -r1.58.2.9 src/sys/kern/sys_ptrace_common.c
cvs rdiff -u -r1.131.2.5 -r1.131.2.6 src/tests/lib/libc/sys/t_ptrace_wait.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.