NetBSD Problem Report #54618

From www@netbsd.org  Mon Oct 14 21:21:54 2019
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 89B097A25D
	for <gnats-bugs@gnats.NetBSD.org>; Mon, 14 Oct 2019 21:21:54 +0000 (UTC)
Message-Id: <20191014212153.826827A26E@mollari.NetBSD.org>
Date: Mon, 14 Oct 2019 21:21:53 +0000 (UTC)
From: david@gutteridge.ca
Reply-To: david@gutteridge.ca
To: gnats-bugs@NetBSD.org
Subject: Recurring kernel panics during shutdown (9.99.15/amd64)
X-Send-Pr-Version: www-1.0

>Number:         54618
>Category:       kern
>Synopsis:       Recurring kernel panics during shutdown (9.99.15/amd64)
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kamil
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Oct 14 21:25:00 +0000 2019
>Closed-Date:    Mon Oct 21 22:57:38 +0000 2019
>Last-Modified:  Wed Oct 23 19:30:05 +0000 2019
>Originator:     David H. Gutteridge
>Release:        9.99.15
>Organization:
>Environment:
NetBSD arcusxvi.nonus-porta.net 9.99.15 NetBSD 9.99.15 (GENERIC) #5: Tue Oct  8 22:53:28 EDT 2019  disciple@arcusxvi.nonus-porta.net:/home/disciple/netbsd-current/src/sys/arch/amd64/compile/obj/GENERIC amd64

>Description:
I've encountered this panic during the shutdown process a few times
recently, with HEAD kernels of an October 1st and October 8th vintage.
I can't consistently duplicate it; it only seems to happen after there
has been a appreciable amount of disk activity (e.g. a bunch of pkgsrc
builds). This is on an oldish Ivy Bridge laptop I use for testing and
such.

I haven't had time to look into it any further, but suspect I can get
this to happen again, to get more details. Each time the backtrace is
essentially the same, the only difference is the CPU it's running on.
Anyway, I thought I'd mention it. Maybe someone else knows more
already.

[  9688.139991] syncing disks... [ 9688.1399915] Skipping crash dump on recursive panic
[  9688.139991] panic: lock error: Reader / writer lock: rw_vector_enter,350: locking against myself: lock 0xffff9986d4b623a0 cpu 0 lwp 0xffff9986cd64ab40
[  9688.139991] cpu0: Begin traceback...
[  9688.139991] vpanic() at netbsd:vpanic+0x160
[  9688.139991] snprintf() at netbsd:snprintf
[  9688.139991] lockdebug_abort() at netbsd:lockdebug_abort+0xee
[  9688.149995] rw_vector_enter() at netbsd:rw_vector_enter+0x3b0
[  9688.149995] exit1() at netbsd:exit1+0xf8
[  9688.159998] lwp_exit() at netbsd:lwp_exit+0x4aa
[  9688.159998] sigswitch() at netbsd:sigswitch+0x329
[  9688.159998] issignal() at netbsd:issignal+0x216
[  9688.159998] sleepq_block() at netbsd:sleepq_block+0x157
[  9688.170002] cv_timedwait_sig() at netbsd:cv_timedwait_sig+0x107
[  9688.170002] ttysleep() at netbsd:ttysleep+0x7b
[  9688.170002] ttywait_timo() at netbsd:ttywait_timo+0x8a
[  9688.180005] exit1() at netbsd:exit1+0x7f4
[  9688.180005] sigexit() at netbsd:sigexit+0x20a
[  9688.190009] sendsig() at netbsd:sendsig
[  9688.190009] lwp_userret() at netbsd:lwp_userret+0x19b
[  9688.190009] syscall() at netbsd:syscall+0x1ed
[  9688.200013] --- syscall (number 54) ---
[  9688.200013] 78e23df818d8:
[  9688.200013] cpu0: End traceback...
[  9688.200013] fatal breakpoint trap in supervisor mode
[  9688.200013] trap type 1 code 0 rip 0xffffffff8021dd9d cs 0x8 rflags 0x202 cr2 0x7dbfb0bd56f0 ilevel 0 rsp 0xffff9b0068c5e910
[  9688.200013] curlwp 0xffff9986cd64ab40 pid 603.1 lowest kstack 0xffff9b0068c5b2c0

>How-To-Repeat:
(As above, it seems a certain amount of intensive disk activity is
required, but I'm not sure about that.)
>Fix:

>Release-Note:

>Audit-Trail:

Responsible-Changed-From-To: kern-bug-people->kamil
Responsible-Changed-By: kamil@NetBSD.org
Responsible-Changed-When: Tue, 15 Oct 2019 11:49:15 +0200
Responsible-Changed-Why:
Take.


From: "Kamil Rytarowski" <kamil@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/54618 CVS commit: src/sys/kern
Date: Tue, 15 Oct 2019 13:59:57 +0000

 Module Name:	src
 Committed By:	kamil
 Date:		Tue Oct 15 13:59:57 UTC 2019

 Modified Files:
 	src/sys/kern: kern_sig.c

 Log Message:
 Remove the short-circuit lwp_exit() path from sigswitch()

 sigswitch() can be called from exit1() through:

    ttywait()->ttysleep()-> cv_timedwait_sig()->sleepq_block()->issignal()->sigswitch()

 lwp_exit() called for the last LWP triggers exit1() and this causes a panic.

 The debugger related signals have short-circuit demise paths in
 eventswitch() and other functions, before calling sigswitch().

 This change restores the original behavior, but there is an open question
 whether the kernel crash is a red herring of misbehavior of ttywait().

 This should fix PR kern/54618 by David H. Gutteridge


 To generate a diff of this commit:
 cvs rdiff -u -r1.372 -r1.373 src/sys/kern/kern_sig.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->feedback
State-Changed-By: kamil@NetBSD.org
State-Changed-When: Tue, 15 Oct 2019 16:03:28 +0200
State-Changed-Why:
Potential fix committed src/sys/kern/kern_sig.c r1.373, please test.


State-Changed-From-To: feedback->closed
State-Changed-By: gutteridge@NetBSD.org
State-Changed-When: Mon, 21 Oct 2019 22:57:38 +0000
State-Changed-Why:
No recurrances since the fix was applied, assuming fixed. Thanks for the quick response!

From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/54618 CVS commit: [netbsd-9] src
Date: Wed, 23 Oct 2019 19:25:39 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Wed Oct 23 19:25:39 UTC 2019

 Modified Files:
 	src/sys/kern [netbsd-9]: kern_sig.c sys_ptrace_common.c
 	src/tests/lib/libc/sys [netbsd-9]: t_ptrace_wait.c

 Log Message:
 Pull up following revision(s) (requested by kamil in ticket #366):

 	tests/lib/libc/sys/t_ptrace_wait.c: revision 1.136
 	sys/kern/kern_sig.c: revision 1.373
 	tests/lib/libc/sys/t_ptrace_wait.c: revision 1.138
 	tests/lib/libc/sys/t_ptrace_wait.c: revision 1.139
 	sys/kern/kern_sig.c: revision 1.376
 	tests/lib/libc/sys/t_ptrace_wait.c: revision 1.140
 	sys/kern/sys_ptrace_common.c: revision 1.64

 Fix typo in a comment

 Enable TEST_LWP_ENABLED in t_ptrace_wait*
 The LWP events (created, exited) are now reliable in my local tests.
 PR kern/51420
 PR kern/51995

 Remove the short-circuit lwp_exit() path from sigswitch()

 sigswitch() can be called from exit1() through:

    ttywait()->ttysleep()-> cv_timedwait_sig()->sleepq_block()->issignal()->sigswitch()

 lwp_exit() called for the last LWP triggers exit1() and this causes a panic.
 The debugger related signals have short-circuit demise paths in
 eventswitch() and other functions, before calling sigswitch().

 This change restores the original behavior, but there is an open question
 whether the kernel crash is a red herring of misbehavior of ttywait().
 This should fix PR kern/54618 by David H. Gutteridge

 Fix a race condition when handling concurrent LWP signals and add a test

 Fix a race condition that caused PT_GET_SIGINFO to return incorrect
 information when multiple signals were delivered concurrently
 to different LWPs.  Add a regression test that verifies that when 50
 threads concurrently use pthread_kill() on themselves, the debugger
 receives all signals with correct information.

 The kernel uses separate signal queues for each LWP.  However,
 the signal context used to implement PT_GET_SIGINFO is stored in 'struct
 proc' and therefore common to all LWPs in the process.  Previously,
 this member was filled in kpsignal2(), i.e. when the signal was sent.

 This meant that if another LWP managed to send another signal
 concurrently, the data was overwritten before the process was stopped.

 As a result, PT_GET_SIGINFO did not report the correct LWP and signal
 (it could even report a different signal than wait()).  This can be
 quite reliably reproduced with the number of 20 LWPs, however it can
 also occur with 10.

 This patch moves setting of signal context to issignal(), just before
 the process is actually stopped.  The data is taken from per-LWP
 or per-process signal queue.  The added test confirms that the debugger
 correctly receives all signals, and PT_GET_SIGINFO reports both correct
 LWP and signal number.
 Reviewed by kamil.

 Remove preprocessor switch TEST_VFORK_ENABLED in t_ptrace_wait*
 vfork(2) tests are now enabled always and confirmed to be stable.

 Remove preprocessor switch TEST_LWP_ENABLED in t_ptrace_wait*
 LWP tests are now enabled always and confirmed to be stable.


 To generate a diff of this commit:
 cvs rdiff -u -r1.364.2.7 -r1.364.2.8 src/sys/kern/kern_sig.c
 cvs rdiff -u -r1.58.2.8 -r1.58.2.9 src/sys/kern/sys_ptrace_common.c
 cvs rdiff -u -r1.131.2.5 -r1.131.2.6 src/tests/lib/libc/sys/t_ptrace_wait.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.