NetBSD Problem Report #54688

From www@netbsd.org  Sun Nov 10 18:47:18 2019
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 89A8D7A1CB
	for <gnats-bugs@gnats.NetBSD.org>; Sun, 10 Nov 2019 18:47:18 +0000 (UTC)
Message-Id: <20191110184717.AFAC47A284@mollari.NetBSD.org>
Date: Sun, 10 Nov 2019 18:47:17 +0000 (UTC)
From: tnn@nygren.pp.se
Reply-To: tnn@nygren.pp.se
To: gnats-bugs@NetBSD.org
Subject: aarch64 softint hang, reentrancy problem?
X-Send-Pr-Version: www-1.0

>Number:         54688
>Category:       port-evbarm
>Synopsis:       aarch64 softint hang, reentrancy problem?
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    port-evbarm-maintainer
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Nov 10 18:50:00 +0000 2019
>Closed-Date:    Tue Aug 20 21:30:31 +0000 2024
>Last-Modified:  Tue Aug 20 21:30:31 +0000 2024
>Originator:     Tobias Nygren
>Release:        9.99.17
>Organization:
>Environment:
>Description:
aarch64 kernel seems prone to deadlocking when softint_dispatch is re-entered while idle loop is processing softints. Running LOCKDEBUG kernel booted with "netbsd -1" and I can get it to hang a couple of times per day.

fp ffffffc060e6fa80 cpu_Debugger() at ffffffc000072f14 netbsd:cpu_Debugger
fp ffffffc060e6faf0 gicv3_fdt_intr() at ffffffc000053054 netbsd:gicv3_fdt_intr+0x1c
fp ffffffc060e6fb10 pic_dispatch() at ffffffc0000027b0 netbsd:pic_dispatch+0x28
fp ffffffc060e6fb40 gicv3_irq_handler() at ffffffc000006c04 netbsd:gicv3_irq_handler+0xe4
fp ffffffc060e6fbb0 interrupt() at ffffffc000074d74 netbsd:interrupt+0x2c
tf ffffffc060e6fbd0 el1_trap() at ffffffc000072df4 netbsd:el1_trap
---- trapframe 0xffffffc060e6fbd0 (304 bytes) ----
registers omitted (hangs here, above is me breaking to ddb)
------------------------------------------------
fp ffffffc060e6fd00 splx() at ffffffc000003fc4 netbsd:splx+0x84
fp ffffffc060e6fd30 softint_dispatch() at ffffffc0004710fc netbsd:softint_dispatch+0xec
fp ffffffc060e57720 cpu_switchto_softint() at ffffffc000072d84 netbsd:cpu_switchto_softint+0x68
tf ffffffc060e57770 el1_trap() at ffffffc000072df4 netbsd:el1_trap
---- trapframe 0xffffffc060e57770 (304 bytes) ----
registers omitted
------------------------------------------------
fp ffffffc060e578a0 vlan_transmit() at ffffffc000558e40 netbsd:vlan_transmit+0x180
fp ffffffc060e57930 ether_output() at ffffffc00053ccf0 netbsd:ether_output+0x238
fp ffffffc060e57990 ip_if_output() at ffffffc000223dd8 netbsd:ip_if_output+0x88
fp ffffffc060e579d0 ip_output() at ffffffc000225404 netbsd:ip_output+0xecc
fp ffffffc060e57b70 ip_forward() at ffffffc000221494 netbsd:ip_forward+0x13c
fp ffffffc060e57c00 ipintr() at ffffffc000221df4 netbsd:ipintr+0x39c
fp ffffffc060e57d30 softint_dispatch() at ffffffc000471110 netbsd:softint_dispatch+0x100
fp ffffffc060e47cc0 cpu_switchto_softint() at ffffffc000072d84 netbsd:cpu_switchto_softint+0x68
fp ffffffc060e47df8 cpu_idle() at ffffffc000073cf8 netbsd:cpu_idle+0x58
fp ffffffc060e47e40 idle_loop() at ffffffc000449a00 netbsd:idle_loop+0x188

>How-To-Repeat:
Exercise softint-utilizing code paths moderately on aarch64, for example use vlan(4).
>Fix:
N/A

>Release-Note:

>Audit-Trail:
From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@netbsd.org
Cc: port-evbarm-maintainer@netbsd.org, gnats-admin@netbsd.org,
    netbsd-bugs@netbsd.org
Subject: re: port-evbarm/54688: aarch64 softint hang, reentrancy problem?
Date: Mon, 11 Nov 2019 13:48:50 +1100

 i think this problem is more than just from the idle loop.

 i've been able to reproduce this problem fairly easily with my
 rock64 and the builtin awge(4) when using non-NET_MPSAFE kernels with
 the MPSAFE interrupt handling also disabled in the fdt frontend.

 for me, the symptoms are usually pretty consistently:

 	- some thread, maybe user, maybe idle, maybe soft clock
 	  or maybe soft bio is running.

 	- it gets fast softint switched to the softnet lwp.  this
 	  runs for a bit..

 	- it gets fast softint switched to the softclock lwp.
 	  this hangs attempting to take the kernel_lock.

 however, i've never determined why the kernel_lock isn't free
 or where it is held from.

 i think it's related to fast softint switching.  note that when
 fast softints are active, you end up with stack traces that have
 all relevant threads in them.  eg, above i see:

    - few frames of softclock
    - exception
    - few frames of softnet
    - exception
    - frames of whatever was running normally here

 in the stack of the original lwp, and trying to trace the other
 lwps, will stop at their start (eg, softnet thread shows the
 first 3 parts above, and softclock only shows 1 part.)


 .mrg.

State-Changed-From-To: open->feedback
State-Changed-By: mrg@NetBSD.org
State-Changed-When: Tue, 20 Aug 2024 19:19:55 +0000
State-Changed-Why:
Tobias, have you seen this problem recently?  i haven't but i also have
been running NET_MPSAFE kernels on the host that saw it, and i haven't
been using it as heavily either.


State-Changed-From-To: feedback->closed
State-Changed-By: tnn@NetBSD.org
State-Changed-When: Tue, 20 Aug 2024 21:30:31 +0000
State-Changed-Why:
No longer reproducable. I recall discussing this issue with
the late ryo@ off-list and a fix was committed at some point.
I don't recall exactly what it was.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2024 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.