NetBSD Problem Report #55485
From mrg@eterna.com.au Mon Jul 13 10:13:08 2020
Return-Path: <mrg@eterna.com.au>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id D51721A921A
for <gnats-bugs@gnats.NetBSD.org>; Mon, 13 Jul 2020 10:13:08 +0000 (UTC)
Message-Id: <20200713101304.B8DEE1590C3@splode.eterna.com.au>
Date: Mon, 13 Jul 2020 20:13:04 +1000 (AEST)
From: mrg@eterna.com.au
Reply-To: mrg@eterna.com.au
To: gnats-bugs@gnats.NetBSD.org
Subject: bwfm @ sdmmc hangs, with some diagnosis
X-Send-Pr-Version: 3.95
>Number: 55485
>Category: kern
>Synopsis: bwfm @ sdmmc hangs, with some diagnosis
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Mon Jul 13 10:15:00 +0000 2020
>Last-Modified: Sat Jul 18 04:20:01 +0000 2020
>Originator: matthew green
>Release: NetBSD 9.99.69
>Organization:
people's front against (bozotic) www (softwar foundation)
>Environment:
System: NetBSD hinotori.eterna23.net 9.99.68 NetBSD 9.99.68 (_hinotori_) #14: Sun Jul 12 22:34:00 PDT 2020 mrg@yesterday-when-i-was-mad.eterna23.net:/var/obj/evbarm-aarch64/usr/src2/sys/arch/evbarm/compile/_hinotori_ evbarm
Architecture: aarch64
Machine: evbarm
>Description:
we've been tracking hangs in bwfm(4) on the pinebookpro for a
while and most of the time there is no information, but i've
found some useful data related to this.
first, NET_MPSAFE kernel doesn't hard-hang the kernel when the
problem occurs. this is because he paths into the bwfm code
that hang aren't holding kernel_lock when they hang. you only
get the specific network interface to hang.
second, in the reproduction i just had, i found that the
hung 'ifconfig -a' command was stuck waiting for the bwfm
adaptive mutex in bwfm_sdio_txctl() with this trace:
fp ffffc00088d07780 mi_switch() at ffffc000004abecc netbsd:mi_switch+0x20c
fp ffffc00088d077e0 sleepq_block() at ffffc000004a88bc netbsd:sleepq_block+0x9c
fp ffffc00088d07820 turnstile_block() at ffffc000004ba0bc netbsd:turnstile_block+0x30c
fp ffffc00088d07890 mutex_enter() at ffffc0000048f58c netbsd:mutex_enter+0x1cc
fp ffffc00088d07910 bwfm_sdio_txctl() at ffffc000000ee9c0 netbsd:bwfm_sdio_txctl+0x60
fp ffffc00088d07940 bwfm_proto_bcdc_query_dcmd() at ffffc000001ea7d4 netbsd:bwfm_proto_bcdc_query_dcmd+0xd4
fp ffffc00088d079a0 bwfm_fwvar_var_get_data() at ffffc000001ec61c netbsd:bwfm_fwvar_var_get_data+0x9c
fp ffffc00088d07a00 bwfm_get_sta_info() at ffffc000001ed5c8 netbsd:bwfm_get_sta_info+0x58
fp ffffc00088d07b60 bwfm_ioctl() at ffffc000001ed808 netbsd:bwfm_ioctl+0x178
fp ffffc00088d07ba0 doifioctl() at ffffc0000057e778 netbsd:doifioctl+0x8f8
fp ffffc00088d07d40 sys_ioctl() at ffffc000004ee770 netbsd:sys_ioctl+0x420
fp ffffc00088d07e20 syscall() at ffffc0000008da2c netbsd:syscall+0x18c
tf ffffc00088d07ed0 el0_trap() at ffffc0000008c524 netbsd:el0_trap
this lock is:
* Lock 0 (initialized at bwfm_sdio_attach)
lock address : 0xffffc00001b2d180 type : sleep/adaptive
initialized : 0xffffc000000ef710
shared holds : 0 exclusive: 1
shares wanted: 0 exclusive: 1
relevant cpu : 5 last held: 5
relevant lwp : 0xffff0000f687b080 last held: 0xffff0000f687b080
last locked* : 0xffffc000000efb00 unlocked : 0xffffc000000efb98
owner field : 0xffff0000f687b080 wait/spin: 1/0
Turnstile:
=> 0 waiting readers:
=> 1 waiting writers: 0xffff00002af94b00
it is held by the sdmmc task thread for bwfm:
fp ffffc00088008b00 splx() at ffffc000000041b0 netbsd:splx+0x80
fp ffffc00088008b30 callout_halt() at ffffc000004b9444 netbsd:callout_halt+0x54
fp ffffc00088008b60 sleepq_block() at ffffc000004a8940 netbsd:sleepq_block+0x120
fp ffffc00088008ba0 cv_timedwait() at ffffc0000046d71c netbsd:cv_timedwait+0x11c
fp ffffc00088008be0 cv_timedwaitbt() at ffffc0000046da74 netbsd:cv_timedwaitbt+0x74
fp ffffc00088008c20 dwc_mmc_exec_command() at ffffc000001e5ac4 netbsd:dwc_mmc_exec_command+0x4b4
fp ffffc00088008ca0 sdmmc_mmc_command() at ffffc000000e80d0 netbsd:sdmmc_mmc_command+0x40
fp ffffc00088008cd0 sdmmc_io_rw_extended() at ffffc000000e9374 netbsd:sdmmc_io_rw_extended+0xa4
fp ffffc00088008d90 sdmmc_io_read_4() at ffffc000000e9d74 netbsd:sdmmc_io_read_4+0x24
fp ffffc00088008db0 bwfm_sdio_task() at ffffc000000efb24 netbsd:bwfm_sdio_task+0x54
fp ffffc00088008e90 sdmmc_task_thread() at ffffc000000e8a00 netbsd:sdmmc_task_thread+0xb0
this thread also holds this lock (which seems unrelated):
* Lock 1 (initialized at dwc_mmc_init)
lock address : 0xffff0000f68f0250 type : sleep/adaptive
initialized : 0xffffc000001e6108
shared holds : 0 exclusive: 1
shares wanted: 0 exclusive: 0
relevant cpu : 5 last held: 5
relevant lwp : 0xffff0000f687b080 last held: 0xffff0000f687b080
last locked* : 0xffffc000001e5634 unlocked : 0xffffc000001e5cbc
owner field : 0xffff0000f687b080 wait/spin: 0/0
Turnstile: no active turnstile for this lock.
this task thread is spinning waiting for a condition:
574 static void
575 dwc_mmc_exec_command(sdmmc_chipset_handle_t sch, struct sdmmc_command *cmd)
...
722 struct bintime timeout = { .sec = 15, .frac = 0 };
723 const struct bintime epsilon = { .sec = 1, .frac = 0 };
724 while (!ISSET(cmd->c_flags, SCF_ITSDONE)) {
725 error = cv_timedwaitbt(&sc->sc_intr_cv,
726 &sc->sc_intr_lock, &timeout, &epsilon);
727 if (error != 0) {
728 cmd->c_error = error;
729 SET(cmd->c_flags, SCF_ITSDONE);
730 mutex_exit(&sc->sc_intr_lock);
731 goto done;
732 }
733 }
this appears to be a loop that should exist after 15 seconds if
the condition fails to come true, but it isn't exiting. this
leaves both the dwc_mmc_init() mutex and the bwfm_sdio_attach()
mutex locked while this doesn't exit, and the bwfm hangs.
as i was attemting to look at the in-memory contents of the
timeout and epsilon values on the stack, i lost ddb and i had
to power cycle.
>How-To-Repeat:
use bwfm. use a lossy network for faster hangs.
>Fix:
>Audit-Trail:
From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: re: kern/55485: bwfm @ sdmmc hangs, with some diagnosis
Date: Sat, 18 Jul 2020 14:16:56 +1000
more data. and some likely less broken behaviour fixes, if not
a full fix.
with some help from riastradh@ and jmcneill@ i've tracked down
that when the failure is occuring, the only thing looping is
an interrupt. we appear to be in dwc_mmc_exec_command() all
the time, but kernhist logging shows that we re-enter this
function many times (one time, 1000 times in 60ms.)
initially, byt checking what bits were set in the
BWFM_SDPCMD_INTSTATUS register, we determined that there was
the SDPCMD_INTSTATUS_XMTDATA_AVAIL bit set, which is not used
by our driver. then we noticed that the interrupt mask reg
was set to 0xffffffff, with commented version asking for 2
bits we are interested in. this allowed my test case to work
for about 10x longer than average before failing, and the
failure mode was less hard-hang of bwfm itself. eg, when
i ran 'ifconfig -a', it took 15 seconds, but it did complete,
and 'ifconfig bwfm down up' restored functionality. (there
are 3x5s slow periods here, each accompanied by a 'checksum
error' message on the console.)
the patch below helps a lot, but there are still stability.
.mrg.
Index: if_bwfm_sdio.c
===================================================================
RCS file: /cvsroot/src/sys/dev/sdmmc/if_bwfm_sdio.c,v
retrieving revision 1.19
diff -p -u -r1.19 if_bwfm_sdio.c
--- if_bwfm_sdio.c 23 Jun 2020 10:09:33 -0000 1.19
+++ if_bwfm_sdio.c 18 Jul 2020 04:13:51 -0000
@@ -505,9 +505,12 @@ bwfm_sdio_attachhook(device_t self)
goto err;
}
-// bwfm_sdio_dev_write(sc, SDPCMD_HOSTINTMASK,
-// SDPCMD_INTSTATUS_HMB_SW_MASK | SDPCMD_INTSTATUS_CHIPACTIVE);
+#if 1
+ bwfm_sdio_dev_write(sc, SDPCMD_HOSTINTMASK,
+ SDPCMD_INTSTATUS_HMB_SW_MASK | SDPCMD_INTSTATUS_CHIPACTIVE);
+#else
bwfm_sdio_dev_write(sc, SDPCMD_HOSTINTMASK, 0xffffffff);
+#endif
bwfm_sdio_write_1(sc, BWFM_SDIO_WATERMARK, 8);
if (bwfm_chip_sr_capable(bwfm)) {
(Contact us)
$NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.