NetBSD Problem Report #55485

From mrg@eterna.com.au  Mon Jul 13 10:13:08 2020
Return-Path: <mrg@eterna.com.au>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id D51721A921A
	for <gnats-bugs@gnats.NetBSD.org>; Mon, 13 Jul 2020 10:13:08 +0000 (UTC)
Message-Id: <20200713101304.B8DEE1590C3@splode.eterna.com.au>
Date: Mon, 13 Jul 2020 20:13:04 +1000 (AEST)
From: mrg@eterna.com.au
Reply-To: mrg@eterna.com.au
To: gnats-bugs@gnats.NetBSD.org
Subject: bwfm @ sdmmc hangs, with some diagnosis
X-Send-Pr-Version: 3.95

>Number:         55485
>Category:       kern
>Synopsis:       bwfm @ sdmmc hangs, with some diagnosis
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Jul 13 10:15:00 +0000 2020
>Last-Modified:  Sat Jul 18 04:20:01 +0000 2020
>Originator:     matthew green
>Release:        NetBSD 9.99.69
>Organization:
people's front against (bozotic) www (softwar foundation)
>Environment:
System: NetBSD hinotori.eterna23.net 9.99.68 NetBSD 9.99.68 (_hinotori_) #14: Sun Jul 12 22:34:00 PDT 2020  mrg@yesterday-when-i-was-mad.eterna23.net:/var/obj/evbarm-aarch64/usr/src2/sys/arch/evbarm/compile/_hinotori_ evbarm
Architecture: aarch64
Machine: evbarm
>Description:

	we've been tracking hangs in bwfm(4) on the pinebookpro for a
	while and most of the time there is no information, but i've
	found some useful data related to this.

	first, NET_MPSAFE kernel doesn't hard-hang the kernel when the
	problem occurs.  this is because he paths into the bwfm code
	that hang aren't holding kernel_lock when they hang.  you only
	get the specific network interface to hang.

	second, in the reproduction i just had, i found that the
	hung 'ifconfig -a' command was stuck waiting for the bwfm
	adaptive mutex in bwfm_sdio_txctl() with this trace:

fp ffffc00088d07780 mi_switch() at ffffc000004abecc netbsd:mi_switch+0x20c
fp ffffc00088d077e0 sleepq_block() at ffffc000004a88bc netbsd:sleepq_block+0x9c
fp ffffc00088d07820 turnstile_block() at ffffc000004ba0bc netbsd:turnstile_block+0x30c
fp ffffc00088d07890 mutex_enter() at ffffc0000048f58c netbsd:mutex_enter+0x1cc
fp ffffc00088d07910 bwfm_sdio_txctl() at ffffc000000ee9c0 netbsd:bwfm_sdio_txctl+0x60
fp ffffc00088d07940 bwfm_proto_bcdc_query_dcmd() at ffffc000001ea7d4 netbsd:bwfm_proto_bcdc_query_dcmd+0xd4
fp ffffc00088d079a0 bwfm_fwvar_var_get_data() at ffffc000001ec61c netbsd:bwfm_fwvar_var_get_data+0x9c
fp ffffc00088d07a00 bwfm_get_sta_info() at ffffc000001ed5c8 netbsd:bwfm_get_sta_info+0x58
fp ffffc00088d07b60 bwfm_ioctl() at ffffc000001ed808 netbsd:bwfm_ioctl+0x178
fp ffffc00088d07ba0 doifioctl() at ffffc0000057e778 netbsd:doifioctl+0x8f8
fp ffffc00088d07d40 sys_ioctl() at ffffc000004ee770 netbsd:sys_ioctl+0x420
fp ffffc00088d07e20 syscall() at ffffc0000008da2c netbsd:syscall+0x18c
tf ffffc00088d07ed0 el0_trap() at ffffc0000008c524 netbsd:el0_trap

	this lock is:

* Lock 0 (initialized at bwfm_sdio_attach)
lock address : 0xffffc00001b2d180 type     :     sleep/adaptive
initialized  : 0xffffc000000ef710
shared holds :                  0 exclusive:                  1
shares wanted:                  0 exclusive:                  1
relevant cpu :                  5 last held:                  5
relevant lwp : 0xffff0000f687b080 last held: 0xffff0000f687b080
last locked* : 0xffffc000000efb00 unlocked : 0xffffc000000efb98
owner field  : 0xffff0000f687b080 wait/spin:                1/0
Turnstile:
=> 0 waiting readers:
=> 1 waiting writers: 0xffff00002af94b00

	it is held by the sdmmc task thread for bwfm:

fp ffffc00088008b00 splx() at ffffc000000041b0 netbsd:splx+0x80
fp ffffc00088008b30 callout_halt() at ffffc000004b9444 netbsd:callout_halt+0x54
fp ffffc00088008b60 sleepq_block() at ffffc000004a8940 netbsd:sleepq_block+0x120
fp ffffc00088008ba0 cv_timedwait() at ffffc0000046d71c netbsd:cv_timedwait+0x11c
fp ffffc00088008be0 cv_timedwaitbt() at ffffc0000046da74 netbsd:cv_timedwaitbt+0x74
fp ffffc00088008c20 dwc_mmc_exec_command() at ffffc000001e5ac4 netbsd:dwc_mmc_exec_command+0x4b4
fp ffffc00088008ca0 sdmmc_mmc_command() at ffffc000000e80d0 netbsd:sdmmc_mmc_command+0x40
fp ffffc00088008cd0 sdmmc_io_rw_extended() at ffffc000000e9374 netbsd:sdmmc_io_rw_extended+0xa4
fp ffffc00088008d90 sdmmc_io_read_4() at ffffc000000e9d74 netbsd:sdmmc_io_read_4+0x24
fp ffffc00088008db0 bwfm_sdio_task() at ffffc000000efb24 netbsd:bwfm_sdio_task+0x54
fp ffffc00088008e90 sdmmc_task_thread() at ffffc000000e8a00 netbsd:sdmmc_task_thread+0xb0

	this thread also holds this lock (which seems unrelated):

* Lock 1 (initialized at dwc_mmc_init)
lock address : 0xffff0000f68f0250 type     :     sleep/adaptive
initialized  : 0xffffc000001e6108
shared holds :                  0 exclusive:                  1
shares wanted:                  0 exclusive:                  0
relevant cpu :                  5 last held:                  5
relevant lwp : 0xffff0000f687b080 last held: 0xffff0000f687b080
last locked* : 0xffffc000001e5634 unlocked : 0xffffc000001e5cbc
owner field  : 0xffff0000f687b080 wait/spin:                0/0
Turnstile: no active turnstile for this lock.

	this task thread is spinning waiting for a condition:

    574 static void
    575 dwc_mmc_exec_command(sdmmc_chipset_handle_t sch, struct sdmmc_command *cmd)
...
    722         struct bintime timeout = { .sec = 15, .frac = 0 };
    723         const struct bintime epsilon = { .sec = 1, .frac = 0 };
    724         while (!ISSET(cmd->c_flags, SCF_ITSDONE)) {
    725                 error = cv_timedwaitbt(&sc->sc_intr_cv,
    726                     &sc->sc_intr_lock, &timeout, &epsilon);
    727                 if (error != 0) {
    728                         cmd->c_error = error;
    729                         SET(cmd->c_flags, SCF_ITSDONE);
    730                         mutex_exit(&sc->sc_intr_lock);
    731                         goto done;
    732                 }
    733         }

this appears to be a loop that should exist after 15 seconds if
the condition fails to come true, but it isn't exiting.  this
leaves both the dwc_mmc_init() mutex and the bwfm_sdio_attach()
mutex locked while this doesn't exit, and the bwfm hangs.

as i was attemting to look at the in-memory contents of the
timeout and epsilon values on the stack, i lost ddb and i had
to power cycle.

>How-To-Repeat:

	use bwfm.  use a lossy network for faster hangs.

>Fix:

>Audit-Trail:
From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
    netbsd-bugs@netbsd.org
Subject: re: kern/55485: bwfm @ sdmmc hangs, with some diagnosis
Date: Sat, 18 Jul 2020 14:16:56 +1000

 more data.  and some likely less broken behaviour fixes, if not
 a full fix.

 with some help from riastradh@ and jmcneill@ i've tracked down
 that when the failure is occuring, the only thing looping is
 an interrupt.  we appear to be in dwc_mmc_exec_command() all
 the time, but kernhist logging shows that we re-enter this
 function many times (one time, 1000 times in 60ms.)

 initially, byt checking what bits were set in the
 BWFM_SDPCMD_INTSTATUS register, we determined that there was
 the SDPCMD_INTSTATUS_XMTDATA_AVAIL bit set, which is not used
 by our driver.  then we noticed that the interrupt mask reg
 was set to 0xffffffff, with commented version asking for 2
 bits we are interested in.  this allowed my test case to work
 for about 10x longer than average before failing, and the
 failure mode was less hard-hang of bwfm itself.   eg, when
 i ran 'ifconfig -a', it took 15 seconds, but it did complete,
 and 'ifconfig bwfm down up' restored functionality.  (there
 are 3x5s slow periods here, each accompanied by a 'checksum
 error' message on the console.)

 the patch below helps a lot, but there are still stability.


 .mrg.


 Index: if_bwfm_sdio.c
 ===================================================================
 RCS file: /cvsroot/src/sys/dev/sdmmc/if_bwfm_sdio.c,v
 retrieving revision 1.19
 diff -p -u -r1.19 if_bwfm_sdio.c
 --- if_bwfm_sdio.c	23 Jun 2020 10:09:33 -0000	1.19
 +++ if_bwfm_sdio.c	18 Jul 2020 04:13:51 -0000
 @@ -505,9 +505,12 @@ bwfm_sdio_attachhook(device_t self)
  		goto err;
  	}

 -//	bwfm_sdio_dev_write(sc, SDPCMD_HOSTINTMASK,
 -//	    SDPCMD_INTSTATUS_HMB_SW_MASK | SDPCMD_INTSTATUS_CHIPACTIVE);
 +#if 1
 +	bwfm_sdio_dev_write(sc, SDPCMD_HOSTINTMASK,
 +	    SDPCMD_INTSTATUS_HMB_SW_MASK | SDPCMD_INTSTATUS_CHIPACTIVE);
 +#else
  	bwfm_sdio_dev_write(sc, SDPCMD_HOSTINTMASK, 0xffffffff);
 +#endif
  	bwfm_sdio_write_1(sc, BWFM_SDIO_WATERMARK, 8);

  	if (bwfm_chip_sr_capable(bwfm)) {

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.