NetBSD Problem Report #55512
From martin@duskware.de Wed Jul 22 10:51:40 2020
Return-Path: <martin@duskware.de>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 35A391A9213
for <gnats-bugs@gnats.NetBSD.org>; Wed, 22 Jul 2020 10:51:40 +0000 (UTC)
From: martin@NetBSD.org
Reply-To: martin@NetBSD.org
To: gnats-bugs@NetBSD.org
Subject: sdmmc locking broken
X-Send-Pr-Version: 3.95
>Number: 55512
>Category: kern
>Synopsis: sdmmc locking broken
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Wed Jul 22 10:55:00 +0000 2020
>Last-Modified: Wed Jul 22 15:20:01 +0000 2020
>Originator: Martin Husemann
>Release: NetBSD 9.99.69
>Organization:
The NetBSD Foundation, Inc.
>Environment:
System: NetBSD space-truckin.duskware.de 9.99.69 NetBSD 9.99.69 (GENERIC) #93: Wed Jul 22 12:40:37 CEST 2020 martin@seven-days-to-the-wolves.aprisoft.de:/work/src/sys/arch/evbarm/compile/GENERIC evbarm
Architecture: earmv7hfeb
Machine: evbarm
>Description:
The locking in the sdmmc driver is broken (quite obviously) broken and it
actually works sometimes only by chance.
Not sure this is caused by it, but I do get:
[ 5.2286102] sunximmc1: WARNING: driver submitted a command while the controller was busy
[ 5.2386127] sdmmc1: extended I/O error 16, r=40992 p=0xc3ef6e94 l=4 read
[ 5.3486139] bwfm0: CHIPACTIVE
[ 5.3486139] bwfm0: wl0: May 16 2018 23:42:49 version 5.90.244 FWID 01-0
For broken locking just check the source:
sdmmcvar.h:#define SDMMC_LOCK(sc)
sdmmcvar.h:#define SDMMC_UNLOCK(sc)
and notice that *some* calls of SDMMC_LOCK/SDMMC_UNLOCK have additional
locks around them:
sdmmc_mem.c: SDMMC_LOCK(sc);
sdmmc_mem.c- mutex_enter(&sc->sc_mtx);
... but some do not.
Blindly removing all SDMMC_LOCK/SDMMC_UNLOCK and replacing them with sc_mtx
locks unfortunately does not work (causes locking against myself panics
during interrupt establish) - but probably is a first step forward and then
quite easy to cleanup.
>How-To-Repeat:
code inspection
>Fix:
n/a
>Audit-Trail:
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/55512: sdmmc locking broken
Date: Wed, 22 Jul 2020 12:29:29 -0000 (UTC)
martin@NetBSD.org writes:
>For broken locking just check the source:
>sdmmcvar.h:#define SDMMC_LOCK(sc)
>sdmmcvar.h:#define SDMMC_UNLOCK(sc)
>and notice that *some* calls of SDMMC_LOCK/SDMMC_UNLOCK have additional
>locks around them:
You see that SDMMC_LOCK/UNLOCK are just dummies.
>Blindly removing all SDMMC_LOCK/SDMMC_UNLOCK and replacing them with sc_mtx
>locks unfortunately does not work (causes locking against myself panics
>during interrupt establish) - but probably is a first step forward and then
>quite easy to cleanup.
Or just remove the macros as the macros were ignored when locking was added
later. The code also deviates from the OpenBSD original and from the current
OpenBSD code.
Your original issue:
[ 5.2286102] sunximmc1: WARNING: driver submitted a command while the controller was busy
[ 5.2386127] sdmmc1: extended I/O error 16, r=40992 p=0xc3ef6e94 l=4 read
[ 5.3486139] bwfm0: CHIPACTIVE
[ 5.3486139] bwfm0: wl0: May 16 2018 23:42:49 version 5.90.244 FWID 01-0
The sunxi sdhc driver fails concurrent command requests which is fine for
memory cards. But SDIO allows for some concurrent commands and the check
is bogus and doesn't exist in other sdhc drivers.
The same message with a kernel from last December and some debugging
extra:
sunximmc1: WARNING: driver submitted command 0035 while the controller was busy with 0034
0x9a23de2c: netbsd:sdmmc_mmc_command+0x44
0x9a23ded4: 909f2000
sdmmc1: extended I/O error 16, r=40992 p=0x9a23ded4 l=4 read
An extended I/O command (multiple data bytes) was issued while a direct I/O
command (single data bytes or functions) command was busy.
That's probably not allowed (the concurrent commands are for e.g.
aborting a running extended I/O command).
r=40992 (0xa020) should be the bwfm INTSTATUS register, the interrupt
is probably triggered by the chipactive condition. Reading the status
failed, but since it isn't cleared either, the interrupt condition
persists and the interrupt is still handled later.
The busy direct I/O command might be the driver polling the chip
for enabled clocks in bwfm_sdio_buscore_prepare.
The end of the attach phase is the only place where concurrent I/O
can happen, and you see that it does (with little effect).
To avoid this, the end of the attach phase needs either to be handled
by sdmmc tasks (I was looking at the chipactive interrupt) or the tasks
that handle interrupts must be deferred until the attachment code
that accesses the chip is finished.
--
--
Michael van Elst
Internet: mlelstv@serpens.de
"A potential Snark may lurk in every tree."
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/55512: sdmmc locking broken
Date: Wed, 22 Jul 2020 15:23:13 +0200
On Wed, Jul 22, 2020 at 12:35:02PM +0000, Michael van Elst wrote:
> Or just remove the macros as the macros were ignored when locking was added
> later.
That would be fine, but still the locking seems to be broken. I'm
pretty sure I saw concurrent cmds issued by different CPUs while trying
to debug my eMMC issue (the strange timings you noticed in my debug
lock on tech-kern) on Odroid C2.
Martin
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/55512: sdmmc locking broken
Date: Wed, 22 Jul 2020 15:19:18 -0000 (UTC)
martin@duskware.de (Martin Husemann) writes:
> That would be fine, but still the locking seems to be broken. I'm
> pretty sure I saw concurrent cmds issued by different CPUs while trying
> to debug my eMMC issue (the strange timings you noticed in my debug
> lock on tech-kern) on Odroid C2.
Actually I noticed that the driver returned a timeout while almost
no time had passed, definitely not qualifying for a timeout.
kprintf uses getnanouptime(), I'm not sure if readouts on
different CPUs are guaranteed to be monotonic.
--
--
Michael van Elst
Internet: mlelstv@serpens.de
"A potential Snark may lurk in every tree."
(Contact us)
$NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.