NetBSD Problem Report #55512

From martin@duskware.de  Wed Jul 22 10:51:40 2020
Return-Path: <martin@duskware.de>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 35A391A9213
	for <gnats-bugs@gnats.NetBSD.org>; Wed, 22 Jul 2020 10:51:40 +0000 (UTC)
From: martin@NetBSD.org
Reply-To: martin@NetBSD.org
To: gnats-bugs@NetBSD.org
Subject: sdmmc locking broken
X-Send-Pr-Version: 3.95

>Number:         55512
>Category:       kern
>Synopsis:       sdmmc locking broken
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Jul 22 10:55:00 +0000 2020
>Last-Modified:  Wed Jul 22 15:20:01 +0000 2020
>Originator:     Martin Husemann
>Release:        NetBSD 9.99.69
>Organization:
The NetBSD Foundation, Inc.
>Environment:
System: NetBSD space-truckin.duskware.de 9.99.69 NetBSD 9.99.69 (GENERIC) #93: Wed Jul 22 12:40:37 CEST 2020 martin@seven-days-to-the-wolves.aprisoft.de:/work/src/sys/arch/evbarm/compile/GENERIC evbarm
Architecture: earmv7hfeb
Machine: evbarm
>Description:

The locking in the sdmmc driver is broken (quite obviously) broken and it
actually works sometimes only by chance.

Not sure this is caused by it, but I do get:

[   5.2286102] sunximmc1: WARNING: driver submitted a command while the controller was busy
[   5.2386127] sdmmc1: extended I/O error 16, r=40992 p=0xc3ef6e94 l=4 read
[   5.3486139] bwfm0: CHIPACTIVE
[   5.3486139] bwfm0: wl0: May 16 2018 23:42:49 version 5.90.244 FWID 01-0

For broken locking just check the source:

sdmmcvar.h:#define      SDMMC_LOCK(sc)
sdmmcvar.h:#define      SDMMC_UNLOCK(sc)

and notice that *some* calls of SDMMC_LOCK/SDMMC_UNLOCK have additional
locks around them:

sdmmc_mem.c:    SDMMC_LOCK(sc);
sdmmc_mem.c-    mutex_enter(&sc->sc_mtx);

... but some do not.

Blindly removing all SDMMC_LOCK/SDMMC_UNLOCK and replacing them with sc_mtx
locks unfortunately does not work (causes locking against myself panics
during interrupt establish) - but probably is a first step forward and then
quite easy to cleanup.

>How-To-Repeat:
code inspection

>Fix:
n/a

>Audit-Trail:
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/55512: sdmmc locking broken
Date: Wed, 22 Jul 2020 12:29:29 -0000 (UTC)

 martin@NetBSD.org writes:

 >For broken locking just check the source:

 >sdmmcvar.h:#define      SDMMC_LOCK(sc)
 >sdmmcvar.h:#define      SDMMC_UNLOCK(sc)

 >and notice that *some* calls of SDMMC_LOCK/SDMMC_UNLOCK have additional
 >locks around them:

 You see that SDMMC_LOCK/UNLOCK are just dummies.


 >Blindly removing all SDMMC_LOCK/SDMMC_UNLOCK and replacing them with sc_mtx
 >locks unfortunately does not work (causes locking against myself panics
 >during interrupt establish) - but probably is a first step forward and then
 >quite easy to cleanup.

 Or just remove the macros as the macros were ignored when locking was added
 later. The code also deviates from the OpenBSD original and from the current
 OpenBSD code.


 Your original issue:

 [   5.2286102] sunximmc1: WARNING: driver submitted a command while the controller was busy
 [   5.2386127] sdmmc1: extended I/O error 16, r=40992 p=0xc3ef6e94 l=4 read
 [   5.3486139] bwfm0: CHIPACTIVE
 [   5.3486139] bwfm0: wl0: May 16 2018 23:42:49 version 5.90.244 FWID 01-0

 The sunxi sdhc driver fails concurrent command requests which is fine for
 memory cards. But SDIO allows for some concurrent commands and the check
 is bogus and doesn't exist in other sdhc drivers.

 The same message with a kernel from last December and some debugging
 extra:

 sunximmc1: WARNING: driver submitted command 0035 while the controller was busy with 0034
 0x9a23de2c: netbsd:sdmmc_mmc_command+0x44
 0x9a23ded4: 909f2000
 sdmmc1: extended I/O error 16, r=40992 p=0x9a23ded4 l=4 read

 An extended I/O command (multiple data bytes) was issued while a direct I/O
 command (single data bytes or functions) command was busy.

 That's probably not allowed (the concurrent commands are for e.g.
 aborting a running extended I/O command).

 r=40992 (0xa020) should be the bwfm INTSTATUS register, the interrupt
 is probably triggered by the chipactive condition. Reading the status
 failed, but since it isn't cleared either, the interrupt condition
 persists and the interrupt is still handled later.

 The busy direct I/O command might be the driver polling the chip
 for enabled clocks in bwfm_sdio_buscore_prepare.

 The end of the attach phase is the only place where concurrent I/O
 can happen, and you see that it does (with little effect).

 To avoid this, the end of the attach phase needs either to be handled
 by sdmmc tasks (I was looking at the chipactive interrupt) or the tasks
 that handle interrupts must be deferred until the attachment code
 that accesses the chip is finished.

 -- 
 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/55512: sdmmc locking broken
Date: Wed, 22 Jul 2020 15:23:13 +0200

 On Wed, Jul 22, 2020 at 12:35:02PM +0000, Michael van Elst wrote:
 >  Or just remove the macros as the macros were ignored when locking was added
 >  later.

 That would be fine, but still the locking seems to be broken. I'm
 pretty sure I saw concurrent cmds issued by different CPUs while trying
 to debug my eMMC issue (the strange timings you noticed in my debug
 lock on tech-kern) on Odroid C2.

 Martin

From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/55512: sdmmc locking broken
Date: Wed, 22 Jul 2020 15:19:18 -0000 (UTC)

 martin@duskware.de (Martin Husemann) writes:

 > That would be fine, but still the locking seems to be broken. I'm
 > pretty sure I saw concurrent cmds issued by different CPUs while trying
 > to debug my eMMC issue (the strange timings you noticed in my debug
 > lock on tech-kern) on Odroid C2.

 Actually I noticed that the driver returned a timeout while almost
 no time had passed, definitely not qualifying for a timeout.

 kprintf uses getnanouptime(), I'm not sure if readouts on
 different CPUs are guaranteed to be monotonic.

 -- 
 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."
Home
PR Database Search
(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.