NetBSD Problem Report #53714

From gson@gson.org  Sat Nov 10 12:05:22 2018
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 696857A1B1
	for <gnats-bugs@gnats.NetBSD.org>; Sat, 10 Nov 2018 12:05:22 +0000 (UTC)
Message-Id: <20181110120515.D6EA59893F6@guava.gson.org>
Date: Sat, 10 Nov 2018 14:05:15 +0200 (EET)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: Panic during recovery from IDE disk error
X-Send-Pr-Version: 3.95

>Number:         53714
>Category:       kern
>Synopsis:       Panic during recovery from IDE disk error
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    jdolecek
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Nov 10 12:10:00 +0000 2018
>Closed-Date:    Tue Nov 13 08:12:09 +0000 2018
>Last-Modified:  Tue Nov 13 08:12:09 +0000 2018
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current, source date 2018.11.09.15.20.36
>Organization:
>Environment:
System: NetBSD
Architecture: i386
Machine: i386
>Description:

Running the ATF tests on a debug build of NetBSD-current/i386 under
qemu 3.0.0 on my own testbed, there were some disk errors followed by
an attempted downgrade to PIO mode 4 and then a panic.  This PR is
specifically about the panic, because that's clearly a NetBSD bug,
unlike the disk errors which may be a qemu issue.

The console showed:

bin/sh/t_patterns (17/768): 3 test cases
    case_matching: [43.106274s] Passed.
    filename_expansion: piixide0:0:0: lost interrupt
[ 1155.0753536]         type: ata tc_bcount: 16384 tc_skip: 0
piixide0:0:0: bus-master DMA error: missing interrupt, status=0x61
[ 1155.3570396] wd0a: DMA error writing fsbn 3779616 of 3779616-3779647 (wd0 bn 3779679; cn 3749 tn 10 sn 57), xfer 1f4, retry 0
piixide0:0:0: lost interrupt
[ 1177.2215340]         type: ata tc_bcount: 16384 tc_skip: 0
piixide0:0:0: bus-master DMA error: missing interrupt, status=0x61
[ 1178.1212942] wd0a: DMA error writing fsbn 3779584 of 3779584-3779615 (wd0 bn 3779647; cn 3749 tn 10 sn 25), xfer c8, retry 0
[ 1178.1271228] wd0: soft error (corrected) xfer 1f4
[ 1178.9967395] wd0: soft error (corrected) xfer c8
piixide0:0:0: lost interrupt
[ 1258.5017277]         type: ata tc_bcount: 16384 tc_skip: 0
piixide0:0:0: bus-master DMA error: missing interrupt, status=0x61
[ 1258.7546516] wd0a: DMA error writing fsbn 3779136 of 3779136-3779167 (wd0 bn 3779199; cn 3749 tn 3 sn 18), xfer 9c4, retry 0
[ 1259.6734782] wd0: soft error (corrected) xfer 9c4
piixide0:0:0: lost interrupt
[ 1279.6505127]         type: ata tc_bcount: 16384 tc_skip: 0
piixide0:0:0: bus-master DMA error: missing interrupt, status=0x61
[ 1280.7308735] wd0: transfer error, downgrading to PIO mode 4
[ 1283.7601240] panic: kernel diagnostic assertion "mutex_owned(&chp->ch_lock)" failed: file "/usr/src/sys/dev/ata/ata_subr.c", line 275 
[ 1283.8202761] cpu0: Begin traceback...
[ 1316.7572251] vpanic(c116553c,cdcb5e34,cdcb5e34,cdcb5e44,c01af986,c116553c,c11653dd,c116551f,c11653a0,113) at netbsd:vpanic+0x1b2
[ 1316.8132134] kern_assert(c116553c,c11653dd,c116551f,c11653a0,113,cdcb5e80,c01ad669,c2207108,2,c21fe064) at netbsd:kern_assert+0x27
[ 1316.8332269] ata_channel_lock_owned(c2207108,2,c21fe064,0,1,4,c21fe064,c2207000,c2207000,c2272428) at netbsd:ata_channel_lock_owned+0x4c
[ 1316.8732393] ata_thread_run(c2207108,0,200,ff,c2207108,c2207000,c2275d08,0,cdcb5eb8,c01ac108) at netbsd:ata_thread_run+0x1e
[ 1316.9132490] ata_downgrade_mode(c2272008,0,cdcb5f00,c01a9c72,c2272008,0,50,0,10,cdcb5edc) at netbsd:ata_downgrade_mode+0x16b
[ 1316.9332531] ata_dmaerr(c2272008,0,50,0,10,cdcb5edc,c18c2a60,50,2,c2207000) at netbsd:ata_dmaerr+0x40
[ 1316.9932775] wdc_ata_bio_intr(c2207108,c22739c4,1,c2207108,cdcb5f30,c2207108,c2207000,2,c22739c4,c1778d80) at netbsd:wdc_ata_bio_intr+0x3c1
[ 1317.0532938] wdctimeout(c2207108,c0100e1f,cdcb5f74,2,8,c20b3540,c18c2a60,c2207118,129df,8) at netbsd:wdctimeout+0x203
[ 1317.0732976] callout_softclock(0,0,0,0,0,0,0,0,b9d4c6,cdc2520c) at netbsd:callout_softclock+0x579
[ 1317.1132967] softint_execute(cdc25074,c20b3540,2,c20b3540,c20b3d20,c20b3540,cdc25074,0,cdc22000,c20a58c8) at netbsd:softint_execute+0x23e
[ 1317.1332991] softint_dispatch(c20b3d20,2,0,0,0,0,cdcb8ff0,cdcb8f98,c20b32a0,80050033) at netbsd:softint_dispatch+0xd7
[ 1317.1332991] Bad frame pointer: 0xcdc24ec4
[ 1317.1332991] cpu0: End traceback...

[ 1317.1332991] dumping to dev 0,1 offset 1574408
[ 1317.1332991] dump pmap_kenter_pa: mapping already present
[ 1317.1332991] pmap_kenter_pa: mapping already present
[ 1317.1332991] pmap_kenter_pa: mapping already present
[ 1317.1332991] pmap_kenter_pa: mapping already present
[ 1317.1332991] pmap_kenter_pa: mapping already present
[ 1317.1332991] pmap_kenter_pa: mapping already present
[ 1317.1332991] pmap_kenter_pa: mapping already present
[ 1317.1332991] pmap_kenter_pa: mapping already present
(etc)

>How-To-Repeat:

Don't know, only happened once so far.

>Fix:

>Release-Note:

>Audit-Trail:
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/53714: Panic during recovery from IDE disk error
Date: Sat, 10 Nov 2018 12:53:37 -0000 (UTC)

 gson@gson.org (Andreas Gustafsson) writes:

 >[ 1316.8732393] ata_thread_run(c2207108,0,200,ff,c2207108,c2207000,c2275d08,0,cdcb5eb8,c01ac108) at netbsd:ata_thread_run+0x1e
 >[ 1316.9132490] ata_downgrade_mode(c2272008,0,cdcb5f00,c01a9c72,c2272008,0,50,0,10,cdcb5edc) at netbsd:ata_downgrade_mode+0x16b

 ata_downgrade_mode calls ata_thread_run without taking the channel lock.

 Maybe this:

 Index: ata.c
 ===================================================================
 RCS file: /cvsroot/src/sys/dev/ata/ata.c,v
 retrieving revision 1.145
 diff -u -r1.145 ata.c
 --- ata.c       24 Oct 2018 20:25:52 -0000      1.145
 +++ ata.c       10 Nov 2018 12:53:20 -0000
 @@ -1809,7 +1809,9 @@
         (*atac->atac_set_modes)(chp);
         ata_print_modes(chp);
         /* reset the channel, which will schedule all drives for setup */
 +       ata_channel_lock(chp);
         ata_thread_run(chp, flags, ATACH_TH_RESET, ATACH_NODRIVE);
 +       ata_channel_unlock(chp);
         return 1;
  }
  #endif /* NATA_DMA */

 -- 
 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

Responsible-Changed-From-To: kern-bug-people->jdolecek
Responsible-Changed-By: jdolecek@NetBSD.org
Responsible-Changed-When: Sat, 10 Nov 2018 13:43:56 +0000
Responsible-Changed-Why:
I'm looking into this.


State-Changed-From-To: open->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Mon, 12 Nov 2018 18:54:56 +0000
State-Changed-Why:
Committed a fix. I understand this is tricky to trigger, nevertheless
can you confirm this is fixed if you happen to get the disk errors again?


From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/53714 CVS commit: src/sys/dev
Date: Mon, 12 Nov 2018 18:51:01 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Mon Nov 12 18:51:01 UTC 2018

 Modified Files:
 	src/sys/dev/ata: ata.c ata_wdc.c
 	src/sys/dev/ic: mvsata.c
 	src/sys/dev/scsipi: atapi_wdc.c

 Log Message:
 hold channel lock during whole ata_dmaerr()/ata_downgrade_mode() -
 according to code inspection this is safe, none of the set_modes
 hooks execute anything which would be taking the lock

 adresses PR kern/53714 by Andreas Gustafsson


 To generate a diff of this commit:
 cvs rdiff -u -r1.145 -r1.146 src/sys/dev/ata/ata.c
 cvs rdiff -u -r1.112 -r1.113 src/sys/dev/ata/ata_wdc.c
 cvs rdiff -u -r1.44 -r1.45 src/sys/dev/ic/mvsata.c
 cvs rdiff -u -r1.131 -r1.132 src/sys/dev/scsipi/atapi_wdc.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/53714 (Panic during recovery from IDE disk error)
Date: Tue, 13 Nov 2018 10:06:52 +0200

 jdolecek@NetBSD.org wrote:
 > Committed a fix. I understand this is tricky to trigger, nevertheless
 > can you confirm this is fixed if you happen to get the disk errors again?

 Thank you for the fix.  I have found only two "downgrading to PIO mode 4"
 messages in my logs from before the fix, and only one of them was followed
 by apanic, so it could take a long time before there is another downgrade,
 and even if there is one without a panic, that's not conclusive evidence. 
 It's probably best to just close the PR now, and reopen it if the panic
 recurs.
 -- 
 Andreas Gustafsson, gson@gson.org

State-Changed-From-To: feedback->closed
State-Changed-By: gson@NetBSD.org
State-Changed-When: Tue, 13 Nov 2018 08:12:09 +0000
State-Changed-Why:
Fix committed


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.