NetBSD Problem Report #47097

From www@NetBSD.org  Sat Oct 20 15:45:04 2012
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	by www.NetBSD.org (Postfix) with ESMTP id 6C3A563E21A
	for <gnats-bugs@gnats.NetBSD.org>; Sat, 20 Oct 2012 15:45:04 +0000 (UTC)
Message-Id: <20121020154503.A9C7763DFAE@www.NetBSD.org>
Date: Sat, 20 Oct 2012 15:45:03 +0000 (UTC)
From: tshiozak@bsdclub.org
Reply-To: tshiozak@bsdclub.org
To: gnats-bugs@NetBSD.org
Subject: ahcisata(4) triggers panic when disk I/O error is caused
X-Send-Pr-Version: www-1.0

>Number:         47097
>Category:       kern
>Synopsis:       ahcisata(4) triggers panic when disk I/O error is caused
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Oct 20 15:50:00 +0000 2012
>Last-Modified:  Fri Oct 26 10:05:05 +0000 2012
>Originator:     Takuya SHIOZAKI
>Release:        trunk
>Organization:
>Environment:
NetBSD ichigo 6.99.11 NetBSD 6.99.11 (GENERIC) #0: Mon Sep 24 11:49:55 JST 2012  root@aoi:/usr/home/current-build/obj/sys/arch/i386/compile/GENERIC i386
>Description:
when disk I/O error is caused, ahcisata(4) seems to call tsleep() under interrupt context and it triggers panic.

call trace is here:

vpanic
kern_assert
callout_halt
sleepq_block
ahci_do_reset_drive
ahci_reset_drive
wddone
ahci_bio_complete
ahci_intr_port
ahci_intr

>How-To-Repeat:
use ahcisata driver with a nearly broken disk.
>Fix:
probably, don't use tsleep() in ahci_do_reset_drive() or don't call drv_done under interrupt context.  we need ata(9) manpage.

>Audit-Trail:
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/47097: ahcisata(4) triggers panic when disk I/O error is
 caused
Date: Sun, 21 Oct 2012 13:04:20 +0200

 --u3/rZRmxL6MmkK24
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline

 On Sat, Oct 20, 2012 at 03:50:00PM +0000, tshiozak@bsdclub.org wrote:
 > when disk I/O error is caused, ahcisata(4) seems to call tsleep() under interrupt context and it triggers panic.
 > 
 > call trace is here:
 > 
 > vpanic
 > kern_assert
 > callout_halt
 > sleepq_block
 > ahci_do_reset_drive
 > ahci_reset_drive
 > wddone
 > ahci_bio_complete
 > ahci_intr_port
 > ahci_intr

 The attached patch should work around it. But this is going to
 delay() for half a second, which is bad. Something better is needed,
 so we can reset the drive from the atabus thread.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

 --u3/rZRmxL6MmkK24
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: attachment; filename=diff

 Index: ahcisata_core.c
 ===================================================================
 RCS file: /cvsroot/src/sys/dev/ic/ahcisata_core.c,v
 retrieving revision 1.44
 diff -u -p -u -r1.44 ahcisata_core.c
 --- ahcisata_core.c	27 Sep 2012 00:39:47 -0000	1.44
 +++ ahcisata_core.c	21 Oct 2012 11:02:38 -0000
 @@ -721,7 +721,10 @@ again:
  		if ((((sig & AHCI_P_TFD_ST) >> AHCI_P_TFD_ST_SHIFT)
  		    & WDCS_BSY) == 0)
  			break;
 -		tsleep(&sc, PRIBIO, "ahcid2h", mstohz(10));
 +		if (flags & AT_WAIT)
 +			tsleep(&sc, PRIBIO, "ahcid2h", mstohz(10));
 +		else
 +			delay(10000);
  	}
  	if (i == AHCI_RST_WAIT) {
  		aprint_error("%s: BSY never cleared, TD 0x%x\n",
 @@ -740,7 +743,10 @@ again:
  	    AHCI_READ(sc, AHCI_P_CMD(chp->ch_channel))), DEBUG_PROBE);
  end:
  	ahci_channel_stop(sc, chp, flags);
 -	tsleep(&sc, PRIBIO, "ahcirst", mstohz(500));
 +	if (flags & AT_WAIT)
 +		tsleep(&sc, PRIBIO, "ahcirst", mstohz(500));
 +	else
 +		delay(500000);
  	/* clear port interrupt register */
  	AHCI_WRITE(sc, AHCI_P_IS(chp->ch_channel), 0xffffffff);
  	ahci_channel_start(sc, chp, AT_WAIT,

 --u3/rZRmxL6MmkK24--

From: "T.SHIOZAKI" <tshiozak@bsdclub.org>
To: bouyer@antioche.eu.org
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org,
        netbsd-bugs@NetBSD.org
Subject: Re: kern/47097: ahcisata(4) triggers panic when disk I/O error is
 caused
Date: Thu, 25 Oct 2012 06:41:06 +0900 (JST)

 > On Sat, Oct 20, 2012 at 03:50:00PM +0000, tshiozak@bsdclub.org wrote:
 > > when disk I/O error is caused, ahcisata(4) seems to call tsleep() under interrupt context and it triggers panic.
 > > 
 > > call trace is here:
 > > 
 > > vpanic
 > > kern_assert
 > > callout_halt
 > > sleepq_block
 > > ahci_do_reset_drive
 > > ahci_reset_drive
 > > wddone
 > > ahci_bio_complete
 > > ahci_intr_port
 > > ahci_intr
 > 
 > The attached patch should work around it. But this is going to
 > delay() for half a second, which is bad. Something better is needed,
 > so we can reset the drive from the atabus thread.

 I confirmed that the patch fixes the problem.  Thanks.

 ---
 Takuya SHIOZAKI

From: "Manuel Bouyer" <bouyer@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/47097 CVS commit: src/sys/dev/ic
Date: Fri, 26 Oct 2012 09:59:11 +0000

 Module Name:	src
 Committed By:	bouyer
 Date:		Fri Oct 26 09:59:11 UTC 2012

 Modified Files:
 	src/sys/dev/ic: ahcisata_core.c

 Log Message:
 Workaround PR kern/47097: use delay() instead of tsleep() ahci_do_reset_drive()
 if not called with AT_WAIT.
 The right fix here is to change the ata layer to reset the drive from
 thread context, to avoid a 0.5 delay() in interrupt context when a drive
 fails.


 To generate a diff of this commit:
 cvs rdiff -u -r1.44 -r1.45 src/sys/dev/ic/ahcisata_core.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: "T.SHIOZAKI" <tshiozak@bsdclub.org>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org,
        netbsd-bugs@NetBSD.org
Subject: Re: kern/47097: ahcisata(4) triggers panic when disk I/O error is
 caused
Date: Fri, 26 Oct 2012 11:59:48 +0200

 On Thu, Oct 25, 2012 at 06:41:06AM +0900, T.SHIOZAKI wrote:
 > > delay() for half a second, which is bad. Something better is needed,
 > > so we can reset the drive from the atabus thread.
 > 
 > I confirmed that the patch fixes the problem.  Thanks.

 OK, I commited this patch as a stopgap measure, but a real fix will
 require more work.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.