NetBSD Problem Report #47606

From apb@cequrux.com  Sat Mar  2 08:07:13 2013
Return-Path: <apb@cequrux.com>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	by www.NetBSD.org (Postfix) with ESMTP id 8A4B463EFC2
	for <gnats-bugs@gnats.NetBSD.org>; Sat,  2 Mar 2013 08:07:13 +0000 (UTC)
Message-Id: <20130302080707.B653A3B632A@apb-laptoy.apb.alt.za>
Date: Sat,  2 Mar 2013 08:07:07 +0000 (UTC)
From: apb@cequrux.com
To: gnats-bugs@NetBSD.org
Subject: panic with ffs+wapbl on cgd on USB disk (ahci)
X-Send-Pr-Version: 3.95

>Number:         47606
>Category:       port-i386
>Synopsis:       sleep in interrupt mode after error with USB disk (ahci)
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    port-i386-maintainer
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Mar 02 08:10:01 +0000 2013
>Last-Modified:  Sun Mar 10 22:45:05 +0000 2013
>Originator:     Alan Barrett
>Release:        NetBSD 6.99.16
>Organization:
Not much
>Environment:
System: NetBSD 6.99.16 i386
>Description:
I encountered the following panic while using rsync to copy
files from an internal disk to an external USB disk.  All relevant
file systems are ffs+wapbl on cgd.

panic: kernel diagnostic assertion "!cpu_intr_p()" failed:
       file .../kern_timeout.c, line 471

db{0}> bt
breakpoint ...
vpanic ...
kern_assert ...
callout_halt ...
sleepq_block ...
ahci_channel_start ...
ahci_do_reset_drive ...
ahci_reset_drive ...
wddone ...
ahci_bio_complete ...
ahci_intr_port ...
ahci_intr ...
intr_biglock_wrapper ...
--- switch to interrupt stack ---
Xintr_ioapic_level7() ...
--- interrupt ---
x86_mwait ...
acpicpu_cstate_idle_enter ...
acpicpu_cstate_idle ...
idle_loop ...
db{0}>

I often encounter panics while using rsync to external USB disks.
Usually, I am unable to see the panic message or to get a stack trace,
because of inability to switch from graphicsto a text console after a
panic.  This time, the panic was while a text console was active, so
I can report this panic, but I can't tell whether the other frequent
panics are the same or different.

>How-To-Repeat:
Plug in external USB disk; it is attached as follows:

    umass1 at uhub3 port 3 configuration 1 interface 0
    umass1: Western Digital My Book, rev 2.00/1.65, addr 4
    umass1: using SCSI over Bulk-Only
    scsibus1 at umass1: 2 targets, 1 lun per target
    sd1 at scsibus1 target 0 lun 0: <WD, 10EACS External, 1.65> disk fixed version 4
    sd1: 931 GB, 16383 cyl, 16 head, 63 sec, 512 bytes/sect x 1953525168 sectors

The disk has an MBR and a disklabel,
in which the "e" partition is a cgd.

Inside the cgd is a disklabel,
in which the "a" partition is ffs+wabl.

# cgdconfig cgd2 /dev/sd1e ${configfile}

# mount -t ffs -o log /dev/cgd2a /mnt

Use rsync to copy lots of stuff to /mnt/somesubdir/

The system may panic after some time.  In this instance, the panic
was after about two hours.

Apart from a single rsync task, there was no other activity
on the disk at the time of the panic.

>Fix:
Unknown

>Release-Note:

>Audit-Trail:
From: Alan Barrett <apb@cequrux.com>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: port-i386/47606: panic with ffs+wapbl on cgd on USB disk (ahci)
Date: Sat, 2 Mar 2013 11:28:02 +0200

 On Sat, 02 Mar 2013, apb@cequrux.com wrote:
 >db{0}> bt
 >breakpoint ...
 >vpanic ...
 >kern_assert ...
 >callout_halt ...
 >sleepq_block ...
 >ahci_channel_start ...
 >ahci_do_reset_drive ...
 >ahci_reset_drive ...
 >wddone ...
 >ahci_bio_complete ...
 >ahci_intr_port ...
 >ahci_intr ...
 >intr_biglock_wrapper ...
 >--- switch to interrupt stack ---

 So, wddone sees that there was some kind of error, and calls 

 	(*wd->atabus->ata_reset_drive)(wd->drvp, AT_RST_NOCMD, NULL);

 The ata_reset_drive pointer refers to the ahci_reset_drive 
 function, which ends up calling ahci_channel_start, which sleeps, 
 and sleeping is not allowed in interrupt mode.

 --apb (Alan Barrett)

From: Alan Barrett <apb@cequrux.com>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: port-i386/47606: sleep in interrupt mode after error with USB
 disk (ahci)
Date: Sat, 2 Mar 2013 15:27:19 +0200

 On Sat, 02 Mar 2013, apb@cequrux.com wrote:
 >>Number:         47606
 >>Category:       port-i386
 >>Synopsis:       panic with ffs+wapbl on cgd on USB disk (ahci)

 I have edited the synopsis, to read:
 Synopsis:	sleep in interrupt mode after error with USB disk (ahci)

 The following patch makes wddone() print the error message before,
 instead of after, resetting the disk.  It won't prevent the panic, but
 at least there will be a message about a disk error just before the
 panic.

 Index: sys/dev/ata/wd.c
 ===================================================================
 --- sys/dev/ata/wd.c	9 Jan 2013 22:03:49 -0000	1.402
 +++ sys/dev/ata/wd.c	2 Mar 2013 10:31:33 -0000
 @@ -748,6 +748,7 @@ wddone(void *v)
   	struct wd_softc *wd = device_private(v);
   	struct buf *bp = wd->sc_bp;
   	const char *errmsg;
 +	int do_reset_drive = 0;
   	int do_perror = 0;

   	ATADEBUG_PRINT(("wddone %s\n", device_xname(wd->sc_dev)),
 @@ -758,26 +759,29 @@ wddone(void *v)
   	switch (wd->sc_wdc_bio.error) {
   	case ERR_DMA:
   		errmsg = "DMA error";
 +		do_reset_drive = 1;
   		goto retry;
   	case ERR_DF:
   		errmsg = "device fault";
 +		do_reset_drive = 1;
   		goto retry;
   	case TIMEOUT:
   		errmsg = "device timeout";
 +		do_reset_drive = 1;
   		goto retry;
   	case ERR_RESET:
   		errmsg = "channel reset";
 -		goto retry2;
 +		do_reset_drive = 0;
 +		goto retry;
   	case ERROR:
   		/* Don't care about media change bits */
   		if (wd->sc_wdc_bio.r_error != 0 &&
   		    (wd->sc_wdc_bio.r_error & ~(WDCE_MC | WDCE_MCR)) == 0)
   			goto noerror;
   		errmsg = "error";
 +		do_reset_drive = 1;
   		do_perror = 1;
 -retry:		/* Just reset and retry. Can we do more ? */
 -		(*wd->atabus->ata_reset_drive)(wd->drvp, AT_RST_NOCMD, NULL);
 -retry2:
 +retry:		/* print message, reset, and retry. Can we do more ? */
   		diskerr(bp, "wd", errmsg, LOG_PRINTF,
   		    wd->sc_wdc_bio.blkdone, wd->sc_dk.dk_label);
   		if (wd->retries < WDIORETRIES)
 @@ -785,6 +789,9 @@ retry2:
   		printf("\n");
   		if (do_perror)
   			wdperror(wd);
 +		if (do_reset_drive)
 +			(*wd->atabus->ata_reset_drive)(wd->drvp, AT_RST_NOCMD,
 +			    NULL);
   		if (wd->retries < WDIORETRIES) {
   			wd->retries++;
   			callout_reset(&wd->sc_restart_ch, RECOVERYTIME,

From: Alan Barrett <apb@cequrux.com>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: port-i386/47606: sleep in interrupt mode after error with USB
 disk (ahci)
Date: Sat, 2 Mar 2013 17:08:57 +0200

 On Sat, 02 Mar 2013, Alan Barrett wrote:
 > The following patch makes wddone() print the error message before,
 > instead of after, resetting the disk.  It won't prevent the panic, but
 > at least there will be a message about a disk error just before the
 > panic.

 With that patch, I see this error message just before the panic:

 wd0e: error reading fsbn NNNN of NNNN-NNNN (wd0 bn NNNN ...), retrying
 wd0: (uncorrectable data error)

 I suspect that there really is an unreadable sector on the disk.

 --apb (Alan Barrett)

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@netbsd.org
Cc: port-i386-maintainer@netbsd.org, gnats-admin@netbsd.org,
        netbsd-bugs@netbsd.org
Subject: Re: port-i386/47606: panic with ffs+wapbl on cgd on USB disk (ahci)
Date: Sun, 10 Mar 2013 19:39:21 +0100

 On Sat, Mar 02, 2013 at 08:10:01AM +0000, apb@cequrux.com wrote:
 > files from an internal disk to an external USB disk.  All relevant
 > file systems are ffs+wapbl on cgd.
 > 
 > panic: kernel diagnostic assertion "!cpu_intr_p()" failed:
 >        file .../kern_timeout.c, line 471
 > 
 > db{0}> bt
 > breakpoint ...
 > vpanic ...
 > kern_assert ...
 > callout_halt ...
 > sleepq_block ...
 > ahci_channel_start ...
 > ahci_do_reset_drive ...
 > ahci_reset_drive ...
 > wddone ...
 > ahci_bio_complete ...
 > ahci_intr_port ...
 > ahci_intr ...
 > intr_biglock_wrapper ...
 > --- switch to interrupt stack ---
 > Xintr_ioapic_level7() ...
 > --- interrupt ---
 > x86_mwait ...
 > acpicpu_cstate_idle_enter ...
 > acpicpu_cstate_idle ...
 > idle_loop ...
 > db{0}>

 this is related to kern/47097, it just fails ont step further ...
 We should look at using the channel kernel thread to reset the drive,
 instead of doing it directly from interrupt context.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.