NetBSD Problem Report #41867

From Wolfgang.Stukenbrock@nagler-company.com  Mon Aug 10 13:39:12 2009
Return-Path: <Wolfgang.Stukenbrock@nagler-company.com>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by www.NetBSD.org (Postfix) with ESMTP id 3722E63C270
	for <gnats-bugs@gnats.NetBSD.org>; Mon, 10 Aug 2009 13:39:12 +0000 (UTC)
Message-Id: <20090810133908.2A84C3925CD@s013.nagler-company.com>
Date: Mon, 10 Aug 2009 15:39:08 +0200 (CEST)
From: Wolfgang.Stukenbrock@nagler-company.com
Reply-To: Wolfgang.Stukenbrock@nagler-company.com
To: gnats-bugs@gnats.NetBSD.org
Subject: ahc-driver freezes after first device timeout and looses error information
X-Send-Pr-Version: 3.95

>Number:         41867
>Category:       kern
>Synopsis:       ahc-driver freezes after first device timeout and looses error information
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    bouyer
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Aug 10 13:40:00 +0000 2009
>Last-Modified:  Mon Aug 10 17:32:46 +0000 2009
>Originator:     Wolfgang Stukenbrock
>Release:        NetBSD 4.0
>Organization:
Dr. Nagler & Company GmbH

>Environment:


System: NetBSD s013 4.0 NetBSD 4.0 (NSW-S013-new) #37: Mon Aug 10 11:55:32 CEST 2009 wgstuken@s013:/usr/src/sys/arch/amd64/compile/NSW-S013-new amd64
Architecture: x86_64
Machine: amd64
>Description:
	After the first deivce timeout on the SCSI-bus, no further commands are executed on this SCSI-bus anymore.
	We have this problem on several machines (running different versions (3.x und 4.x) of NetBSD) with tape drives.
	(DAT and VXA). We are using some 19160, 29160 and 29160N controllers - they all share this problem.
	I've located the problem for it in a missing THAW call during the device-reset processing.
	The channel is freezed in ahc_set_recoveryscb(), but never thawn again.

	The next problem in the ahc driver was then, that the failed request is returned to caller without any error indication.
	I will set an error indication in the abort_scsc function - that looks the correct place to me, because any aborted request
	should return an error from my point of view.

	During the anlyses I've recognized another problem in the driver with the device reset processing.
	If an explicit device reset is requested from user level, the same scb would be queued to the controler twice. The only device
	that seems to be affected by this is the CD-driver - I've found no other location where the relevant flags is set.

	The following two patches fixed the above problems when a device timeout occures - or a device reset is requested.

	Neverless I haven't found the main cause for the timeout - I assume there is another bug somewhere in the ahc driver.
	I haven't found a way to trigger it - sometimes it takes minutes until it happens, sometimes days. I'm shure it is not related the
	the wiring of the tape devices on the SCSI-bus ...

	With this patch EIO is returned to the caller after device timeout and it is no longer required to reboot the system.
>How-To-Repeat:
	Connect a tape to an ahc controler and write to it. Wait until a device timeout occures ... 
>Fix:
	the following two files in /usr/src/sys/dev/ic must be updated to fix the problem.

===================================================================
RCS file: RCS/aic7xxx.c,v
retrieving revision 1.1
diff -u -r1.1 aic7xxx.c
--- aic7xxx.c	2009/08/07 12:15:32	1.1
+++ aic7xxx.c	2009/08/07 12:27:51
@@ -1290,6 +1290,12 @@
 						    CAM_BDR_SENT,
 						    "Bus Device Reset",
 						    /*verbose_level*/0);
+
+				/* reset freeze status - was set in ahc_set_recoveryscb() - otherwise we will hang ... */
+				scsipi_channel_thaw(&ahc->sc_channel, 1);
+				if (ahc->features & AHC_TWIN)
+				  scsipi_channel_thaw(&ahc->sc_channel_b, 1);
+
 				printerror = 0;
 			} else if (ahc_sent_msg(ahc, AHCMSG_EXT,
 						MSG_EXT_PPR, FALSE)) {
@@ -5880,6 +5886,11 @@
 				ahc_freeze_scb(scbp);
 			if ((scbp->flags & SCB_ACTIVE) == 0)
 				printf("Inactive SCB on pending list\n");
+
+			/* set error status - otherwise theese scb will signal success to the initator .... */
+			if (scbp->xs != NULL && scbp->xs->error != XS_NOERROR)
+			  scbp->xs->error = XS_RESET; /* we use XS_RESET here - it may be a good idea to retry the command later */
+
 			ahc_done(ahc, scbp);
 			found++;
 		}
===================================================================
RCS file: RCS/aic7xxx_osm.c,v
retrieving revision 1.1
diff -u -r1.1 aic7xxx_osm.c
--- aic7xxx_osm.c	2009/08/10 13:15:03	1.1
+++ aic7xxx_osm.c	2009/08/10 13:16:28
@@ -323,8 +323,8 @@
 			hscb->control |= MK_MESSAGE;
 			ahc_execute_scb(scb, NULL, 0);
 		}
-
-		ahc_setup_data(ahc, xs, scb);
+		else /* do not use the scb a second time - it has been freed by the ahc_execute_scb processing above ... */
+			ahc_setup_data(ahc, xs, scb);

 		break;
 	  }

>Release-Note:

>Audit-Trail:

Responsible-Changed-From-To: kern-bug-people->bouyer
Responsible-Changed-By: bouyer@NetBSD.org
Responsible-Changed-When: Mon, 10 Aug 2009 17:32:46 +0000
Responsible-Changed-Why:
I have a way to test this.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.