NetBSD Problem Report #42844

From mhitch@netbsd0.msu.montana.edu  Thu Feb 18 22:18:33 2010
Return-Path: <mhitch@netbsd0.msu.montana.edu>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by www.NetBSD.org (Postfix) with ESMTP id 3FCB363B11D
	for <gnats-bugs@gnats.NetBSD.org>; Thu, 18 Feb 2010 22:18:33 +0000 (UTC)
Message-Id: <20100218211310.9FB1B3CC96@netbsd0.msu.montana.edu>
Date: Thu, 18 Feb 2010 14:13:10 -0700 (MST)
From: mhitch@NetBSD.org
Reply-To: mhitch@NetBSD.org
To: gnats-bugs@gnats.NetBSD.org
Subject: esiop(4)/siop(4) can lose cmd entries under resource shortage conditions
X-Send-Pr-Version: 3.95

>Number:         42844
>Category:       kern
>Synopsis:       esiop(4)/siop(4) can lose cmd entries under resource shortage conditions
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Feb 18 22:20:00 +0000 2010
>Closed-Date:    Sat Nov 20 18:52:56 +0000 2010
>Last-Modified:  Sat Nov 20 18:52:56 +0000 2010
>Originator:     Michael L. Hitch
>Release:        NetBSD 5.99.24 and 5.0
>Organization:
	Montana State University
>Environment:


System: NetBSD netbsd0.msu.montana.edu 5.99.24 NetBSD 5.99.24 (GENERIC-$Revision: 1.330 $) #7: Thu Feb 18 10:24:23 MST 2010 mhitch@net4.msu.montana.edu:/home/mhitch/NetBSD-current/OBJ/alphaev56/home/mhitch/NetBSD-current/src/sys/arch/alpha/compile/LOCKDEBUG alpha
Architecture: alpha
Machine: alpha
>Description:
The siop(4) and esiop(4) driver will dynmaically grow "resources" as needed, and
actually starts out with none.  The resources in this case are cmd structures used
for each disk I/O transaction.  The driver will allocate a page of memory, setup
cmd structures in that page, and add them to the free list.  The adapter openings
are ajusted accordingly.

The problem starts when the driver tries to process an adapter request.  It will
remove a cmd structure from the free list and proceeds to fill in the cmd to
perform the requested operation.  There are several places where the cmd setup
can fail, in which case the driver will indicate a resource failure or a driver
"stuffup" and and terminate the request.  Howver, the driver fails to return the
now unused cmd to the free list, and the number of free items will no longer
match the adapter openings available.  The free list can run out of entries
and the driver will return a resource failure, and the upper layer will sleep
for a short time and retry the request later.  A particularly bad failure can
occur if the bus_dma_load fails due to another resource failure (bus_dma_load
will return EAGAIN), and the driver reports XS_DRIVER_STUFFUP.  This fails
the I/O request, but doesn't appear to report this failure back to the user
program, and seems to result in corrupted data in the file.  This can be
hard to detect some times, since accessing the file seems to be using the
data in the memory buffers, so appears all right.  Once the buffers have been
flushed (either over time as other data replaces the buffer or a system reboot),
then accessing the file will result in the corrupted data.
>How-To-Repeat:
Perform operations on files that write large amounts of data.  In my particular
case, the files are db(1) files with a few *vary* large records (it's a perl
script that keeps large amounts of information in hashes that are all written
as a single record in the db database).  It will also occur when doing a cp
of the database file to another file (I was trying to get backup copies of the
files).  After a reboot, or time enough for the buffer cache to be flushed,
the perl script would fail trying to update the db files.  An easy way to
verify the corrupted data was to use dump to backup the files and restore
them at another location (I did it to a different system) and compare md5
sums of the original and restored files.
>Fix:
When terminating the adapter request after the cmd has been removed from
the free list, put that cmd back on the free list before returing.  This
will keep from losing the cmd structures, and keeps the number of free entries
in sync with the adapter openings.  Check the return from the bus_dma_load()
operations, and return XS_RESOURCE_SHORTAGE if the error return was EAGAIN.
One other minor change is to adjust one of the error messages that occurs
when bus_dma_load() fails.  One bus_dma_load() is to set up the DMA information
for the cmd structure itself, and a second bus_dma_load() is done for the
actual data buffer.  Both messages indicated a 'cmd' failure; change the 2nd
to indicate 'data'.  Then exactly which bus_dma_load() failed can be determined.

The patches I am now using have made my CS20 much more reliable:  I no long
get large numbers of adapter resource shortage messages, and the failures to
load the DMA map no longer appear to corrupt my files.


Index: sys/dev/ic/esiop.c
===================================================================
RCS file: /cvsroot/src/sys/dev/ic/esiop.c,v
retrieving revision 1.42
diff -u -p -r1.42 esiop.c
--- sys/dev/ic/esiop.c	8 Apr 2008 12:07:26 -0000	1.42
+++ sys/dev/ic/esiop.c	16 Feb 2010 03:36:18 -0000
@@ -1555,6 +1555,7 @@ esiop_scsipi_request(chan, req, arg)
 				aprint_error_dev(&sc->sc_c.sc_dev, "can't malloc memory for "
 				    "target %d\n", target);
 				xs->error = XS_RESOURCE_SHORTAGE;
+				TAILQ_INSERT_TAIL(&sc->free_list, esiop_cmd, next);
 				scsipi_done(xs);
 				splx(s);
 				return;
@@ -1581,6 +1582,7 @@ esiop_scsipi_request(chan, req, arg)
 				    "target %d lun %d\n",
 				    target, lun);
 				xs->error = XS_RESOURCE_SHORTAGE;
+				TAILQ_INSERT_TAIL(&sc->free_list, esiop_cmd, next);
 				scsipi_done(xs);
 				splx(s);
 				return;
@@ -1598,7 +1600,10 @@ esiop_scsipi_request(chan, req, arg)
 		if (error) {
 			aprint_error_dev(&sc->sc_c.sc_dev, "unable to load cmd DMA map: %d\n",
 			    error);
-			xs->error = XS_DRIVER_STUFFUP;
+			xs->error = (error == EAGAIN) ? XS_RESOURCE_SHORTAGE :
+							XS_DRIVER_STUFFUP;
+			esiop_cmd->cmd_c.status = CMDST_FREE;
+			TAILQ_INSERT_TAIL(&sc->free_list, esiop_cmd, next);
 			scsipi_done(xs);
 			splx(s);
 			return;
@@ -1610,9 +1615,12 @@ esiop_scsipi_request(chan, req, arg)
 			    ((xs->xs_control & XS_CTL_DATA_IN) ?
 			     BUS_DMA_READ : BUS_DMA_WRITE));
 			if (error) {
-				aprint_error_dev(&sc->sc_c.sc_dev, "unable to load cmd DMA map: %d",
+				aprint_error_dev(&sc->sc_c.sc_dev, "unable to load data DMA map: %d",
 				    error);
-				xs->error = XS_DRIVER_STUFFUP;
+				xs->error = (error == EAGAIN) ? XS_RESOURCE_SHORTAGE :
+								XS_DRIVER_STUFFUP;
+				esiop_cmd->cmd_c.status = CMDST_FREE;
+				TAILQ_INSERT_TAIL(&sc->free_list, esiop_cmd, next);
 				scsipi_done(xs);
 				bus_dmamap_unload(sc->sc_c.sc_dmat,
 				    esiop_cmd->cmd_c.dmamap_cmd);
Index: sys/dev/ic/siop.c
===================================================================
RCS file: /cvsroot/src/sys/dev/ic/siop.c,v
retrieving revision 1.87
diff -u -p -r1.87 siop.c
--- sys/dev/ic/siop.c	8 Apr 2008 12:07:27 -0000	1.87
+++ sys/dev/ic/siop.c	16 Feb 2010 03:36:18 -0000
@@ -1281,6 +1281,7 @@ siop_scsipi_request(chan, req, arg)
 				aprint_error_dev(&sc->sc_c.sc_dev, "can't malloc memory for "
 				    "target %d\n", target);
 				xs->error = XS_RESOURCE_SHORTAGE;
+				TAILQ_INSERT_TAIL(&sc->free_list, siop_cmd, next);
 				scsipi_done(xs);
 				splx(s);
 				return;
@@ -1300,6 +1301,7 @@ siop_scsipi_request(chan, req, arg)
 				aprint_error_dev(&sc->sc_c.sc_dev, "can't alloc lunsw for target %d\n",
 				    target);
 				xs->error = XS_RESOURCE_SHORTAGE;
+				TAILQ_INSERT_TAIL(&sc->free_list, siop_cmd, next);
 				scsipi_done(xs);
 				splx(s);
 				return;
@@ -1317,6 +1319,7 @@ siop_scsipi_request(chan, req, arg)
 				    "target %d lun %d\n",
 				    target, lun);
 				xs->error = XS_RESOURCE_SHORTAGE;
+				TAILQ_INSERT_TAIL(&sc->free_list, siop_cmd, next);
 				scsipi_done(xs);
 				splx(s);
 				return;
@@ -1334,7 +1337,10 @@ siop_scsipi_request(chan, req, arg)
 		if (error) {
 			aprint_error_dev(&sc->sc_c.sc_dev, "unable to load cmd DMA map: %d\n",
 			    error);
-			xs->error = XS_DRIVER_STUFFUP;
+			xs->error = (error == EAGAIN) ? XS_RESOURCE_SHORTAGE :
+							XS_DRIVER_STUFFUP;
+			siop_cmd->cmd_c.status = CMDST_FREE;
+			TAILQ_INSERT_TAIL(&sc->free_list, siop_cmd, next);
 			scsipi_done(xs);
 			splx(s);
 			return;
@@ -1346,9 +1352,12 @@ siop_scsipi_request(chan, req, arg)
 			    ((xs->xs_control & XS_CTL_DATA_IN) ?
 			     BUS_DMA_READ : BUS_DMA_WRITE));
 			if (error) {
-				aprint_error_dev(&sc->sc_c.sc_dev, "unable to load cmd DMA map: %d",
+				aprint_error_dev(&sc->sc_c.sc_dev, "unable to load data DMA map: %d",
 				    error);
-				xs->error = XS_DRIVER_STUFFUP;
+				xs->error = (error == EAGAIN) ? XS_RESOURCE_SHORTAGE :
+								XS_DRIVER_STUFFUP;
+				siop_cmd->cmd_c.status = CMDST_FREE;
+				TAILQ_INSERT_TAIL(&sc->free_list, siop_cmd, next);
 				scsipi_done(xs);
 				bus_dmamap_unload(sc->sc_c.sc_dmat,
 				    siop_cmd->cmd_c.dmamap_cmd);

>Release-Note:

>Audit-Trail:

Responsible-Changed-From-To: bin-bug-people->kern-bug-people
Responsible-Changed-By: mhitch@NetBSD.org
Responsible-Changed-When: Thu, 18 Feb 2010 22:30:42 +0000
Responsible-Changed-Why:
Grrr.  mis-categorized.


From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: bin/42844: esiop(4)/siop(4) can lose cmd entries under
	resource shortage conditions
Date: Sat, 20 Feb 2010 08:17:16 +0000

 On Thu, Feb 18, 2010 at 10:20:00PM +0000, mhitch@NetBSD.org wrote:
  > The patches I am now using have made my CS20 much more reliable:  I no long
  > get large numbers of adapter resource shortage messages, and the failures to
  > load the DMA map no longer appear to corrupt my files.

 ...any reason you didn't just commit the patch?

 -- 
 David A. Holland
 dholland@netbsd.org

From: "Jonathan A. Kollasch" <jakllsch@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/42844 CVS commit: src/sys/dev/ic
Date: Fri, 9 Apr 2010 19:25:53 +0000

 Module Name:	src
 Committed By:	jakllsch
 Date:		Fri Apr  9 19:25:52 UTC 2010

 Modified Files:
 	src/sys/dev/ic: esiop.c siop.c

 Log Message:
 Improve error paths in (e)siop_scsipi_request():

  - When terminating the adapter request after the cmd has been removed
    from the free list, put that cmd back on the free list before returing.
  - Correctly indicate which bus_dma_load() failed.

 Analysis and fix from Michael L. Hitch in PR/42844.


 To generate a diff of this commit:
 cvs rdiff -u -r1.51 -r1.52 src/sys/dev/ic/esiop.c
 cvs rdiff -u -r1.94 -r1.95 src/sys/dev/ic/siop.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->feedback
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sun, 02 May 2010 06:44:28 +0000
State-Changed-Why:
Fixed?


From: "Michael L. Hitch" <mhitch@NetBSD.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/42844 (esiop(4)/siop(4) can lose cmd entries under resource
 shortage conditions)
Date: Sun, 2 May 2010 20:38:34 -0600 (MDT)

 On Sun, 2 May 2010, dholland@NetBSD.org wrote:

 > State-Changed-From-To: open->feedback
 > State-Changed-By: dholland@NetBSD.org
 > State-Changed-When: Sun, 02 May 2010 06:44:28 +0000
 > State-Changed-Why:
 > Fixed?

    It appears so, and I've requested a pullup for netbsd-5.

 --
 Michael L. Hitch			mhitch@NetBSD.org

State-Changed-From-To: feedback->pending-pullups
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Mon, 03 May 2010 04:43:12 +0000
State-Changed-Why:
awaiting pullup-5 #1390.


From: "Jeff Rizzo" <riz@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/42844 CVS commit: [netbsd-5] src/sys/dev/ic
Date: Sat, 20 Nov 2010 18:20:50 +0000

 Module Name:	src
 Committed By:	riz
 Date:		Sat Nov 20 18:20:49 UTC 2010

 Modified Files:
 	src/sys/dev/ic [netbsd-5]: esiop.c siop.c

 Log Message:
 Pull up following revision(s) (requested by mhitch in ticket #1390):
 	sys/dev/ic/siop.c: revision 1.95
 	sys/dev/ic/esiop.c: revision 1.52
 Improve error paths in (e)siop_scsipi_request():
   - When terminating the adapter request after the cmd has been removed
     from the free list, put that cmd back on the free list before returing.
   - Correctly indicate which bus_dma_load() failed.
 Analysis and fix from Michael L. Hitch in PR/42844.


 To generate a diff of this commit:
 cvs rdiff -u -r1.42 -r1.42.14.1 src/sys/dev/ic/esiop.c
 cvs rdiff -u -r1.87 -r1.87.14.1 src/sys/dev/ic/siop.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: pending-pullups->closed
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sat, 20 Nov 2010 18:52:56 +0000
State-Changed-Why:
Pullups complete.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.