NetBSD Problem Report #41797

From Wolfgang.Stukenbrock@nagler-company.com  Wed Jul 29 17:43:21 2009
Return-Path: <Wolfgang.Stukenbrock@nagler-company.com>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by www.NetBSD.org (Postfix) with ESMTP id AD32363B882
	for <gnats-bugs@gnats.NetBSD.org>; Wed, 29 Jul 2009 17:43:21 +0000 (UTC)
Message-Id: <20090729174316.E49CF4EA9FE@s012.nagler-company.com>
Date: Wed, 29 Jul 2009 19:43:16 +0200 (CEST)
From: Wolfgang.Stukenbrock@nagler-company.com
Reply-To: Wolfgang.Stukenbrock@nagler-company.com
To: gnats-bugs@gnats.NetBSD.org
Subject: kernel panic in kern_physio when tape reaches EOM during write if DIAGNOSTICS is enbled, without DIGNOSTICS error status is lost
X-Send-Pr-Version: 3.95

>Number:         41797
>Category:       kern
>Synopsis:       kernel panic in kern_physio when tape reaches EOM during write if DIAGNOSTICS is enbled, without DIGNOSTICS error status is lost
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Jul 29 17:45:00 +0000 2009
>Last-Modified:  Wed Jul 29 19:15:02 +0000 2009
>Originator:     Wolfgang Stukenbrock
>Release:        NetBSD 4.0
>Organization:
Dr. Nagler & Company GmbH

>Environment:


System: NetBSD s012 4.0 NetBSD 4.0 (NSW-S012) #9: Fri Mar 13 12:31:52 CET 2009 wgstuken@s012:/usr/src/sys/arch/amd64/compile/NSW-S012 amd64
Architecture: x86_64
Machine: amd64
>Description:
	We have a VXA320 Tape connected to an adaptec 29160 controler at this system.
	For debugging purpose we run a kernel with DIAGNOSTICS enabled.
	"Sometimes" the systems panics with an asserstion in kern_phyio.c in line 201 "KASSERT((bp->b_flags & B_ERROR) == 0);".
	This is only enabled when the kernel is compiled with DIAGNOSTICS - so most user will never get the panic.
	(but first EOM error status is lost ... - see analyses below)
	It took some time to find out, that the cause of it is the st-driver.
	We are using the nrst-devices - so no fixed block mode - and the default behaviour ot theese is EEW disabled.

	Now the following happens when EOM is reached on the tape:
	The XS-command is returned with XS_SENCE from the ahc driver. The transfer count is equal to the number of bytes requested.
	The st-driver detects that EOM is the cause for the problem. Due to the fact that no EEW is enabled it returns EIO.
	The st-drive is called again to finish the packet with EIO indicated. It set B_ERROR in the buffer.
	The physio-done routine now checks if all bytes have been transfered - and it is (!) - so it reaches the assertion above -> panic.

	If EEW is enabled on the tape, the st-driver returns 0 (no-error) after detecting EOM and no problem occurs.

	I'm not realy confirmed with the return semantics of the HW-controlers. The code seems to ecpect, that the failed command is
	returned and the sence-info has to be requested.
	In the case above, the ahc driver already returns the sence information. The driver seems to be able to handle this too.

	I don't know it it is a legal situation, that all bytes have been written by the tape, but EOM is signaled anyway.
	Perhaps this is a special case of the VXA-tape drive.

	netherless: The code in kern_phsyio.c physio_done() looks wrong to me, because it does not update the error status in mbp if all bytes
	have been transfered but B_ERROR has been set too. This looses the error information and no error is reported to user level
	as it should be.
	In fact without DIAGNOSTICS in kernel-config, the first EOM-hit by a write is not returned to user level! I've tested it.

	I think the way to fix this, is to check the B_ERROR flag in phsyio_done() too and enter error processing if either not all
	requested bytes have been transfered or an error status is set.
	remark: this leads to another bug in phsyio() some lines below .. the check with delta must allow 0 too. We must allow an error
	even if all requested data has been transferd .....
>How-To-Repeat:
	Setup a kernel with DIGNOSTIC, connect an SCSI-Tape to it and fill up the tape till it hits EOM.
	The system will panic there ...
>Fix:
	The following fix need to be applyedto sys/kern/kern_physio.c.
	With this fix the system no longer panic an the error is returend to user-level on first EOM detection.

--- kern_physio.c       2009/07/29 13:56:10     1.1
+++ kern_physio.c       2009/07/29 17:39:11
@@ -158,7 +158,7 @@
        uvm_vsunlock(bp->b_proc->p_vmspace, bp->b_data, todo);

        simple_lock(&mbp->b_interlock);
-       if (__predict_false(done != todo)) {
+       if (__predict_false(done != todo || (bp->b_flags & B_ERROR) == 0)) {
                off_t endoffset = dbtob(bp->b_blkno) + done;

                /*
@@ -197,8 +197,6 @@
                        mbp->b_error = error;
                }
                mbp->b_flags |= B_ERROR;
-       } else {
-               KASSERT((bp->b_flags & B_ERROR) == 0);
        }

        mbp->b_running--;
@@ -438,7 +436,7 @@
                off_t delta;

                delta = uio->uio_offset - mbp->b_endoffset;
-               KASSERT(delta > 0);
+               KASSERT(delta >= 0);
                uio->uio_resid += delta;
                /* uio->uio_offset = mbp->b_endoffset; */
        } else {

>Audit-Trail:
From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/41797: kernel panic in kern_physio when tape reaches EOM
	during write if DIAGNOSTICS is enbled, without DIGNOSTICS error
	status is lost
Date: Wed, 29 Jul 2009 19:13:33 +0000

 On Wed, Jul 29, 2009 at 05:45:00PM +0000, Wolfgang.Stukenbrock@nagler-company.com wrote:
  > 	We have a VXA320 Tape connected to an adaptec 29160 controler
  > 	at this system.  For debugging purpose we run a kernel with
  > 	DIAGNOSTICS enabled.  "Sometimes" the systems panics with an
  > 	asserstion in kern_phyio.c in line 201 "KASSERT((bp->b_flags &
  > 	B_ERROR) == 0);".

 I swear I've seen this problem reported before, but there isn't an
 open PR that I can find for it, and it doesn't look to have been fixed
 either. Does anyone else remember this or am I confusing it with
 something else?

 Anyhow, also see 38643, which is connected. The I/O completion
 reporting for st seems to be thoroughly unsatisfactory. :-|

 -- 
 David A. Holland
 dholland@netbsd.org

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.