NetBSD Problem Report #41889

From Wolfgang.Stukenbrock@nagler-company.com  Fri Aug 14 11:49:46 2009
Return-Path: <Wolfgang.Stukenbrock@nagler-company.com>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by www.NetBSD.org (Postfix) with ESMTP id 48A7B63C284
	for <gnats-bugs@gnats.NetBSD.org>; Fri, 14 Aug 2009 11:49:46 +0000 (UTC)
Message-Id: <20090814114942.7D95B4EA9FE@s012.nagler-company.com>
Date: Fri, 14 Aug 2009 13:49:42 +0200 (CEST)
From: Wolfgang.Stukenbrock@nagler-company.com
Reply-To: Wolfgang.Stukenbrock@nagler-company.com
To: gnats-bugs@gnats.NetBSD.org
Subject: st(4) driver splits larger IO request into peaces - compability problem with other OS's
X-Send-Pr-Version: 3.95

>Number:         41889
>Category:       kern
>Synopsis:       st(4) driver splits larger IO request into peaces - compability problem with other OS's
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          change-request
>Submitter-Id:   net
>Arrival-Date:   Fri Aug 14 11:50:00 +0000 2009
>Last-Modified:  Sun Aug 16 11:40:02 +0000 2009
>Originator:     Wolfgang Stukenbrock
>Release:        NetBSD 4.0
>Organization:
Dr. Nagler & Company GmbH

>Environment:


System: NetBSD s012 4.0 NetBSD 4.0 (NSW-S012) #9: Fri Mar 13 12:31:52 CET 2009 wgstuken@s012:/usr/src/sys/arch/amd64/compile/NSW-S012 amd64
Architecture: x86_64
Machine: amd64
>Description:
	The current implementation of physio() (in kern_physio.c) determines the maximum possible transfer size of the underlying HW
	be fist clamping to MAXPHYS and then asking the driver for further restrictions.
	This is fine for disk-IO, but leads to unexpected behaviour when exchangeing tapes with other operating systems.
	The "normal" semantic of a write(2) call on a tape device in variable-block-mode is, that exactly one tape block of that size
	is written to the media. (e.g. Solaris does it like this ...)
	If the tape device is not able to support that size, an error is returned.
	The current NetBSD implementation does not return an error in "all" cases - it will write multiple tape-blocks in some cases.
	This may break applications on other systems or other NetBSD-kernels if the MAXPHYS kernel options is used with a value other than the default.

	The default value for MAXPHYS is currently equal to MAXBSIZE - both are 64k - for nearly all kernel configurations.
	Some HW-driver (e.g. the ata/sata stuff) allow only may 64k transfers too, but the scsi drivers allows larger blocksizes.
	By the way: the 64k-limit for the max tape-blocksize is a very hard restriction and does not make realy sense.
	In order to be able to read tapes written on Solaris with 1MB blocks, you need to set MAYPHYS to 1024k (and MAXBSIZE to 64k - otherwise you need to
	fix some other parts in the kernel ...) in order read the data from the tape.
	OK so far, but the normal NetBSD kernel is not able to read anything from such a tape - and you get a misleading error message ... ugly ..

	You may be also out of luck, if you try to write e.g. 64k+1byte. At least our VXA-tape requires a blocksize of at least 4 bytes and the splitted
	transfer will be only 1 byte -> error ... This is not what I expect from a variable block size device if the device supports blocks up to 240k!

	The next compability problem occures if you have an application that tries to write e.g. 128k (or variable sized) blocks and the
	application knows the number of blocks on the tape. If you try to read the tape on another OS, you suddently get only 64k blocks but much more ...


	For TAPE read and write operations, physio() should be changed in an manner, that it does not split transfers for tape devices.

	The following patch will do this for the read() and write() system call.
	It returns EINVAL and print a message on the console.
	The idea of printing the messages comes from st.c, where a messages is printed if the tape does not support the requested block size.

	The readv() and writev() calls are affected in the following way:
	- each single vector results in a sigle write operation as before
	- no single vector is splitted into peaces anymore
	I'm realy not shure about the correct behaviour of readv/writev() on tape devices - I've never used it there ...
	From my understanding the iovec structure should "only" allow the user to split the memory space in user-level of one read() or write() operation into 
	several peaces. But for tape devices it should result in an operation that act on exactly on tape-block - at least this is my understanding of it.
	This not the case in the current NetBSD implemantion and also not the case with this patch!
	But I'm not shure about the correct behaviour for readv() and writev() - so this patch only affects the single transfers of it...
	The only way to archive this behaviour would be to copy anything into one block, because some comments in physio() seems to point out some
	problems when several user-level memeory peaches resides inside the same physical page. At least as far as I understand it.

	remark: this fix will return an arror after transfering some data if readv() or writev() is used! If this is not acceptable, the size-check must be done
		prior the main transfer loop.

	In this patch the exiting flag B_TAPE is (re)used. It looks a relict of some days, because it is only defined in the header file and ask for in specfs-code
	at exactly one place.  It is not used in the whole NetBSD-sources at any other location.
	The name sounds good, so I took it. Feel free to define an other name if you prefer this - some bits are sill unused.

	If the default value for MAXPHYS should be changed to something more common for tape devices is a religion question.
	There will be a limit in all cases. I would prefere to change it to at least 256k, but you may have a different point.
	remark: the default value of MAXPHYS is not affected by this patch. You need to set it in your kernel config if you need larger block sizes.
		But keep in mind to set MAXBSIZE back to 64k for the filesystem - otherwise the kernel will not compile ....
>How-To-Repeat:
	Write 10 128k-blocks on a tape on NetBSD and try to read 10 blocks on Solaris. You will only read 640k from the tape.
	Try to write a 65537 bytes block on a variable-block tape device with a mininum block size (e.g. VXA-320) - it will fail with the message,
	that the transfer size should be between 4 and 240k bytes. And 65537 is ...
>Fix:
	Apply the folloing patch to /usr/src/sys/kern/kern_physio.c and /usr/src/sys/dev/scsipi/st.c.
	read() and write() call will never be splitted into several transfers anymore.


--- kern_physio.c	2009/07/29 17:45:45	1.2
+++ kern_physio.c	2009/08/14 11:21:19
@@ -285,7 +285,7 @@
 	DPRINTF(("%s: called: off=%" PRIu64 ", resid=%zu\n",
 	    __func__, uio->uio_offset, uio->uio_resid));

-	flags &= B_READ | B_WRITE;
+	flags &= B_READ | B_WRITE | B_TAPE;

 	/* Make sure we have a buffer, creating one if necessary. */
 	if (obp != NULL) {
@@ -375,6 +375,15 @@
 			 * for later comparison.
 			 */
 			(*min_phys)(bp);
+			if (flags & B_TAPE) {
+				if (bp->b_bcount != iovp->iov_len) {
+					printf("TAPE physio - request %zu, but supported by HW max %u\n",
+						iovp->iov_len, bp->b_bcount);
+					error = EINVAL; /* do no split IO-request on tape devices ... */
+					goto done;
+
+				}
+			}
 			todo = bp->b_bufsize = bp->b_bcount;
 #if defined(DIAGNOSTIC)
 			if (todo > MAXPHYS)
--- st.c	2009/08/14 09:06:59	1.1
+++ st.c	2009/08/14 09:09:13
@@ -1380,8 +1380,13 @@
 {
 	struct st_softc *st = st_cd.cd_devs[STUNIT(dev)];

-	return (physio(ststrategy, NULL, dev, B_READ,
-	    st->sc_periph->periph_channel->chan_adapter->adapt_minphys, uio));
+	if (st->flags & ST_FIXEDBLOCKS) {
+		return (physio(ststrategy, NULL, dev, B_READ,
+		    st->sc_periph->periph_channel->chan_adapter->adapt_minphys, uio));
+	} else {
+		return (physio(ststrategy, NULL, dev, B_READ | B_TAPE,
+		    st->sc_periph->periph_channel->chan_adapter->adapt_minphys, uio));
+	}
 }

 static int
@@ -1389,8 +1394,13 @@
 {
 	struct st_softc *st = st_cd.cd_devs[STUNIT(dev)];

-	return (physio(ststrategy, NULL, dev, B_WRITE,
-	    st->sc_periph->periph_channel->chan_adapter->adapt_minphys, uio));
+	if (st->flags & ST_FIXEDBLOCKS) {
+		return (physio(ststrategy, NULL, dev, B_WRITE,
+		    st->sc_periph->periph_channel->chan_adapter->adapt_minphys, uio));
+	} else {
+		return (physio(ststrategy, NULL, dev, B_WRITE | B_TAPE,
+		    st->sc_periph->periph_channel->chan_adapter->adapt_minphys, uio));
+	}
 }

 /*

>Audit-Trail:
From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/41889: st(4) driver splits larger IO request into peaces
	- compability problem with other OS's
Date: Sun, 16 Aug 2009 05:20:37 +0000

 On Fri, Aug 14, 2009 at 11:50:00AM +0000, Wolfgang.Stukenbrock@nagler-company.com wrote:
  > The current implementation of physio() (in kern_physio.c)
  > determines the maximum possible transfer size of the underlying HW
  > be fist clamping to MAXPHYS and then asking the driver for further
  > restrictions.

 See also PR 38643. :-|

 -- 
 David A. Holland
 dholland@netbsd.org

From: Wolfgang.Stukenbrock@nagler-company.com
To: gnats-bugs@NetBSD.org, "David Holland" <dholland-bugs@NetBSD.org>
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/41889: st(4) driver splits larger IO request into peaces
	- compability problem with other OS's
Date: Sun, 16 Aug 2009 13:38:12 +0200

 Hi again,

 I haven't seen 38643 before - OK.

 The problem there has the same cause.
 A tape device has a little bit different semantic in reading and =20
 writing as a "normal" block device.
 Blocknumber have no meaning for a tape and a tape block is the atom =20
 for a transfer.
 Therefore a write should write exaclty one tape block - or should fail.
 And a read should try to read exactly on tape block - or should fail.
 A read my succeeed only if the requested size is at least the size of =20
 the tape block.
 There is absolutly no way to get only "a second" half of a tape block =20
 from a tape.

 This problem needs a fix as soon as possible!

 My fix will "just" remove the silent misbehaviour for simple read and =20
 write calls.
 A fix fore readv and writev is mentioned, but perhaps someone has a =20
 better idea ..

 W. Stukenbrock

 Zitat von "David Holland" <dholland-bugs@NetBSD.org>:

 > The following reply was made to PR kern/41889; it has been noted by GNATS.
 >
 > From: David Holland <dholland-bugs@netbsd.org>
 > To: gnats-bugs@NetBSD.org
 > Cc:
 > Subject: Re: kern/41889: st(4) driver splits larger IO request into peaces
 > =09- compability problem with other OS's
 > Date: Sun, 16 Aug 2009 05:20:37 +0000
 >
 >  On Fri, Aug 14, 2009 at 11:50:00AM +0000, =20
 > Wolfgang.Stukenbrock@nagler-company.com wrote:
 >   > The current implementation of physio() (in kern_physio.c)
 >   > determines the maximum possible transfer size of the underlying HW
 >   > be fist clamping to MAXPHYS and then asking the driver for further
 >   > restrictions.
 >
 >  See also PR 38643. :-|
 >
 >  --
 >  David A. Holland
 >  dholland@netbsd.org
 >
 >


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.