NetBSD Problem Report #55965

From hauke@Espresso.Rhein-Neckar.DE  Fri Jan 29 17:31:35 2021
Return-Path: <hauke@Espresso.Rhein-Neckar.DE>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id D3EBB1A9239
	for <gnats-bugs@gnats.NetBSD.org>; Fri, 29 Jan 2021 17:31:35 +0000 (UTC)
Message-Id: <202101291724.10THOmHP009163@henkelmann.causeuse.org>
Date: Fri, 29 Jan 2021 18:24:48 +0100 (CET)
From: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
Reply-To: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
To: gnats-bugs@NetBSD.org
Cc: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
Subject: st(4) on adaptec fails to 'mt erase'
X-Send-Pr-Version: 3.95

>Number:         55965
>Category:       kern
>Synopsis:       st(4) on adaptec fails to 'mt erase'
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Jan 29 17:35:00 +0000 2021
>Last-Modified:  Sun Jan 31 09:20:02 +0000 2021
>Originator:     Hauke Fath
>Release:        NetBSD 9.1_STABLE
>Organization:
Falling Raindrops
>Environment:


System: NetBSD henkelmann.causeuse.org 9.1_STABLE NetBSD 9.1_STABLE (HENKELMANN) #0: Wed Jan 27 15:57:38 CET 2021 hauke@pizza.causeuse.org:/var/obj/netbsd-builds/9/macppc/sys/arch/macppc/compile/HENKELMANN macppc
Architecture: powerpc
Machine: macppc
>Description:

	An Adaptec 2944 (HVD) scsi adapter

[     1.000000] ahc0 at pci1 dev 2 function 0: Adaptec 2944 Ultra SCSI adapter
[     1.000000] genppc_pci_intr_map: pin: 3, line: 23
[     1.000000] ahc0: interrupting at irq 23
[     1.000000] ahc0: aic7880: Ultra Wide Channel A, SCSI Id=7, 16/253 SCBs
[     1.000000] scsibus0 at ahc0: 16 targets, 8 luns per target

	with Quantum DLT4000 drives attached

[     4.319997] st0 at scsibus0 target 5 lun 0: <Quantum, DLT4000, CC1E> tape removable
[     4.319997] st0: drive empty
[     4.329997] st1 at scsibus0 target 6 lun 0: <Quantum, DLT4000, CC1E> tape removable
[     4.329997] st1: density code 25, variable blocks, write-enabled
[     4.329997] st1: sync (100.00ns offset 15), 8-bit (10.000MB/s) transfers

	generally works fine, but an 'mt erase' leads to surprising
	verbosity after a few seconds:

[...]
[   198.050100] ahc0:SCB 0xe - timed out
[   198.050100]  Dump Card State Begins <<<<<<<<<<<<<<<<<
[   198.050100] ahc0: Dumping Card State while idle, at SEQADDR 0x7
[   198.050100] Card was paused
[   198.050100] ACCUM = 0x32, SINDEX = 0x67, DINDEX = 0x27, ARG_2 = 0x3
[   198.050100] HCNT = 0x0 SCBPTR = 0x0
[   198.050100] SCSISIGI[0x0] ERROR[0x0] SCSIBUSL[0x0] LASTPHASE[0x1] 
[   198.050100] SCSISEQ[0x12] SBLKCTL[0x2] SCSIRATE[0x0] SEQCTL[0x10] 
[   198.050100] SEQ_FLAGS[0xc0] SSTAT0[0x5] SSTAT1[0xa] SSTAT2[0x0] 
[   198.050100] SSTAT3[0x0] SIMODE0[0x0] SIMODE1[0xa4] SXFRCTL0[0x80] 
[   198.050100] DFCNTRL[0x0] DFSTATUS[0x2d] 
[   198.050100] STACK: 0x0 0x16c 0x19c 0x3
[   198.050100] SCB count = 16
[   198.050100] Kernel NEXTQSCB = 15
[   198.050100] Card NEXTQSCB = 15
[   198.050100] QINFIFO entries: 
[   198.050100] Waiting Queue entries: 
[   198.050100] Disconnected Queue entries: 0:14 
[   198.050100] QOUTFIFO entries: 
[   198.050100] Sequencer Free SCB List: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
[   198.050100] Sequencer SCB Info: 
[   198.050100]   0 SCB_CONTROL[0x44] SCB_SCSIID[0x67] 
[   198.050100] SCB_LUN[0x0] SCB_TAG[0xe] 
[   198.050100]   1 SCB_CONTROL[0x0] SCB_SCSIID[0xff] 
[   198.050100] SCB_LUN[0xff] SCB_TAG[0xff] 
[   198.050100]   2 SCB_CONTROL[0x0] SCB_SCSIID[0xff] 
[   198.050100] SCB_LUN[0xff] SCB_TAG[0xff] 
[   198.050100]   3 SCB_CONTROL[0x0] SCB_SCSIID[0xff] 
[   198.050100] SCB_LUN[0xff] SCB_TAG[0xff] 
[   198.050100]   4 SCB_CONTROL[0x0] SCB_SCSIID[0xff] 
[   198.050100] SCB_LUN[0xff] SCB_TAG[0xff] 
[   198.050100]   5 SCB_CONTROL[0x0] SCB_SCSIID[0xff] 
[   198.050100] SCB_LUN[0xff] SCB_TAG[0xff] 
[   198.050100]   6 SCB_CONTROL[0x0] SCB_SCSIID[0xff] 
[   198.050100] SCB_LUN[0xff] SCB_TAG[0xff] 
[   198.050100]   7 SCB_CONTROL[0x0] SCB_SCSIID[0xff] 
[   198.050100] SCB_LUN[0xff] SCB_TAG[0xff] 
[   198.050100]   8 SCB_CONTROL[0x0] SCB_SCSIID[0xff] 
[   198.050100] SCB_LUN[0xff] SCB_TAG[0xff] 
[   198.050100]   9 SCB_CONTROL[0x0] SCB_SCSIID[0xff] 
[   198.050100] SCB_LUN[0xff] SCB_TAG[0xff] 
[   198.050100]  10 SCB_CONTROL[0x0] SCB_SCSIID[0xff] 
[   198.050100] SCB_LUN[0xff] SCB_TAG[0xff] 
[   198.050100]  11 SCB_CONTROL[0x0] SCB_SCSIID[0xff] 
[   198.050100] SCB_LUN[0xff] SCB_TAG[0xff] 
[   198.050100]  12 SCB_CONTROL[0x0] SCB_SCSIID[0xff] 
[   198.050100] SCB_LUN[0xff] SCB_TAG[0xff] 
[   198.050100]  13 SCB_CONTROL[0x0] SCB_SCSIID[0xff] 
[   198.050100] SCB_LUN[0xff] SCB_TAG[0xff] 
[   198.050100]  14 SCB_CONTROL[0x0] SCB_SCSIID[0xff] 
[   198.050100] SCB_LUN[0xff] SCB_TAG[0xff] 
[   198.050100]  15 SCB_CONTROL[0x0] SCB_SCSIID[0xff] 
[   198.050100] SCB_LUN[0xff] SCB_TAG[0xff] 
[   198.050100] Pending list: 
[   198.050100]  14 SCB_CONTROL[0x40] SCB_SCSIID[0x67] 
[   198.050100] SCB_LUN[0x0] 
[   198.050100] Kernel Free SCB list: 13 12 11 10 9 8 7 6 5 4 3 2 1 0 
[   198.050100] Untagged Q(6): 14 

[   198.050100] 
[   198.050100] ahc0:Queuing a BDR SCB
[   198.050100] ahc0:Bus Device Reset Message Sent
[   198.050100] st1(ahc0:0:6:0): ahc0: no longer in timeout, status = 0
[   198.050100] ahc0: Bus Device Reset on A:6. 1 SCBs aborted
[...]

	and no erase.

	I originally encountered this on the machine's netbsd-5
	installation, then upgraded since "surely this has long been
	fixed". Not so, although -5 lost the scsibus as a result till
	the next reboot, whereas -9 will recover, and you can repeat
	the exercise.

	This PowerMac G3 B/W is my only homeoffice machine with a pci
	bus, but I don't think the issue is specific to macppc. This
	used to be my home server for a couple of years, and it ran a
	DLT4000 off a different interface just fine.


>How-To-Repeat:


	Hang a DLT drive off an Adaptec 2944 (HVD), and attempt to
	erase a tape.

>Fix:


	Yes, please. 

	My single-ended drive has died, and I need to erase the tapes...

>Audit-Trail:
From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Fri, 29 Jan 2021 19:06:43 +0100

 Also part of the now closed bug from 2005 - PR/kern 29369.

 Seems like the ahc driver did not get any love on timeout processing. 
 Any takers?

 Frank


 On 01/29/21 18:35, Hauke Fath wrote:
 >> Number:         55965
 >> Category:       kern
 >> Synopsis:       st(4) on adaptec fails to 'mt erase'
 >> Confidential:   no
 >> Severity:       serious
 >> Priority:       medium
 >> Responsible:    kern-bug-people
 >> State:          open
 >> Class:          sw-bug
 >> Submitter-Id:   net
 >> Arrival-Date:   Fri Jan 29 17:35:00 +0000 2021
 >> Originator:     Hauke Fath
 >> Release:        NetBSD 9.1_STABLE
 >> Organization:
 > Falling Raindrops
 >> Environment:
 > 	
 > 	
 > System: NetBSD henkelmann.causeuse.org 9.1_STABLE NetBSD 9.1_STABLE (HENKELMANN) #0: Wed Jan 27 15:57:38 CET 2021 hauke@pizza.causeuse.org:/var/obj/netbsd-builds/9/macppc/sys/arch/macppc/compile/HENKELMANN macppc
 > Architecture: powerpc
 > Machine: macppc
 >> Description:
 > 	An Adaptec 2944 (HVD) scsi adapter
 >
 > [     1.000000] ahc0 at pci1 dev 2 function 0: Adaptec 2944 Ultra SCSI adapter
 > [     1.000000] genppc_pci_intr_map: pin: 3, line: 23
 > [     1.000000] ahc0: interrupting at irq 23
 > [     1.000000] ahc0: aic7880: Ultra Wide Channel A, SCSI Id=7, 16/253 SCBs
 > [     1.000000] scsibus0 at ahc0: 16 targets, 8 luns per target
 >
 > 	with Quantum DLT4000 drives attached
 >
 > [     4.319997] st0 at scsibus0 target 5 lun 0: <Quantum, DLT4000, CC1E> tape removable
 > [     4.319997] st0: drive empty
 > [     4.329997] st1 at scsibus0 target 6 lun 0: <Quantum, DLT4000, CC1E> tape removable
 > [     4.329997] st1: density code 25, variable blocks, write-enabled
 > [     4.329997] st1: sync (100.00ns offset 15), 8-bit (10.000MB/s) transfers
 >
 > 	generally works fine, but an 'mt erase' leads to surprising
 > 	verbosity after a few seconds:
 >
 > [...]
 > [   198.050100] ahc0:SCB 0xe - timed out
 > [   198.050100]  Dump Card State Begins <<<<<<<<<<<<<<<<<
 > [   198.050100] ahc0: Dumping Card State while idle, at SEQADDR 0x7
 > [   198.050100] Card was paused
 > [   198.050100] ACCUM = 0x32, SINDEX = 0x67, DINDEX = 0x27, ARG_2 = 0x3
 > [   198.050100] HCNT = 0x0 SCBPTR = 0x0
 > [   198.050100] SCSISIGI[0x0] ERROR[0x0] SCSIBUSL[0x0] LASTPHASE[0x1]
 > [   198.050100] SCSISEQ[0x12] SBLKCTL[0x2] SCSIRATE[0x0] SEQCTL[0x10]
 > [   198.050100] SEQ_FLAGS[0xc0] SSTAT0[0x5] SSTAT1[0xa] SSTAT2[0x0]
 > [   198.050100] SSTAT3[0x0] SIMODE0[0x0] SIMODE1[0xa4] SXFRCTL0[0x80]
 > [   198.050100] DFCNTRL[0x0] DFSTATUS[0x2d]
 > [   198.050100] STACK: 0x0 0x16c 0x19c 0x3
 > [   198.050100] SCB count = 16
 > [   198.050100] Kernel NEXTQSCB = 15
 > [   198.050100] Card NEXTQSCB = 15
 > [   198.050100] QINFIFO entries:
 > [   198.050100] Waiting Queue entries:
 > [   198.050100] Disconnected Queue entries: 0:14
 > [   198.050100] QOUTFIFO entries:
 > [   198.050100] Sequencer Free SCB List: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
 > [   198.050100] Sequencer SCB Info:
 > [   198.050100]   0 SCB_CONTROL[0x44] SCB_SCSIID[0x67]
 > [   198.050100] SCB_LUN[0x0] SCB_TAG[0xe]
 > [   198.050100]   1 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
 > [   198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
 > [   198.050100]   2 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
 > [   198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
 > [   198.050100]   3 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
 > [   198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
 > [   198.050100]   4 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
 > [   198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
 > [   198.050100]   5 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
 > [   198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
 > [   198.050100]   6 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
 > [   198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
 > [   198.050100]   7 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
 > [   198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
 > [   198.050100]   8 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
 > [   198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
 > [   198.050100]   9 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
 > [   198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
 > [   198.050100]  10 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
 > [   198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
 > [   198.050100]  11 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
 > [   198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
 > [   198.050100]  12 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
 > [   198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
 > [   198.050100]  13 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
 > [   198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
 > [   198.050100]  14 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
 > [   198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
 > [   198.050100]  15 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
 > [   198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
 > [   198.050100] Pending list:
 > [   198.050100]  14 SCB_CONTROL[0x40] SCB_SCSIID[0x67]
 > [   198.050100] SCB_LUN[0x0]
 > [   198.050100] Kernel Free SCB list: 13 12 11 10 9 8 7 6 5 4 3 2 1 0
 > [   198.050100] Untagged Q(6): 14
 >
 > [   198.050100]
 > [   198.050100] ahc0:Queuing a BDR SCB
 > [   198.050100] ahc0:Bus Device Reset Message Sent
 > [   198.050100] st1(ahc0:0:6:0): ahc0: no longer in timeout, status = 0
 > [   198.050100] ahc0: Bus Device Reset on A:6. 1 SCBs aborted
 > [...]
 >
 > 	and no erase.
 >
 > 	I originally encountered this on the machine's netbsd-5
 > 	installation, then upgraded since "surely this has long been
 > 	fixed". Not so, although -5 lost the scsibus as a result till
 > 	the next reboot, whereas -9 will recover, and you can repeat
 > 	the exercise.
 >
 > 	This PowerMac G3 B/W is my only homeoffice machine with a pci
 > 	bus, but I don't think the issue is specific to macppc. This
 > 	used to be my home server for a couple of years, and it ran a
 > 	DLT4000 off a different interface just fine.
 >
 > 	
 >> How-To-Repeat:
 > 	
 >
 > 	Hang a DLT drive off an Adaptec 2944 (HVD), and attempt to
 > 	erase a tape.
 >   
 >> Fix:
 > 	
 >
 > 	Yes, please.
 >
 > 	My single-ended drive has died, and I need to erase the tapes...
 >
 >> Unformatted:
 >   	
 >   	

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
        Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Fri, 29 Jan 2021 19:55:10 +0100

 On Fri, Jan 29, 2021 at 06:10:01PM +0000, Frank Kardel wrote:
 > The following reply was made to PR kern/55965; it has been noted by GNATS.
 > 
 > From: Frank Kardel <kardel@netbsd.org>
 > To: gnats-bugs@netbsd.org
 > Cc: 
 > Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
 > Date: Fri, 29 Jan 2021 19:06:43 +0100
 > 
 >  Also part of the now closed bug from 2005 - PR/kern 29369.
 >  
 >  Seems like the ahc driver did not get any love on timeout processing. 
 >  Any takers?

 Actually it seems that the timeout processing did improve, as
 the bus now recovers.
 What's strange here is that the timeout seems to happen way too fast
 (the command should have a 3h timeout).

 The first thing to check is the timeout value, and if it matches what the
 driver did ask.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Fri, 29 Jan 2021 21:48:58 +0100

 --Apple-Mail=_D8C0BEDB-E426-4422-9DCD-12B2ADCD6469
 Content-Type: multipart/alternative;
 	boundary="Apple-Mail=_DAD8B8EE-8EE8-4B0E-8C57-CB5D23EF8EB2"


 --Apple-Mail=_DAD8B8EE-8EE8-4B0E-8C57-CB5D23EF8EB2
 Content-Transfer-Encoding: 7bit
 Content-Type: text/plain;
 	charset=us-ascii

 > On 29. Jan 2021, at 19:55, Manuel Bouyer <bouyer@antioche.eu.org> wrote:
 <snip>
 > What's strange here is that the timeout seems to happen way too fast
 > (the command should have a 3h timeout).

 For a short erase (the default?) it is 3 min.

 --
 J. Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig

 --Apple-Mail=_DAD8B8EE-8EE8-4B0E-8C57-CB5D23EF8EB2
 Content-Transfer-Encoding: 7bit
 Content-Type: text/html;
 	charset=us-ascii

 <html><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;"><blockquote type="cite" class="">On 29. Jan 2021, at 19:55, Manuel Bouyer &lt;<a href="mailto:bouyer@antioche.eu.org" class="">bouyer@antioche.eu.org</a>&gt; wrote:<br class=""></blockquote>&lt;snip&gt;<div class=""><blockquote type="cite" class="">What's strange here is that the timeout seems to happen way too fast</blockquote><blockquote type="cite" class="">(the command should have a 3h timeout).<br class=""></blockquote><div class=""><br class=""></div>For a short erase (the default?) it is 3 min.<br class=""><br class="">--<br class="">J. Hannken-Illjes - <a href="mailto:hannken@eis.cs.tu-bs.de" class="">hannken@eis.cs.tu-bs.de</a> - TU Braunschweig<br class=""></div></body></html>
 --Apple-Mail=_DAD8B8EE-8EE8-4B0E-8C57-CB5D23EF8EB2--

 --Apple-Mail=_D8C0BEDB-E426-4422-9DCD-12B2ADCD6469
 Content-Transfer-Encoding: 7bit
 Content-Disposition: attachment;
 	filename=signature.asc
 Content-Type: application/pgp-signature;
 	name=signature.asc
 Content-Description: Message signed with OpenPGP

 -----BEGIN PGP SIGNATURE-----

 iQEzBAEBCAAdFiEE2BL3ha7Xao4WUZVYKoaVJdNr+uEFAmAUdLoACgkQKoaVJdNr
 +uEItgf/Qhc6pUrt4Ytq7NRrKNyQT7snIF98416Yf4ghra/m4wM70P8fSWSKYlb4
 RRvWciP7Zg0IHtQECBJtH92viQPPHIrl5wxYsVQi6oJUSqlOLJC3ITLr3hCUb7wJ
 HgunZxfriMel4MqqslxSLJskSkdKjt2YpqrbssakjudC9xqScmthNhwTUMdKeg2C
 GuDRcajMhgsGeHwZOdHAxTUHWBHTEUkVPs1GGNVuFbaLo8k+fp8uRhpjRiF6BBNc
 HxW7pYV8SzCEf3biYDJve6oxVYJ85u6d1otF9z5MvdcnB4sfEWSzjAwiS4/6I5rq
 o3gJ/8WA5QoYk7iQKuDRPaqzr0pMxA==
 =Bt31
 -----END PGP SIGNATURE-----

 --Apple-Mail=_D8C0BEDB-E426-4422-9DCD-12B2ADCD6469--

From: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Fri, 29 Jan 2021 22:00:34 +0100

 On Fri, 29 Jan 2021 19:55:10 +0100, Manuel Bouyer wrote:
 > What's strange here is that the timeout seems to happen way too fast
 > (the command should have a 3h timeout).

 Four hours for the full erase, three minutes for the short one (which=20
 works).

 The full erase errors out within less than a minute, though.

 > The first thing to check is the timeout value, and if it matches what the
 > driver did ask.

 Where would I look? I got a mental stack overflow trying to trace the=20
 scsipi call sequence from an st 'erase' ioctl...

 Cheerio,
 Hauke

 --=20
 Hauke Fath                        <hauke@Espresso.Rhein-Neckar.DE>
 Linn=E9weg 7
 64342 Seeheim-Jugenheim
 Germany

From: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Fri, 29 Jan 2021 22:07:23 +0100

 On Fri, 29 Jan 2021 20:50:01 +0000 (UTC), J. Hannken-Illjes wrote:
 > For a short erase (the default?) it is 3 min.

 The default is a long erase, if I read mt.c correctly - without a=20
 parameter to erase, it substitutes '1' for 'count'.

 mt(1) is wrong about ignoring the 'count' parameter.

 Cheerio,
 Hauke

 --=20
 Hauke Fath                        <hauke@Espresso.Rhein-Neckar.DE>
 Linn=E9weg 7
 64342 Seeheim-Jugenheim
 Germany

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Fri, 29 Jan 2021 22:43:11 +0100

 On Fri, Jan 29, 2021 at 10:00:34PM +0100, Hauke Fath wrote:
 > On Fri, 29 Jan 2021 19:55:10 +0100, Manuel Bouyer wrote:
 > > What's strange here is that the timeout seems to happen way too fast
 > > (the command should have a 3h timeout).
 > 
 > Four hours for the full erase, three minutes for the short one (which 
 > works).
 > 
 > The full erase errors out within less than a minute, though.
 > 
 > > The first thing to check is the timeout value, and if it matches what the
 > > driver did ask.
 > 
 > Where would I look? I got a mental stack overflow trying to trace the 
 > scsipi call sequence from an st 'erase' ioctl...

 I would start from the ahc driver.
 In aic7xxx_osm.c, ahc_timeout(), you could print the value of
 scb->xs->timeout. If the value looks good, the issue is probably in the ahc
 driver and you'll have to find where the callout has been scheduled.
 If the value is wrong, then maybe a Debugger() call (to get a stack trace)
 when this specific command appears in ADAPTER_REQ_RUN_XFER can help.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Fri, 29 Jan 2021 22:31:40 +0100

 On Fri, 29 Jan 2021 21:10:01 +0000 (UTC), Hauke Fath wrote:
 >  The default is a long erase, if I read mt.c correctly - without a
 >  parameter to erase, it substitutes '1' for 'count'.
 >  
 >  mt(1) is wrong about ignoring the 'count' parameter.

 See also the discussion in <http://gnats.netbsd.org/9840>

 ;)

From: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>, gnats-bugs@NetBSD.org,
        kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Sat, 30 Jan 2021 13:19:22 +0100

 At 22:43 Uhr +0100 29.01.2021, Manuel Bouyer wrote:
 >In aic7xxx_osm.c, ahc_timeout(), you could print the value of
 >scb->xs->timeout.

 Hm. I get '5000', which is pretty much what a 'time mt erase' gives me...

 --
 "It's never straight up and down"     (DEVO)


From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Sat, 30 Jan 2021 17:54:20 +0100

 Well, from the code it should either be

 in st.c:st_erase:
      cmd.opcode = ERASE;
      if (full) {
          cmd.byte2 = SE_LONG;
          tmo = ST_SPC_TIME;
      } else
          tmo = ST_IO_TIME;

 stvar.h:57:#define    ST_IO_TIME    (3 * 60 * 1000)        /* 3 minutes */
 stvar.h:59:#define    ST_SPC_TIME    (4 * 60 * 60 * 1000)    /* 4 hours */

 180000ms  or 14400000ms (or longer if the drive reported longer timeouts 
 via MAINTENANCE_IN/REPORT_SUPPORTED_OPCODES for
 a SCSI-3 or above device and reasonably new code).

 So 5000 is definitely strange. Where does this value get lost? 5 secs is 
 not even enough for REWIND/SPACE.

 Frank

 On 01/30/21 13:45, Hauke Fath wrote:
 > The following reply was made to PR kern/55965; it has been noted by GNATS.
 >
 > From: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
 > To: Manuel Bouyer <bouyer@antioche.eu.org>
 > Cc: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>, gnats-bugs@NetBSD.org,
 >          kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
 > Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
 > Date: Sat, 30 Jan 2021 13:19:22 +0100
 >
 >   At 22:43 Uhr +0100 29.01.2021, Manuel Bouyer wrote:
 >   >In aic7xxx_osm.c, ahc_timeout(), you could print the value of
 >   >scb->xs->timeout.
 >   
 >   Hm. I get '5000', which is pretty much what a 'time mt erase' gives me...
 >   
 >   --
 >   "It's never straight up and down"     (DEVO)
 >   
 >   

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Sat, 30 Jan 2021 18:13:44 +0100

 On Sat, Jan 30, 2021 at 01:19:22PM +0100, Hauke Fath wrote:
 > At 22:43 Uhr +0100 29.01.2021, Manuel Bouyer wrote:
 > >In aic7xxx_osm.c, ahc_timeout(), you could print the value of
 > >scb->xs->timeout.
 > 
 > Hm. I get '5000', which is pretty much what a 'time mt erase' gives me...

 this should be in milliseconds. So it's 5s ....

 Now to find where the value comes from.
 You could print the first few bytes of xs->cmd to make sure it's a ERASE
 command. Then add prints from ADAPTER_REQ_RUN_XFER to check the timeout
 value when this command is sent to the adapter.
 Eventuelly add a Debugger() call and get a stack trace to see the code
 path that gets there and find where the timeout value is altered.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Sat, 30 Jan 2021 21:16:19 +0100

 On Sat, 30 Jan 2021 18:13:44 +0100, Manuel Bouyer wrote:
 > You could print the first few bytes of xs->cmd to make sure it's a ERASE
 > command.

 It's 0x1e - SCSI_PREVENT_ALLOW_MEDIUM_REMOVAL, not ERASE.

 That shouldn't take seconds?

From: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Sat, 30 Jan 2021 21:18:54 +0100

 On Sat, 30 Jan 2021 21:16:19 +0100, Hauke Fath wrote:
 > On Sat, 30 Jan 2021 18:13:44 +0100, Manuel Bouyer wrote:
 >> You could print the first few bytes of xs->cmd to make sure it's a ERASE
 >> command.
 > 
 > It's 0x1e - SCSI_PREVENT_ALLOW_MEDIUM_REMOVAL, not ERASE.

 ... which indeed is being issued with a timeout of 5000 msec, so that's 
 explained?

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Sat, 30 Jan 2021 21:26:23 +0100

 On Sat, Jan 30, 2021 at 09:18:54PM +0100, Hauke Fath wrote:
 > On Sat, 30 Jan 2021 21:16:19 +0100, Hauke Fath wrote:
 > > On Sat, 30 Jan 2021 18:13:44 +0100, Manuel Bouyer wrote:
 > >> You could print the first few bytes of xs->cmd to make sure it's a ERASE
 > >> command.
 > > 
 > > It's 0x1e - SCSI_PREVENT_ALLOW_MEDIUM_REMOVAL, not ERASE.
 > 
 > ... which indeed is being issued with a timeout of 5000 msec, so that's 
 > explained?

 5s sounds enough. But maybe try to bump the timeout to see what happens ?

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
        Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>,
        Manuel Bouyer <bouyer@antioche.eu.org>
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Sat, 30 Jan 2021 22:22:48 +0100

 At 20:30 Uhr +0000 30.01.2021, Manuel Bouyer wrote:
 > > > It's 0x1e - SCSI_PREVENT_ALLOW_MEDIUM_REMOVAL, not ERASE.
 > >
 > > ... which indeed is being issued with a timeout of 5000 msec, so that's
 > > explained?
 >
 > 5s sounds enough. But maybe try to bump the timeout to see what happens ?

 Funky... With timeout bumped to 60s, an 'mt erase' returns after 35s, which
 the drive audibly spent mounting the tape. Then - nothing, no tape running
 for erase.

 An 'mt status' then leads to a

 [ 1433.8827963] st0(ahc0:0:5:0):  DEFERRED ERROR, key = 0x3
 [ 1433.8827963] st0(ahc0:0:5:0):  Check Condition on CDB: 0x00 00 00 00 00 00
 [ 1433.9128026]     SENSE KEY:  Media Error
 [ 1433.9128026]      ASC/ASCQ:  ASC 0x80 ASCQ 0x01
 [ 1433.9128026] st0: mount error (sense key=3) - terminating mount session

 "Rinse, lather repeat."

 --
 "It's never straight up and down"     (DEVO)


From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Sat, 30 Jan 2021 21:30:01 -0000 (UTC)

 kardel@netbsd.org (Frank Kardel) writes:

 > So 5000 is definitely strange. Where does this value get lost? 5 secs is 
 > not even enough for REWIND/SPACE.

 It's the timeout for PREVENT/ALLOW.

 -- 
 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Sat, 30 Jan 2021 22:39:11 +0100

 On Sat, Jan 30, 2021 at 10:22:48PM +0100, Hauke Fath wrote:
 > At 20:30 Uhr +0000 30.01.2021, Manuel Bouyer wrote:
 > > > > It's 0x1e - SCSI_PREVENT_ALLOW_MEDIUM_REMOVAL, not ERASE.
 > > >
 > > > ... which indeed is being issued with a timeout of 5000 msec, so that's
 > > > explained?
 > >
 > > 5s sounds enough. But maybe try to bump the timeout to see what happens ?
 > 
 > Funky... With timeout bumped to 60s, an 'mt erase' returns after 35s, which

 And did you get the timeout message from the kernel ?

 > the drive audibly spent mounting the tape. Then - nothing, no tape running
 > for erase.
 > 
 > An 'mt status' then leads to a
 > 
 > [ 1433.8827963] st0(ahc0:0:5:0):  DEFERRED ERROR, key = 0x3
 > [ 1433.8827963] st0(ahc0:0:5:0):  Check Condition on CDB: 0x00 00 00 00 00 00
 > [ 1433.9128026]     SENSE KEY:  Media Error
 > [ 1433.9128026]      ASC/ASCQ:  ASC 0x80 ASCQ 0x01
 > [ 1433.9128026] st0: mount error (sense key=3) - terminating mount session

 that looks like a red flag; did you try with another tape ?

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>, gnats-bugs@netbsd.org,
        kern-bug-people@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Sat, 30 Jan 2021 23:30:27 +0100

 At 22:39 Uhr +0100 30.01.2021, Manuel Bouyer wrote:
 >> Funky... With timeout bumped to 60s, an 'mt erase' returns after 35s, which
 >
 >And did you get the timeout message from the kernel ?

 No, not that time... But the next command (the 'mt status') then errored
 out like below.

 >> the drive audibly spent mounting the tape. Then - nothing, no tape running
 >> for erase.
 >>
 >> An 'mt status' then leads to a
 >>
 >> [ 1433.8827963] st0(ahc0:0:5:0):  DEFERRED ERROR, key = 0x3
 >> [ 1433.8827963] st0(ahc0:0:5:0):  Check Condition on CDB: 0x00 00 00 00
 >>00 00
 >> [ 1433.9128026]     SENSE KEY:  Media Error
 >> [ 1433.9128026]      ASC/ASCQ:  ASC 0x80 ASCQ 0x01
 >> [ 1433.9128026] st0: mount error (sense key=3) - terminating mount session
 >
 >that looks like a red flag; did you try with another tape ?

 I did, and also with the other drive in the case. I can happily dd(1) data
 to and from the tape, though, and suspect the above might be the drive
 getting confused about the aborted erase.

 Next, I went all out, and increased the timeout in scsipi_prevent() to 4
 hours (four). Now the drive is happily running the tape, presumably erasing
 away.

 That's a workaround. But mt is hanging there, whereas st.c says erase is an
 asynchronous operation? And why would the operation be governed by the
 scsipi_prevent() timeout? Isn't that a bit overly generic, when it has to
 accommodate any and all devices?

 Cheerio,
 Hauke


 --
 "It's never straight up and down"     (DEVO)


From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Sat, 30 Jan 2021 23:43:50 +0100

 On Sat, Jan 30, 2021 at 11:30:27PM +0100, Hauke Fath wrote:
 > At 22:39 Uhr +0100 30.01.2021, Manuel Bouyer wrote:
 > >> Funky... With timeout bumped to 60s, an 'mt erase' returns after 35s, which
 > >
 > >And did you get the timeout message from the kernel ?
 > 
 > No, not that time... But the next command (the 'mt status') then errored
 > out like below.
 > 
 > >> the drive audibly spent mounting the tape. Then - nothing, no tape running
 > >> for erase.
 > >>
 > >> An 'mt status' then leads to a
 > >>
 > >> [ 1433.8827963] st0(ahc0:0:5:0):  DEFERRED ERROR, key = 0x3
 > >> [ 1433.8827963] st0(ahc0:0:5:0):  Check Condition on CDB: 0x00 00 00 00
 > >>00 00
 > >> [ 1433.9128026]     SENSE KEY:  Media Error
 > >> [ 1433.9128026]      ASC/ASCQ:  ASC 0x80 ASCQ 0x01
 > >> [ 1433.9128026] st0: mount error (sense key=3) - terminating mount session
 > >
 > >that looks like a red flag; did you try with another tape ?
 > 
 > I did, and also with the other drive in the case. I can happily dd(1) data
 > to and from the tape, though, and suspect the above might be the drive
 > getting confused about the aborted erase.
 > 
 > Next, I went all out, and increased the timeout in scsipi_prevent() to 4
 > hours (four). Now the drive is happily running the tape, presumably erasing
 > away.
 > 
 > That's a workaround. But mt is hanging there, whereas st.c says erase is an
 > asynchronous operation? And why would the operation be governed by the
 > scsipi_prevent() timeout? Isn't that a bit overly generic, when it has to
 > accommodate any and all devices?

 Hum. I've not looked at the code but maybe the erase is started and
 the ioctl returns. Then mt exists, causing the device from being
 closed. This causes scsipi_prevent() from being called, and this
 command waits for the erase to complete.

 Baybe you should use /dev/ernst0 instead of rst0 ?

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Sun, 31 Jan 2021 00:21:55 +0100

 At 23:43 Uhr +0100 30.01.2021, Manuel Bouyer wrote:
 >Hum. I've not looked at the code but maybe the erase is started and
 >the ioctl returns. Then mt exists,

 "exits", you mean? No, it doesn't, that is what I am puzzled about.

 Naively, I would have expected mt to return, and have a way of checking on
 the tape if I am curious.

 mt sitting there patiently until the drive is finished with the erase is
 synchronous behaviour in my book.

 > causing the device from being
 >closed. This causes scsipi_prevent() from being called, and this
 >command waits for the erase to complete.
 >
 >Baybe you should use /dev/ernst0 instead of rst0 ?

 You mean "enrst0", AKA control mode? Since I have a tape in the drive to be
 erased, what should that buy me?

 I traditionally use nrst?, which is fine as long as the drive has a tape to
 play...

 Cheerio,
 Hauke


 --
 "It's never straight up and down"     (DEVO)


From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Sun, 31 Jan 2021 09:48:53 +0100

 On Sun, Jan 31, 2021 at 12:21:55AM +0100, Hauke Fath wrote:
 > At 23:43 Uhr +0100 30.01.2021, Manuel Bouyer wrote:
 > >Hum. I've not looked at the code but maybe the erase is started and
 > >the ioctl returns. Then mt exists,
 > 
 > "exits", you mean?

 Yes.

 > No, it doesn't, that is what I am puzzled about.
 > 
 > Naively, I would have expected mt to return, and have a way of checking on
 > the tape if I am curious.
 > 
 > mt sitting there patiently until the drive is finished with the erase is
 > synchronous behaviour in my book.

 This is because closing the device cause the call to scsipi_prevent(),
 which waits for the erase to complete.
 exit() does an implicit close, if mt doens't do it itself.

 > 
 > > causing the device from being
 > >closed. This causes scsipi_prevent() from being called, and this
 > >command waits for the erase to complete.
 > >
 > >Baybe you should use /dev/ernst0 instead of rst0 ?
 > 
 > You mean "enrst0", AKA control mode? Since I have a tape in the drive to be
 > erased, what should that buy me?

 Avoid the call to scsipi_prevent() on close.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Sun, 31 Jan 2021 09:14:52 -0000 (UTC)

 hauke@Espresso.Rhein-Neckar.DE (Hauke Fath) writes:

 > That's a workaround. But mt is hanging there, whereas st.c says erase is an
 > asynchronous operation? And why would the operation be governed by the
 > scsipi_prevent() timeout?

 The scsipi_prevent() is called on close to release the drive and I suspect
 the asynchronous erase just blocks the drive from executing and finishing
 the prevent/allow command.

 Now, that should only present an error (and keep the drive locked),
 but the erase should complete somewhen.

 But if the driver tries a bus reset in the timeout handling which also
 resets the drive, this would abort the erase.

 -- 
 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.