NetBSD Problem Report #55965
From hauke@Espresso.Rhein-Neckar.DE Fri Jan 29 17:31:35 2021
Return-Path: <hauke@Espresso.Rhein-Neckar.DE>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id D3EBB1A9239
for <gnats-bugs@gnats.NetBSD.org>; Fri, 29 Jan 2021 17:31:35 +0000 (UTC)
Message-Id: <202101291724.10THOmHP009163@henkelmann.causeuse.org>
Date: Fri, 29 Jan 2021 18:24:48 +0100 (CET)
From: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
Reply-To: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
To: gnats-bugs@NetBSD.org
Cc: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
Subject: st(4) on adaptec fails to 'mt erase'
X-Send-Pr-Version: 3.95
>Number: 55965
>Category: kern
>Synopsis: st(4) on adaptec fails to 'mt erase'
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri Jan 29 17:35:00 +0000 2021
>Last-Modified: Sun Jan 31 09:20:02 +0000 2021
>Originator: Hauke Fath
>Release: NetBSD 9.1_STABLE
>Organization:
Falling Raindrops
>Environment:
System: NetBSD henkelmann.causeuse.org 9.1_STABLE NetBSD 9.1_STABLE (HENKELMANN) #0: Wed Jan 27 15:57:38 CET 2021 hauke@pizza.causeuse.org:/var/obj/netbsd-builds/9/macppc/sys/arch/macppc/compile/HENKELMANN macppc
Architecture: powerpc
Machine: macppc
>Description:
An Adaptec 2944 (HVD) scsi adapter
[ 1.000000] ahc0 at pci1 dev 2 function 0: Adaptec 2944 Ultra SCSI adapter
[ 1.000000] genppc_pci_intr_map: pin: 3, line: 23
[ 1.000000] ahc0: interrupting at irq 23
[ 1.000000] ahc0: aic7880: Ultra Wide Channel A, SCSI Id=7, 16/253 SCBs
[ 1.000000] scsibus0 at ahc0: 16 targets, 8 luns per target
with Quantum DLT4000 drives attached
[ 4.319997] st0 at scsibus0 target 5 lun 0: <Quantum, DLT4000, CC1E> tape removable
[ 4.319997] st0: drive empty
[ 4.329997] st1 at scsibus0 target 6 lun 0: <Quantum, DLT4000, CC1E> tape removable
[ 4.329997] st1: density code 25, variable blocks, write-enabled
[ 4.329997] st1: sync (100.00ns offset 15), 8-bit (10.000MB/s) transfers
generally works fine, but an 'mt erase' leads to surprising
verbosity after a few seconds:
[...]
[ 198.050100] ahc0:SCB 0xe - timed out
[ 198.050100] Dump Card State Begins <<<<<<<<<<<<<<<<<
[ 198.050100] ahc0: Dumping Card State while idle, at SEQADDR 0x7
[ 198.050100] Card was paused
[ 198.050100] ACCUM = 0x32, SINDEX = 0x67, DINDEX = 0x27, ARG_2 = 0x3
[ 198.050100] HCNT = 0x0 SCBPTR = 0x0
[ 198.050100] SCSISIGI[0x0] ERROR[0x0] SCSIBUSL[0x0] LASTPHASE[0x1]
[ 198.050100] SCSISEQ[0x12] SBLKCTL[0x2] SCSIRATE[0x0] SEQCTL[0x10]
[ 198.050100] SEQ_FLAGS[0xc0] SSTAT0[0x5] SSTAT1[0xa] SSTAT2[0x0]
[ 198.050100] SSTAT3[0x0] SIMODE0[0x0] SIMODE1[0xa4] SXFRCTL0[0x80]
[ 198.050100] DFCNTRL[0x0] DFSTATUS[0x2d]
[ 198.050100] STACK: 0x0 0x16c 0x19c 0x3
[ 198.050100] SCB count = 16
[ 198.050100] Kernel NEXTQSCB = 15
[ 198.050100] Card NEXTQSCB = 15
[ 198.050100] QINFIFO entries:
[ 198.050100] Waiting Queue entries:
[ 198.050100] Disconnected Queue entries: 0:14
[ 198.050100] QOUTFIFO entries:
[ 198.050100] Sequencer Free SCB List: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
[ 198.050100] Sequencer SCB Info:
[ 198.050100] 0 SCB_CONTROL[0x44] SCB_SCSIID[0x67]
[ 198.050100] SCB_LUN[0x0] SCB_TAG[0xe]
[ 198.050100] 1 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
[ 198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
[ 198.050100] 2 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
[ 198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
[ 198.050100] 3 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
[ 198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
[ 198.050100] 4 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
[ 198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
[ 198.050100] 5 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
[ 198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
[ 198.050100] 6 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
[ 198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
[ 198.050100] 7 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
[ 198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
[ 198.050100] 8 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
[ 198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
[ 198.050100] 9 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
[ 198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
[ 198.050100] 10 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
[ 198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
[ 198.050100] 11 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
[ 198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
[ 198.050100] 12 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
[ 198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
[ 198.050100] 13 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
[ 198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
[ 198.050100] 14 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
[ 198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
[ 198.050100] 15 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
[ 198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
[ 198.050100] Pending list:
[ 198.050100] 14 SCB_CONTROL[0x40] SCB_SCSIID[0x67]
[ 198.050100] SCB_LUN[0x0]
[ 198.050100] Kernel Free SCB list: 13 12 11 10 9 8 7 6 5 4 3 2 1 0
[ 198.050100] Untagged Q(6): 14
[ 198.050100]
[ 198.050100] ahc0:Queuing a BDR SCB
[ 198.050100] ahc0:Bus Device Reset Message Sent
[ 198.050100] st1(ahc0:0:6:0): ahc0: no longer in timeout, status = 0
[ 198.050100] ahc0: Bus Device Reset on A:6. 1 SCBs aborted
[...]
and no erase.
I originally encountered this on the machine's netbsd-5
installation, then upgraded since "surely this has long been
fixed". Not so, although -5 lost the scsibus as a result till
the next reboot, whereas -9 will recover, and you can repeat
the exercise.
This PowerMac G3 B/W is my only homeoffice machine with a pci
bus, but I don't think the issue is specific to macppc. This
used to be my home server for a couple of years, and it ran a
DLT4000 off a different interface just fine.
>How-To-Repeat:
Hang a DLT drive off an Adaptec 2944 (HVD), and attempt to
erase a tape.
>Fix:
Yes, please.
My single-ended drive has died, and I need to erase the tapes...
>Audit-Trail:
From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Fri, 29 Jan 2021 19:06:43 +0100
Also part of the now closed bug from 2005 - PR/kern 29369.
Seems like the ahc driver did not get any love on timeout processing.
Any takers?
Frank
On 01/29/21 18:35, Hauke Fath wrote:
>> Number: 55965
>> Category: kern
>> Synopsis: st(4) on adaptec fails to 'mt erase'
>> Confidential: no
>> Severity: serious
>> Priority: medium
>> Responsible: kern-bug-people
>> State: open
>> Class: sw-bug
>> Submitter-Id: net
>> Arrival-Date: Fri Jan 29 17:35:00 +0000 2021
>> Originator: Hauke Fath
>> Release: NetBSD 9.1_STABLE
>> Organization:
> Falling Raindrops
>> Environment:
>
>
> System: NetBSD henkelmann.causeuse.org 9.1_STABLE NetBSD 9.1_STABLE (HENKELMANN) #0: Wed Jan 27 15:57:38 CET 2021 hauke@pizza.causeuse.org:/var/obj/netbsd-builds/9/macppc/sys/arch/macppc/compile/HENKELMANN macppc
> Architecture: powerpc
> Machine: macppc
>> Description:
> An Adaptec 2944 (HVD) scsi adapter
>
> [ 1.000000] ahc0 at pci1 dev 2 function 0: Adaptec 2944 Ultra SCSI adapter
> [ 1.000000] genppc_pci_intr_map: pin: 3, line: 23
> [ 1.000000] ahc0: interrupting at irq 23
> [ 1.000000] ahc0: aic7880: Ultra Wide Channel A, SCSI Id=7, 16/253 SCBs
> [ 1.000000] scsibus0 at ahc0: 16 targets, 8 luns per target
>
> with Quantum DLT4000 drives attached
>
> [ 4.319997] st0 at scsibus0 target 5 lun 0: <Quantum, DLT4000, CC1E> tape removable
> [ 4.319997] st0: drive empty
> [ 4.329997] st1 at scsibus0 target 6 lun 0: <Quantum, DLT4000, CC1E> tape removable
> [ 4.329997] st1: density code 25, variable blocks, write-enabled
> [ 4.329997] st1: sync (100.00ns offset 15), 8-bit (10.000MB/s) transfers
>
> generally works fine, but an 'mt erase' leads to surprising
> verbosity after a few seconds:
>
> [...]
> [ 198.050100] ahc0:SCB 0xe - timed out
> [ 198.050100] Dump Card State Begins <<<<<<<<<<<<<<<<<
> [ 198.050100] ahc0: Dumping Card State while idle, at SEQADDR 0x7
> [ 198.050100] Card was paused
> [ 198.050100] ACCUM = 0x32, SINDEX = 0x67, DINDEX = 0x27, ARG_2 = 0x3
> [ 198.050100] HCNT = 0x0 SCBPTR = 0x0
> [ 198.050100] SCSISIGI[0x0] ERROR[0x0] SCSIBUSL[0x0] LASTPHASE[0x1]
> [ 198.050100] SCSISEQ[0x12] SBLKCTL[0x2] SCSIRATE[0x0] SEQCTL[0x10]
> [ 198.050100] SEQ_FLAGS[0xc0] SSTAT0[0x5] SSTAT1[0xa] SSTAT2[0x0]
> [ 198.050100] SSTAT3[0x0] SIMODE0[0x0] SIMODE1[0xa4] SXFRCTL0[0x80]
> [ 198.050100] DFCNTRL[0x0] DFSTATUS[0x2d]
> [ 198.050100] STACK: 0x0 0x16c 0x19c 0x3
> [ 198.050100] SCB count = 16
> [ 198.050100] Kernel NEXTQSCB = 15
> [ 198.050100] Card NEXTQSCB = 15
> [ 198.050100] QINFIFO entries:
> [ 198.050100] Waiting Queue entries:
> [ 198.050100] Disconnected Queue entries: 0:14
> [ 198.050100] QOUTFIFO entries:
> [ 198.050100] Sequencer Free SCB List: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> [ 198.050100] Sequencer SCB Info:
> [ 198.050100] 0 SCB_CONTROL[0x44] SCB_SCSIID[0x67]
> [ 198.050100] SCB_LUN[0x0] SCB_TAG[0xe]
> [ 198.050100] 1 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
> [ 198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
> [ 198.050100] 2 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
> [ 198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
> [ 198.050100] 3 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
> [ 198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
> [ 198.050100] 4 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
> [ 198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
> [ 198.050100] 5 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
> [ 198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
> [ 198.050100] 6 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
> [ 198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
> [ 198.050100] 7 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
> [ 198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
> [ 198.050100] 8 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
> [ 198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
> [ 198.050100] 9 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
> [ 198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
> [ 198.050100] 10 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
> [ 198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
> [ 198.050100] 11 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
> [ 198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
> [ 198.050100] 12 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
> [ 198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
> [ 198.050100] 13 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
> [ 198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
> [ 198.050100] 14 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
> [ 198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
> [ 198.050100] 15 SCB_CONTROL[0x0] SCB_SCSIID[0xff]
> [ 198.050100] SCB_LUN[0xff] SCB_TAG[0xff]
> [ 198.050100] Pending list:
> [ 198.050100] 14 SCB_CONTROL[0x40] SCB_SCSIID[0x67]
> [ 198.050100] SCB_LUN[0x0]
> [ 198.050100] Kernel Free SCB list: 13 12 11 10 9 8 7 6 5 4 3 2 1 0
> [ 198.050100] Untagged Q(6): 14
>
> [ 198.050100]
> [ 198.050100] ahc0:Queuing a BDR SCB
> [ 198.050100] ahc0:Bus Device Reset Message Sent
> [ 198.050100] st1(ahc0:0:6:0): ahc0: no longer in timeout, status = 0
> [ 198.050100] ahc0: Bus Device Reset on A:6. 1 SCBs aborted
> [...]
>
> and no erase.
>
> I originally encountered this on the machine's netbsd-5
> installation, then upgraded since "surely this has long been
> fixed". Not so, although -5 lost the scsibus as a result till
> the next reboot, whereas -9 will recover, and you can repeat
> the exercise.
>
> This PowerMac G3 B/W is my only homeoffice machine with a pci
> bus, but I don't think the issue is specific to macppc. This
> used to be my home server for a couple of years, and it ran a
> DLT4000 off a different interface just fine.
>
>
>> How-To-Repeat:
>
>
> Hang a DLT drive off an Adaptec 2944 (HVD), and attempt to
> erase a tape.
>
>> Fix:
>
>
> Yes, please.
>
> My single-ended drive has died, and I need to erase the tapes...
>
>> Unformatted:
>
>
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Fri, 29 Jan 2021 19:55:10 +0100
On Fri, Jan 29, 2021 at 06:10:01PM +0000, Frank Kardel wrote:
> The following reply was made to PR kern/55965; it has been noted by GNATS.
>
> From: Frank Kardel <kardel@netbsd.org>
> To: gnats-bugs@netbsd.org
> Cc:
> Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
> Date: Fri, 29 Jan 2021 19:06:43 +0100
>
> Also part of the now closed bug from 2005 - PR/kern 29369.
>
> Seems like the ahc driver did not get any love on timeout processing.
> Any takers?
Actually it seems that the timeout processing did improve, as
the bus now recovers.
What's strange here is that the timeout seems to happen way too fast
(the command should have a 3h timeout).
The first thing to check is the timeout value, and if it matches what the
driver did ask.
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Fri, 29 Jan 2021 21:48:58 +0100
--Apple-Mail=_D8C0BEDB-E426-4422-9DCD-12B2ADCD6469
Content-Type: multipart/alternative;
boundary="Apple-Mail=_DAD8B8EE-8EE8-4B0E-8C57-CB5D23EF8EB2"
--Apple-Mail=_DAD8B8EE-8EE8-4B0E-8C57-CB5D23EF8EB2
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
charset=us-ascii
> On 29. Jan 2021, at 19:55, Manuel Bouyer <bouyer@antioche.eu.org> wrote:
<snip>
> What's strange here is that the timeout seems to happen way too fast
> (the command should have a 3h timeout).
For a short erase (the default?) it is 3 min.
--
J. Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig
--Apple-Mail=_DAD8B8EE-8EE8-4B0E-8C57-CB5D23EF8EB2
Content-Transfer-Encoding: 7bit
Content-Type: text/html;
charset=us-ascii
<html><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;"><blockquote type="cite" class="">On 29. Jan 2021, at 19:55, Manuel Bouyer <<a href="mailto:bouyer@antioche.eu.org" class="">bouyer@antioche.eu.org</a>> wrote:<br class=""></blockquote><snip><div class=""><blockquote type="cite" class="">What's strange here is that the timeout seems to happen way too fast</blockquote><blockquote type="cite" class="">(the command should have a 3h timeout).<br class=""></blockquote><div class=""><br class=""></div>For a short erase (the default?) it is 3 min.<br class=""><br class="">--<br class="">J. Hannken-Illjes - <a href="mailto:hannken@eis.cs.tu-bs.de" class="">hannken@eis.cs.tu-bs.de</a> - TU Braunschweig<br class=""></div></body></html>
--Apple-Mail=_DAD8B8EE-8EE8-4B0E-8C57-CB5D23EF8EB2--
--Apple-Mail=_D8C0BEDB-E426-4422-9DCD-12B2ADCD6469
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename=signature.asc
Content-Type: application/pgp-signature;
name=signature.asc
Content-Description: Message signed with OpenPGP
-----BEGIN PGP SIGNATURE-----
iQEzBAEBCAAdFiEE2BL3ha7Xao4WUZVYKoaVJdNr+uEFAmAUdLoACgkQKoaVJdNr
+uEItgf/Qhc6pUrt4Ytq7NRrKNyQT7snIF98416Yf4ghra/m4wM70P8fSWSKYlb4
RRvWciP7Zg0IHtQECBJtH92viQPPHIrl5wxYsVQi6oJUSqlOLJC3ITLr3hCUb7wJ
HgunZxfriMel4MqqslxSLJskSkdKjt2YpqrbssakjudC9xqScmthNhwTUMdKeg2C
GuDRcajMhgsGeHwZOdHAxTUHWBHTEUkVPs1GGNVuFbaLo8k+fp8uRhpjRiF6BBNc
HxW7pYV8SzCEf3biYDJve6oxVYJ85u6d1otF9z5MvdcnB4sfEWSzjAwiS4/6I5rq
o3gJ/8WA5QoYk7iQKuDRPaqzr0pMxA==
=Bt31
-----END PGP SIGNATURE-----
--Apple-Mail=_D8C0BEDB-E426-4422-9DCD-12B2ADCD6469--
From: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Fri, 29 Jan 2021 22:00:34 +0100
On Fri, 29 Jan 2021 19:55:10 +0100, Manuel Bouyer wrote:
> What's strange here is that the timeout seems to happen way too fast
> (the command should have a 3h timeout).
Four hours for the full erase, three minutes for the short one (which=20
works).
The full erase errors out within less than a minute, though.
> The first thing to check is the timeout value, and if it matches what the
> driver did ask.
Where would I look? I got a mental stack overflow trying to trace the=20
scsipi call sequence from an st 'erase' ioctl...
Cheerio,
Hauke
--=20
Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
Linn=E9weg 7
64342 Seeheim-Jugenheim
Germany
From: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Fri, 29 Jan 2021 22:07:23 +0100
On Fri, 29 Jan 2021 20:50:01 +0000 (UTC), J. Hannken-Illjes wrote:
> For a short erase (the default?) it is 3 min.
The default is a long erase, if I read mt.c correctly - without a=20
parameter to erase, it substitutes '1' for 'count'.
mt(1) is wrong about ignoring the 'count' parameter.
Cheerio,
Hauke
--=20
Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
Linn=E9weg 7
64342 Seeheim-Jugenheim
Germany
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Fri, 29 Jan 2021 22:43:11 +0100
On Fri, Jan 29, 2021 at 10:00:34PM +0100, Hauke Fath wrote:
> On Fri, 29 Jan 2021 19:55:10 +0100, Manuel Bouyer wrote:
> > What's strange here is that the timeout seems to happen way too fast
> > (the command should have a 3h timeout).
>
> Four hours for the full erase, three minutes for the short one (which
> works).
>
> The full erase errors out within less than a minute, though.
>
> > The first thing to check is the timeout value, and if it matches what the
> > driver did ask.
>
> Where would I look? I got a mental stack overflow trying to trace the
> scsipi call sequence from an st 'erase' ioctl...
I would start from the ahc driver.
In aic7xxx_osm.c, ahc_timeout(), you could print the value of
scb->xs->timeout. If the value looks good, the issue is probably in the ahc
driver and you'll have to find where the callout has been scheduled.
If the value is wrong, then maybe a Debugger() call (to get a stack trace)
when this specific command appears in ADAPTER_REQ_RUN_XFER can help.
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
From: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Fri, 29 Jan 2021 22:31:40 +0100
On Fri, 29 Jan 2021 21:10:01 +0000 (UTC), Hauke Fath wrote:
> The default is a long erase, if I read mt.c correctly - without a
> parameter to erase, it substitutes '1' for 'count'.
>
> mt(1) is wrong about ignoring the 'count' parameter.
See also the discussion in <http://gnats.netbsd.org/9840>
;)
From: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>, gnats-bugs@NetBSD.org,
kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Sat, 30 Jan 2021 13:19:22 +0100
At 22:43 Uhr +0100 29.01.2021, Manuel Bouyer wrote:
>In aic7xxx_osm.c, ahc_timeout(), you could print the value of
>scb->xs->timeout.
Hm. I get '5000', which is pretty much what a 'time mt erase' gives me...
--
"It's never straight up and down" (DEVO)
From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Sat, 30 Jan 2021 17:54:20 +0100
Well, from the code it should either be
in st.c:st_erase:
cmd.opcode = ERASE;
if (full) {
cmd.byte2 = SE_LONG;
tmo = ST_SPC_TIME;
} else
tmo = ST_IO_TIME;
stvar.h:57:#define ST_IO_TIME (3 * 60 * 1000) /* 3 minutes */
stvar.h:59:#define ST_SPC_TIME (4 * 60 * 60 * 1000) /* 4 hours */
180000ms or 14400000ms (or longer if the drive reported longer timeouts
via MAINTENANCE_IN/REPORT_SUPPORTED_OPCODES for
a SCSI-3 or above device and reasonably new code).
So 5000 is definitely strange. Where does this value get lost? 5 secs is
not even enough for REWIND/SPACE.
Frank
On 01/30/21 13:45, Hauke Fath wrote:
> The following reply was made to PR kern/55965; it has been noted by GNATS.
>
> From: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
> To: Manuel Bouyer <bouyer@antioche.eu.org>
> Cc: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>, gnats-bugs@NetBSD.org,
> kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
> Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
> Date: Sat, 30 Jan 2021 13:19:22 +0100
>
> At 22:43 Uhr +0100 29.01.2021, Manuel Bouyer wrote:
> >In aic7xxx_osm.c, ahc_timeout(), you could print the value of
> >scb->xs->timeout.
>
> Hm. I get '5000', which is pretty much what a 'time mt erase' gives me...
>
> --
> "It's never straight up and down" (DEVO)
>
>
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Sat, 30 Jan 2021 18:13:44 +0100
On Sat, Jan 30, 2021 at 01:19:22PM +0100, Hauke Fath wrote:
> At 22:43 Uhr +0100 29.01.2021, Manuel Bouyer wrote:
> >In aic7xxx_osm.c, ahc_timeout(), you could print the value of
> >scb->xs->timeout.
>
> Hm. I get '5000', which is pretty much what a 'time mt erase' gives me...
this should be in milliseconds. So it's 5s ....
Now to find where the value comes from.
You could print the first few bytes of xs->cmd to make sure it's a ERASE
command. Then add prints from ADAPTER_REQ_RUN_XFER to check the timeout
value when this command is sent to the adapter.
Eventuelly add a Debugger() call and get a stack trace to see the code
path that gets there and find where the timeout value is altered.
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
From: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Sat, 30 Jan 2021 21:16:19 +0100
On Sat, 30 Jan 2021 18:13:44 +0100, Manuel Bouyer wrote:
> You could print the first few bytes of xs->cmd to make sure it's a ERASE
> command.
It's 0x1e - SCSI_PREVENT_ALLOW_MEDIUM_REMOVAL, not ERASE.
That shouldn't take seconds?
From: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Sat, 30 Jan 2021 21:18:54 +0100
On Sat, 30 Jan 2021 21:16:19 +0100, Hauke Fath wrote:
> On Sat, 30 Jan 2021 18:13:44 +0100, Manuel Bouyer wrote:
>> You could print the first few bytes of xs->cmd to make sure it's a ERASE
>> command.
>
> It's 0x1e - SCSI_PREVENT_ALLOW_MEDIUM_REMOVAL, not ERASE.
... which indeed is being issued with a timeout of 5000 msec, so that's
explained?
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Sat, 30 Jan 2021 21:26:23 +0100
On Sat, Jan 30, 2021 at 09:18:54PM +0100, Hauke Fath wrote:
> On Sat, 30 Jan 2021 21:16:19 +0100, Hauke Fath wrote:
> > On Sat, 30 Jan 2021 18:13:44 +0100, Manuel Bouyer wrote:
> >> You could print the first few bytes of xs->cmd to make sure it's a ERASE
> >> command.
> >
> > It's 0x1e - SCSI_PREVENT_ALLOW_MEDIUM_REMOVAL, not ERASE.
>
> ... which indeed is being issued with a timeout of 5000 msec, so that's
> explained?
5s sounds enough. But maybe try to bump the timeout to see what happens ?
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
From: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>,
Manuel Bouyer <bouyer@antioche.eu.org>
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Sat, 30 Jan 2021 22:22:48 +0100
At 20:30 Uhr +0000 30.01.2021, Manuel Bouyer wrote:
> > > It's 0x1e - SCSI_PREVENT_ALLOW_MEDIUM_REMOVAL, not ERASE.
> >
> > ... which indeed is being issued with a timeout of 5000 msec, so that's
> > explained?
>
> 5s sounds enough. But maybe try to bump the timeout to see what happens ?
Funky... With timeout bumped to 60s, an 'mt erase' returns after 35s, which
the drive audibly spent mounting the tape. Then - nothing, no tape running
for erase.
An 'mt status' then leads to a
[ 1433.8827963] st0(ahc0:0:5:0): DEFERRED ERROR, key = 0x3
[ 1433.8827963] st0(ahc0:0:5:0): Check Condition on CDB: 0x00 00 00 00 00 00
[ 1433.9128026] SENSE KEY: Media Error
[ 1433.9128026] ASC/ASCQ: ASC 0x80 ASCQ 0x01
[ 1433.9128026] st0: mount error (sense key=3) - terminating mount session
"Rinse, lather repeat."
--
"It's never straight up and down" (DEVO)
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Sat, 30 Jan 2021 21:30:01 -0000 (UTC)
kardel@netbsd.org (Frank Kardel) writes:
> So 5000 is definitely strange. Where does this value get lost? 5 secs is
> not even enough for REWIND/SPACE.
It's the timeout for PREVENT/ALLOW.
--
--
Michael van Elst
Internet: mlelstv@serpens.de
"A potential Snark may lurk in every tree."
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Sat, 30 Jan 2021 22:39:11 +0100
On Sat, Jan 30, 2021 at 10:22:48PM +0100, Hauke Fath wrote:
> At 20:30 Uhr +0000 30.01.2021, Manuel Bouyer wrote:
> > > > It's 0x1e - SCSI_PREVENT_ALLOW_MEDIUM_REMOVAL, not ERASE.
> > >
> > > ... which indeed is being issued with a timeout of 5000 msec, so that's
> > > explained?
> >
> > 5s sounds enough. But maybe try to bump the timeout to see what happens ?
>
> Funky... With timeout bumped to 60s, an 'mt erase' returns after 35s, which
And did you get the timeout message from the kernel ?
> the drive audibly spent mounting the tape. Then - nothing, no tape running
> for erase.
>
> An 'mt status' then leads to a
>
> [ 1433.8827963] st0(ahc0:0:5:0): DEFERRED ERROR, key = 0x3
> [ 1433.8827963] st0(ahc0:0:5:0): Check Condition on CDB: 0x00 00 00 00 00 00
> [ 1433.9128026] SENSE KEY: Media Error
> [ 1433.9128026] ASC/ASCQ: ASC 0x80 ASCQ 0x01
> [ 1433.9128026] st0: mount error (sense key=3) - terminating mount session
that looks like a red flag; did you try with another tape ?
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
From: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>, gnats-bugs@netbsd.org,
kern-bug-people@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Sat, 30 Jan 2021 23:30:27 +0100
At 22:39 Uhr +0100 30.01.2021, Manuel Bouyer wrote:
>> Funky... With timeout bumped to 60s, an 'mt erase' returns after 35s, which
>
>And did you get the timeout message from the kernel ?
No, not that time... But the next command (the 'mt status') then errored
out like below.
>> the drive audibly spent mounting the tape. Then - nothing, no tape running
>> for erase.
>>
>> An 'mt status' then leads to a
>>
>> [ 1433.8827963] st0(ahc0:0:5:0): DEFERRED ERROR, key = 0x3
>> [ 1433.8827963] st0(ahc0:0:5:0): Check Condition on CDB: 0x00 00 00 00
>>00 00
>> [ 1433.9128026] SENSE KEY: Media Error
>> [ 1433.9128026] ASC/ASCQ: ASC 0x80 ASCQ 0x01
>> [ 1433.9128026] st0: mount error (sense key=3) - terminating mount session
>
>that looks like a red flag; did you try with another tape ?
I did, and also with the other drive in the case. I can happily dd(1) data
to and from the tape, though, and suspect the above might be the drive
getting confused about the aborted erase.
Next, I went all out, and increased the timeout in scsipi_prevent() to 4
hours (four). Now the drive is happily running the tape, presumably erasing
away.
That's a workaround. But mt is hanging there, whereas st.c says erase is an
asynchronous operation? And why would the operation be governed by the
scsipi_prevent() timeout? Isn't that a bit overly generic, when it has to
accommodate any and all devices?
Cheerio,
Hauke
--
"It's never straight up and down" (DEVO)
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Sat, 30 Jan 2021 23:43:50 +0100
On Sat, Jan 30, 2021 at 11:30:27PM +0100, Hauke Fath wrote:
> At 22:39 Uhr +0100 30.01.2021, Manuel Bouyer wrote:
> >> Funky... With timeout bumped to 60s, an 'mt erase' returns after 35s, which
> >
> >And did you get the timeout message from the kernel ?
>
> No, not that time... But the next command (the 'mt status') then errored
> out like below.
>
> >> the drive audibly spent mounting the tape. Then - nothing, no tape running
> >> for erase.
> >>
> >> An 'mt status' then leads to a
> >>
> >> [ 1433.8827963] st0(ahc0:0:5:0): DEFERRED ERROR, key = 0x3
> >> [ 1433.8827963] st0(ahc0:0:5:0): Check Condition on CDB: 0x00 00 00 00
> >>00 00
> >> [ 1433.9128026] SENSE KEY: Media Error
> >> [ 1433.9128026] ASC/ASCQ: ASC 0x80 ASCQ 0x01
> >> [ 1433.9128026] st0: mount error (sense key=3) - terminating mount session
> >
> >that looks like a red flag; did you try with another tape ?
>
> I did, and also with the other drive in the case. I can happily dd(1) data
> to and from the tape, though, and suspect the above might be the drive
> getting confused about the aborted erase.
>
> Next, I went all out, and increased the timeout in scsipi_prevent() to 4
> hours (four). Now the drive is happily running the tape, presumably erasing
> away.
>
> That's a workaround. But mt is hanging there, whereas st.c says erase is an
> asynchronous operation? And why would the operation be governed by the
> scsipi_prevent() timeout? Isn't that a bit overly generic, when it has to
> accommodate any and all devices?
Hum. I've not looked at the code but maybe the erase is started and
the ioctl returns. Then mt exists, causing the device from being
closed. This causes scsipi_prevent() from being called, and this
command waits for the erase to complete.
Baybe you should use /dev/ernst0 instead of rst0 ?
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
From: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Sun, 31 Jan 2021 00:21:55 +0100
At 23:43 Uhr +0100 30.01.2021, Manuel Bouyer wrote:
>Hum. I've not looked at the code but maybe the erase is started and
>the ioctl returns. Then mt exists,
"exits", you mean? No, it doesn't, that is what I am puzzled about.
Naively, I would have expected mt to return, and have a way of checking on
the tape if I am curious.
mt sitting there patiently until the drive is finished with the erase is
synchronous behaviour in my book.
> causing the device from being
>closed. This causes scsipi_prevent() from being called, and this
>command waits for the erase to complete.
>
>Baybe you should use /dev/ernst0 instead of rst0 ?
You mean "enrst0", AKA control mode? Since I have a tape in the drive to be
erased, what should that buy me?
I traditionally use nrst?, which is fine as long as the drive has a tape to
play...
Cheerio,
Hauke
--
"It's never straight up and down" (DEVO)
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Hauke Fath <hauke@Espresso.Rhein-Neckar.DE>
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Sun, 31 Jan 2021 09:48:53 +0100
On Sun, Jan 31, 2021 at 12:21:55AM +0100, Hauke Fath wrote:
> At 23:43 Uhr +0100 30.01.2021, Manuel Bouyer wrote:
> >Hum. I've not looked at the code but maybe the erase is started and
> >the ioctl returns. Then mt exists,
>
> "exits", you mean?
Yes.
> No, it doesn't, that is what I am puzzled about.
>
> Naively, I would have expected mt to return, and have a way of checking on
> the tape if I am curious.
>
> mt sitting there patiently until the drive is finished with the erase is
> synchronous behaviour in my book.
This is because closing the device cause the call to scsipi_prevent(),
which waits for the erase to complete.
exit() does an implicit close, if mt doens't do it itself.
>
> > causing the device from being
> >closed. This causes scsipi_prevent() from being called, and this
> >command waits for the erase to complete.
> >
> >Baybe you should use /dev/ernst0 instead of rst0 ?
>
> You mean "enrst0", AKA control mode? Since I have a tape in the drive to be
> erased, what should that buy me?
Avoid the call to scsipi_prevent() on close.
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/55965: st(4) on adaptec fails to 'mt erase'
Date: Sun, 31 Jan 2021 09:14:52 -0000 (UTC)
hauke@Espresso.Rhein-Neckar.DE (Hauke Fath) writes:
> That's a workaround. But mt is hanging there, whereas st.c says erase is an
> asynchronous operation? And why would the operation be governed by the
> scsipi_prevent() timeout?
The scsipi_prevent() is called on close to release the drive and I suspect
the asynchronous erase just blocks the drive from executing and finishing
the prevent/allow command.
Now, that should only present an error (and keep the drive locked),
but the erase should complete somewhen.
But if the driver tries a bus reset in the timeout handling which also
resets the drive, this would abort the erase.
--
--
Michael van Elst
Internet: mlelstv@serpens.de
"A potential Snark may lurk in every tree."
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.