NetBSD Problem Report #9857
Received: (qmail 24033 invoked from network); 10 Apr 2000 23:02:05 -0000
Message-Id: <200004102302.TAA01622@snorklewacker.mit.edu>
Date: Mon, 10 Apr 2000 19:02:03 -0400 (EDT)
From: jhawk@MIT.EDU
Reply-To: jhawk@MIT.EDU
To: gnats-bugs@gnats.netbsd.org
Subject: wddone() omits block numbers from soft errors
X-Send-Pr-Version: 3.95
>Number: 9857
>Category: kern
>Synopsis: wddone() omits block numbers from soft errors
>Confidential: no
>Severity: non-critical
>Priority: medium
>Responsible: bouyer
>State: analyzed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Mon Apr 10 23:03:00 +0000 2000
>Closed-Date:
>Last-Modified: Mon Apr 07 16:15:28 +0000 2003
>Originator: John Hawkinson
>Release: 1.4.2
>Organization:
>Environment:
>Description:
wddone() omits block numbers from soft errors. For hard errors,
diskerr() is called and a block number is reported. For soft errors,
a simple printf() happens and the block number is not included.
I presume that blkdone is also available in the NOERROR case.
>How-To-Repeat:
Inspect the code:
case ERROR:
/* Don't care about media change bits */
if (wd->sc_wdc_bio.r_error != 0 &&
(wd->sc_wdc_bio.r_error & ~(WDCE_MC | WDCE_MCR)) == 0)
goto noerror;
ata_perror(wd->drvp, wd->sc_wdc_bio.r_error, errbuf);
retry: /* Just reset and retry. Can we do more ? */
wdc_reset_channel(wd->drvp);
diskerr(bp, "wd", errbuf, LOG_PRINTF,
wd->sc_wdc_bio.blkdone, wd->sc_dk.dk_label);
if (wd->retries++ < WDIORETRIES) {
printf(", retrying\n");
timeout(wdrestart, wd, RECOVERYTIME);
return;
}
printf("\n");
bp->b_flags |= B_ERROR;
bp->b_error = EIO;
break;
case NOERROR:
noerror: if ((wd->sc_wdc_bio.flags & ATA_CORR) || wd->retries > 0)
printf("%s: soft error (corrected)\n",
wd->sc_dev.dv_xname);
}
disk_unbusy(&wd->sc_dk, (bp->b_bcount - bp->b_resid));
>Fix:
Presumably diskerr() should be called for the NOERR case, as well.
Especially if someone gets around to modifying diskerr() to centrally collect
statistics on disk errors (ala iostat -E under Solaris). I can't help but
wondering if there is some reason it wasn't done this way?
>Release-Note:
>Audit-Trail:
From: Manuel Bouyer <bouyer@antioche.lip6.fr>
To: jhawk@MIT.EDU
Cc: gnats-bugs@gnats.netbsd.org
Subject: Re: kern/9857: wddone() omits block numbers from soft errors
Date: Tue, 11 Apr 2000 16:35:35 +0200
On Mon, Apr 10, 2000 at 07:02:03PM -0400, jhawk@MIT.EDU wrote:
> >Description:
> wddone() omits block numbers from soft errors. For hard errors,
> diskerr() is called and a block number is reported. For soft errors,
> a simple printf() happens and the block number is not included.
> I presume that blkdone is also available in the NOERROR case.
A soft error is a hard error which has been corrected by a retry.
So when we get into the 'soft error...' printf, the transfer is complete
and blkdone doesn't point any more on the block that caused the error.
Anyway, if we have a soft error it's because wd->retries has been incremented;
the only place where it's incremented is in wd.c:wddone, line 514 (just
after the retry: label), the previous line is a call to diskerr().
So the error which caused the soft error has already been logged when
'soft error' is printed.
--
Manuel Bouyer, LIP6, Universite Paris VI. Manuel.Bouyer@lip6.fr
--
State-Changed-From-To: open->closed
State-Changed-By: bouyer
State-Changed-When: Tue Apr 11 07:37:42 PDT 2000
State-Changed-Why:
The block number *is logged, before the xfer is retried. See Audit-trail
for details.
From: John Hawkinson <jhawk@MIT.EDU>
To: Manuel Bouyer <bouyer@antioche.lip6.fr>
Cc: gnats-bugs@gnats.netbsd.org
Subject: Re: kern/9857: wddone() omits block numbers from soft errors
Date: Tue, 11 Apr 2000 10:42:25 -0400
>A soft error is a hard error which has been corrected by a retry.
Ah. I suppose that should have been clear if I'd thought about it
a bit more carefully.
>So when we get into the 'soft error...' printf, the transfer is complete
>and blkdone doesn't point any more on the block that caused the error.
>Anyway, if we have a soft error it's because wd->retries has been incremented;
>the only place where it's incremented is in wd.c:wddone, line 514 (just
>after the retry: label), the previous line is a call to diskerr().
>So the error which caused the soft error has already been logged when
>'soft error' is printed.
OK. I suppose I would worry about logs getting out of sequence,
but that's not a terribly severe concern, and perhaps is not
even relevent.
Thanks!
--jhawk
(who is curious what you'll say on the hard problem, i.e. kern/9856; looks
like the laptop is going to get sent back to Sony this week so if you
want any tests run against it, you should probably request them soonish)
From: John Hawkinson <jhawk@MIT.EDU>
To: Manuel Bouyer <bouyer@antioche.lip6.fr>
Cc: gnats-bugs@gnats.netbsd.org
Subject: Re: kern/9857: wddone() omits block numbers from soft errors
Date: Tue, 11 Apr 2000 11:00:10 -0400
In message <200004111442.KAA03973@mary-kay-commandos.mit.edu>, John Hawkinson w
rites:
>>A soft error is a hard error which has been corrected by a retry.
>
>Ah. I suppose that should have been clear if I'd thought about it
>a bit more carefully.
Hmm. Thinking about this once more, I don't think it is clear enough.
Either the printf() should be made more clear, or the documentation
should be updated. I would sort of favor the first, but I suppose
an argument can be made for ata(4), especially since it wouldn't
require futzing with the code to save blkdone.
What do you think?
--jhawk
From: Manuel Bouyer <bouyer@antioche.lip6.fr>
To: John Hawkinson <jhawk@MIT.EDU>
Cc: gnats-bugs@gnats.netbsd.org
Subject: Re: kern/9857: wddone() omits block numbers from soft errors
Date: Tue, 11 Apr 2000 17:07:58 +0200
On Tue, Apr 11, 2000 at 11:00:10AM -0400, John Hawkinson wrote:
> In message <200004111442.KAA03973@mary-kay-commandos.mit.edu>, John Hawkinson w
> rites:
> >>A soft error is a hard error which has been corrected by a retry.
> >
> >Ah. I suppose that should have been clear if I'd thought about it
> >a bit more carefully.
>
> Hmm. Thinking about this once more, I don't think it is clear enough.
> Either the printf() should be made more clear, or the documentation
> should be updated. I would sort of favor the first, but I suppose
> an argument can be made for ata(4), especially since it wouldn't
> require futzing with the code to save blkdone.
>
> What do you think?
I think that the 'soft error' printf should just go away. This is just
redondant with the message saying 'retrying'. If the kernel says it's
retrying and we don't hear anything after that it's implicit that the
retry worked, isn't it ?
--
Manuel Bouyer, LIP6, Universite Paris VI. Manuel.Bouyer@lip6.fr
--
From: John Hawkinson <jhawk@MIT.EDU>
To: Manuel Bouyer <bouyer@antioche.lip6.fr>
Cc: gnats-bugs@gnats.netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/9857: wddone() omits block numbers from soft errors
Date: Tue, 11 Apr 2000 11:29:41 -0400
[ added netbsd-bugs, left some quoting in because of it ]
>> >>A soft error is a hard error which has been corrected by a retry.
>> Hmm. Thinking about this once more, I don't think it is clear enough.
>> Either the printf() should be made more clear, or the documentation
>> should be updated. I would sort of favor the first, but I suppose
>> an argument can be made for ata(4), especially since it wouldn't
>> require futzing with the code to save blkdone.
>>
>> What do you think?
>
>I think that the 'soft error' printf should just go away. This is just
>redondant with the message saying 'retrying'. If the kernel says it's
>retrying and we don't hear anything after that it's implicit that the
>retry worked, isn't it ?
I don't think this sufficient by any means. Here's an example of
some stuff lying around in my message buffer ;-):
wd0e: uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49), retrying
wd0e: uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49), retrying
wd0e: uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49), retrying
wd0e: uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49), retrying
wd0e: uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49), retrying
wd0e: uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49)
wd0e: uncorrectable data error reading fsbn 7509888 of 7509824-7509903 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
wd0e: uncorrectable data error reading fsbn 7509888 of 7509824-7509903 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
wd0e: uncorrectable data error reading fsbn 7509888 of 7509824-7509903 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
wd0e: uncorrectable data error reading fsbn 7509888 of 7509824-7509903 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
wd0e: uncorrectable data error reading fsbn 7509894 of 7509824-7509903 (wd0 bn 12954039; cn 13707 tn 14 sn 42), retrying
wd0e: uncorrectable data error reading fsbn 7509894 of 7509824-7509903 (wd0 bn 12954039; cn 13707 tn 14 sn 42)
wd0e: uncorrectable data error reading fsbn 7509888 of 7509872-7509919 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
wd0e: uncorrectable data error reading fsbn 7509888 of 7509872-7509919 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
wd0e: uncorrectable data error reading fsbn 7509888 of 7509872-7509919 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
wd0e: uncorrectable data error reading fsbn 7509888 of 7509872-7509919 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
wd0: soft error (corrected)
wd0e: uncorrectable data error reading fsbn 7510128 of 7510128-7510143 (wd0 bn 12954273; cn 13708 tn 3 sn 24), retrying
wd0e: uncorrectable data error reading fsbn 7510128 of 7510128-7510143 (wd0 bn 12954273; cn 13708 tn 3 sn 24), retrying
wd0e: (obsolete) reading fsbn 7510128 of 7510128-7510143 (wd0 bn 12954273; cn 13708 tn 3 sn 24), retrying
wd0e: uncorrectable data error reading fsbn 7510128 of 7510128-7510143 (wd0 bn 12954273; cn 13708 tn 3 sn 24), retrying
wd0e: uncorrectable data error reading fsbn 7510134 of 7510128-7510143 (wd0 bn 12954279; cn 13708 tn 3 sn 30), retrying
wd0e: (obsolete) reading fsbn 7510134 of 7510128-7510143 (wd0 bn 12954279; cn 13708 tn 3 sn 30)
wd0e: uncorrectable data error reading fsbn 7510368 of 7510368-7510415 (wd0 bn 12954513; cn 13708 tn 7 sn 12), retrying
wd0e: uncorrectable data error reading fsbn 7510368 of 7510368-7510415 (wd0 bn 12954513; cn 13708 tn 7 sn 12), retrying
wd0e: uncorrectable data error reading fsbn 7510368 of 7510368-7510415 (wd0 bn 12954513; cn 13708 tn 7 sn 12), retrying
wd0e: uncorrectable data error reading fsbn 7510368 of 7510368-7510415 (wd0 bn 12954513; cn 13708 tn 7 sn 12), retrying
wd0e: uncorrectable data error reading fsbn 7510374 of 7510368-7510415 (wd0 bn 12954519; cn 13708 tn 7 sn 18), retrying
wd0e: uncorrectable data error reading fsbn 7510374 of 7510368-7510415 (wd0 bn 12954519; cn 13708 tn 7 sn 18)
Now, the absence of the "soft error" output there would make block
7509888 indistinguishable from 7490056 or 7510128.
Also, I think it would be nice if the wd driver kept stats on these
errors -- but this part of an overall stats desire that's probably
better addressed seperately.
--jhawk
From: Manuel Bouyer <bouyer@antioche.lip6.fr>
To: John Hawkinson <jhawk@MIT.EDU>
Cc: gnats-bugs@gnats.netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/9857: wddone() omits block numbers from soft errors
Date: Tue, 11 Apr 2000 17:48:28 +0200
On Tue, Apr 11, 2000 at 11:29:41AM -0400, John Hawkinson wrote:
> I don't think this sufficient by any means. Here's an example of
> some stuff lying around in my message buffer ;-):
>
> wd0e: uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49), retrying
> wd0e: uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49), retrying
> wd0e: uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49), retrying
> wd0e: uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49), retrying
> wd0e: uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49), retrying
> wd0e: uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49)
> wd0e: uncorrectable data error reading fsbn 7509888 of 7509824-7509903 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
> wd0e: uncorrectable data error reading fsbn 7509888 of 7509824-7509903 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
> wd0e: uncorrectable data error reading fsbn 7509888 of 7509824-7509903 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
> wd0e: uncorrectable data error reading fsbn 7509888 of 7509824-7509903 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
> wd0e: uncorrectable data error reading fsbn 7509894 of 7509824-7509903 (wd0 bn 12954039; cn 13707 tn 14 sn 42), retrying
> wd0e: uncorrectable data error reading fsbn 7509894 of 7509824-7509903 (wd0 bn 12954039; cn 13707 tn 14 sn 42)
> wd0e: uncorrectable data error reading fsbn 7509888 of 7509872-7509919 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
> wd0e: uncorrectable data error reading fsbn 7509888 of 7509872-7509919 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
> wd0e: uncorrectable data error reading fsbn 7509888 of 7509872-7509919 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
> wd0e: uncorrectable data error reading fsbn 7509888 of 7509872-7509919 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
> wd0: soft error (corrected)
> wd0e: uncorrectable data error reading fsbn 7510128 of 7510128-7510143 (wd0 bn 12954273; cn 13708 tn 3 sn 24), retrying
> wd0e: uncorrectable data error reading fsbn 7510128 of 7510128-7510143 (wd0 bn 12954273; cn 13708 tn 3 sn 24), retrying
> wd0e: (obsolete) reading fsbn 7510128 of 7510128-7510143 (wd0 bn 12954273; cn 13708 tn 3 sn 24), retrying
> wd0e: uncorrectable data error reading fsbn 7510128 of 7510128-7510143 (wd0 bn 12954273; cn 13708 tn 3 sn 24), retrying
> wd0e: uncorrectable data error reading fsbn 7510134 of 7510128-7510143 (wd0 bn 12954279; cn 13708 tn 3 sn 30), retrying
> wd0e: (obsolete) reading fsbn 7510134 of 7510128-7510143 (wd0 bn 12954279; cn 13708 tn 3 sn 30)
> wd0e: uncorrectable data error reading fsbn 7510368 of 7510368-7510415 (wd0 bn 12954513; cn 13708 tn 7 sn 12), retrying
> wd0e: uncorrectable data error reading fsbn 7510368 of 7510368-7510415 (wd0 bn 12954513; cn 13708 tn 7 sn 12), retrying
> wd0e: uncorrectable data error reading fsbn 7510368 of 7510368-7510415 (wd0 bn 12954513; cn 13708 tn 7 sn 12), retrying
> wd0e: uncorrectable data error reading fsbn 7510368 of 7510368-7510415 (wd0 bn 12954513; cn 13708 tn 7 sn 12), retrying
> wd0e: uncorrectable data error reading fsbn 7510374 of 7510368-7510415 (wd0 bn 12954519; cn 13708 tn 7 sn 18), retrying
> wd0e: uncorrectable data error reading fsbn 7510374 of 7510368-7510415 (wd0 bn 12954519; cn 13708 tn 7 sn 18)
>
> Now, the absence of the "soft error" output there would make block
> 7509888 indistinguishable from 7490056 or 7510128.
Ok, so the text should maybe be changed ('retry succeeded', or something
like this). But I don't think it's worth repeating the block number.
>
> Also, I think it would be nice if the wd driver kept stats on these
> errors -- but this part of an overall stats desire that's probably
> better addressed seperately.
Yes. diconnect/reselect and tagged command queueing would make this even
more attractive.
--
Manuel Bouyer, LIP6, Universite Paris VI. Manuel.Bouyer@lip6.fr
--
From: John Hawkinson <jhawk@MIT.EDU>
To: Manuel Bouyer <bouyer@antioche.lip6.fr>
Cc: gnats-bugs@gnats.netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/9857: wddone() omits block numbers from soft errors
Date: Tue, 11 Apr 2000 12:02:54 -0400
>> wd0e: uncorrectable data error reading fsbn 7509888 of 7509872-7509919 (wd0
> bn 12954033; cn 13707 tn 14 sn 36), retrying
>> wd0e: uncorrectable data error reading fsbn 7509888 of 7509872-7509919 (wd0
> bn 12954033; cn 13707 tn 14 sn 36), retrying
>> wd0: soft error (corrected)
>> Now, the absence of the "soft error" output there would make block
>> 7509888 indistinguishable from 7490056 or 7510128.
>
>Ok, so the text should maybe be changed ('retry succeeded', or something
>like this).
Yes, that sounds good to me, it would be more clear and would save
a documentation update.
> But I don't think it's worth repeating the block number.
If you can convince me that they will always be in-order, than
that seems reasonable.
>Yes. diconnect/reselect and tagged command queueing would make this even
>more attractive.
IDE has tagged command queueing? If so, might there not be a problem
with two bad blocks being read simultaneously and not knowing
which one is associated with the successful retry?
--jhawk
State-Changed-From-To: closed->feedback
State-Changed-By: jhawk
State-Changed-When: Thu Apr 13 13:15:45 PDT 2000
State-Changed-Why:
Some dispute about the right fix, but the right fix is not no fix,
and no fix has yet been applied.
State-Changed-From-To: feedback->analyzed
State-Changed-By: jhawk
State-Changed-When: Thu Jul 6 12:45:45 PDT 2000
State-Changed-Why:
feedback implies this is waiting on me, which it is not.
Responsible-Changed-From-To: kern-bug-people->bouyer
Responsible-Changed-By: bouyer
Responsible-Changed-When: Mon Apr 7 09:14:52 PDT 2003
Responsible-Changed-Why:
I look at this, the idea isn't that bad.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.