NetBSD Problem Report #9857

Received: (qmail 24033 invoked from network); 10 Apr 2000 23:02:05 -0000
Message-Id: <200004102302.TAA01622@snorklewacker.mit.edu>
Date: Mon, 10 Apr 2000 19:02:03 -0400 (EDT)
From: jhawk@MIT.EDU
Reply-To: jhawk@MIT.EDU
To: gnats-bugs@gnats.netbsd.org
Subject: wddone() omits block numbers from soft errors
X-Send-Pr-Version: 3.95

>Number:         9857
>Category:       kern
>Synopsis:       wddone() omits block numbers from soft errors
>Confidential:   no
>Severity:       non-critical
>Priority:       medium
>Responsible:    bouyer
>State:          analyzed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Apr 10 23:03:00 +0000 2000
>Closed-Date:    
>Last-Modified:  Mon Apr 07 16:15:28 +0000 2003
>Originator:     John Hawkinson
>Release:        1.4.2
>Organization:

>Environment:


>Description:
	wddone() omits block numbers from soft errors. For hard errors,
diskerr() is called and a block number is reported. For soft errors,
a simple printf() happens and the block number is not included.
I presume that blkdone is also available in the NOERROR case.

>How-To-Repeat:
Inspect the code:

        case ERROR:
                /* Don't care about media change bits */
                if (wd->sc_wdc_bio.r_error != 0 &&
                    (wd->sc_wdc_bio.r_error & ~(WDCE_MC | WDCE_MCR)) == 0)
                        goto noerror;
                ata_perror(wd->drvp, wd->sc_wdc_bio.r_error, errbuf);
retry:          /* Just reset and retry. Can we do more ? */
                wdc_reset_channel(wd->drvp);
                diskerr(bp, "wd", errbuf, LOG_PRINTF,
                    wd->sc_wdc_bio.blkdone, wd->sc_dk.dk_label);
                if (wd->retries++ < WDIORETRIES) {
                        printf(", retrying\n");
                        timeout(wdrestart, wd, RECOVERYTIME);
                        return;
                }
                printf("\n");
                bp->b_flags |= B_ERROR;
                bp->b_error = EIO;
                break;
        case NOERROR:
noerror:        if ((wd->sc_wdc_bio.flags & ATA_CORR) || wd->retries > 0)
                        printf("%s: soft error (corrected)\n",
                            wd->sc_dev.dv_xname);
        }
        disk_unbusy(&wd->sc_dk, (bp->b_bcount - bp->b_resid));

>Fix:
	Presumably diskerr() should be called for the NOERR case, as well.
Especially if someone gets around to modifying diskerr() to centrally collect
statistics on disk errors (ala iostat -E under Solaris). I can't help but
wondering if there is some reason it wasn't done this way?
>Release-Note:
>Audit-Trail:

From: Manuel Bouyer <bouyer@antioche.lip6.fr>
To: jhawk@MIT.EDU
Cc: gnats-bugs@gnats.netbsd.org
Subject: Re: kern/9857: wddone() omits block numbers from soft errors
Date: Tue, 11 Apr 2000 16:35:35 +0200

 On Mon, Apr 10, 2000 at 07:02:03PM -0400, jhawk@MIT.EDU wrote:
 > >Description:
 > 	wddone() omits block numbers from soft errors. For hard errors,
 > diskerr() is called and a block number is reported. For soft errors,
 > a simple printf() happens and the block number is not included.
 > I presume that blkdone is also available in the NOERROR case.

 A soft error is a hard error which has been corrected by a retry.
 So when we get into the 'soft error...' printf, the transfer is complete
 and blkdone doesn't point any more on the block that caused the error.
 Anyway, if we have a soft error it's because wd->retries has been incremented;
 the only place where it's incremented is in wd.c:wddone, line 514 (just
 after the retry: label), the previous line is a call to diskerr().
 So the error which caused the soft error has already been logged when
 'soft error' is printed.

 --
 Manuel Bouyer, LIP6, Universite Paris VI.           Manuel.Bouyer@lip6.fr
 --
State-Changed-From-To: open->closed 
State-Changed-By: bouyer 
State-Changed-When: Tue Apr 11 07:37:42 PDT 2000 
State-Changed-Why:  
The block number *is logged, before the xfer is retried. See Audit-trail 
for details. 

From: John Hawkinson <jhawk@MIT.EDU>
To: Manuel Bouyer <bouyer@antioche.lip6.fr>
Cc: gnats-bugs@gnats.netbsd.org
Subject: Re: kern/9857: wddone() omits block numbers from soft errors 
Date: Tue, 11 Apr 2000 10:42:25 -0400

 >A soft error is a hard error which has been corrected by a retry.

 Ah. I suppose that should have been clear if I'd thought about it
 a bit more carefully.

 >So when we get into the 'soft error...' printf, the transfer is complete
 >and blkdone doesn't point any more on the block that caused the error.

 >Anyway, if we have a soft error it's because wd->retries has been incremented;
 >the only place where it's incremented is in wd.c:wddone, line 514 (just
 >after the retry: label), the previous line is a call to diskerr().
 >So the error which caused the soft error has already been logged when
 >'soft error' is printed.

 OK. I suppose I would worry about logs getting out of sequence,
 but that's not a terribly severe concern, and perhaps is not
 even relevent.

 Thanks!

 --jhawk
   (who is curious what you'll say on the hard problem, i.e. kern/9856; looks
    like the laptop is going to get sent back to Sony this week so if you
    want any tests run against it, you should probably request them soonish)

From: John Hawkinson <jhawk@MIT.EDU>
To: Manuel Bouyer <bouyer@antioche.lip6.fr>
Cc: gnats-bugs@gnats.netbsd.org
Subject: Re: kern/9857: wddone() omits block numbers from soft errors 
Date: Tue, 11 Apr 2000 11:00:10 -0400

 In message <200004111442.KAA03973@mary-kay-commandos.mit.edu>, John Hawkinson w
 rites:
 >>A soft error is a hard error which has been corrected by a retry.
 >
 >Ah. I suppose that should have been clear if I'd thought about it
 >a bit more carefully.

 Hmm. Thinking about this once more, I don't think it is clear enough.
 Either the printf() should be made more clear, or the documentation
 should be updated. I would sort of favor the first, but I suppose
 an argument can be made for ata(4), especially since it wouldn't
 require futzing with the code to save blkdone.

 What do you think?

 --jhawk

From: Manuel Bouyer <bouyer@antioche.lip6.fr>
To: John Hawkinson <jhawk@MIT.EDU>
Cc: gnats-bugs@gnats.netbsd.org
Subject: Re: kern/9857: wddone() omits block numbers from soft errors
Date: Tue, 11 Apr 2000 17:07:58 +0200

 On Tue, Apr 11, 2000 at 11:00:10AM -0400, John Hawkinson wrote:
 > In message <200004111442.KAA03973@mary-kay-commandos.mit.edu>, John Hawkinson w
 > rites:
 > >>A soft error is a hard error which has been corrected by a retry.
 > >
 > >Ah. I suppose that should have been clear if I'd thought about it
 > >a bit more carefully.
 > 
 > Hmm. Thinking about this once more, I don't think it is clear enough.
 > Either the printf() should be made more clear, or the documentation
 > should be updated. I would sort of favor the first, but I suppose
 > an argument can be made for ata(4), especially since it wouldn't
 > require futzing with the code to save blkdone.
 > 
 > What do you think?

 I think that the 'soft error' printf should just go away. This is just
 redondant with the message saying 'retrying'. If the kernel says it's
 retrying and we don't hear anything after that it's implicit that the
 retry worked, isn't it ?

 --
 Manuel Bouyer, LIP6, Universite Paris VI.           Manuel.Bouyer@lip6.fr
 --

From: John Hawkinson <jhawk@MIT.EDU>
To: Manuel Bouyer <bouyer@antioche.lip6.fr>
Cc: gnats-bugs@gnats.netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/9857: wddone() omits block numbers from soft errors 
Date: Tue, 11 Apr 2000 11:29:41 -0400

 [ added netbsd-bugs, left some quoting in because of it ]

 >> >>A soft error is a hard error which has been corrected by a retry.

 >> Hmm. Thinking about this once more, I don't think it is clear enough.
 >> Either the printf() should be made more clear, or the documentation
 >> should be updated. I would sort of favor the first, but I suppose
 >> an argument can be made for ata(4), especially since it wouldn't
 >> require futzing with the code to save blkdone.
 >> 
 >> What do you think?
 >
 >I think that the 'soft error' printf should just go away. This is just
 >redondant with the message saying 'retrying'. If the kernel says it's
 >retrying and we don't hear anything after that it's implicit that the
 >retry worked, isn't it ?

 I don't think this sufficient by any means. Here's an example of
 some stuff lying around in my message buffer ;-):

 wd0e:  uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49), retrying
 wd0e:  uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49), retrying
 wd0e:  uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49), retrying
 wd0e:  uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49), retrying
 wd0e:  uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49), retrying
 wd0e:  uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49)
 wd0e:  uncorrectable data error reading fsbn 7509888 of 7509824-7509903 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
 wd0e:  uncorrectable data error reading fsbn 7509888 of 7509824-7509903 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
 wd0e:  uncorrectable data error reading fsbn 7509888 of 7509824-7509903 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
 wd0e:  uncorrectable data error reading fsbn 7509888 of 7509824-7509903 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
 wd0e:  uncorrectable data error reading fsbn 7509894 of 7509824-7509903 (wd0 bn 12954039; cn 13707 tn 14 sn 42), retrying
 wd0e:  uncorrectable data error reading fsbn 7509894 of 7509824-7509903 (wd0 bn 12954039; cn 13707 tn 14 sn 42)
 wd0e:  uncorrectable data error reading fsbn 7509888 of 7509872-7509919 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
 wd0e:  uncorrectable data error reading fsbn 7509888 of 7509872-7509919 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
 wd0e:  uncorrectable data error reading fsbn 7509888 of 7509872-7509919 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
 wd0e:  uncorrectable data error reading fsbn 7509888 of 7509872-7509919 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
 wd0: soft error (corrected)
 wd0e:  uncorrectable data error reading fsbn 7510128 of 7510128-7510143 (wd0 bn 12954273; cn 13708 tn 3 sn 24), retrying
 wd0e:  uncorrectable data error reading fsbn 7510128 of 7510128-7510143 (wd0 bn 12954273; cn 13708 tn 3 sn 24), retrying
 wd0e:  (obsolete) reading fsbn 7510128 of 7510128-7510143 (wd0 bn 12954273; cn 13708 tn 3 sn 24), retrying
 wd0e:  uncorrectable data error reading fsbn 7510128 of 7510128-7510143 (wd0 bn 12954273; cn 13708 tn 3 sn 24), retrying
 wd0e:  uncorrectable data error reading fsbn 7510134 of 7510128-7510143 (wd0 bn 12954279; cn 13708 tn 3 sn 30), retrying
 wd0e:  (obsolete) reading fsbn 7510134 of 7510128-7510143 (wd0 bn 12954279; cn 13708 tn 3 sn 30)
 wd0e:  uncorrectable data error reading fsbn 7510368 of 7510368-7510415 (wd0 bn 12954513; cn 13708 tn 7 sn 12), retrying
 wd0e:  uncorrectable data error reading fsbn 7510368 of 7510368-7510415 (wd0 bn 12954513; cn 13708 tn 7 sn 12), retrying
 wd0e:  uncorrectable data error reading fsbn 7510368 of 7510368-7510415 (wd0 bn 12954513; cn 13708 tn 7 sn 12), retrying
 wd0e:  uncorrectable data error reading fsbn 7510368 of 7510368-7510415 (wd0 bn 12954513; cn 13708 tn 7 sn 12), retrying
 wd0e:  uncorrectable data error reading fsbn 7510374 of 7510368-7510415 (wd0 bn 12954519; cn 13708 tn 7 sn 18), retrying
 wd0e:  uncorrectable data error reading fsbn 7510374 of 7510368-7510415 (wd0 bn 12954519; cn 13708 tn 7 sn 18)

 Now, the absence of the "soft error" output there would make block
 7509888 indistinguishable from 7490056 or 7510128.

 Also, I think it would be nice if the wd driver kept stats on these
 errors -- but this part of an overall stats desire that's probably
 better addressed seperately.

 --jhawk

From: Manuel Bouyer <bouyer@antioche.lip6.fr>
To: John Hawkinson <jhawk@MIT.EDU>
Cc: gnats-bugs@gnats.netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/9857: wddone() omits block numbers from soft errors
Date: Tue, 11 Apr 2000 17:48:28 +0200

 On Tue, Apr 11, 2000 at 11:29:41AM -0400, John Hawkinson wrote:
 > I don't think this sufficient by any means. Here's an example of
 > some stuff lying around in my message buffer ;-):
 > 
 > wd0e:  uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49), retrying
 > wd0e:  uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49), retrying
 > wd0e:  uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49), retrying
 > wd0e:  uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49), retrying
 > wd0e:  uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49), retrying
 > wd0e:  uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49)
 > wd0e:  uncorrectable data error reading fsbn 7509888 of 7509824-7509903 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
 > wd0e:  uncorrectable data error reading fsbn 7509888 of 7509824-7509903 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
 > wd0e:  uncorrectable data error reading fsbn 7509888 of 7509824-7509903 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
 > wd0e:  uncorrectable data error reading fsbn 7509888 of 7509824-7509903 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
 > wd0e:  uncorrectable data error reading fsbn 7509894 of 7509824-7509903 (wd0 bn 12954039; cn 13707 tn 14 sn 42), retrying
 > wd0e:  uncorrectable data error reading fsbn 7509894 of 7509824-7509903 (wd0 bn 12954039; cn 13707 tn 14 sn 42)
 > wd0e:  uncorrectable data error reading fsbn 7509888 of 7509872-7509919 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
 > wd0e:  uncorrectable data error reading fsbn 7509888 of 7509872-7509919 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
 > wd0e:  uncorrectable data error reading fsbn 7509888 of 7509872-7509919 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
 > wd0e:  uncorrectable data error reading fsbn 7509888 of 7509872-7509919 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
 > wd0: soft error (corrected)
 > wd0e:  uncorrectable data error reading fsbn 7510128 of 7510128-7510143 (wd0 bn 12954273; cn 13708 tn 3 sn 24), retrying
 > wd0e:  uncorrectable data error reading fsbn 7510128 of 7510128-7510143 (wd0 bn 12954273; cn 13708 tn 3 sn 24), retrying
 > wd0e:  (obsolete) reading fsbn 7510128 of 7510128-7510143 (wd0 bn 12954273; cn 13708 tn 3 sn 24), retrying
 > wd0e:  uncorrectable data error reading fsbn 7510128 of 7510128-7510143 (wd0 bn 12954273; cn 13708 tn 3 sn 24), retrying
 > wd0e:  uncorrectable data error reading fsbn 7510134 of 7510128-7510143 (wd0 bn 12954279; cn 13708 tn 3 sn 30), retrying
 > wd0e:  (obsolete) reading fsbn 7510134 of 7510128-7510143 (wd0 bn 12954279; cn 13708 tn 3 sn 30)
 > wd0e:  uncorrectable data error reading fsbn 7510368 of 7510368-7510415 (wd0 bn 12954513; cn 13708 tn 7 sn 12), retrying
 > wd0e:  uncorrectable data error reading fsbn 7510368 of 7510368-7510415 (wd0 bn 12954513; cn 13708 tn 7 sn 12), retrying
 > wd0e:  uncorrectable data error reading fsbn 7510368 of 7510368-7510415 (wd0 bn 12954513; cn 13708 tn 7 sn 12), retrying
 > wd0e:  uncorrectable data error reading fsbn 7510368 of 7510368-7510415 (wd0 bn 12954513; cn 13708 tn 7 sn 12), retrying
 > wd0e:  uncorrectable data error reading fsbn 7510374 of 7510368-7510415 (wd0 bn 12954519; cn 13708 tn 7 sn 18), retrying
 > wd0e:  uncorrectable data error reading fsbn 7510374 of 7510368-7510415 (wd0 bn 12954519; cn 13708 tn 7 sn 18)
 > 
 > Now, the absence of the "soft error" output there would make block
 > 7509888 indistinguishable from 7490056 or 7510128.

 Ok, so the text should maybe be changed ('retry succeeded', or something
 like this). But I don't think it's worth repeating the block number.

 > 
 > Also, I think it would be nice if the wd driver kept stats on these
 > errors -- but this part of an overall stats desire that's probably
 > better addressed seperately.

 Yes. diconnect/reselect and tagged command queueing would make this even
 more attractive.

 --
 Manuel Bouyer, LIP6, Universite Paris VI.           Manuel.Bouyer@lip6.fr
 --

From: John Hawkinson <jhawk@MIT.EDU>
To: Manuel Bouyer <bouyer@antioche.lip6.fr>
Cc: gnats-bugs@gnats.netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/9857: wddone() omits block numbers from soft errors 
Date: Tue, 11 Apr 2000 12:02:54 -0400

 >> wd0e:  uncorrectable data error reading fsbn 7509888 of 7509872-7509919 (wd0
 > bn 12954033; cn 13707 tn 14 sn 36), retrying
 >> wd0e:  uncorrectable data error reading fsbn 7509888 of 7509872-7509919 (wd0
 > bn 12954033; cn 13707 tn 14 sn 36), retrying
 >> wd0: soft error (corrected)

 >> Now, the absence of the "soft error" output there would make block
 >> 7509888 indistinguishable from 7490056 or 7510128.
 >
 >Ok, so the text should maybe be changed ('retry succeeded', or something
 >like this).

 Yes, that sounds good to me, it would be more clear and would save
 a documentation update.

 > But I don't think it's worth repeating the block number.

 If you can convince me that they will always be in-order, than
 that seems reasonable.

 >Yes. diconnect/reselect and tagged command queueing would make this even
 >more attractive.

 IDE has tagged command queueing? If so, might there not be a problem
 with two bad blocks being read simultaneously and not knowing
 which one is associated with the successful retry?

 --jhawk
State-Changed-From-To: closed->feedback 
State-Changed-By: jhawk 
State-Changed-When: Thu Apr 13 13:15:45 PDT 2000 
State-Changed-Why:  
Some dispute about the right fix, but the right fix is not no fix, 
and no fix has yet been applied. 
State-Changed-From-To: feedback->analyzed 
State-Changed-By: jhawk 
State-Changed-When: Thu Jul 6 12:45:45 PDT 2000 
State-Changed-Why:  
feedback implies this is waiting on me, which it is not. 
Responsible-Changed-From-To: kern-bug-people->bouyer 
Responsible-Changed-By: bouyer 
Responsible-Changed-When: Mon Apr 7 09:14:52 PDT 2003 
Responsible-Changed-Why:  
I look at this, the idea isn't that bad. 
>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.