NetBSD Problem Report #40569

From tron@zhadum.org.uk  Fri Feb  6 23:02:22 2009
Return-Path: <tron@zhadum.org.uk>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by narn.NetBSD.org (Postfix) with ESMTP id BEBA063C07F
	for <gnats-bugs@gnats.NetBSD.org>; Fri,  6 Feb 2009 23:02:21 +0000 (UTC)
Message-Id: <200902062302.n16N2HQw001790@colwyn.zhadum.org.uk>
Date: Fri, 6 Feb 2009 23:02:17 GMT
From: tron@zhadum.org.uk
Reply-To: tron@zhadum.org.uk
To: gnats-bugs@gnats.NetBSD.org
Subject: Failed RAIDframe parity rewrite prevents system shutdown
X-Send-Pr-Version: 3.95

>Number:         40569
>Category:       kern
>Synopsis:       Failed RAIDframe parity rewrite prevents system shutdown
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Feb 06 23:05:00 +0000 2009
>Closed-Date:    Thu Feb 19 22:34:50 +0000 2009
>Last-Modified:  Thu Feb 26 08:10:02 +0000 2009
>Originator:     Matthias Scheler
>Release:        NetBSD 5.0_RC1 2009-02-03 sources
>Organization:
Matthias Scheler                                  http://zhadum.org.uk/
>Environment:
System: NetBSD colwyn.zhadum.org.uk 5.0_RC1 NetBSD 5.0_RC1 (COLWYN.64) #0: Fri Feb 6 17:59:15 GMT 2009 tron@colwyn.zhadum.org.uk:/src/sys/compile/COLWYN.64 amd64
Architecture: x86_64
Machine: amd64
>Description:
One of the SATA disks in my server had a few write errors and was ejected
for a RAIDframe RAID 1 a few days ago. When I finally noticed this
morning I initiated a parity rewrite with "raidctl -R /dev/wd2e raid1".
The rebuild failed unfortunately:

raid1: initiating in-place reconstruction on column 0
wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
[...]
wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15)
wd2: (id not found)
raid1: Recon write failed!
raid1: reconstruction failed.

I retried the parity rewrite but it was rejected by "raidctl" because of
an invalid I/O control. The reconstruction was not tried again. When
I later tried to shutdown the system (to check the cabling) the kernel
stopped while unmounting the file systems with this message:

unmounting file systems...raid1: Waiting for reconstruction to stop...

I had to remove the power hard at this point.

>How-To-Repeat:
Use "raidctl -R /dev/<x> raid<y>" and try to shutdown the system afterwards.

>Fix:
None known.

>Release-Note:

>Audit-Trail:
From: Greg Oster <oster@cs.usask.ca>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown 
Date: Fri, 06 Feb 2009 18:47:48 -0600

 tron@zhadum.org.uk writes:
 > >Number:         40569
 > >Category:       kern
 > >Synopsis:       Faild RAIDframe parity rewrite prevents system shutdown
 > >Confidential:   no
 > >Severity:       serious
 > >Priority:       medium
 > >Responsible:    kern-bug-people
 > >State:          open
 > >Class:          sw-bug
 > >Submitter-Id:   net
 > >Arrival-Date:   Fri Feb 06 23:05:00 +0000 2009
 > >Originator:     Matthias Scheler
 > >Release:        NetBSD 5.0_RC1 2009-02-03 sources
 > >Organization:
 > Matthias Scheler                                  http://zhadum.org.uk/
 > >Environment:
 > System: NetBSD colwyn.zhadum.org.uk 5.0_RC1 NetBSD 5.0_RC1 (COLWYN.64) #0: Fr
 > i Feb 6 17:59:15 GMT 2009 tron@colwyn.zhadum.org.uk:/src/sys/compile/COLWYN.6
 > 4 amd64
 > Architecture: x86_64
 > Machine: amd64
 > >Description:
 > One of the SATA disks in my server had a few write errors and was ejected
 > for a RAIDframe RAID 1 a few days ago. When I finally noticed this
 > morning I initiated a parity rewrite with "raidctl -R /dev/wd2e raid1".
 > The rebuild failed unfortunately:
 > 
 > raid1: initiating in-place reconstruction on column 0
 > wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; 
 > cn 266305 tn 0 sn 15), retrying
 > [...]
 > wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; 
 > cn 266305 tn 0 sn 15)
 > wd2: (id not found)
 > raid1: Recon write failed!
 > raid1: reconstruction failed.
 > 
 > I retried the parity rewrite but it was rejected by "raidctl" because of
 > an invalid I/O control. 

 Do you have a bit more info on exactly what you tried here and what 
 the error was?  A parity rewrite shouldn't have bumped 
 reconInProgress.

 > The reconstruction was not tried again. When
 > I later tried to shutdown the system (to check the cabling) the kernel
 > stopped while unmounting the file systems with this message:
 > 
 > unmounting file systems...raid1: Waiting for reconstruction to stop...
 > 
 > I had to remove the power hard at this point.
 > 
 > >How-To-Repeat:
 > Use "raidctl -R /dev/<x> raid<y>" and try to shutdown the system afterwards.

 I suspect the reconstruction also needs to fail, and you may need to 
 attempt to do something else again.. but I'm not sure yet... 
 (I can't see how reconInProgress is non-zero in rf_driver.c unless 
 there really is a reconstruction going on... From what you describe 
 here there wasn't an active reconstruction going on, and so I have no 
 clue how it could get into that state... :( )

 Later...

 Greg Oster


From: Matthias Scheler <tron@zhadum.org.uk>
To: Greg Oster <oster@cs.usask.ca>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Sat, 7 Feb 2009 10:33:42 +0000

 On Sat, Feb 07, 2009 at 12:50:03AM +0000, Greg Oster wrote:
 >  > I retried the parity rewrite but it was rejected by "raidctl" because of
 >  > an invalid I/O control. 
 >  
 >  Do you have a bit more info on exactly what you tried here and what 
 >  the error was?

 Not really but I tried another rebuild after powercycling the system (to
 check the cabling) and it failed again:

 aid1: initiating in-place reconstruction on column 0
 wd3e: LBA48 bug reading fsbn 268435392 of 268435392-268435519 (wd3 bn 268435455; cn 266305 tn 0 sn 15), retrying
 wd3: soft error (corrected)
 wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
 wd2: (id not found)
 ahcisata0 port 2: device present, speed: 1.5Gb/s
 wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
 wd2: (id not found)
 ahcisata0 port 2: device present, speed: 1.5Gb/s
 wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
 wd2: (id not found)
 ahcisata0 port 2: device present, speed: 1.5Gb/s
 wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
 wd2: (id not found)
 ahcisata0 port 2: device present, speed: 1.5Gb/s
 wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
 wd2: (id not found)
 ahcisata0 port 2: device present, speed: 1.5Gb/s
 wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15)
 wd2: (id not found)
 raid1: Recon write failed!
 raid1: reconstruction failed.
 ahcisata0 port 2: device present, speed: 1.5Gb/s
 raid1: Error re-writing parity!

 If you tell me what kind of debugging you would like me to do I can try
 to reproduce the problem by attempting another rebuild.

 	Kind regards

 -- 
 Matthias Scheler                                  http://zhadum.org.uk/

From: Matthias Scheler <tron@zhadum.org.uk>
To: Greg Oster <oster@cs.usask.ca>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Sat, 7 Feb 2009 10:41:16 +0000

 Another interesting bit of information:

 The rebuild definitely failed ...

 raid1: Error re-writing parity!

 ... but "raidctl -s raid1" still says it is in progress:

 Components:
            /dev/wd2e: reconstructing
            /dev/wd3e: optimal
 No spares.
 /dev/wd2e status is: reconstructing.  Skipping label.
 Component label for /dev/wd3e:
    Row: 0, Column: 1, Num Rows: 1, Num Columns: 2
    Version: 2, Serial Number: 2009011200, Mod Counter: 177
    Clean: No, Status: 0
    sectPerSU: 128, SUsPerPU: 1, SUsPerRU: 1
    Queue size: 100, blocksize: 512, numBlocks: 312581632
    RAID Level: 1
    Autoconfig: Yes
    Root partition: No
    Last configured as: raid1
 Parity status: DIRTY
 Reconstruction is 100% complete.
 Parity Re-write is 100% complete.
 Copyback is 100% complete.

 This looks to like it hasn't aborted the rebuild completely.

 I'm not sure whether it matters but I was monitoring the rebuild with
 "raidctl -S raid1" until it failed.

 	Kind regards

 -- 
 Matthias Scheler                                  http://zhadum.org.uk/

From: Greg Oster <oster@cs.usask.ca>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown 
Date: Sat, 07 Feb 2009 12:54:54 -0600

 Matthias Scheler writes:
 > On Sat, Feb 07, 2009 at 12:50:03AM +0000, Greg Oster wrote:
 > >  > I retried the parity rewrite but it was rejected by "raidctl" because of
 > >  > an invalid I/O control. 
 > >  
 > >  Do you have a bit more info on exactly what you tried here and what 
 > >  the error was?
 > 
 > Not really but I tried another rebuild after powercycling the system (to
 > check the cabling) and it failed again:
 > 
 > aid1: initiating in-place reconstruction on column 0
 > wd3e: LBA48 bug reading fsbn 268435392 of 268435392-268435519 (wd3 bn 2684354
 > 55; cn 266305 tn 0 sn 15), retrying
 > wd3: soft error (corrected)
 > wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; 
 > cn 266305 tn 0 sn 15), retrying
 [snip]
 > wd2: (id not found)
 > raid1: Recon write failed!
 > raid1: reconstruction failed.
 > ahcisata0 port 2: device present, speed: 1.5Gb/s
 > raid1: Error re-writing parity!

 I don't understand where this last line is coming from... Unless it 
 finished rebuilding parity for raid0, and it's just coincidece that 
 it finished at exactly this spot?

 > If you tell me what kind of debugging you would like me to do I can try
 > to reproduce the problem by attempting another rebuild.

 Hmmmmm....  Around line 857 of src/sys/dev/raidframe/rf_reconstruct.c
 there is a:

   return (1);

 You could try adding a printf() just before that line, and see if 
 that gets printed....  I bet it doesn't... 

 I *think* you're getting hung up in the:

 		if (!write_error) {
 			/* wait for writes to complete */
 			while (raidPtr->reconControl->pending_writes > 0) {

 part of rf_ContinueReconstructFailedDisk().

 It seems that you've had a (corrected) read error on wd3e.. but I'm 
 wondering if that's contributing to the problem here..  The issue, I 
 think, is that there are still pending writes, or that the code 
 thinks there are pending writes...  I know this code was tested on a 
 disk that had real failing writes, but it's unlikely that they were 
 exactly the same as what you're seeing, and so there's room here for 
 bugs... 

 Oh... It just hit me:

 wd3e: LBA48 bug reading fsbn 268435392 of 268435392-268435519 (wd3 bn 268435455; cn 266305 tn 0 sn 15), retrying
  wd3: soft error (corrected)
  wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
  wd2: (id not found)

 What type of disks are these, and do they have the the 'LBA48-quirk' entry 
 to change addressing modes or whatever for block 268435455?  (just hunt 
 for that block number in Google for more info...)  There are other 
 PRs (like 38376) which describe these same sort of symptoms...

 Later...

 Greg Oster


From: Matthias Scheler <tron@zhadum.org.uk>
To: gnats-bugs@NetBSD.org
Cc: Greg Oster <oster@cs.usask.ca>
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Sat, 7 Feb 2009 19:15:29 +0000

 On Sat, Feb 07, 2009 at 06:55:01PM +0000, Greg Oster wrote:
 >  Hmmmmm....  Around line 857 of src/sys/dev/raidframe/rf_reconstruct.c
 >  there is a:
 >  
 >    return (1);
 >  
 >  You could try adding a printf() just before that line, and see if 
 >  that gets printed....  I bet it doesn't... 

 I'll make that change before I reboot the system the next time.

 >  wd3e: LBA48 bug reading fsbn 268435392 of 268435392-268435519 (wd3 bn 268435455; cn 266305 tn 0 sn 15), retrying
 >   wd3: soft error (corrected)
 >   wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
 >   wd2: (id not found)
 >  
 >  What type of disks are these, and do they have the the 'LBA48-quirk' entry 
 >  to change addressing modes or whatever for block 268435455?

 No, they apparently don't. But that's not the problem. The parity rewrite
 worked in the past despite this bug. And if I attempt to rebuild the
 parity now one of the disks, probably wd2, produces really alarming
 "clonk" noises. I think it is just a case of a broken disk.

 	Kind regards

 -- 
 Matthias Scheler                                  http://zhadum.org.uk/

From: Greg Oster <oster@cs.usask.ca>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown 
Date: Sat, 07 Feb 2009 13:23:33 -0600

 Matthias Scheler writes:
 > On Sat, Feb 07, 2009 at 06:55:01PM +0000, Greg Oster wrote:
 > >  Hmmmmm....  Around line 857 of src/sys/dev/raidframe/rf_reconstruct.c
 > >  there is a:
 > >  
 > >    return (1);
 > >  
 > >  You could try adding a printf() just before that line, and see if 
 > >  that gets printed....  I bet it doesn't... 
 > 
 > I'll make that change before I reboot the system the next time.

 K.. If I've caught you in time, please add a printf here:

  /* wait for writes to complete */

 in rf_reconstruct.c: rf_ContinueReconstructFailedDisk()
 and in that next while() loop, do something like:

    while (raidPtr->reconControl->pending_writes > 0) {
  	printf("pending writes: %d\n",raidPtr->reconControl->pending_writes);

 to print the value of raidPtr->reconControl->pending_writes.

 I'd basicall like to know whether it thinks there have been write 
 errors at that point, and if not, then how many pending writes it's 
 waiting for...

 > >  wd3e: LBA48 bug reading fsbn 268435392 of 268435392-268435519 (wd3 bn 2684
 > 35455; cn 266305 tn 0 sn 15), retrying
 > >   wd3: soft error (corrected)
 > >   wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 2684354
 > 55; cn 266305 tn 0 sn 15), retrying
 > >   wd2: (id not found)
 > >  
 > >  What type of disks are these, and do they have the the 'LBA48-quirk' entry
 >  
 > >  to change addressing modes or whatever for block 268435455?
 > 
 > No, they apparently don't. But that's not the problem. The parity rewrite
 > worked in the past despite this bug. 

 Well... parity re-writes only read the disks and do writes if there 
 are errors... so if it didn't need to write to that block before, it 
 wouldn't have been detected...

 > And if I attempt to rebuild the
 > parity now one of the disks, probably wd2, produces really alarming
 > "clonk" noises. I think it is just a case of a broken disk.

 "clonk" noises from disks are never good :-}

 Later...

 Greg Oster


From: Matthias Scheler <tron@zhadum.org.uk>
To: gnats-bugs@NetBSD.org
Cc: Greg Oster <oster@cs.usask.ca>
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Mon, 9 Feb 2009 19:48:22 +0000

 On Sat, Feb 07, 2009 at 07:25:02PM +0000, Greg Oster wrote:
 >  I'd basicall like to know whether it thinks there have been write 
 >  errors at that point, and if not, then how many pending writes it's 
 >  waiting for...

 1.) When I rebooted to install the new kernel the system managed to reboot
     despite the failed re-construction.

 2.) This is what I got when I attempted a re-construction with the
     new kernel:

 RECON: initiating reconstruction on col 0 -> spare at col 2
 wd3e: LBA48 bug reading fsbn 268435392 of 268435392-268435519 (wd3 bn 268435455; cn 266305 tn 0 sn 15), retrying
 wd3: soft error (corrected)
 wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
 wd2: (id not found)
 ahcisata0 port 2: device present, speed: 1.5Gb/s
 wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
 wd2: (id not found)
 ahcisata0 port 2: device present, speed: 1.5Gb/s
 wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
 wd2: (id not found)
 ahcisata0 port 2: device present, speed: 1.5Gb/s
 wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
 wd2: (id not found)
 ahcisata0 port 2: device present, speed: 1.5Gb/s
 wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
 wd2: (id not found)
 ahcisata0 port 2: device present, speed: 1.5Gb/s
 wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15)
 wd2: (id not found)
 raid1: Recon write failed!
 raid1: reconstruction failed.
 pending writes: 1
 ahcisata0 port 2: device present, speed: 1.5Gb/s

 This time it actually seems to fail because of the LBA48 bug. I know
 remember that re-construct was done in the opposite direction in
 the past. I'll try to add the harddisk to the quick table.

 	Kind regards

 -- 
 Matthias Scheler                                  http://zhadum.org.uk/

From: Matthias Scheler <tron@zhadum.org.uk>
To: NetBSD GNATS <gnats-bugs@NetBSD.org>
Cc: Greg Oster <oster@cs.usask.ca>
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Mon, 9 Feb 2009 21:07:13 +0000

 On Mon, Feb 09, 2009 at 07:50:03PM +0000, Matthias Scheler wrote:
 >  This time it actually seems to fail because of the LBA48 bug. I know
 >  remember that re-construct was done in the opposite direction in
 >  the past. I'll try to add the harddisk to the quick table.

 1.) The system wedge again during shutdown:

 ahcisata0 port 2: device present, speed: 1.5Gb/s
 Feb  9 19:56:43 colwyn su: tron to root on /dev/ttyp1
 Feb  9 19:57:56 colwyn shutdown: reboot by tron: Kernel bug fix 
 Feb  9 19:58:11 colwyn syslogd: Exiting on signal 15
 syncing disks... 5 done
 unmounting file systems...raid1: Waiting for reconstruction to stop...

 2.) The kernel with the two hard disks in the quick table has managed
     to re-construct the RAID 1.

 So it seems this is not a RAIDframe bug after all but rather a problem
 with the drives (and eventually error handling in ahcisata(4)).

 	Kind regards

 -- 
 Matthias Scheler                                  http://zhadum.org.uk/

From: Matthias Scheler <tron@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/40569 CVS commit: src/sys/dev/ata
Date: Mon,  9 Feb 2009 22:34:23 +0000 (UTC)

 Module Name:	src
 Committed By:	tron
 Date:		Mon Feb  9 22:34:23 UTC 2009

 Modified Files:
 	src/sys/dev/ata: wd.c

 Log Message:
 Add two more entries to the quirk table for hard disks which need the
 LBA 48 work around. The first entry will watch the Seagate ST3160815AS
 (and similar models), the second one HP's OEM version of the same drive.

 This avoids the RAID rebuild problems described in PR kern/40569.


 To generate a diff of this commit:
 cvs rdiff -r1.368 -r1.369 src/sys/dev/ata/wd.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Greg Oster <oster@cs.usask.ca>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown 
Date: Mon, 09 Feb 2009 17:01:45 -0600

 Matthias Scheler writes:
 > The following reply was made to PR kern/40569; it has been noted by GNATS.
 > 
 > From: Matthias Scheler <tron@zhadum.org.uk>
 > To: NetBSD GNATS <gnats-bugs@NetBSD.org>
 > Cc: Greg Oster <oster@cs.usask.ca>
 > Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutd
 > own
 > Date: Mon, 9 Feb 2009 21:07:13 +0000
 > 
 >  On Mon, Feb 09, 2009 at 07:50:03PM +0000, Matthias Scheler wrote:
 >  >  This time it actually seems to fail because of the LBA48 bug. I know
 >  >  remember that re-construct was done in the opposite direction in
 >  >  the past. I'll try to add the harddisk to the quick table.
 >  
 >  1.) The system wedge again during shutdown:
 >  
 >  ahcisata0 port 2: device present, speed: 1.5Gb/s
 >  Feb  9 19:56:43 colwyn su: tron to root on /dev/ttyp1
 >  Feb  9 19:57:56 colwyn shutdown: reboot by tron: Kernel bug fix 
 >  Feb  9 19:58:11 colwyn syslogd: Exiting on signal 15
 >  syncing disks... 5 done
 >  unmounting file systems...raid1: Waiting for reconstruction to stop...

 So it is very likely sleeping in the reconstruct code and waiting for 
 a write that is never going to happen...

 AHHHH... I think I see the bug there is at least one missing:

  num_writes++;

 in rf_reconstruct:rf_ContinueReconstructFailedDisk().  (it might be 
 that two are missing.. I need to do more analysis...)  Basically 
 writes with errors are still writes that need to be accounted for, 
 and that's not happening properly...  I'll see about geting this 
 fixed for 5.0.  (testing may prove to be a pain... I may have to 
 resurect an old testing box so that I have some disks with 
 real write errors... and I'm not sure those will even be sufficient 
 to replicate this :-/ )

 >  2.) The kernel with the two hard disks in the quick table has managed
 >      to re-construct the RAID 1.
 >  
 >  So it seems this is not a RAIDframe bug after all but rather a problem
 >  with the drives (and eventually error handling in ahcisata(4)).

 There's a RAIDframe bug in there too... 

 Later...

 Greg Oster


From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org,
        tron@zhadum.org.uk
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system
	shutdown
Date: Tue, 10 Feb 2009 13:11:13 +0100

 On Mon, Feb 09, 2009 at 07:50:03PM +0000, Matthias Scheler wrote:
 >  On Sat, Feb 07, 2009 at 07:25:02PM +0000, Greg Oster wrote:
 >  >  I'd basicall like to know whether it thinks there have been write 
 >  >  errors at that point, and if not, then how many pending writes it's 
 >  >  waiting for...
 >  
 >  1.) When I rebooted to install the new kernel the system managed to reboot
 >      despite the failed re-construction.
 >  
 >  2.) This is what I got when I attempted a re-construction with the
 >      new kernel:
 >  
 >  RECON: initiating reconstruction on col 0 -> spare at col 2
 >  wd3e: LBA48 bug reading fsbn 268435392 of 268435392-268435519 (wd3 bn 268435455; cn 266305 tn 0 sn 15), retrying
 >  wd3: soft error (corrected)
 >  wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
 >  wd2: (id not found)

 I don't understand why the problem is properly detected and worked around for
 wd3, but not for wd2 ? Are wd2 and wd3 identical devices ?

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: Matthias Scheler <tron@zhadum.org.uk>
To: gnats-bugs@NetBSD.org
Cc: Manuel Bouyer <bouyer@antioche.eu.org>
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Tue, 10 Feb 2009 12:16:29 +0000

 On Tue, Feb 10, 2009 at 12:15:04PM +0000, Manuel Bouyer wrote:
 >  I don't understand why the problem is properly detected and worked
 >  around for wd3, but not for wd2 ? Are wd2 and wd3 identical devices?

 Well, almost.

 "wd2" is HP's OEM version of the same Seagate drive as "wd3". But the
 HP drive definitely has a different firmware which e.g. doesn't
 enable the write-cache by default.

 	Kind regards

 -- 
 Matthias Scheler                                  http://zhadum.org.uk/

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Matthias Scheler <tron@zhadum.org.uk>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system
	shutdown
Date: Tue, 10 Feb 2009 18:25:00 +0100

 On Tue, Feb 10, 2009 at 12:16:29PM +0000, Matthias Scheler wrote:
 > On Tue, Feb 10, 2009 at 12:15:04PM +0000, Manuel Bouyer wrote:
 > >  I don't understand why the problem is properly detected and worked
 > >  around for wd3, but not for wd2 ? Are wd2 and wd3 identical devices?
 > 
 > Well, almost.
 > 
 > "wd2" is HP's OEM version of the same Seagate drive as "wd3". But the
 > HP drive definitely has a different firmware which e.g. doesn't
 > enable the write-cache by default.

 The LBA48 should have been properly detected; is also reported a
 "id not found", and with the right address. Maybe it's a read vs write issue ?
 Could you check if a read on wd2 would trigger proper detection (obviously
 you have to remove the quirk entry first) ?
 something like
 dd if=/dev/rwd2e of=/dev/null bs=32k skip=4194302 count=5
 should do it.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: Matthias Scheler <tron@zhadum.org.uk>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Tue, 10 Feb 2009 17:42:49 +0000

 On Tue, Feb 10, 2009 at 06:25:00PM +0100, Manuel Bouyer wrote:
 > Could you check if a read on wd2 would trigger proper detection (obviously
 > you have to remove the quirk entry first) ?

 I'm 99% sure it does because I could construct the RAID previously when
 RAIDframe read from "wd2" and wrote to "wd3" previously. The problems
 started when it had to rewrite parity the other way around (with
 "wd3" as the source).

 > something like
 > dd if=/dev/rwd2e of=/dev/null bs=32k skip=4194302 count=5
 > should do it.

 Is the above good enough? I would really like to avoid rebooting that
 machine again as it is my server.

 	Kind regards

 -- 
 Matthias Scheler                                  http://zhadum.org.uk/

From: Matthias Scheler <tron@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/40569 CVS commit: src/sys/dev/ata
Date: Tue, 10 Feb 2009 19:45:22 +0000 (UTC)

 Module Name:	src
 Committed By:	tron
 Date:		Tue Feb 10 19:45:22 UTC 2009

 Modified Files:
 	src/sys/dev/ata: wd.c

 Log Message:
 Backout LBA 48 quick entries which were added to fix one aspect of
 PR kern/40569 because of objections by Manual Bouyer.


 To generate a diff of this commit:
 cvs rdiff -r1.369 -r1.370 src/sys/dev/ata/wd.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Matthias Scheler <tron@zhadum.org.uk>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Tue, 10 Feb 2009 23:12:38 +0100

 On Tue, Feb 10, 2009 at 05:42:49PM +0000, Matthias Scheler wrote:
 > On Tue, Feb 10, 2009 at 06:25:00PM +0100, Manuel Bouyer wrote:
 > > Could you check if a read on wd2 would trigger proper detection (obviously
 > > you have to remove the quirk entry first) ?
 > 
 > I'm 99% sure it does because I could construct the RAID previously when
 > RAIDframe read from "wd2" and wrote to "wd3" previously. The problems
 > started when it had to rewrite parity the other way around (with
 > "wd3" as the source).
 > 
 > > something like
 > > dd if=/dev/rwd2e of=/dev/null bs=32k skip=4194302 count=5
 > > should do it.
 > 
 > Is the above good enough? I would really like to avoid rebooting that
 > machine again as it is my server.

 I've reproduced it on a system here and I think I've a fix.
 Once again a DIAGNOSTIC kernel would have pointed right to the problem;
 I don't think removing DIAGNOSTIC from GENERIC (at last on -current)
 was a good move.

 Just to confirm; you're using a AHCI controller, right ?

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: Matthias Scheler <tron@zhadum.org.uk>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Tue, 10 Feb 2009 22:17:06 +0000

 On Tue, Feb 10, 2009 at 11:12:38PM +0100, Manuel Bouyer wrote:
 > I've reproduced it on a system here and I think I've a fix.

 That's great news.

 > Just to confirm; you're using a AHCI controller, right ?

 Yes, I do:

 atabus4 at ahcisata0 channel 2
 atabus5 at ahcisata0 channel 3
 [...]
 ahcisata0 port 2: device present, speed: 1.5Gb/s
 ahcisata0 port 3: device present, speed: 3.0Gb/s
 [...]
 wd2 at atabus4 drive 0: <FB160C4081>
 wd2: quirks 2<FORCE_LBA48>
 wd2: drive supports 16-sector PIO transfers, LBA48 addressing
 wd2: 149 GB, 310101 cyl, 16 head, 63 sec, 512 bytes/sect x 312581808 sectors
 wd2: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
 wd2(ahcisata0:2:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100) (using DMA)
 wd3 at atabus5 drive 0: <ST3160815AS>
 wd3: quirks 2<FORCE_LBA48>
 wd3: drive supports 16-sector PIO transfers, LBA48 addressing
 wd3: 149 GB, 310101 cyl, 16 head, 63 sec, 512 bytes/sect x 312581808 sectors
 wd3: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
 wd3(ahcisata0:3:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) (using DMA)

 	Kind regards

 -- 
 Matthias Scheler                                  http://zhadum.org.uk/

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Matthias Scheler <tron@zhadum.org.uk>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Tue, 10 Feb 2009 23:24:40 +0100

 --ZPt4rx8FFjLCG7dd
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline

 On Tue, Feb 10, 2009 at 10:17:06PM +0000, Matthias Scheler wrote:
 > On Tue, Feb 10, 2009 at 11:12:38PM +0100, Manuel Bouyer wrote:
 > > I've reproduced it on a system here and I think I've a fix.
 > 
 > That's great news.
 > 
 > > Just to confirm; you're using a AHCI controller, right ?
 > 
 > Yes, I do:
 > 
 > atabus4 at ahcisata0 channel 2
 > atabus5 at ahcisata0 channel 3
 > [...]
 > ahcisata0 port 2: device present, speed: 1.5Gb/s
 > ahcisata0 port 3: device present, speed: 3.0Gb/s

 OK, so it's probably an issue with the ahci controller: b_resid was set to
 0 even in case of failure; and it's used in the LBA48 workaround detection
 to see if we crossed the boundary ... I think the attached patch fixes it but
 unfortunably my test box didn't reboot after panic to I can't test before
 tomorow.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

 --ZPt4rx8FFjLCG7dd
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: attachment; filename=diff

 Index: ahcisata_core.c
 ===================================================================
 RCS file: /cvsroot/src/sys/dev/ic/ahcisata_core.c,v
 retrieving revision 1.18
 diff -u -p -u -r1.18 ahcisata_core.c
 --- ahcisata_core.c	3 Oct 2008 13:02:08 -0000	1.18
 +++ ahcisata_core.c	10 Feb 2009 22:22:42 -0000
 @@ -1065,7 +1065,7 @@ ahci_bio_complete(struct ata_channel *ch
  		ata_bio->error = TIMEOUT;
  	} else {
  		callout_stop(&chp->ch_callout);
 -		ata_bio->error = 0;
 +		ata_bio->error = NOERROR;
  	}

  	chp->ch_queue->active_xfer = NULL;
 @@ -1095,7 +1095,14 @@ ahci_bio_complete(struct ata_channel *ch
  	    BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE);
  	AHCIDEBUG_PRINT(("ahci_bio_complete bcount %ld",
  	    ata_bio->bcount), DEBUG_XFERS);
 -	ata_bio->bcount -= le32toh(achp->ahcic_cmdh[slot].cmdh_prdbc);
 +	/* 
 +	 * if it was a write, complete data buffer may have been transfered
 +	 * before error detection; in this case don't use cmdh_prdbc
 +	 * as it won't reflect what was written to media. Assume nothing
 +	 * was transfered and leave bcount as-is.
 +	 */
 +	if ((ata_bio->flags & ATA_READ) || ata_bio->error != NOERROR)
 +		ata_bio->bcount -= le32toh(achp->ahcic_cmdh[slot].cmdh_prdbc);
  	AHCIDEBUG_PRINT((" now %ld\n", ata_bio->bcount), DEBUG_XFERS);
  	(*chp->ch_drive[drive].drv_done)(chp->ch_drive[drive].drv_softc);
  	atastart(chp);

 --ZPt4rx8FFjLCG7dd--

From: Matthias Scheler <tron@zhadum.org.uk>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Wed, 11 Feb 2009 07:26:17 +0000

 On Tue, Feb 10, 2009 at 11:24:40PM +0100, Manuel Bouyer wrote:
 > OK, so it's probably an issue with the ahci controller: b_resid was set to
 > 0 even in case of failure; and it's used in the LBA48 workaround detection
 > to see if we crossed the boundary ... I think the attached patch fixes it but
 > unfortunably my test box didn't reboot after panic to I can't test before
 > tomorow.

 The fix doesn't work on my system:

 raid1: initiating in-place reconstruction on column 0
 wd3e: LBA48 bug reading fsbn 268435392 of 268435392-268435519 (wd3 bn 268435455; cn 266305 tn 0 sn 15), retrying
 wd3: soft error (corrected)
 wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
 wd2: (id not found)
 ahcisata0 port 2: device present, speed: 1.5Gb/s
 wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
 wd2: (id not found)
 ahcisata0 port 2: device present, speed: 1.5Gb/s
 wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
 wd2: (id not found)
 ahcisata0 port 2: device present, speed: 1.5Gb/s
 wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
 wd2: (id not found)
 ahcisata0 port 2: device present, speed: 1.5Gb/s
 wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
 wd2: (id not found)
 ahcisata0 port 2: device present, speed: 1.5Gb/s
 wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15)
 wd2: (id not found)
 raid1: Recon write failed!
 raid1: reconstruction failed.
 ahcisata0 port 2: device present, speed: 1.5Gb/s

 	Kind regards

 -- 
 Matthias Scheler                                  http://zhadum.org.uk/

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Matthias Scheler <tron@zhadum.org.uk>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system
	shutdown
Date: Wed, 11 Feb 2009 13:09:30 +0100

 --NzB8fVQJ5HfG6fxh
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline

 On Wed, Feb 11, 2009 at 07:26:17AM +0000, Matthias Scheler wrote:
 > On Tue, Feb 10, 2009 at 11:24:40PM +0100, Manuel Bouyer wrote:
 > > OK, so it's probably an issue with the ahci controller: b_resid was set to
 > > 0 even in case of failure; and it's used in the LBA48 workaround detection
 > > to see if we crossed the boundary ... I think the attached patch fixes it but
 > > unfortunably my test box didn't reboot after panic to I can't test before
 > > tomorow.
 > 
 > The fix doesn't work on my system:
 > 
 > raid1: initiating in-place reconstruction on column 0
 > wd3e: LBA48 bug reading fsbn 268435392 of 268435392-268435519 (wd3 bn 268435455; cn 266305 tn 0 sn 15), retrying
 > wd3: soft error (corrected)
 > wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
 > wd2: (id not found)

 there was an inverted test condition in my patch;  the attached one should work
 (it does for me at last; with it a write at the LBA48 address triggers
 the workaround detection)

 -- 
 Manuel Bouyer, LIP6, Universite Paris VI.           Manuel.Bouyer@lip6.fr
      NetBSD: 26 ans d'experience feront toujours la difference
 --

 --NzB8fVQJ5HfG6fxh
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: attachment; filename=diff

 Index: ahcisata_core.c
 ===================================================================
 RCS file: /cvsroot/src/sys/dev/ic/ahcisata_core.c,v
 retrieving revision 1.18
 diff -u -p -u -r1.18 ahcisata_core.c
 --- ahcisata_core.c	3 Oct 2008 13:02:08 -0000	1.18
 +++ ahcisata_core.c	11 Feb 2009 12:07:01 -0000
 @@ -1065,7 +1065,7 @@ ahci_bio_complete(struct ata_channel *ch
  		ata_bio->error = TIMEOUT;
  	} else {
  		callout_stop(&chp->ch_callout);
 -		ata_bio->error = 0;
 +		ata_bio->error = NOERROR;
  	}

  	chp->ch_queue->active_xfer = NULL;
 @@ -1095,7 +1095,14 @@ ahci_bio_complete(struct ata_channel *ch
  	    BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE);
  	AHCIDEBUG_PRINT(("ahci_bio_complete bcount %ld",
  	    ata_bio->bcount), DEBUG_XFERS);
 -	ata_bio->bcount -= le32toh(achp->ahcic_cmdh[slot].cmdh_prdbc);
 +	/* 
 +	 * if it was a write, complete data buffer may have been transfered
 +	 * before error detection; in this case don't use cmdh_prdbc
 +	 * as it won't reflect what was written to media. Assume nothing
 +	 * was transfered and leave bcount as-is.
 +	 */
 +	if ((ata_bio->flags & ATA_READ) || ata_bio->error == NOERROR)
 +		ata_bio->bcount -= le32toh(achp->ahcic_cmdh[slot].cmdh_prdbc);
  	AHCIDEBUG_PRINT((" now %ld\n", ata_bio->bcount), DEBUG_XFERS);
  	(*chp->ch_drive[drive].drv_done)(chp->ch_drive[drive].drv_softc);
  	atastart(chp);

 --NzB8fVQJ5HfG6fxh--

From: Greg Oster <oster@cs.usask.ca>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown 
Date: Wed, 11 Feb 2009 10:16:24 -0600

 Matthias Scheler writes:
 > The following reply was made to PR kern/40569; it has been noted by GNATS.
 > 
 > From: Matthias Scheler <tron@zhadum.org.uk>
 > To: Manuel Bouyer <bouyer@antioche.eu.org>
 > Cc: gnats-bugs@NetBSD.org
 > Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutd
 > own
 > Date: Wed, 11 Feb 2009 07:26:17 +0000
 > 
 >  On Tue, Feb 10, 2009 at 11:24:40PM +0100, Manuel Bouyer wrote:
 >  > OK, so it's probably an issue with the ahci controller: b_resid was set to
 >  > 0 even in case of failure; and it's used in the LBA48 workaround detection
 >  > to see if we crossed the boundary ... I think the attached patch fixes it 
 > but
 >  > unfortunably my test box didn't reboot after panic to I can't test before
 >  > tomorow.
 >  
 >  The fix doesn't work on my system:
 >  
 >  raid1: initiating in-place reconstruction on column 0
 >  wd3e: LBA48 bug reading fsbn 268435392 of 268435392-268435519 (wd3 bn 268435
 > 455; cn 266305 tn 0 sn 15), retrying
 >  wd3: soft error (corrected)
 [snip]
 >  raid1: Recon write failed!
 >  raid1: reconstruction failed.
 >  ahcisata0 port 2: device present, speed: 1.5Gb/s
 >  

 Are you planning to do more testing?  If so, I can get a patch for 
 the RAIDframe issue to you as well...  (I think I have a patch that 
 will work, but I havn't validated it yet..)

 Later...

 Greg Oster


From: Matthias Scheler <tron@zhadum.org.uk>
To: gnats-bugs@NetBSD.org
Cc: Greg Oster <oster@cs.usask.ca>
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Wed, 11 Feb 2009 16:52:09 +0000

 On Wed, Feb 11, 2009 at 04:20:03PM +0000, Greg Oster wrote:
 >  Are you planning to do more testing?

 Not really, but I could do.

 >  If so, I can get a patch for 
 >  the RAIDframe issue to you as well...  (I think I have a patch that 
 >  will work, but I havn't validated it yet..)

 I would need that patch first because Manuel's patch is supposed to
 avoid the RAID rebuild issue in the first place.

 	Kind regards

 -- 
 Matthias Scheler                                  http://zhadum.org.uk/

From: Greg Oster <oster@cs.usask.ca>
To: Matthias Scheler <tron@zhadum.org.uk>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown 
Date: Wed, 11 Feb 2009 15:18:32 -0600

 This is a multipart MIME message.

 --==_Exmh_1234387000_241040
 Content-Type: text/plain; charset=us-ascii

 Matthias Scheler writes:
 > On Wed, Feb 11, 2009 at 04:20:03PM +0000, Greg Oster wrote:
 > >  Are you planning to do more testing?
 > 
 > Not really, but I could do.

 If you could, that would be great... If you can't, no worries -- I 
 can nuke the LBA48 patch for the drives on my test box and attempt to 
 test this myself...

 > >  If so, I can get a patch for 
 > >  the RAIDframe issue to you as well...  (I think I have a patch that 
 > >  will work, but I havn't validated it yet..)
 > 
 > I would need that patch first because Manuel's patch is supposed to
 > avoid the RAID rebuild issue in the first place.

 See attached.  (I think it's actually the last part of the patch that 
 will make the difference in your case, but the other two changes fix 
 issues too...)

 Thanks!

 Later...

 Greg Oster


 --==_Exmh_1234387000_241040
 Content-Type: text/plain ; name="rf_reconstruct.c.diff"; charset=us-ascii
 Content-Description: rf_reconstruct.c.diff

 Index: rf_reconstruct.c
 ===================================================================
 RCS file: /cvsroot/src/sys/dev/raidframe/rf_reconstruct.c,v
 retrieving revision 1.106
 diff -u -r1.106 rf_reconstruct.c
 --- rf_reconstruct.c	20 Dec 2008 17:04:51 -0000	1.106
 +++ rf_reconstruct.c	11 Feb 2009 19:33:43 -0000
 @@ -676,8 +676,10 @@
  				   done dealing with the reads that are
  				   finished, we don't want to wait for any
  				   writes */
 -				if (status == RF_RECON_WRITE_ERROR)
 +				if (status == RF_RECON_WRITE_ERROR) {
  					write_error = 1;
 +					num_writes++;
 +				}

  			} else if (status == RF_RECON_READ_STOPPED) {
  				/* count this component as being "done" */
 @@ -718,12 +720,13 @@
  			status = ProcessReconEvent(raidPtr, event);

  			if (status == RF_RECON_WRITE_ERROR) {
 +				num_writes++;
  				recon_error = 1;
  				raidPtr->reconControl->error = 1;
  				/* an error was encountered at the very end... bail */
  			} else if (status == RF_RECON_WRITE_DONE) {
  				num_writes++;
 -			}
 +			} /* else it's something else, and we don't care */
  		}
  		if (recon_error || 
  		    (raidPtr->reconControl->lastPSID == lastPSID)) {
 @@ -1054,6 +1057,12 @@
  	case RF_REVENT_WRITE_FAILED:
  		retcode = RF_RECON_WRITE_ERROR;

 +		/* This is an error, but it was a pending write.
 +		   Account for it. */
 +		RF_LOCK_MUTEX(raidPtr->reconControl->rb_mutex);
 +		raidPtr->reconControl->pending_writes--;
 +		RF_UNLOCK_MUTEX(raidPtr->reconControl->rb_mutex);
 +
  		rbuf = (RF_ReconBuffer_t *) event->arg;

  		/* cleanup the disk queue data */

 --==_Exmh_1234387000_241040--


From: Matthias Scheler <tron@zhadum.org.uk>
To: Greg Oster <oster@cs.usask.ca>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Wed, 11 Feb 2009 22:12:44 +0000

 On Wed, Feb 11, 2009 at 03:18:32PM -0600, Greg Oster wrote:
 > Matthias Scheler writes:
 > > On Wed, Feb 11, 2009 at 04:20:03PM +0000, Greg Oster wrote:
 > > >  Are you planning to do more testing?
 > > 
 > > Not really, but I could do.
 > 
 > If you could, that would be great... If you can't, no worries -- I 
 > can nuke the LBA48 patch for the drives on my test box and attempt to 
 > test this myself...

 I've booted a kernel with your and without Manuel's patch. The currently
 ongoing RAID construction should therefore fail and I can tell whether
 it was aborted properly this time.

 	Kind regards


 -- 
 Matthias Scheler                                  http://zhadum.org.uk/

From: Matthias Scheler <tron@zhadum.org.uk>
To: Greg Oster <oster@cs.usask.ca>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Wed, 11 Feb 2009 22:51:11 +0000

 On Wed, Feb 11, 2009 at 10:12:44PM +0000, Matthias Scheler wrote:
 > > If you could, that would be great... If you can't, no worries -- I 
 > > can nuke the LBA48 patch for the drives on my test box and attempt to 
 > > test this myself...
 > 
 > I've booted a kernel with your and without Manuel's patch. The currently
 > ongoing RAID construction should therefore fail and I can tell whether
 > it was aborted properly this time.

 It looks like the RAID re-construction failed properly this time:

 raid1: Recon write failed!
 raid1: reconstruction failed.

 tron@colwyn:~#raidctl -s raid1
 Components:
            /dev/wd2e: failed
            /dev/wd3e: optimal
 No spares.
 /dev/wd2e status is: failed.  Skipping label.
 Component label for /dev/wd3e:
    Row: 0, Column: 1, Num Rows: 1, Num Columns: 2
    Version: 2, Serial Number: 2009011200, Mod Counter: 303
    Clean: No, Status: 0
    sectPerSU: 128, SUsPerPU: 1, SUsPerRU: 1
    Queue size: 100, blocksize: 512, numBlocks: 312581632
    RAID Level: 1
    Autoconfig: Yes
    Root partition: No
    Last configured as: raid1
 Parity status: clean
 Reconstruction is 100% complete.
 Parity Re-write is 100% complete.
 Copyback is 100% complete.

 I could reboot the system without problems afterwards. So this patch
 is a clear winner. :-)

 	Kind regards

 -- 
 Matthias Scheler                                  http://zhadum.org.uk/

From: Matthias Scheler <tron@zhadum.org.uk>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Wed, 11 Feb 2009 22:53:00 +0000

 On Wed, Feb 11, 2009 at 01:09:30PM +0100, Manuel Bouyer wrote:
 > there was an inverted test condition in my patch;  the attached one should work
 > (it does for me at last; with it a write at the LBA48 address triggers
 > the workaround detection)

 My machine is rebuilding the RAID again right now, this time with this
 patched in the kernel.

 I'll let you know tomorrow whether the RAID rebuild succeeded.

 	Thanks a lot for your help

 -- 
 Matthias Scheler                                  http://zhadum.org.uk/

From: Greg Oster <oster@cs.usask.ca>
To: Matthias Scheler <tron@zhadum.org.uk>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown 
Date: Wed, 11 Feb 2009 17:01:33 -0600

 Matthias Scheler writes:
 > On Wed, Feb 11, 2009 at 10:12:44PM +0000, Matthias Scheler wrote:
 > > > If you could, that would be great... If you can't, no worries -- I 
 > > > can nuke the LBA48 patch for the drives on my test box and attempt to 
 > > > test this myself...
 > > 
 > > I've booted a kernel with your and without Manuel's patch. The currently
 > > ongoing RAID construction should therefore fail and I can tell whether
 > > it was aborted properly this time.
 > 
 > It looks like the RAID re-construction failed properly this time:
 > 
 > raid1: Recon write failed!
 > raid1: reconstruction failed.
 > 
 > tron@colwyn:~#raidctl -s raid1
 > Components:
 >            /dev/wd2e: failed
 >            /dev/wd3e: optimal
 > No spares.
 > /dev/wd2e status is: failed.  Skipping label.
 > Component label for /dev/wd3e:
 >    Row: 0, Column: 1, Num Rows: 1, Num Columns: 2
 >    Version: 2, Serial Number: 2009011200, Mod Counter: 303
 >    Clean: No, Status: 0
 >    sectPerSU: 128, SUsPerPU: 1, SUsPerRU: 1
 >    Queue size: 100, blocksize: 512, numBlocks: 312581632
 >    RAID Level: 1
 >    Autoconfig: Yes
 >    Root partition: No
 >    Last configured as: raid1
 > Parity status: clean
 > Reconstruction is 100% complete.
 > Parity Re-write is 100% complete.
 > Copyback is 100% complete.
 > 
 > I could reboot the system without problems afterwards. So this patch
 > is a clear winner. :-)

 Ahh.. excellent!! :) 

 Many thanks for the testing..  I'll get it checked in and request  
 pullups this evening...

 Later...

 Greg Oster


From: Greg Oster <oster@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/40569 CVS commit: src/sys/dev/raidframe
Date: Wed, 11 Feb 2009 23:54:11 +0000 (UTC)

 Module Name:	src
 Committed By:	oster
 Date:		Wed Feb 11 23:54:11 UTC 2009

 Modified Files:
 	src/sys/dev/raidframe: rf_reconstruct.c

 Log Message:
 If we see a RF_RECON_WRITE_ERROR event we know a write has finished and
 we need to account for that.  Failure to do so means we can end up
 waiting forever for writes we think are outstanding, but which have
 already completed.

 Addresses the RAIDframe part of PR#40569.  Thanks to Matthias Scheler
 for reporting the issue and verifying the fix.


 To generate a diff of this commit:
 cvs rdiff -r1.106 -r1.107 src/sys/dev/raidframe/rf_reconstruct.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Matthias Scheler <tron@zhadum.org.uk>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Thu, 12 Feb 2009 07:44:34 +0000

 On Wed, Feb 11, 2009 at 10:53:00PM +0000, Matthias Scheler wrote:
 > My machine is rebuilding the RAID again right now, this time with this
 > patched in the kernel.
 > 
 > I'll let you know tomorrow whether the RAID rebuild succeeded.

 And we have another winner:

 raid1: initiating in-place reconstruction on column 0
 wd3e: LBA48 bug reading fsbn 268435392 of 268435392-268435519 (wd3 bn 268435455; cn 266305 tn 0 sn 15), retrying
 wd3: soft error (corrected)
 wd2e: LBA48 bug writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
 wd2: soft error (corrected)
 raid1: Reconstruction of disk at col 0 completed
 raid1: Recon time was 2670.559041 seconds, accumulated XOR time was 0 us (0.000000)
 raid1:  (start time 1234392687 sec 859972 usec, end time 1234395358 sec 419013 usec)
 raid1: Total head-sep stall count was 0
 raid1: 4773177 recon event waits, 4 recon delays
 raid1: 92642112 max exec ticks

 With your patch ahcisata(4) will handle the LBA 48 error properly and
 the RAID rebuild works fine.

 	Kind regards

 -- 
 Matthias Scheler                                  http://zhadum.org.uk/

State-Changed-From-To: open->pending-pullups
State-Changed-By: tron@NetBSD.org
State-Changed-When: Thu, 12 Feb 2009 12:18:08 +0000
State-Changed-Why:
The two aspects of this problem have been fixed. Pullups into the
"netbsd-5" branch have been requested:

http://releng.netbsd.org/cgi-bin/req-5.cgi?show=454
http://releng.netbsd.org/cgi-bin/req-5.cgi?show=455


From: Soren Jacobsen <snj@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/40569 CVS commit: [netbsd-5] src/sys/dev/raidframe
Date: Thu, 19 Feb 2009 20:27:08 +0000 (UTC)

 Module Name:	src
 Committed By:	snj
 Date:		Thu Feb 19 20:27:08 UTC 2009

 Modified Files:
 	src/sys/dev/raidframe [netbsd-5]: rf_reconstruct.c

 Log Message:
 Pull up following revision(s) (requested by oster in ticket #454):
 	sys/dev/raidframe/rf_reconstruct.c: revision 1.107
 If we see a RF_RECON_WRITE_ERROR event we know a write has finished and
 we need to account for that.  Failure to do so means we can end up
 waiting forever for writes we think are outstanding, but which have
 already completed.
 Addresses the RAIDframe part of PR#40569.  Thanks to Matthias Scheler
 for reporting the issue and verifying the fix.


 To generate a diff of this commit:
 cvs rdiff -r1.105.4.1 -r1.105.4.2 src/sys/dev/raidframe/rf_reconstruct.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: pending-pullups->closed
State-Changed-By: tron@NetBSD.org
State-Changed-When: Thu, 19 Feb 2009 22:34:50 +0000
State-Changed-Why:
All pullups have been made.


From: Soren Jacobsen <snj@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/40569 CVS commit: [netbsd-4] src/sys/dev/raidframe
Date: Thu, 26 Feb 2009 08:07:06 +0000 (UTC)

 Module Name:	src
 Committed By:	snj
 Date:		Thu Feb 26 08:07:06 UTC 2009

 Modified Files:
 	src/sys/dev/raidframe [netbsd-4]: rf_reconstruct.c

 Log Message:
 Pull up following revision(s) (requested by oster in ticket #1276):
 	sys/dev/raidframe/rf_reconstruct.c: revision 1.107
 If we see a RF_RECON_WRITE_ERROR event we know a write has finished and
 we need to account for that.  Failure to do so means we can end up
 waiting forever for writes we think are outstanding, but which have
 already completed.
 Addresses the RAIDframe part of PR#40569.  Thanks to Matthias Scheler
 for reporting the issue and verifying the fix.


 To generate a diff of this commit:
 cvs rdiff -r1.95.2.4 -r1.95.2.5 src/sys/dev/raidframe/rf_reconstruct.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.