NetBSD Problem Report #40569
From tron@zhadum.org.uk Fri Feb 6 23:02:22 2009
Return-Path: <tron@zhadum.org.uk>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
by narn.NetBSD.org (Postfix) with ESMTP id BEBA063C07F
for <gnats-bugs@gnats.NetBSD.org>; Fri, 6 Feb 2009 23:02:21 +0000 (UTC)
Message-Id: <200902062302.n16N2HQw001790@colwyn.zhadum.org.uk>
Date: Fri, 6 Feb 2009 23:02:17 GMT
From: tron@zhadum.org.uk
Reply-To: tron@zhadum.org.uk
To: gnats-bugs@gnats.NetBSD.org
Subject: Failed RAIDframe parity rewrite prevents system shutdown
X-Send-Pr-Version: 3.95
>Number: 40569
>Category: kern
>Synopsis: Failed RAIDframe parity rewrite prevents system shutdown
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri Feb 06 23:05:00 +0000 2009
>Closed-Date: Thu Feb 19 22:34:50 +0000 2009
>Last-Modified: Thu Feb 26 08:10:02 +0000 2009
>Originator: Matthias Scheler
>Release: NetBSD 5.0_RC1 2009-02-03 sources
>Organization:
Matthias Scheler http://zhadum.org.uk/
>Environment:
System: NetBSD colwyn.zhadum.org.uk 5.0_RC1 NetBSD 5.0_RC1 (COLWYN.64) #0: Fri Feb 6 17:59:15 GMT 2009 tron@colwyn.zhadum.org.uk:/src/sys/compile/COLWYN.64 amd64
Architecture: x86_64
Machine: amd64
>Description:
One of the SATA disks in my server had a few write errors and was ejected
for a RAIDframe RAID 1 a few days ago. When I finally noticed this
morning I initiated a parity rewrite with "raidctl -R /dev/wd2e raid1".
The rebuild failed unfortunately:
raid1: initiating in-place reconstruction on column 0
wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
[...]
wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15)
wd2: (id not found)
raid1: Recon write failed!
raid1: reconstruction failed.
I retried the parity rewrite but it was rejected by "raidctl" because of
an invalid I/O control. The reconstruction was not tried again. When
I later tried to shutdown the system (to check the cabling) the kernel
stopped while unmounting the file systems with this message:
unmounting file systems...raid1: Waiting for reconstruction to stop...
I had to remove the power hard at this point.
>How-To-Repeat:
Use "raidctl -R /dev/<x> raid<y>" and try to shutdown the system afterwards.
>Fix:
None known.
>Release-Note:
>Audit-Trail:
From: Greg Oster <oster@cs.usask.ca>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Fri, 06 Feb 2009 18:47:48 -0600
tron@zhadum.org.uk writes:
> >Number: 40569
> >Category: kern
> >Synopsis: Faild RAIDframe parity rewrite prevents system shutdown
> >Confidential: no
> >Severity: serious
> >Priority: medium
> >Responsible: kern-bug-people
> >State: open
> >Class: sw-bug
> >Submitter-Id: net
> >Arrival-Date: Fri Feb 06 23:05:00 +0000 2009
> >Originator: Matthias Scheler
> >Release: NetBSD 5.0_RC1 2009-02-03 sources
> >Organization:
> Matthias Scheler http://zhadum.org.uk/
> >Environment:
> System: NetBSD colwyn.zhadum.org.uk 5.0_RC1 NetBSD 5.0_RC1 (COLWYN.64) #0: Fr
> i Feb 6 17:59:15 GMT 2009 tron@colwyn.zhadum.org.uk:/src/sys/compile/COLWYN.6
> 4 amd64
> Architecture: x86_64
> Machine: amd64
> >Description:
> One of the SATA disks in my server had a few write errors and was ejected
> for a RAIDframe RAID 1 a few days ago. When I finally noticed this
> morning I initiated a parity rewrite with "raidctl -R /dev/wd2e raid1".
> The rebuild failed unfortunately:
>
> raid1: initiating in-place reconstruction on column 0
> wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455;
> cn 266305 tn 0 sn 15), retrying
> [...]
> wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455;
> cn 266305 tn 0 sn 15)
> wd2: (id not found)
> raid1: Recon write failed!
> raid1: reconstruction failed.
>
> I retried the parity rewrite but it was rejected by "raidctl" because of
> an invalid I/O control.
Do you have a bit more info on exactly what you tried here and what
the error was? A parity rewrite shouldn't have bumped
reconInProgress.
> The reconstruction was not tried again. When
> I later tried to shutdown the system (to check the cabling) the kernel
> stopped while unmounting the file systems with this message:
>
> unmounting file systems...raid1: Waiting for reconstruction to stop...
>
> I had to remove the power hard at this point.
>
> >How-To-Repeat:
> Use "raidctl -R /dev/<x> raid<y>" and try to shutdown the system afterwards.
I suspect the reconstruction also needs to fail, and you may need to
attempt to do something else again.. but I'm not sure yet...
(I can't see how reconInProgress is non-zero in rf_driver.c unless
there really is a reconstruction going on... From what you describe
here there wasn't an active reconstruction going on, and so I have no
clue how it could get into that state... :( )
Later...
Greg Oster
From: Matthias Scheler <tron@zhadum.org.uk>
To: Greg Oster <oster@cs.usask.ca>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Sat, 7 Feb 2009 10:33:42 +0000
On Sat, Feb 07, 2009 at 12:50:03AM +0000, Greg Oster wrote:
> > I retried the parity rewrite but it was rejected by "raidctl" because of
> > an invalid I/O control.
>
> Do you have a bit more info on exactly what you tried here and what
> the error was?
Not really but I tried another rebuild after powercycling the system (to
check the cabling) and it failed again:
aid1: initiating in-place reconstruction on column 0
wd3e: LBA48 bug reading fsbn 268435392 of 268435392-268435519 (wd3 bn 268435455; cn 266305 tn 0 sn 15), retrying
wd3: soft error (corrected)
wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
wd2: (id not found)
ahcisata0 port 2: device present, speed: 1.5Gb/s
wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
wd2: (id not found)
ahcisata0 port 2: device present, speed: 1.5Gb/s
wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
wd2: (id not found)
ahcisata0 port 2: device present, speed: 1.5Gb/s
wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
wd2: (id not found)
ahcisata0 port 2: device present, speed: 1.5Gb/s
wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
wd2: (id not found)
ahcisata0 port 2: device present, speed: 1.5Gb/s
wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15)
wd2: (id not found)
raid1: Recon write failed!
raid1: reconstruction failed.
ahcisata0 port 2: device present, speed: 1.5Gb/s
raid1: Error re-writing parity!
If you tell me what kind of debugging you would like me to do I can try
to reproduce the problem by attempting another rebuild.
Kind regards
--
Matthias Scheler http://zhadum.org.uk/
From: Matthias Scheler <tron@zhadum.org.uk>
To: Greg Oster <oster@cs.usask.ca>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Sat, 7 Feb 2009 10:41:16 +0000
Another interesting bit of information:
The rebuild definitely failed ...
raid1: Error re-writing parity!
... but "raidctl -s raid1" still says it is in progress:
Components:
/dev/wd2e: reconstructing
/dev/wd3e: optimal
No spares.
/dev/wd2e status is: reconstructing. Skipping label.
Component label for /dev/wd3e:
Row: 0, Column: 1, Num Rows: 1, Num Columns: 2
Version: 2, Serial Number: 2009011200, Mod Counter: 177
Clean: No, Status: 0
sectPerSU: 128, SUsPerPU: 1, SUsPerRU: 1
Queue size: 100, blocksize: 512, numBlocks: 312581632
RAID Level: 1
Autoconfig: Yes
Root partition: No
Last configured as: raid1
Parity status: DIRTY
Reconstruction is 100% complete.
Parity Re-write is 100% complete.
Copyback is 100% complete.
This looks to like it hasn't aborted the rebuild completely.
I'm not sure whether it matters but I was monitoring the rebuild with
"raidctl -S raid1" until it failed.
Kind regards
--
Matthias Scheler http://zhadum.org.uk/
From: Greg Oster <oster@cs.usask.ca>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Sat, 07 Feb 2009 12:54:54 -0600
Matthias Scheler writes:
> On Sat, Feb 07, 2009 at 12:50:03AM +0000, Greg Oster wrote:
> > > I retried the parity rewrite but it was rejected by "raidctl" because of
> > > an invalid I/O control.
> >
> > Do you have a bit more info on exactly what you tried here and what
> > the error was?
>
> Not really but I tried another rebuild after powercycling the system (to
> check the cabling) and it failed again:
>
> aid1: initiating in-place reconstruction on column 0
> wd3e: LBA48 bug reading fsbn 268435392 of 268435392-268435519 (wd3 bn 2684354
> 55; cn 266305 tn 0 sn 15), retrying
> wd3: soft error (corrected)
> wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455;
> cn 266305 tn 0 sn 15), retrying
[snip]
> wd2: (id not found)
> raid1: Recon write failed!
> raid1: reconstruction failed.
> ahcisata0 port 2: device present, speed: 1.5Gb/s
> raid1: Error re-writing parity!
I don't understand where this last line is coming from... Unless it
finished rebuilding parity for raid0, and it's just coincidece that
it finished at exactly this spot?
> If you tell me what kind of debugging you would like me to do I can try
> to reproduce the problem by attempting another rebuild.
Hmmmmm.... Around line 857 of src/sys/dev/raidframe/rf_reconstruct.c
there is a:
return (1);
You could try adding a printf() just before that line, and see if
that gets printed.... I bet it doesn't...
I *think* you're getting hung up in the:
if (!write_error) {
/* wait for writes to complete */
while (raidPtr->reconControl->pending_writes > 0) {
part of rf_ContinueReconstructFailedDisk().
It seems that you've had a (corrected) read error on wd3e.. but I'm
wondering if that's contributing to the problem here.. The issue, I
think, is that there are still pending writes, or that the code
thinks there are pending writes... I know this code was tested on a
disk that had real failing writes, but it's unlikely that they were
exactly the same as what you're seeing, and so there's room here for
bugs...
Oh... It just hit me:
wd3e: LBA48 bug reading fsbn 268435392 of 268435392-268435519 (wd3 bn 268435455; cn 266305 tn 0 sn 15), retrying
wd3: soft error (corrected)
wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
wd2: (id not found)
What type of disks are these, and do they have the the 'LBA48-quirk' entry
to change addressing modes or whatever for block 268435455? (just hunt
for that block number in Google for more info...) There are other
PRs (like 38376) which describe these same sort of symptoms...
Later...
Greg Oster
From: Matthias Scheler <tron@zhadum.org.uk>
To: gnats-bugs@NetBSD.org
Cc: Greg Oster <oster@cs.usask.ca>
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Sat, 7 Feb 2009 19:15:29 +0000
On Sat, Feb 07, 2009 at 06:55:01PM +0000, Greg Oster wrote:
> Hmmmmm.... Around line 857 of src/sys/dev/raidframe/rf_reconstruct.c
> there is a:
>
> return (1);
>
> You could try adding a printf() just before that line, and see if
> that gets printed.... I bet it doesn't...
I'll make that change before I reboot the system the next time.
> wd3e: LBA48 bug reading fsbn 268435392 of 268435392-268435519 (wd3 bn 268435455; cn 266305 tn 0 sn 15), retrying
> wd3: soft error (corrected)
> wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
> wd2: (id not found)
>
> What type of disks are these, and do they have the the 'LBA48-quirk' entry
> to change addressing modes or whatever for block 268435455?
No, they apparently don't. But that's not the problem. The parity rewrite
worked in the past despite this bug. And if I attempt to rebuild the
parity now one of the disks, probably wd2, produces really alarming
"clonk" noises. I think it is just a case of a broken disk.
Kind regards
--
Matthias Scheler http://zhadum.org.uk/
From: Greg Oster <oster@cs.usask.ca>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Sat, 07 Feb 2009 13:23:33 -0600
Matthias Scheler writes:
> On Sat, Feb 07, 2009 at 06:55:01PM +0000, Greg Oster wrote:
> > Hmmmmm.... Around line 857 of src/sys/dev/raidframe/rf_reconstruct.c
> > there is a:
> >
> > return (1);
> >
> > You could try adding a printf() just before that line, and see if
> > that gets printed.... I bet it doesn't...
>
> I'll make that change before I reboot the system the next time.
K.. If I've caught you in time, please add a printf here:
/* wait for writes to complete */
in rf_reconstruct.c: rf_ContinueReconstructFailedDisk()
and in that next while() loop, do something like:
while (raidPtr->reconControl->pending_writes > 0) {
printf("pending writes: %d\n",raidPtr->reconControl->pending_writes);
to print the value of raidPtr->reconControl->pending_writes.
I'd basicall like to know whether it thinks there have been write
errors at that point, and if not, then how many pending writes it's
waiting for...
> > wd3e: LBA48 bug reading fsbn 268435392 of 268435392-268435519 (wd3 bn 2684
> 35455; cn 266305 tn 0 sn 15), retrying
> > wd3: soft error (corrected)
> > wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 2684354
> 55; cn 266305 tn 0 sn 15), retrying
> > wd2: (id not found)
> >
> > What type of disks are these, and do they have the the 'LBA48-quirk' entry
>
> > to change addressing modes or whatever for block 268435455?
>
> No, they apparently don't. But that's not the problem. The parity rewrite
> worked in the past despite this bug.
Well... parity re-writes only read the disks and do writes if there
are errors... so if it didn't need to write to that block before, it
wouldn't have been detected...
> And if I attempt to rebuild the
> parity now one of the disks, probably wd2, produces really alarming
> "clonk" noises. I think it is just a case of a broken disk.
"clonk" noises from disks are never good :-}
Later...
Greg Oster
From: Matthias Scheler <tron@zhadum.org.uk>
To: gnats-bugs@NetBSD.org
Cc: Greg Oster <oster@cs.usask.ca>
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Mon, 9 Feb 2009 19:48:22 +0000
On Sat, Feb 07, 2009 at 07:25:02PM +0000, Greg Oster wrote:
> I'd basicall like to know whether it thinks there have been write
> errors at that point, and if not, then how many pending writes it's
> waiting for...
1.) When I rebooted to install the new kernel the system managed to reboot
despite the failed re-construction.
2.) This is what I got when I attempted a re-construction with the
new kernel:
RECON: initiating reconstruction on col 0 -> spare at col 2
wd3e: LBA48 bug reading fsbn 268435392 of 268435392-268435519 (wd3 bn 268435455; cn 266305 tn 0 sn 15), retrying
wd3: soft error (corrected)
wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
wd2: (id not found)
ahcisata0 port 2: device present, speed: 1.5Gb/s
wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
wd2: (id not found)
ahcisata0 port 2: device present, speed: 1.5Gb/s
wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
wd2: (id not found)
ahcisata0 port 2: device present, speed: 1.5Gb/s
wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
wd2: (id not found)
ahcisata0 port 2: device present, speed: 1.5Gb/s
wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
wd2: (id not found)
ahcisata0 port 2: device present, speed: 1.5Gb/s
wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15)
wd2: (id not found)
raid1: Recon write failed!
raid1: reconstruction failed.
pending writes: 1
ahcisata0 port 2: device present, speed: 1.5Gb/s
This time it actually seems to fail because of the LBA48 bug. I know
remember that re-construct was done in the opposite direction in
the past. I'll try to add the harddisk to the quick table.
Kind regards
--
Matthias Scheler http://zhadum.org.uk/
From: Matthias Scheler <tron@zhadum.org.uk>
To: NetBSD GNATS <gnats-bugs@NetBSD.org>
Cc: Greg Oster <oster@cs.usask.ca>
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Mon, 9 Feb 2009 21:07:13 +0000
On Mon, Feb 09, 2009 at 07:50:03PM +0000, Matthias Scheler wrote:
> This time it actually seems to fail because of the LBA48 bug. I know
> remember that re-construct was done in the opposite direction in
> the past. I'll try to add the harddisk to the quick table.
1.) The system wedge again during shutdown:
ahcisata0 port 2: device present, speed: 1.5Gb/s
Feb 9 19:56:43 colwyn su: tron to root on /dev/ttyp1
Feb 9 19:57:56 colwyn shutdown: reboot by tron: Kernel bug fix
Feb 9 19:58:11 colwyn syslogd: Exiting on signal 15
syncing disks... 5 done
unmounting file systems...raid1: Waiting for reconstruction to stop...
2.) The kernel with the two hard disks in the quick table has managed
to re-construct the RAID 1.
So it seems this is not a RAIDframe bug after all but rather a problem
with the drives (and eventually error handling in ahcisata(4)).
Kind regards
--
Matthias Scheler http://zhadum.org.uk/
From: Matthias Scheler <tron@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/40569 CVS commit: src/sys/dev/ata
Date: Mon, 9 Feb 2009 22:34:23 +0000 (UTC)
Module Name: src
Committed By: tron
Date: Mon Feb 9 22:34:23 UTC 2009
Modified Files:
src/sys/dev/ata: wd.c
Log Message:
Add two more entries to the quirk table for hard disks which need the
LBA 48 work around. The first entry will watch the Seagate ST3160815AS
(and similar models), the second one HP's OEM version of the same drive.
This avoids the RAID rebuild problems described in PR kern/40569.
To generate a diff of this commit:
cvs rdiff -r1.368 -r1.369 src/sys/dev/ata/wd.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Greg Oster <oster@cs.usask.ca>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Mon, 09 Feb 2009 17:01:45 -0600
Matthias Scheler writes:
> The following reply was made to PR kern/40569; it has been noted by GNATS.
>
> From: Matthias Scheler <tron@zhadum.org.uk>
> To: NetBSD GNATS <gnats-bugs@NetBSD.org>
> Cc: Greg Oster <oster@cs.usask.ca>
> Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutd
> own
> Date: Mon, 9 Feb 2009 21:07:13 +0000
>
> On Mon, Feb 09, 2009 at 07:50:03PM +0000, Matthias Scheler wrote:
> > This time it actually seems to fail because of the LBA48 bug. I know
> > remember that re-construct was done in the opposite direction in
> > the past. I'll try to add the harddisk to the quick table.
>
> 1.) The system wedge again during shutdown:
>
> ahcisata0 port 2: device present, speed: 1.5Gb/s
> Feb 9 19:56:43 colwyn su: tron to root on /dev/ttyp1
> Feb 9 19:57:56 colwyn shutdown: reboot by tron: Kernel bug fix
> Feb 9 19:58:11 colwyn syslogd: Exiting on signal 15
> syncing disks... 5 done
> unmounting file systems...raid1: Waiting for reconstruction to stop...
So it is very likely sleeping in the reconstruct code and waiting for
a write that is never going to happen...
AHHHH... I think I see the bug there is at least one missing:
num_writes++;
in rf_reconstruct:rf_ContinueReconstructFailedDisk(). (it might be
that two are missing.. I need to do more analysis...) Basically
writes with errors are still writes that need to be accounted for,
and that's not happening properly... I'll see about geting this
fixed for 5.0. (testing may prove to be a pain... I may have to
resurect an old testing box so that I have some disks with
real write errors... and I'm not sure those will even be sufficient
to replicate this :-/ )
> 2.) The kernel with the two hard disks in the quick table has managed
> to re-construct the RAID 1.
>
> So it seems this is not a RAIDframe bug after all but rather a problem
> with the drives (and eventually error handling in ahcisata(4)).
There's a RAIDframe bug in there too...
Later...
Greg Oster
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org,
tron@zhadum.org.uk
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system
shutdown
Date: Tue, 10 Feb 2009 13:11:13 +0100
On Mon, Feb 09, 2009 at 07:50:03PM +0000, Matthias Scheler wrote:
> On Sat, Feb 07, 2009 at 07:25:02PM +0000, Greg Oster wrote:
> > I'd basicall like to know whether it thinks there have been write
> > errors at that point, and if not, then how many pending writes it's
> > waiting for...
>
> 1.) When I rebooted to install the new kernel the system managed to reboot
> despite the failed re-construction.
>
> 2.) This is what I got when I attempted a re-construction with the
> new kernel:
>
> RECON: initiating reconstruction on col 0 -> spare at col 2
> wd3e: LBA48 bug reading fsbn 268435392 of 268435392-268435519 (wd3 bn 268435455; cn 266305 tn 0 sn 15), retrying
> wd3: soft error (corrected)
> wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
> wd2: (id not found)
I don't understand why the problem is properly detected and worked around for
wd3, but not for wd2 ? Are wd2 and wd3 identical devices ?
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
From: Matthias Scheler <tron@zhadum.org.uk>
To: gnats-bugs@NetBSD.org
Cc: Manuel Bouyer <bouyer@antioche.eu.org>
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Tue, 10 Feb 2009 12:16:29 +0000
On Tue, Feb 10, 2009 at 12:15:04PM +0000, Manuel Bouyer wrote:
> I don't understand why the problem is properly detected and worked
> around for wd3, but not for wd2 ? Are wd2 and wd3 identical devices?
Well, almost.
"wd2" is HP's OEM version of the same Seagate drive as "wd3". But the
HP drive definitely has a different firmware which e.g. doesn't
enable the write-cache by default.
Kind regards
--
Matthias Scheler http://zhadum.org.uk/
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Matthias Scheler <tron@zhadum.org.uk>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system
shutdown
Date: Tue, 10 Feb 2009 18:25:00 +0100
On Tue, Feb 10, 2009 at 12:16:29PM +0000, Matthias Scheler wrote:
> On Tue, Feb 10, 2009 at 12:15:04PM +0000, Manuel Bouyer wrote:
> > I don't understand why the problem is properly detected and worked
> > around for wd3, but not for wd2 ? Are wd2 and wd3 identical devices?
>
> Well, almost.
>
> "wd2" is HP's OEM version of the same Seagate drive as "wd3". But the
> HP drive definitely has a different firmware which e.g. doesn't
> enable the write-cache by default.
The LBA48 should have been properly detected; is also reported a
"id not found", and with the right address. Maybe it's a read vs write issue ?
Could you check if a read on wd2 would trigger proper detection (obviously
you have to remove the quirk entry first) ?
something like
dd if=/dev/rwd2e of=/dev/null bs=32k skip=4194302 count=5
should do it.
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
From: Matthias Scheler <tron@zhadum.org.uk>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Tue, 10 Feb 2009 17:42:49 +0000
On Tue, Feb 10, 2009 at 06:25:00PM +0100, Manuel Bouyer wrote:
> Could you check if a read on wd2 would trigger proper detection (obviously
> you have to remove the quirk entry first) ?
I'm 99% sure it does because I could construct the RAID previously when
RAIDframe read from "wd2" and wrote to "wd3" previously. The problems
started when it had to rewrite parity the other way around (with
"wd3" as the source).
> something like
> dd if=/dev/rwd2e of=/dev/null bs=32k skip=4194302 count=5
> should do it.
Is the above good enough? I would really like to avoid rebooting that
machine again as it is my server.
Kind regards
--
Matthias Scheler http://zhadum.org.uk/
From: Matthias Scheler <tron@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/40569 CVS commit: src/sys/dev/ata
Date: Tue, 10 Feb 2009 19:45:22 +0000 (UTC)
Module Name: src
Committed By: tron
Date: Tue Feb 10 19:45:22 UTC 2009
Modified Files:
src/sys/dev/ata: wd.c
Log Message:
Backout LBA 48 quick entries which were added to fix one aspect of
PR kern/40569 because of objections by Manual Bouyer.
To generate a diff of this commit:
cvs rdiff -r1.369 -r1.370 src/sys/dev/ata/wd.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Matthias Scheler <tron@zhadum.org.uk>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Tue, 10 Feb 2009 23:12:38 +0100
On Tue, Feb 10, 2009 at 05:42:49PM +0000, Matthias Scheler wrote:
> On Tue, Feb 10, 2009 at 06:25:00PM +0100, Manuel Bouyer wrote:
> > Could you check if a read on wd2 would trigger proper detection (obviously
> > you have to remove the quirk entry first) ?
>
> I'm 99% sure it does because I could construct the RAID previously when
> RAIDframe read from "wd2" and wrote to "wd3" previously. The problems
> started when it had to rewrite parity the other way around (with
> "wd3" as the source).
>
> > something like
> > dd if=/dev/rwd2e of=/dev/null bs=32k skip=4194302 count=5
> > should do it.
>
> Is the above good enough? I would really like to avoid rebooting that
> machine again as it is my server.
I've reproduced it on a system here and I think I've a fix.
Once again a DIAGNOSTIC kernel would have pointed right to the problem;
I don't think removing DIAGNOSTIC from GENERIC (at last on -current)
was a good move.
Just to confirm; you're using a AHCI controller, right ?
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
From: Matthias Scheler <tron@zhadum.org.uk>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Tue, 10 Feb 2009 22:17:06 +0000
On Tue, Feb 10, 2009 at 11:12:38PM +0100, Manuel Bouyer wrote:
> I've reproduced it on a system here and I think I've a fix.
That's great news.
> Just to confirm; you're using a AHCI controller, right ?
Yes, I do:
atabus4 at ahcisata0 channel 2
atabus5 at ahcisata0 channel 3
[...]
ahcisata0 port 2: device present, speed: 1.5Gb/s
ahcisata0 port 3: device present, speed: 3.0Gb/s
[...]
wd2 at atabus4 drive 0: <FB160C4081>
wd2: quirks 2<FORCE_LBA48>
wd2: drive supports 16-sector PIO transfers, LBA48 addressing
wd2: 149 GB, 310101 cyl, 16 head, 63 sec, 512 bytes/sect x 312581808 sectors
wd2: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
wd2(ahcisata0:2:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100) (using DMA)
wd3 at atabus5 drive 0: <ST3160815AS>
wd3: quirks 2<FORCE_LBA48>
wd3: drive supports 16-sector PIO transfers, LBA48 addressing
wd3: 149 GB, 310101 cyl, 16 head, 63 sec, 512 bytes/sect x 312581808 sectors
wd3: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
wd3(ahcisata0:3:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) (using DMA)
Kind regards
--
Matthias Scheler http://zhadum.org.uk/
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Matthias Scheler <tron@zhadum.org.uk>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Tue, 10 Feb 2009 23:24:40 +0100
--ZPt4rx8FFjLCG7dd
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
On Tue, Feb 10, 2009 at 10:17:06PM +0000, Matthias Scheler wrote:
> On Tue, Feb 10, 2009 at 11:12:38PM +0100, Manuel Bouyer wrote:
> > I've reproduced it on a system here and I think I've a fix.
>
> That's great news.
>
> > Just to confirm; you're using a AHCI controller, right ?
>
> Yes, I do:
>
> atabus4 at ahcisata0 channel 2
> atabus5 at ahcisata0 channel 3
> [...]
> ahcisata0 port 2: device present, speed: 1.5Gb/s
> ahcisata0 port 3: device present, speed: 3.0Gb/s
OK, so it's probably an issue with the ahci controller: b_resid was set to
0 even in case of failure; and it's used in the LBA48 workaround detection
to see if we crossed the boundary ... I think the attached patch fixes it but
unfortunably my test box didn't reboot after panic to I can't test before
tomorow.
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
--ZPt4rx8FFjLCG7dd
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename=diff
Index: ahcisata_core.c
===================================================================
RCS file: /cvsroot/src/sys/dev/ic/ahcisata_core.c,v
retrieving revision 1.18
diff -u -p -u -r1.18 ahcisata_core.c
--- ahcisata_core.c 3 Oct 2008 13:02:08 -0000 1.18
+++ ahcisata_core.c 10 Feb 2009 22:22:42 -0000
@@ -1065,7 +1065,7 @@ ahci_bio_complete(struct ata_channel *ch
ata_bio->error = TIMEOUT;
} else {
callout_stop(&chp->ch_callout);
- ata_bio->error = 0;
+ ata_bio->error = NOERROR;
}
chp->ch_queue->active_xfer = NULL;
@@ -1095,7 +1095,14 @@ ahci_bio_complete(struct ata_channel *ch
BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE);
AHCIDEBUG_PRINT(("ahci_bio_complete bcount %ld",
ata_bio->bcount), DEBUG_XFERS);
- ata_bio->bcount -= le32toh(achp->ahcic_cmdh[slot].cmdh_prdbc);
+ /*
+ * if it was a write, complete data buffer may have been transfered
+ * before error detection; in this case don't use cmdh_prdbc
+ * as it won't reflect what was written to media. Assume nothing
+ * was transfered and leave bcount as-is.
+ */
+ if ((ata_bio->flags & ATA_READ) || ata_bio->error != NOERROR)
+ ata_bio->bcount -= le32toh(achp->ahcic_cmdh[slot].cmdh_prdbc);
AHCIDEBUG_PRINT((" now %ld\n", ata_bio->bcount), DEBUG_XFERS);
(*chp->ch_drive[drive].drv_done)(chp->ch_drive[drive].drv_softc);
atastart(chp);
--ZPt4rx8FFjLCG7dd--
From: Matthias Scheler <tron@zhadum.org.uk>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Wed, 11 Feb 2009 07:26:17 +0000
On Tue, Feb 10, 2009 at 11:24:40PM +0100, Manuel Bouyer wrote:
> OK, so it's probably an issue with the ahci controller: b_resid was set to
> 0 even in case of failure; and it's used in the LBA48 workaround detection
> to see if we crossed the boundary ... I think the attached patch fixes it but
> unfortunably my test box didn't reboot after panic to I can't test before
> tomorow.
The fix doesn't work on my system:
raid1: initiating in-place reconstruction on column 0
wd3e: LBA48 bug reading fsbn 268435392 of 268435392-268435519 (wd3 bn 268435455; cn 266305 tn 0 sn 15), retrying
wd3: soft error (corrected)
wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
wd2: (id not found)
ahcisata0 port 2: device present, speed: 1.5Gb/s
wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
wd2: (id not found)
ahcisata0 port 2: device present, speed: 1.5Gb/s
wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
wd2: (id not found)
ahcisata0 port 2: device present, speed: 1.5Gb/s
wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
wd2: (id not found)
ahcisata0 port 2: device present, speed: 1.5Gb/s
wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
wd2: (id not found)
ahcisata0 port 2: device present, speed: 1.5Gb/s
wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15)
wd2: (id not found)
raid1: Recon write failed!
raid1: reconstruction failed.
ahcisata0 port 2: device present, speed: 1.5Gb/s
Kind regards
--
Matthias Scheler http://zhadum.org.uk/
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Matthias Scheler <tron@zhadum.org.uk>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system
shutdown
Date: Wed, 11 Feb 2009 13:09:30 +0100
--NzB8fVQJ5HfG6fxh
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
On Wed, Feb 11, 2009 at 07:26:17AM +0000, Matthias Scheler wrote:
> On Tue, Feb 10, 2009 at 11:24:40PM +0100, Manuel Bouyer wrote:
> > OK, so it's probably an issue with the ahci controller: b_resid was set to
> > 0 even in case of failure; and it's used in the LBA48 workaround detection
> > to see if we crossed the boundary ... I think the attached patch fixes it but
> > unfortunably my test box didn't reboot after panic to I can't test before
> > tomorow.
>
> The fix doesn't work on my system:
>
> raid1: initiating in-place reconstruction on column 0
> wd3e: LBA48 bug reading fsbn 268435392 of 268435392-268435519 (wd3 bn 268435455; cn 266305 tn 0 sn 15), retrying
> wd3: soft error (corrected)
> wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
> wd2: (id not found)
there was an inverted test condition in my patch; the attached one should work
(it does for me at last; with it a write at the LBA48 address triggers
the workaround detection)
--
Manuel Bouyer, LIP6, Universite Paris VI. Manuel.Bouyer@lip6.fr
NetBSD: 26 ans d'experience feront toujours la difference
--
--NzB8fVQJ5HfG6fxh
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename=diff
Index: ahcisata_core.c
===================================================================
RCS file: /cvsroot/src/sys/dev/ic/ahcisata_core.c,v
retrieving revision 1.18
diff -u -p -u -r1.18 ahcisata_core.c
--- ahcisata_core.c 3 Oct 2008 13:02:08 -0000 1.18
+++ ahcisata_core.c 11 Feb 2009 12:07:01 -0000
@@ -1065,7 +1065,7 @@ ahci_bio_complete(struct ata_channel *ch
ata_bio->error = TIMEOUT;
} else {
callout_stop(&chp->ch_callout);
- ata_bio->error = 0;
+ ata_bio->error = NOERROR;
}
chp->ch_queue->active_xfer = NULL;
@@ -1095,7 +1095,14 @@ ahci_bio_complete(struct ata_channel *ch
BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE);
AHCIDEBUG_PRINT(("ahci_bio_complete bcount %ld",
ata_bio->bcount), DEBUG_XFERS);
- ata_bio->bcount -= le32toh(achp->ahcic_cmdh[slot].cmdh_prdbc);
+ /*
+ * if it was a write, complete data buffer may have been transfered
+ * before error detection; in this case don't use cmdh_prdbc
+ * as it won't reflect what was written to media. Assume nothing
+ * was transfered and leave bcount as-is.
+ */
+ if ((ata_bio->flags & ATA_READ) || ata_bio->error == NOERROR)
+ ata_bio->bcount -= le32toh(achp->ahcic_cmdh[slot].cmdh_prdbc);
AHCIDEBUG_PRINT((" now %ld\n", ata_bio->bcount), DEBUG_XFERS);
(*chp->ch_drive[drive].drv_done)(chp->ch_drive[drive].drv_softc);
atastart(chp);
--NzB8fVQJ5HfG6fxh--
From: Greg Oster <oster@cs.usask.ca>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Wed, 11 Feb 2009 10:16:24 -0600
Matthias Scheler writes:
> The following reply was made to PR kern/40569; it has been noted by GNATS.
>
> From: Matthias Scheler <tron@zhadum.org.uk>
> To: Manuel Bouyer <bouyer@antioche.eu.org>
> Cc: gnats-bugs@NetBSD.org
> Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutd
> own
> Date: Wed, 11 Feb 2009 07:26:17 +0000
>
> On Tue, Feb 10, 2009 at 11:24:40PM +0100, Manuel Bouyer wrote:
> > OK, so it's probably an issue with the ahci controller: b_resid was set to
> > 0 even in case of failure; and it's used in the LBA48 workaround detection
> > to see if we crossed the boundary ... I think the attached patch fixes it
> but
> > unfortunably my test box didn't reboot after panic to I can't test before
> > tomorow.
>
> The fix doesn't work on my system:
>
> raid1: initiating in-place reconstruction on column 0
> wd3e: LBA48 bug reading fsbn 268435392 of 268435392-268435519 (wd3 bn 268435
> 455; cn 266305 tn 0 sn 15), retrying
> wd3: soft error (corrected)
[snip]
> raid1: Recon write failed!
> raid1: reconstruction failed.
> ahcisata0 port 2: device present, speed: 1.5Gb/s
>
Are you planning to do more testing? If so, I can get a patch for
the RAIDframe issue to you as well... (I think I have a patch that
will work, but I havn't validated it yet..)
Later...
Greg Oster
From: Matthias Scheler <tron@zhadum.org.uk>
To: gnats-bugs@NetBSD.org
Cc: Greg Oster <oster@cs.usask.ca>
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Wed, 11 Feb 2009 16:52:09 +0000
On Wed, Feb 11, 2009 at 04:20:03PM +0000, Greg Oster wrote:
> Are you planning to do more testing?
Not really, but I could do.
> If so, I can get a patch for
> the RAIDframe issue to you as well... (I think I have a patch that
> will work, but I havn't validated it yet..)
I would need that patch first because Manuel's patch is supposed to
avoid the RAID rebuild issue in the first place.
Kind regards
--
Matthias Scheler http://zhadum.org.uk/
From: Greg Oster <oster@cs.usask.ca>
To: Matthias Scheler <tron@zhadum.org.uk>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Wed, 11 Feb 2009 15:18:32 -0600
This is a multipart MIME message.
--==_Exmh_1234387000_241040
Content-Type: text/plain; charset=us-ascii
Matthias Scheler writes:
> On Wed, Feb 11, 2009 at 04:20:03PM +0000, Greg Oster wrote:
> > Are you planning to do more testing?
>
> Not really, but I could do.
If you could, that would be great... If you can't, no worries -- I
can nuke the LBA48 patch for the drives on my test box and attempt to
test this myself...
> > If so, I can get a patch for
> > the RAIDframe issue to you as well... (I think I have a patch that
> > will work, but I havn't validated it yet..)
>
> I would need that patch first because Manuel's patch is supposed to
> avoid the RAID rebuild issue in the first place.
See attached. (I think it's actually the last part of the patch that
will make the difference in your case, but the other two changes fix
issues too...)
Thanks!
Later...
Greg Oster
--==_Exmh_1234387000_241040
Content-Type: text/plain ; name="rf_reconstruct.c.diff"; charset=us-ascii
Content-Description: rf_reconstruct.c.diff
Index: rf_reconstruct.c
===================================================================
RCS file: /cvsroot/src/sys/dev/raidframe/rf_reconstruct.c,v
retrieving revision 1.106
diff -u -r1.106 rf_reconstruct.c
--- rf_reconstruct.c 20 Dec 2008 17:04:51 -0000 1.106
+++ rf_reconstruct.c 11 Feb 2009 19:33:43 -0000
@@ -676,8 +676,10 @@
done dealing with the reads that are
finished, we don't want to wait for any
writes */
- if (status == RF_RECON_WRITE_ERROR)
+ if (status == RF_RECON_WRITE_ERROR) {
write_error = 1;
+ num_writes++;
+ }
} else if (status == RF_RECON_READ_STOPPED) {
/* count this component as being "done" */
@@ -718,12 +720,13 @@
status = ProcessReconEvent(raidPtr, event);
if (status == RF_RECON_WRITE_ERROR) {
+ num_writes++;
recon_error = 1;
raidPtr->reconControl->error = 1;
/* an error was encountered at the very end... bail */
} else if (status == RF_RECON_WRITE_DONE) {
num_writes++;
- }
+ } /* else it's something else, and we don't care */
}
if (recon_error ||
(raidPtr->reconControl->lastPSID == lastPSID)) {
@@ -1054,6 +1057,12 @@
case RF_REVENT_WRITE_FAILED:
retcode = RF_RECON_WRITE_ERROR;
+ /* This is an error, but it was a pending write.
+ Account for it. */
+ RF_LOCK_MUTEX(raidPtr->reconControl->rb_mutex);
+ raidPtr->reconControl->pending_writes--;
+ RF_UNLOCK_MUTEX(raidPtr->reconControl->rb_mutex);
+
rbuf = (RF_ReconBuffer_t *) event->arg;
/* cleanup the disk queue data */
--==_Exmh_1234387000_241040--
From: Matthias Scheler <tron@zhadum.org.uk>
To: Greg Oster <oster@cs.usask.ca>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Wed, 11 Feb 2009 22:12:44 +0000
On Wed, Feb 11, 2009 at 03:18:32PM -0600, Greg Oster wrote:
> Matthias Scheler writes:
> > On Wed, Feb 11, 2009 at 04:20:03PM +0000, Greg Oster wrote:
> > > Are you planning to do more testing?
> >
> > Not really, but I could do.
>
> If you could, that would be great... If you can't, no worries -- I
> can nuke the LBA48 patch for the drives on my test box and attempt to
> test this myself...
I've booted a kernel with your and without Manuel's patch. The currently
ongoing RAID construction should therefore fail and I can tell whether
it was aborted properly this time.
Kind regards
--
Matthias Scheler http://zhadum.org.uk/
From: Matthias Scheler <tron@zhadum.org.uk>
To: Greg Oster <oster@cs.usask.ca>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Wed, 11 Feb 2009 22:51:11 +0000
On Wed, Feb 11, 2009 at 10:12:44PM +0000, Matthias Scheler wrote:
> > If you could, that would be great... If you can't, no worries -- I
> > can nuke the LBA48 patch for the drives on my test box and attempt to
> > test this myself...
>
> I've booted a kernel with your and without Manuel's patch. The currently
> ongoing RAID construction should therefore fail and I can tell whether
> it was aborted properly this time.
It looks like the RAID re-construction failed properly this time:
raid1: Recon write failed!
raid1: reconstruction failed.
tron@colwyn:~#raidctl -s raid1
Components:
/dev/wd2e: failed
/dev/wd3e: optimal
No spares.
/dev/wd2e status is: failed. Skipping label.
Component label for /dev/wd3e:
Row: 0, Column: 1, Num Rows: 1, Num Columns: 2
Version: 2, Serial Number: 2009011200, Mod Counter: 303
Clean: No, Status: 0
sectPerSU: 128, SUsPerPU: 1, SUsPerRU: 1
Queue size: 100, blocksize: 512, numBlocks: 312581632
RAID Level: 1
Autoconfig: Yes
Root partition: No
Last configured as: raid1
Parity status: clean
Reconstruction is 100% complete.
Parity Re-write is 100% complete.
Copyback is 100% complete.
I could reboot the system without problems afterwards. So this patch
is a clear winner. :-)
Kind regards
--
Matthias Scheler http://zhadum.org.uk/
From: Matthias Scheler <tron@zhadum.org.uk>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Wed, 11 Feb 2009 22:53:00 +0000
On Wed, Feb 11, 2009 at 01:09:30PM +0100, Manuel Bouyer wrote:
> there was an inverted test condition in my patch; the attached one should work
> (it does for me at last; with it a write at the LBA48 address triggers
> the workaround detection)
My machine is rebuilding the RAID again right now, this time with this
patched in the kernel.
I'll let you know tomorrow whether the RAID rebuild succeeded.
Thanks a lot for your help
--
Matthias Scheler http://zhadum.org.uk/
From: Greg Oster <oster@cs.usask.ca>
To: Matthias Scheler <tron@zhadum.org.uk>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Wed, 11 Feb 2009 17:01:33 -0600
Matthias Scheler writes:
> On Wed, Feb 11, 2009 at 10:12:44PM +0000, Matthias Scheler wrote:
> > > If you could, that would be great... If you can't, no worries -- I
> > > can nuke the LBA48 patch for the drives on my test box and attempt to
> > > test this myself...
> >
> > I've booted a kernel with your and without Manuel's patch. The currently
> > ongoing RAID construction should therefore fail and I can tell whether
> > it was aborted properly this time.
>
> It looks like the RAID re-construction failed properly this time:
>
> raid1: Recon write failed!
> raid1: reconstruction failed.
>
> tron@colwyn:~#raidctl -s raid1
> Components:
> /dev/wd2e: failed
> /dev/wd3e: optimal
> No spares.
> /dev/wd2e status is: failed. Skipping label.
> Component label for /dev/wd3e:
> Row: 0, Column: 1, Num Rows: 1, Num Columns: 2
> Version: 2, Serial Number: 2009011200, Mod Counter: 303
> Clean: No, Status: 0
> sectPerSU: 128, SUsPerPU: 1, SUsPerRU: 1
> Queue size: 100, blocksize: 512, numBlocks: 312581632
> RAID Level: 1
> Autoconfig: Yes
> Root partition: No
> Last configured as: raid1
> Parity status: clean
> Reconstruction is 100% complete.
> Parity Re-write is 100% complete.
> Copyback is 100% complete.
>
> I could reboot the system without problems afterwards. So this patch
> is a clear winner. :-)
Ahh.. excellent!! :)
Many thanks for the testing.. I'll get it checked in and request
pullups this evening...
Later...
Greg Oster
From: Greg Oster <oster@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/40569 CVS commit: src/sys/dev/raidframe
Date: Wed, 11 Feb 2009 23:54:11 +0000 (UTC)
Module Name: src
Committed By: oster
Date: Wed Feb 11 23:54:11 UTC 2009
Modified Files:
src/sys/dev/raidframe: rf_reconstruct.c
Log Message:
If we see a RF_RECON_WRITE_ERROR event we know a write has finished and
we need to account for that. Failure to do so means we can end up
waiting forever for writes we think are outstanding, but which have
already completed.
Addresses the RAIDframe part of PR#40569. Thanks to Matthias Scheler
for reporting the issue and verifying the fix.
To generate a diff of this commit:
cvs rdiff -r1.106 -r1.107 src/sys/dev/raidframe/rf_reconstruct.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Matthias Scheler <tron@zhadum.org.uk>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
Date: Thu, 12 Feb 2009 07:44:34 +0000
On Wed, Feb 11, 2009 at 10:53:00PM +0000, Matthias Scheler wrote:
> My machine is rebuilding the RAID again right now, this time with this
> patched in the kernel.
>
> I'll let you know tomorrow whether the RAID rebuild succeeded.
And we have another winner:
raid1: initiating in-place reconstruction on column 0
wd3e: LBA48 bug reading fsbn 268435392 of 268435392-268435519 (wd3 bn 268435455; cn 266305 tn 0 sn 15), retrying
wd3: soft error (corrected)
wd2e: LBA48 bug writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; cn 266305 tn 0 sn 15), retrying
wd2: soft error (corrected)
raid1: Reconstruction of disk at col 0 completed
raid1: Recon time was 2670.559041 seconds, accumulated XOR time was 0 us (0.000000)
raid1: (start time 1234392687 sec 859972 usec, end time 1234395358 sec 419013 usec)
raid1: Total head-sep stall count was 0
raid1: 4773177 recon event waits, 4 recon delays
raid1: 92642112 max exec ticks
With your patch ahcisata(4) will handle the LBA 48 error properly and
the RAID rebuild works fine.
Kind regards
--
Matthias Scheler http://zhadum.org.uk/
State-Changed-From-To: open->pending-pullups
State-Changed-By: tron@NetBSD.org
State-Changed-When: Thu, 12 Feb 2009 12:18:08 +0000
State-Changed-Why:
The two aspects of this problem have been fixed. Pullups into the
"netbsd-5" branch have been requested:
http://releng.netbsd.org/cgi-bin/req-5.cgi?show=454
http://releng.netbsd.org/cgi-bin/req-5.cgi?show=455
From: Soren Jacobsen <snj@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/40569 CVS commit: [netbsd-5] src/sys/dev/raidframe
Date: Thu, 19 Feb 2009 20:27:08 +0000 (UTC)
Module Name: src
Committed By: snj
Date: Thu Feb 19 20:27:08 UTC 2009
Modified Files:
src/sys/dev/raidframe [netbsd-5]: rf_reconstruct.c
Log Message:
Pull up following revision(s) (requested by oster in ticket #454):
sys/dev/raidframe/rf_reconstruct.c: revision 1.107
If we see a RF_RECON_WRITE_ERROR event we know a write has finished and
we need to account for that. Failure to do so means we can end up
waiting forever for writes we think are outstanding, but which have
already completed.
Addresses the RAIDframe part of PR#40569. Thanks to Matthias Scheler
for reporting the issue and verifying the fix.
To generate a diff of this commit:
cvs rdiff -r1.105.4.1 -r1.105.4.2 src/sys/dev/raidframe/rf_reconstruct.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: pending-pullups->closed
State-Changed-By: tron@NetBSD.org
State-Changed-When: Thu, 19 Feb 2009 22:34:50 +0000
State-Changed-Why:
All pullups have been made.
From: Soren Jacobsen <snj@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/40569 CVS commit: [netbsd-4] src/sys/dev/raidframe
Date: Thu, 26 Feb 2009 08:07:06 +0000 (UTC)
Module Name: src
Committed By: snj
Date: Thu Feb 26 08:07:06 UTC 2009
Modified Files:
src/sys/dev/raidframe [netbsd-4]: rf_reconstruct.c
Log Message:
Pull up following revision(s) (requested by oster in ticket #1276):
sys/dev/raidframe/rf_reconstruct.c: revision 1.107
If we see a RF_RECON_WRITE_ERROR event we know a write has finished and
we need to account for that. Failure to do so means we can end up
waiting forever for writes we think are outstanding, but which have
already completed.
Addresses the RAIDframe part of PR#40569. Thanks to Matthias Scheler
for reporting the issue and verifying the fix.
To generate a diff of this commit:
cvs rdiff -r1.95.2.4 -r1.95.2.5 src/sys/dev/raidframe/rf_reconstruct.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.