NetBSD Problem Report #30674
From tron@colwyn.zhadum.de Wed Jul 6 09:51:10 2005
Return-Path: <tron@colwyn.zhadum.de>
Received: from colwyn.zhadum.de (colwyn.zhadum.de [81.187.181.114])
by narn.netbsd.org (Postfix) with ESMTP id D4CDD63B104
for <gnats-bugs@gnats.NetBSD.org>; Wed, 6 Jul 2005 09:51:09 +0000 (UTC)
Message-Id: <200507060951.j669p6rj013594@lyssa.zhadum.de>
Date: Wed, 6 Jul 2005 10:51:07 +0100 (BST)
From: Matthias Scheler <tron@colwyn.zhadum.de>
Reply-To: tron@colwyn.zhadum.de
To: gnats-bugs@netbsd.org
Subject: RAIDframe should be able to create volumes without parity rewrite
X-Send-Pr-Version: 3.95
>Number: 30674
>Category: kern
>Synopsis: RAIDframe should be able to create volumes without parity rewrite
>Confidential: no
>Severity: non-critical
>Priority: medium
>Responsible: oster
>State: open
>Class: change-request
>Submitter-Id: net
>Arrival-Date: Wed Jul 06 09:52:00 +0000 2005
>Last-Modified: Tue Jul 12 16:15:39 +0000 2005
>Originator: Matthias Scheler
>Release: NetBSD 3.99.7
>Organization:
Matthias Scheler http://scheler.de/~matthias/
>Environment:
System: NetBSD lyssa.zhadum.de 3.99.7 NetBSD 3.99.7 (LYSSA) #0: Mon Jul 4 10:16:28 BST 2005 tron@lyssa.zhadum.de:/src/sys/compile/LYSSA i386
Architecture: i386
Machine: i386
>Description:
Setting up a RAIDframe volume requires an initial parity rewrite which
can take a long time. This is completely pointless because the volume
doesn't contain any data yet. The Solaris Volume Manager e.g. allows
you to create a mirror with all disks attached without a resync which
is perfectly fine if the next thing you do is using "newfs" on it.
Solaris Volume Manager
>How-To-Repeat:
Create a RAID 1 volume with RAIDframe which requires using
"raidctl -i raidXZY" to rewrite parity.
>Fix:
None provided.
>Release-Note:
>Audit-Trail:
From: Greg Oster <oster@cs.usask.ca>
To: gnats-bugs@netbsd.org, Matthias Scheler <tron@colwyn.zhadum.de>
Cc:
Subject: Re: kern/30674: RAIDframe should be able to create volumes without parity rewrite
Date: Wed, 06 Jul 2005 09:19:19 -0600
Matthias Scheler writes:
> >Number: 30674
> >Category: kern
> >Synopsis: RAIDframe should be able to create volumes without parity re
> write
> >Confidential: no
> >Severity: non-critical
> >Priority: medium
> >Responsible: kern-bug-people
> >State: open
> >Class: change-request
> >Submitter-Id: net
> >Arrival-Date: Wed Jul 06 09:52:00 +0000 2005
> >Originator: Matthias Scheler
> >Release: NetBSD 3.99.7
> >Organization:
> Matthias Scheler http://scheler.de/~matthias
> /
> >Environment:
> System: NetBSD lyssa.zhadum.de 3.99.7 NetBSD 3.99.7 (LYSSA) #0: Mon Jul 4 10:
> 16:28 BST 2005 tron@lyssa.zhadum.de:/src/sys/compile/LYSSA i386
> Architecture: i386
> Machine: i386
> >Description:
> Setting up a RAIDframe volume requires an initial parity rewrite which
> can take a long time. This is completely pointless because the volume
> doesn't contain any data yet.
Let's address the RAID 1 case first:
If you're just going to build a FFS on it, then one can get away with
marking the parity as "good" because data will never be read until
after it has been written. Fine. If the machine crashes or
otherwise goes down without marking the parity as "good", then you are
back to square one -- you *HAVE* to do the parity rebuild at that
point, since you have no guarantee that there were no writes in
progress, and that for a given sector that the primary and the mirror
are in sync. So the only thing you've saved is the initial rebuild
(and there's nothing saying you can't do that initial rebuild in the
background sometime after you're using the partition).
There is, however, also a violation of the Principle of Least Astonishment.
If, for example, the components had random data on them before the
RAID 1 set was created, and one does two "dd if=/dev/rraid0d | md5"
with the parity marked as "good" (but not actually synced!) then one
might well yield different results. One certainly does not expect a
"disk device" to return different data on subsequent reads! (RAIDframe
will pick either the master or the mirror to read from -- in cases where
data is already written, this won't be a problem. In cases where data
has not been written to that sector, but we are still claiming that
the parity is good, it will violate the PoLA.)
Let's now look at the RAID 5 case: Consider a stripe made up of
component blocks A, B, C, D, and E. Let A be the block being updated,
and E be the parity for the stripe. Let E not be the XOR of A+B+C+D,
which will be the case if the parity rewrite is not done.
To do a write of A, the old contents of A will be read, the current
contents of E will be read, a new E will be computed, and the new A
and new E will be written. In the event that A fails, there is now
no way of reconstructing the contents of A, since B, C, and D were
never in sync with E, and thus are useless in recomputing A. For a
RAID 5 case, one *MUST* rebuild the parity before live data is put on
the RAID set, as otherwise there will be no way of reconstructing
data in the event of a component failure.
I've heard the argument a couple of times, but I don't see it buying
anything other than removing one parity rebuild...
Further comments? As you can guess, I'm not seeing any real advantage to
creating volumes without parity rewrites, even for RAID 1 sets.
Later...
Greg Oster
From: Matthias Scheler <tron@zhadum.de>
To: Greg Oster <oster@cs.usask.ca>
Cc: gnats-bugs@netbsd.org
Subject: Re: kern/30674: RAIDframe should be able to create volumes without parity rewrite
Date: Wed, 6 Jul 2005 16:47:30 +0100
On Wed, Jul 06, 2005 at 09:19:19AM -0600, Greg Oster wrote:
> Let's address the RAID 1 case first:
> If you're just going to build a FFS on it, then one can get away with
> marking the parity as "good" because data will never be read until
> after it has been written. Fine.
Exactly.
> If the machine crashes or otherwise goes down without marking the
> parity as "good", then you are back to square one -- you *HAVE* to
> do the parity rebuild at that point,
That is actually another disadvantage of RAIDframe. SVM doesn't manage
"parity good" by a single bit. It uses a database which manages it
on per "SVM meta cluster" base. The result is that Solaris only needs
to sync a few MBs after a crash and not the whole volume.
> There is, however, also a violation of the Principle of Least Astonishment.
I don't ask for this being turned on by default. Solaris doesn't manul page
doesn't even recomment. But it is nice to have that option if you know what
your are doing.
> If, for example, the components had random data on them before the
> RAID 1 set was created, and one does two "dd if=/dev/rraid0d | md5"
> with the parity marked as "good" (but not actually synced!) then one
> might well yield different results.
That is a very artifical case. The *really* interesting information is
the checksum of the filesystem data on the RAID volume. And that will
always match even if the mirror was created without an initial
parity rewrite.
> Let's now look at the RAID 5 case:
I already guessed that it is different for RAID 5. So we can just leave
that case out of the discussion.
> I've heard the argument a couple of times, but I don't see it buying
> anything other than removing one parity rebuild...
Which might save you hours of waiting and/or slow system performance.
Kind regards
--
Matthias Scheler http://scheler.de/~matthias/
From: Greg Oster <oster@cs.usask.ca>
To: Matthias Scheler <tron@zhadum.de>
Cc: gnats-bugs@netbsd.org
Subject: Re: kern/30674: RAIDframe should be able to create volumes without parity rewrite
Date: Wed, 06 Jul 2005 10:02:35 -0600
Matthias Scheler writes:
> On Wed, Jul 06, 2005 at 09:19:19AM -0600, Greg Oster wrote:
> > Let's address the RAID 1 case first:
> > If you're just going to build a FFS on it, then one can get away with
> > marking the parity as "good" because data will never be read until
> > after it has been written. Fine.
>
> Exactly.
>
> > If the machine crashes or otherwise goes down without marking the
> > parity as "good", then you are back to square one -- you *HAVE* to
> > do the parity rebuild at that point,
>
> That is actually another disadvantage of RAIDframe. SVM doesn't manage
> "parity good" by a single bit. It uses a database which manages it
> on per "SVM meta cluster" base. The result is that Solaris only needs
> to sync a few MBs after a crash and not the whole volume.
Right. Something like this is on my TODO list, but I've not gotten
to it..
> > There is, however, also a violation of the Principle of Least Astonishment.
>
> I don't ask for this being turned on by default. Solaris doesn't manul page
> doesn't even recomment. But it is nice to have that option if you know what
> your are doing.
>
> > If, for example, the components had random data on them before the
> > RAID 1 set was created, and one does two "dd if=/dev/rraid0d | md5"
> > with the parity marked as "good" (but not actually synced!) then one
> > might well yield different results.
>
> That is a very artifical case.
Somewhat artificial, yes :)
> The *really* interesting information is
> the checksum of the filesystem data on the RAID volume. And that will
> always match even if the mirror was created without an initial
> parity rewrite.
Yes.
> > Let's now look at the RAID 5 case:
>
> I already guessed that it is different for RAID 5. So we can just leave
> that case out of the discussion.
>
> > I've heard the argument a couple of times, but I don't see it buying
> > anything other than removing one parity rebuild...
>
> Which might save you hours of waiting and/or slow system performance.
But only the first time... after the first crash, you're back to waiting
again... And the longer the system is up (or the more rebuilds that are
done) the less expensive that one rebuild really is in the life of the
system...
Later...
Greg Oster
Responsible-Changed-From-To: kern-bug-people->oster
Responsible-Changed-By: oster@netbsd.org
Responsible-Changed-When: Tue, 12 Jul 2005 16:15:39 +0000
Responsible-Changed-Why:
I look after fixing RAIDframe lossage.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.