NetBSD Problem Report #30674

From tron@colwyn.zhadum.de  Wed Jul  6 09:51:10 2005
Return-Path: <tron@colwyn.zhadum.de>
Received: from colwyn.zhadum.de (colwyn.zhadum.de [81.187.181.114])
	by narn.netbsd.org (Postfix) with ESMTP id D4CDD63B104
	for <gnats-bugs@gnats.NetBSD.org>; Wed,  6 Jul 2005 09:51:09 +0000 (UTC)
Message-Id: <200507060951.j669p6rj013594@lyssa.zhadum.de>
Date: Wed, 6 Jul 2005 10:51:07 +0100 (BST)
From: Matthias Scheler <tron@colwyn.zhadum.de>
Reply-To: tron@colwyn.zhadum.de
To: gnats-bugs@netbsd.org
Subject: RAIDframe should be able to create volumes without parity rewrite
X-Send-Pr-Version: 3.95

>Number:         30674
>Category:       kern
>Synopsis:       RAIDframe should be able to create volumes without parity rewrite
>Confidential:   no
>Severity:       non-critical
>Priority:       medium
>Responsible:    oster
>State:          open
>Class:          change-request
>Submitter-Id:   net
>Arrival-Date:   Wed Jul 06 09:52:00 +0000 2005
>Last-Modified:  Tue Jul 12 16:15:39 +0000 2005
>Originator:     Matthias Scheler
>Release:        NetBSD 3.99.7
>Organization:
Matthias Scheler                                  http://scheler.de/~matthias/
>Environment:
System: NetBSD lyssa.zhadum.de 3.99.7 NetBSD 3.99.7 (LYSSA) #0: Mon Jul 4 10:16:28 BST 2005 tron@lyssa.zhadum.de:/src/sys/compile/LYSSA i386
Architecture: i386
Machine: i386
>Description:
Setting up a RAIDframe volume requires an initial parity rewrite which
can take a long time. This is completely pointless because the volume
doesn't contain any data yet. The Solaris Volume Manager e.g. allows
you to create a mirror with all disks attached without a resync which
is perfectly fine if the next thing you do is using "newfs" on it.

 Solaris Volume Manager

>How-To-Repeat:
Create a RAID 1 volume with RAIDframe which requires using
"raidctl -i raidXZY" to rewrite parity.

>Fix:
None provided.

>Release-Note:

>Audit-Trail:
From: Greg Oster <oster@cs.usask.ca>
To: gnats-bugs@netbsd.org, Matthias Scheler <tron@colwyn.zhadum.de>
Cc: 
Subject: Re: kern/30674: RAIDframe should be able to create volumes without parity rewrite 
Date: Wed, 06 Jul 2005 09:19:19 -0600

 Matthias Scheler writes:
 > >Number:         30674
 > >Category:       kern
 > >Synopsis:       RAIDframe should be able to create volumes without parity re
 > write
 > >Confidential:   no
 > >Severity:       non-critical
 > >Priority:       medium
 > >Responsible:    kern-bug-people
 > >State:          open
 > >Class:          change-request
 > >Submitter-Id:   net
 > >Arrival-Date:   Wed Jul 06 09:52:00 +0000 2005
 > >Originator:     Matthias Scheler
 > >Release:        NetBSD 3.99.7
 > >Organization:
 > Matthias Scheler                                  http://scheler.de/~matthias
 > /
 > >Environment:
 > System: NetBSD lyssa.zhadum.de 3.99.7 NetBSD 3.99.7 (LYSSA) #0: Mon Jul 4 10:
 > 16:28 BST 2005 tron@lyssa.zhadum.de:/src/sys/compile/LYSSA i386
 > Architecture: i386
 > Machine: i386
 > >Description:
 > Setting up a RAIDframe volume requires an initial parity rewrite which
 > can take a long time. This is completely pointless because the volume
 > doesn't contain any data yet.

 Let's address the RAID 1 case first:
 If you're just going to build a FFS on it, then one can get away with 
 marking the parity as "good" because data will never be read until 
 after it has been written.  Fine.  If the machine crashes or 
 otherwise goes down without marking the parity as "good", then you are
 back to square one -- you *HAVE* to do the parity rebuild at that 
 point, since you have no guarantee that there were no writes in 
 progress, and that for a given sector that the primary and the mirror 
 are in sync.  So the only thing you've saved is the initial rebuild 
 (and there's nothing saying you can't do that initial rebuild in the 
 background sometime after you're using the partition).

 There is, however, also a violation of the Principle of Least Astonishment.
 If, for example, the components had random data on them before the 
 RAID 1 set was created, and one does two "dd if=/dev/rraid0d | md5" 
 with the parity marked as "good" (but not actually synced!) then one
 might well yield different results.  One certainly does not expect a 
 "disk device" to return different data on subsequent reads!  (RAIDframe 
 will pick either the master or the mirror to read from -- in cases where 
 data is already written, this won't be a problem.  In cases where data
 has not been written to that sector, but we are still claiming that 
 the parity is good, it will violate the PoLA.)

 Let's now look at the RAID 5 case: Consider a stripe made up of 
 component blocks A, B, C, D, and E.  Let A be the block being updated, 
 and E be the parity for the stripe.  Let E not be the XOR of A+B+C+D, 
 which will be the case if the parity rewrite is not done.
 To do a write of A, the old contents of A will be read, the current 
 contents of E will be read, a new E will be computed, and the new A 
 and new E will be written.  In the event that A fails, there is now 
 no way of reconstructing the contents of A, since B, C, and D were 
 never in sync with E, and thus are useless in recomputing A.  For a 
 RAID 5 case, one *MUST* rebuild the parity before live data is put on 
 the RAID set, as otherwise there will be no way of reconstructing 
 data in the event of a component failure.

 I've heard the argument a couple of times, but I don't see it buying 
 anything other than removing one parity rebuild...

 Further comments?  As you can guess, I'm not seeing any real advantage to 
 creating volumes without parity rewrites, even for RAID 1 sets.

 Later...

 Greg Oster


From: Matthias Scheler <tron@zhadum.de>
To: Greg Oster <oster@cs.usask.ca>
Cc: gnats-bugs@netbsd.org
Subject: Re: kern/30674: RAIDframe should be able to create volumes without parity rewrite
Date: Wed, 6 Jul 2005 16:47:30 +0100

 On Wed, Jul 06, 2005 at 09:19:19AM -0600, Greg Oster wrote:
 > Let's address the RAID 1 case first:
 > If you're just going to build a FFS on it, then one can get away with 
 > marking the parity as "good" because data will never be read until 
 > after it has been written.  Fine.

 Exactly.

 > If the machine crashes or otherwise goes down without marking the
 > parity as "good", then you are back to square one -- you *HAVE* to
 > do the parity rebuild at that point,

 That is actually another disadvantage of RAIDframe. SVM doesn't manage
 "parity good" by a single bit. It uses a database which manages it
 on per "SVM meta cluster" base. The result is that Solaris only needs
 to sync a few MBs after a crash and not the whole volume.

 > There is, however, also a violation of the Principle of Least Astonishment.

 I don't ask for this being turned on by default. Solaris doesn't manul page
 doesn't even recomment. But it is nice to have that option if you know what
 your are doing.

 > If, for example, the components had random data on them before the 
 > RAID 1 set was created, and one does two "dd if=/dev/rraid0d | md5" 
 > with the parity marked as "good" (but not actually synced!) then one
 > might well yield different results.

 That is a very artifical case. The *really* interesting information is
 the checksum of the filesystem data on the RAID volume. And that will
 always match even if the mirror was created without an initial
 parity rewrite.

 > Let's now look at the RAID 5 case:

 I already guessed that it is different for RAID 5. So we can just leave
 that case out of the discussion.

 > I've heard the argument a couple of times, but I don't see it buying 
 > anything other than removing one parity rebuild...

 Which might save you hours of waiting and/or slow system performance.

 	Kind regards

 -- 
 Matthias Scheler                                  http://scheler.de/~matthias/

From: Greg Oster <oster@cs.usask.ca>
To: Matthias Scheler <tron@zhadum.de>
Cc: gnats-bugs@netbsd.org
Subject: Re: kern/30674: RAIDframe should be able to create volumes without parity rewrite 
Date: Wed, 06 Jul 2005 10:02:35 -0600

 Matthias Scheler writes:
 > On Wed, Jul 06, 2005 at 09:19:19AM -0600, Greg Oster wrote:
 > > Let's address the RAID 1 case first:
 > > If you're just going to build a FFS on it, then one can get away with 
 > > marking the parity as "good" because data will never be read until 
 > > after it has been written.  Fine.
 > 
 > Exactly.
 > 
 > > If the machine crashes or otherwise goes down without marking the
 > > parity as "good", then you are back to square one -- you *HAVE* to
 > > do the parity rebuild at that point,
 > 
 > That is actually another disadvantage of RAIDframe. SVM doesn't manage
 > "parity good" by a single bit. It uses a database which manages it
 > on per "SVM meta cluster" base. The result is that Solaris only needs
 > to sync a few MBs after a crash and not the whole volume.

 Right.  Something like this is on my TODO list, but I've not gotten 
 to it.. 

 > > There is, however, also a violation of the Principle of Least Astonishment.
 > 
 > I don't ask for this being turned on by default. Solaris doesn't manul page
 > doesn't even recomment. But it is nice to have that option if you know what
 > your are doing.
 > 
 > > If, for example, the components had random data on them before the 
 > > RAID 1 set was created, and one does two "dd if=/dev/rraid0d | md5" 
 > > with the parity marked as "good" (but not actually synced!) then one
 > > might well yield different results.
 > 
 > That is a very artifical case.

 Somewhat artificial, yes :)

 > The *really* interesting information is
 > the checksum of the filesystem data on the RAID volume. And that will
 > always match even if the mirror was created without an initial
 > parity rewrite.

 Yes.

 > > Let's now look at the RAID 5 case:
 > 
 > I already guessed that it is different for RAID 5. So we can just leave
 > that case out of the discussion.
 > 
 > > I've heard the argument a couple of times, but I don't see it buying 
 > > anything other than removing one parity rebuild...
 > 
 > Which might save you hours of waiting and/or slow system performance.

 But only the first time... after the first crash, you're back to waiting
 again... And the longer the system is up (or the more rebuilds that are 
 done) the less expensive that one rebuild really is in the life of the 
 system...

 Later...

 Greg Oster


Responsible-Changed-From-To: kern-bug-people->oster
Responsible-Changed-By: oster@netbsd.org
Responsible-Changed-When: Tue, 12 Jul 2005 16:15:39 +0000
Responsible-Changed-Why:
I look after fixing RAIDframe lossage.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.