Redundant NFS storage setup (part 3) :ThedisappointingPERC5/E(solved?)

Harald_Jensas at Dell.com Harald_Jensas at Dell.com
Fri Jan 4 10:14:05 CST 2008


> -----Original Message-----
> From: linux-poweredge-bounces at dell.com [mailto:linux-poweredge-
> bounces at dell.com] On Behalf Of Harald_Jensas at dell.com
> Sent: 04 January 2008 12:38
> To:
> thias at spam.spam.spam.spam.spam.spam.spam.egg.and.spam.freshrpms.net;
> linux-poweredge-Lists
> Subject: RE: Redundant NFS storage setup (part 3)
> :ThedisappointingPERC5/E(solved?)
> 
> > -----Original Message-----
> > From: linux-poweredge-bounces at dell.com [mailto:linux-poweredge-
> > bounces at dell.com] On Behalf Of Matthias Saou
> > Sent: 04 January 2008 11:42
> > To: linux-poweredge-Lists
> > Subject: Re: Redundant NFS storage setup (part 3) :
> > ThedisappointingPERC5/E(solved?)
> >
> > Harald_Jensas at dell.com wrote :
> >
> > > Kevin, Did you ever try to create your Hardware RAID striped
> > partitions aligned?
> > > This will cause less stripe crossings and thus less parity to
> > calculate for RAID 5 writes. It should also improve read
performance.
> > >
> > > In this article they report up to 30% performance increase, in
> their
> > tests, with properly aligned partitions.
> > > http://insights.oetiker.ch/linux/raidoptimization.html
> > >
> > > Assuming block size is 512 bytes.
> > > If Stripe Size is (64KB/512 byte = 128 Blocks) align the partition
> to
> > block 128.
> > > If Stripe Size is (128KB/512 byte = 256 Blocks) align the
partition
> > to block 256.
> > >
> > > 1. Enter fdisk /dev/sd<x> where <x> is the device suffix.
> > > 2. Determine if any partitions already exist.
> > > 3. Type n to create a new partition.
> > > 4. Type p to create a primary partition.
> > > 5. Type 1 to create partition No. 1.
> > > 6. Select the defaults to use the complete disk.
> > > 7. Type t to set the partition's system ID.
> > > 8. Type in the code for the partition type you want.
> > > 9. Type x to go into expert mode.
> > > 10. Type b to adjust the starting block number.
> > > 11. Type 1 to choose partition 1.
> > > 12. Type 128 to set it to 128 (the array's stripe element size).
> > > 13. Type w to write label and partition information to disk.
> >
> > This is quite interesting, but I'm a little confused as to how to
> > achieve this when using a gpt partition table.
> >
> > Here's what parted shows me :
> >
> > Model: DELL PERC 5/E Adapter (scsi)
> > Disk /dev/sdb: 13.0TB
> > Sector size (logical/physical): 512B/512B
> > Partition Table: gpt
> >
> > Number  Start   End     Size    File system  Name  Flags
> >  1      17.4kB  13.0TB  13.0TB  xfs          MD1
> >
> > The partition was created with "mkpart MD1 0 100%", and I don't see
> how
> > to access any "expert" features with parted. Here's what fdisk shows
> > me (FWIW, since it doesn't support gpt) :
> >
> > Disk /dev/sdb: 12995.4 GB, 12995497295872 bytes
> > 255 heads, 63 sectors/track, 1579945 cylinders
> > Units = cylinders of 16065 * 512 = 8225280 bytes
> >
> >    Device Boot      Start         End      Blocks   Id  System
> > /dev/sdb1               1      267350  2147483647+  ee  EFI GPT
> >
> > As you can see, I have the "63 sectors/track" value which the
article
> > considers sub-optimal. Is there any way to change this?
> >
> > Another thing I'll try is to use LVM on the block device directly,
> > basically replacing the partition, an see how that performs.
> >
> > Matthias
> >
> > --
> 
> It is not the 63 sectors/track value that is sub-optimal, it is the
> fact
> that the partition is normally aligned based on the 63 sector/track
> value. Back in the day the ## sectors/track value actually told you
> something about the physical layout of the disk. Thus back in the day
> it
> made sense to align based on the sectors/track value. Now days it is
> all
> hidden from you. It is better to think of the disk as a sequence of
> blocks. And in a striped RAID those blocks are striped over several
> drives. Thus we want to align striped RAID partition so that a single
> I/O operation not spread across multiple stripes.
> 
> 
> AFAIK GPT partitions do not have this problem.
> 
> 

I have had a look at the GPT partition layout on the disk, and I belive
it also has alignment problems.

Will try to explain my reasoning: (I hope the drawing stay intact in
your mail clients.)

LBA	       0  1  2  3         34
GPT	       |--|--|--|----------|-----------------Partition
1---------
PhysDisks    |--------------Stripe 1-----------------|-----Stripe 2----
Filesystem                       |----I/O-------|----I/O-------|---I/O-

I got the information about GPT layout from Wikipedia entry
http://en.wikipedia.org/wiki/GUID_Partition_Table

Each LBA is a 512 bytes in size. 
LBA 0 = Protective MBR
LBA 1 = Primary GPT Header
LBA 2 - 33 = Partition entries
LBA 34 = This is where the first partition will be created.

LBA 34 is (34 * 512 bytes) = 17408 bytes into the disk. This is quite
close to what parted reports as the start of the xfs partition Matthias
created.

Now if the RAID stripre size is 64KB (65 536 bytes) the remaining space
on the first stripe is (65 536 - 17408) = 48128. 
If the filesystem element size is 32KB the first filesystem block will
be created on the first stripe, then the next 32KB filesystem block will
be split between stripe 1 and stripe 2 and so on.

I belive you can get the partition aligned by specifying start and end
of the partition in megabytes when you create the partition.

mkpart part-type [fs-type] start end


So for a striped RAID volume with Stipe Size set to 64K in the
controller, do the following to create a partition approximately 10 TB
in size:

mkpart part-type [fs-type] 0.0625 10485760

The result should be something like this:
PhysDisks    |--------------Stripe 1-----------------|-----Stripe 2----
Filesystem                                           |----I/O-------|--


The only way I can think of to verify alignment is to use a sector
analyzer and check that the start LBA in the partition entry is
dividable by 128 for a 64KB stripe size array...


Please do the math over again, I might have done a mistake. And remember
to use values that fit your setup, stripe-size configured in HW RAID
Controller  etc.
 


--
Harald



More information about the Linux-PowerEdge mailing list