performance bottleneck in Linux MD RAID-1

John LLOYD jal at mdacorporation.com
Thu Jul 15 14:12:26 CDT 2010


> Subject: Re: performance bottleneck in Linux MD RAID-1
> To: "Paul M. Dyer" <pmdyer at ctgcentral2.com>
> Cc: linux-poweredge <linux-poweredge at dell.com>
> Message-ID: <1279209138.30702.5.camel at tokyo.bbky.org>
> Content-Type: text/plain; charset="UTF-8"
> 
> Thanks for the suggestions. Yes, we are aware of those other
> parameters, but we now know the bottleneck is in the MD RAID-1 layer.
> This is RHEL 5.5 w/ latest updated kernel (don't have the version with
> me right now)
> 
> We've tried all schedulers, a variety of read ahead buffers, etc. The
> only thing that has allowed us to break the 200MB/s seq. write limit is
> when we get rid of the MD RAID-1 layer.
> 
> Even if we don't use the file system (XFS in this case), if we build
> the
> MD RAID-1 with a missing half, and then add the 2nd half to allow it to
> re-sync, the fastest the re-sync will go (with all else pretty much
> idle) is about 200MB/s. So, this is MD RAID-1 layer doing it's own
> block
> copying with no LVM2 or XFS or anything else involved.
> 
> -Bond


We use MD layer on JBODs on a PERC6 and routinely get 350 Mbytes/sec write and 650 Mbytes/sec read.  (bonnie++ test).  Hardware is R900.  OS was SLES10SP2.

(The JBODs are individual disks, 450GB x 15krpm, setup RAID "0" although they have only one pdisk per vdisk).

You are getting more throughput without MD so I have to ask why bother with MD?  If it seems to be a bottleneck then do without.  Although I have to say your expectations

>We were expecting with 7x effective
>spindles on the RAID-10, to get about ~350MBytes/sec sustained writes
>for sequential access.

are not very realistic, however.  You have SATA disks but you want speed. 

While individual spindles can do 50 to 70 Mbytes/sec (SATA-2, Seagate disks), when you put those behind several layers of Linux software, and a PERC card with only Dell knows what CPU speed, firmware version and memory bandwidth, plus numerous buffering, command decode/schedule/execute/reply loops, you might be lucky to get 50% of disk drive maximum throughput when the whole system is assembled.

The only configuration you have not mentioned is MD or LVM for all physical spindles -- rather than layering LVM upon MD upon PERC RAID-10 for a chosen, uh, RAID-110 configuration, choose just MD (or LVM), and skip the PERC's implementation of RAID anything.

You don't mention why RAID-1 upon RAID-10.  Is this intended for high data availability / system reliability?  


--John








> 
> On Thu, 2010-07-15 at 10:34 -0500, Paul M. Dyer wrote:
> > Hi,
> >
> > which IO elevator are you using?   Are you using RHEL4 or RHEL5?
> >
> > In RHEL5, you could try the deadline or noop elevator to see if that
> works better.  Implement by using this example for sda, change for your
> particular device:
> >
> > cat /sys/block/sda/queue/scheduler
> >
> > echo "deadline" > /sys/block/sda/queue/scheduler
> >
> > or use noop:
> > echo "noop" > /sys/block/sda/queue/scheduler
> >
> > Here is a link from RHEL4 days about the schedulers.
> > http://www.redhat.com/magazine/008jun05/features/schedulers/
> >
> > Paul
> >
> >
> > ----- Original Message -----
> > From: "Bond Masuda" <bond.masuda at jlbond.com>
> > To: "linux-poweredge" <linux-poweredge at dell.com>
> > Sent: Wednesday, July 14, 2010 10:32:57 PM
> > Subject: performance bottleneck in Linux MD RAID-1
> >
> > Hi Everyone,
> >
> > I'm wondering if some of the gurus around here might be able to help
> me
> > out. We have a PE2970 with two PERC 6/E, each PERC6/E is connected
> via
> > single SAS cable to an MD1000 with 15x 1TB Hitachi SATA 7.2K drives.
> We
> > have each MD1000 setup in RAID-10 with 14 drives and 1 hot spare.
> Within
> > Linux, we mirror the two MD1000's with Linux MD RAID-1 as /dev/md0.
> On
> > top of /dev/md0, we have LVM2 and then XFS on the LV. The reason for
> the
> > LVM2 is to take snapshots (we reserve about 10% of space in VG for
> it)
> >
> > We're seeing a performance bottleneck of about 200MBytes/sec
> sequential
> > writes when testing with iozone. We were expecting with 7x effective
> > spindles on the RAID-10, to get about ~350MBytes/sec sustained writes
> > for sequential access.
> >
> > After trying out several combinations of things, we found that if we
> > remove the Linux MD software RAID layer, and just LVM2 on top of
> > the /dev/sdc (the vdisk as presented by the PERC 6/E RAID-10), we get
> > about 340MBytes/sec sequential writes. If we put XFS directly on top
> > of /dev/sdc1, we get about the same 340MBytes/sec. So, we can get our
> > anticipated performance of about 350MB/s only when we don't use the
> MD
> > RAID-1.
> >
> > Since both MD1000s are connected via separate PERC 6/E, we didn't
> think
> > the MD RAID-1 would cause >40% performance loss...
> >
> > We even tried to degrade the MD RAID-1 and see if writing only to one
> of
> > the mirrors would improve performance. It did NOT.. .still 200MB/s.
> It
> > almost seems like Linux MD layer has a performance cap at around
> > 200MB/s.
> >
> > Has anyone encountered this and have suggestions to remove this
> > bottleneck? Any advice would be appreciated.
> >
> > Thanks,
> > -Bond
> >
> > _______________________________________________ Linux-PowerEdge
> mailing
> > list Linux-PowerEdge at dell.com
> > https://lists.us.dell.com/mailman/listinfo/linux-poweredge Please
> read
> > the FAQ at http://lists.us.dell.com/faq
> 
> 
> 
> 
> ------------------------------
> 
> _______________________________________________
> Linux-PowerEdge mailing list
> Linux-PowerEdge at dell.com
> https://lists.us.dell.com/mailman/listinfo/linux-poweredge
> Please read the FAQ at http://lists.us.dell.com/faq
> 
> End of Linux-PowerEdge Digest, Vol 73, Issue 27
> ***********************************************



More information about the Linux-PowerEdge mailing list