performance bottleneck in Linux MD RAID-1

Andrew Sharp at
Thu Jul 15 12:55:43 CDT 2010

On Thu, 15 Jul 2010 11:00:03 -0600 Bond Masuda <bond.masuda at>

> Thanks for the suggestions. Yes, we are aware of those other
> parameters, but we now know the bottleneck is in the MD RAID-1 layer.
> This is RHEL 5.5 w/ latest updated kernel (don't have the version with
> me right now)
> We've tried all schedulers, a variety of read ahead buffers, etc. The
> only thing that has allowed us to break the 200MB/s seq. write limit
> is when we get rid of the MD RAID-1 layer.
> Even if we don't use the file system (XFS in this case), if we build
> the MD RAID-1 with a missing half, and then add the 2nd half to allow
> it to re-sync, the fastest the re-sync will go (with all else pretty
> much idle) is about 200MB/s. So, this is MD RAID-1 layer doing it's
> own block copying with no LVM2 or XFS or anything else involved.
> -Bond
> On Thu, 2010-07-15 at 10:34 -0500, Paul M. Dyer wrote:

> > 
> > We're seeing a performance bottleneck of about 200MBytes/sec
> > sequential writes when testing with iozone. We were expecting with
> > 7x effective spindles on the RAID-10, to get about ~350MBytes/sec
> > sustained writes for sequential access.
> > 
> > After trying out several combinations of things, we found that if we
> > remove the Linux MD software RAID layer, and just LVM2 on top of
> > the /dev/sdc (the vdisk as presented by the PERC 6/E RAID-10), we
> > get about 340MBytes/sec sequential writes. If we put XFS directly
> > on top of /dev/sdc1, we get about the same 340MBytes/sec. So, we
> > can get our anticipated performance of about 350MB/s only when we
> > don't use the MD RAID-1.
> > 
> > Since both MD1000s are connected via separate PERC 6/E, we didn't
> > think the MD RAID-1 would cause >40% performance loss...
> > 
> > We even tried to degrade the MD RAID-1 and see if writing only to
> > one of the mirrors would improve performance. It did NOT.. .still
> > 200MB/s. It almost seems like Linux MD layer has a performance cap
> > at around 200MB/s.

Call me crazy, but I'm guessing you never let it finish syncing in the
first place.  When syncing, it caps its own throughput at some
arbitrary amount the code deems sensible.  Sounds like it never
un-capped, suggesting you never let it complete the sync process before
benching it.  Further evidence that both numbers are so
remarkably the same also suggests that.  Depending on the size of the
dataset, it could take a while to finish.  I could swear that I recall
there being some parameter which allows you to create a mirror but skip
the sync step.  If you do that, then basically you're swearing on the
storage bible that both volumes were zeroed before the mirror creation,
and the first thing you did after creating the volume was mkfs.
Personally I've never tried it, and searching the documentation, I
can't find it now, so perhaps that was just a bad dream I had.



More information about the Linux-PowerEdge mailing list