MD1220 + H800 nfs performance
jal at mdacorporation.com
Wed Sep 22 12:48:31 CDT 2010
> Date: Wed, 22 Sep 2010 14:45:16 +0100
> From: Robert Horton <r.horton at qmul.ac.uk>
> Subject: MD1220 + H800 nfs performance
> To: Dell poweredge Mailling-liste <linux-poweredge at dell.com>
> Message-ID: <1285163116.1780.80.camel at moelwyn>
> Content-Type: text/plain; charset="UTF-8"
> I'm having some problems getting decent nfs performance from some
> MD1220s connected to an R710. Here's a summary of the setup:
> 3 x MD1220 each with 24 x 500GB 7.2k SAS disk
> All connected to a Perc H800 in an R710.
> At present I have a single RAID60 volume with three spans of 23 disks,
> so each array holds one span plus a hot standby. Stripe element size is
> I'm testing the performance with:
> iozone -l 1 -u 1 -r 4k -s 10g -e
> and getting write performance of:
> Direct to filesystem: 1076 MB/s
> nfs via loopback interface: 217 MB/s
> nfs via IPoIB: 38 MB/s
> nfs via Ethernet: 24 MB/s
Which version of NFS?
NFS is synchronous normally. Every operation gets acknowledged before the next operation occurs. This synchrony costs time: your loopback measurement already tells you to expect "1/5 of real" given a network with zero latency and memory-speed bandwidth. Slowing down the network slightly (IB) or adding latency (microseconds with either) and your throughput takes another big hit.
To get fast write speed don't use NFS (!). Switch to FTP. If you must use NFS, turn on some writeback caching (async on /etc/exports). Ensure NFS uses TCP. Verify your network connections are direct, full-duplex and zero error-rate, and not something odd.
> Based on testing other systems I would expect the nfs over Ethernet to
> be around 100MB/s (ie saturating the GigE link) and the nfs over IPoIB
> to be higher than that. I've tested the network links with nttcp and
> there don't appear to be any problems.
> I've tried various filesystems (ext3, ext4, xfs) but this didn't have a
> significant effect.
> I'm wondering:
> 1) Should the stripe size be smaller? Given that the nfs max block size
> is 32KB each write is going to be less than one stripe..?
Stripe makes a big difference if your filesystem is not aligned to the stripe. Allocate partitions on stripe boundaries and use XFS and tell it what the stripe size and stripe width is. Remember what RAID-5 has to do to write 64 kbytes on a 21-disk stripe. You want XFS to optimize this.
> 2) Is there a better way of arranging the disks? Given that I want the
> dual parity I'm more or less stuck with some form of RAID 6, but I
> have more spans or create separate volumes and stripe them with LVM.
You'll probably run into H800 bandwidth limits. 1GB/sec is pretty amazing throughput for a cheap^h^h^h^h^h inexpensive raid controller. Your issue lies elsewhere since the combination of network and software is taking away performance by the bucketfull.
> I'm happy to test different configurations but given the time needed to
> reinitialise the array it would be good to get some pointers first...
> Any thoughts would be appreciated.
For that many disks, RAID-10 is indicated. With 24 disks you'll get one or two failures per year per MD1220, and therefore will be running into the risk of double-failures killing your data.
Also, the filesystem is now huge and you are relying on: disks reading and writing perfectly, RAID controller perfection, and Linux and XFS perfection not to corrupt the filesystem data and/or the metadata.
When a disk breaks, do you know what to do to replace it? You could practise replacing disks now, when you have no data at risk.
More information about the Linux-PowerEdge