Time-honoured problem : RHEL3 iowait performance

Brendan Heading brendanheading at clara.co.uk
Sat Feb 17 16:27:15 CST 2007


All,

I have a PowerEdge 2850, containing a PERC4 RAID controller with six 
10000rpm SCSI disks. There are three 146GB disks and three 300GB disks; 
both sets are arranged as a pair of RAID5 arrays. It has 4GB RAM and a 
pair of Xeon 3.6Ghz CPUs with hyperthreading enabled.  The box and the 
clients are all running Red Hat Enterprise Linux 3, all patched up to 
the latest update (Update 8).

The OS boot, swap, and other system partitions all sit on the smaller 
RAID array. The greater proportion of that array, and the entire of the 
larger array, are strapped together as an LVM volume group.

The machine's main purpose in life is to serve as a file server for a 
bunch of NFS and (less so) Samba clients. All of the NFS clients (there 
are three) are connected to the server via a gigabit ethernet link using 
jumbo frames.

I seem to be having problems with it that a lot of other people have. 
When two or three users on remote clients start doing heavy I/O access 
over NFS, the server grinds almost to a complete halt with iowait shown 
in the high 90%s. The /sbin/iostat reports relatively low throughput 
with one of the two RAID arrays 100% utilized.

I've found that it seems to be simultaneous activity that causes it to 
grind. If I go to my users and tell them to stop, then get one user at a 
time to do whatever operation they need, the iowait is lower (but still 
in the 50%s) but each operation completes relatively quickly compared 
with several users working in parallel. That suggests to me that the 
issue is something to do either with the randomness of the access 
pattern, or caching.

To fix this I tried :

- messing with elvtune. This seemed to help a little but not much. 
Presently it's at 2048 read, 8192 write with max_bomb_segments at 0. I 
believe that's something similar to the Red Hat defaults.

- setting the systcl parameters (in /proc/sys/vm) in accordance with 
previous messages posted on this list. Again, these seem to have helped 
a bit, but not a lot.

I am wondering if I am asking too much of the box and OS, but I don't 
think so. I've got 6 year old Sun Enterprise 220R boxes running Solaris 
8 which can handle NFS workloads better than this. Does anyone have any 
suggestions about what other options I can try ? The options available 
to me are :

- upgrading to RHEL4.
- upgrading the RAID controller's firmware. It's still running the 
factory original.
- purchasing more disks, putting them in a Powervault 220S, and using 
RAID10 or RAID50 instead of RAID5
- putting more memory in the box in case this is a caching matter

However, I do not want to go to the trouble and expense of doing the 
above unless I can be sure that they will improve things.

Any pointers or tips would be greatly appreciated.

Regards

Brendan




More information about the Linux-PowerEdge mailing list