RHEL 4U6 x86_64 freezed by heavy I/O on PE2950

Ram_Sevak at Dell.com Ram_Sevak at Dell.com
Mon Apr 14 04:25:28 CDT 2008


Rick, 
You can manipulate /proc/sys/vm parameters by echo'ing to them. Two
relevant parameters in this case may be:
Echo <xx> > /proc/sys/vm/dirty_ratio 
Echo <xx> > /proc/sys/vm/dirty_background_ratio

Or through sysctl -w vm.dirty_background_ratio=<xx> and sysctl -w
vm.dirty_ratio=<xx>

For me, lowering these parameters did produce somewhat positive results
though these parameters' exact value should be a function of your
intended end-use as they affect file system cache hits. May be you might
be able to arrive at some number which *just* works for you.

Also did you find time to update the firmware and test it again?

Thanks
Ram



-----Original Message-----
From: Sevak, Ram 
Sent: Wednesday, April 02, 2008 5:15 PM
To: 'roderick.castillo at metanomics.de'; linux-poweredge-Lists
Subject: RE: RHEL 4U6 x86_64 freezed by heavy I/O on PE2950

Rick,
I was able to reproduce this problem with sync mount option on a PE2950
with 4GB RAM and RAID1 on a PERC5/I card. I copied some 300 odd files
(900MB in size) onto an ext3 file system. It took some 5-6 minutes to do
so which is long. I will try and test some /proc/sys/vm parameters
tuning and check if this problem gets resolved.
Although I didn't see this issue with async option. 

One thing I notice is that your PERC5/I firmware is not updated. The
latest version is 5.2.1-0067 available on support.dell.com. You might
want to update the firmware and try again.

Thanks
Ram

-----Original Message-----
From: linux-poweredge-bounces at dell.com
[mailto:linux-poweredge-bounces at dell.com] On Behalf Of
roderick.castillo at metanomics.de
Sent: Tuesday, April 01, 2008 8:02 PM
To: linux-poweredge-Lists
Subject: RHEL 4U6 x86_64 freezed by heavy I/O on PE2950

Just copying a large number of files (about 2500, 800 MB) reproducibly 
renders
the server unusuable for a period of time (about 12 min here), load goes

up above 12,
but the operation proceeds to an end. Only after the load goes down the 
server
gets responsive again. I/O wait states can be seen to rise (while top 
still runs just
at the early stage of the copy operation), but otherwise CPUs are idle.

But at times, the operation finishes in 2 seconds! Only trying a simple 
"ls -l" afterwards
puts the server in the mentioned "frozen" state. This is so because at 
times I/O
operations are completely buffered and apparently the issue arises only 
when actually
flushing to disk. This server has 16 GB memory. 

When remounting the file system with the option "sync" (default is 
"async"), then
the server remains responsive, but the operation takes an unreasonable 
long time
to finish (could not even wait for it to finish). This is not specific
to 
the cp command.
For example, unpacking a large gz file behaves similarly.

Additionally, internal disks in a RAID 1 configuration suffer stronger 
than those
in RAID 5 (also internal disks), as fas as I could test.

When copying to a file system residing on an external LUN of a SAN 
connected via
Qlogic HBA, operation always takes 2 seconds to finish, a sync
afterwards 
takes
18 seconds. Compare to 12 min (async) or perhaps hours (sync) when using

the
internal disks. Disks in the SAN are not particularly fast per se.

This issue applies for the latest kernel available for this system 
2.6.9-67.0.4.ELsmp,
and also for version  and 2.6.9-55.ELsmp, and only for file systems
(ext3) 
on the
internal disks, which are controlled by PERC5i. Both the PERC5i firmware
(V. 5.1.1-0040)) and the megaraid_sas driver software (V. 3.16-1) were 
updated to
the latest versions. The latest kernel released by RedHat, 
2.6.9-67-0.7ELsmp,
won't even boot after the initrd phase, it does not find the root file 
system and says:

error 6 mounting ext3
error 2 mounting none
switchroot: mount failed: 22
kernel panic

so this kernel can't be tested at all. A Service Request has been 
registered at RedHat
Support, but I haven't got any solution yet. Changing the I/O Scheduler
to 
elevator=deadline
did not help a bit.

I don't know how to experiment with vm parameters in this version of 
RedHat (Nahant).
With versions of RedHat one could change values of vm.bdflush using the 
sysctl command
or echoing the values to the /proc file system. Anyway, I am afraid that

it would not help,
since it seems that there is a real brake here and not just unsuitable 
parameters.

The server itself was in a quite pristine state, it has not been put
into 
production yet.
Anyone having the same issue? Anyone having solved it already? This is
an 
urgent 
issue, any hint will be appreciated. In view of the fact that all 
components are supported
or certified, this case tends to be a quite serious issue, in my
opinion.

Thanks in advance

Rick

_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge at dell.com
http://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq



More information about the Linux-PowerEdge mailing list