RHEL 4U6 x86_64 freezed by heavy I/O on PE2950

roderick.castillo at metanomics.de roderick.castillo at metanomics.de
Tue Apr 1 09:32:02 CDT 2008


Just copying a large number of files (about 2500, 800 MB) reproducibly 
renders
the server unusuable for a period of time (about 12 min here), load goes 
up above 12,
but the operation proceeds to an end. Only after the load goes down the 
server
gets responsive again. I/O wait states can be seen to rise (while top 
still runs just
at the early stage of the copy operation), but otherwise CPUs are idle.

But at times, the operation finishes in 2 seconds! Only trying a simple 
"ls -l" afterwards
puts the server in the mentioned "frozen" state. This is so because at 
times I/O
operations are completely buffered and apparently the issue arises only 
when actually
flushing to disk. This server has 16 GB memory. 

When remounting the file system with the option "sync" (default is 
"async"), then
the server remains responsive, but the operation takes an unreasonable 
long time
to finish (could not even wait for it to finish). This is not specific to 
the cp command.
For example, unpacking a large gz file behaves similarly.

Additionally, internal disks in a RAID 1 configuration suffer stronger 
than those
in RAID 5 (also internal disks), as fas as I could test.

When copying to a file system residing on an external LUN of a SAN 
connected via
Qlogic HBA, operation always takes 2 seconds to finish, a sync afterwards 
takes
18 seconds. Compare to 12 min (async) or perhaps hours (sync) when using 
the
internal disks. Disks in the SAN are not particularly fast per se.

This issue applies for the latest kernel available for this system 
2.6.9-67.0.4.ELsmp,
and also for version  and 2.6.9-55.ELsmp, and only for file systems (ext3) 
on the
internal disks, which are controlled by PERC5i. Both the PERC5i firmware
(V. 5.1.1-0040)) and the megaraid_sas driver software (V. 3.16-1) were 
updated to
the latest versions. The latest kernel released by RedHat, 
2.6.9-67-0.7ELsmp,
won't even boot after the initrd phase, it does not find the root file 
system and says:

error 6 mounting ext3
error 2 mounting none
switchroot: mount failed: 22
kernel panic

so this kernel can't be tested at all. A Service Request has been 
registered at RedHat
Support, but I haven't got any solution yet. Changing the I/O Scheduler to 
elevator=deadline
did not help a bit.

I don't know how to experiment with vm parameters in this version of 
RedHat (Nahant).
With versions of RedHat one could change values of vm.bdflush using the 
sysctl command
or echoing the values to the /proc file system. Anyway, I am afraid that 
it would not help,
since it seems that there is a real brake here and not just unsuitable 
parameters.

The server itself was in a quite pristine state, it has not been put into 
production yet.
Anyone having the same issue? Anyone having solved it already? This is an 
urgent 
issue, any hint will be appreciated. In view of the fact that all 
components are supported
or certified, this case tends to be a quite serious issue, in my opinion.

Thanks in advance

Rick



More information about the Linux-PowerEdge mailing list