RHEL 4U6 x86_64 freezed by heavy I/O on PE2950
roderick.castillo at metanomics.de
roderick.castillo at metanomics.de
Tue Apr 1 09:32:02 CDT 2008
Just copying a large number of files (about 2500, 800 MB) reproducibly
renders
the server unusuable for a period of time (about 12 min here), load goes
up above 12,
but the operation proceeds to an end. Only after the load goes down the
server
gets responsive again. I/O wait states can be seen to rise (while top
still runs just
at the early stage of the copy operation), but otherwise CPUs are idle.
But at times, the operation finishes in 2 seconds! Only trying a simple
"ls -l" afterwards
puts the server in the mentioned "frozen" state. This is so because at
times I/O
operations are completely buffered and apparently the issue arises only
when actually
flushing to disk. This server has 16 GB memory.
When remounting the file system with the option "sync" (default is
"async"), then
the server remains responsive, but the operation takes an unreasonable
long time
to finish (could not even wait for it to finish). This is not specific to
the cp command.
For example, unpacking a large gz file behaves similarly.
Additionally, internal disks in a RAID 1 configuration suffer stronger
than those
in RAID 5 (also internal disks), as fas as I could test.
When copying to a file system residing on an external LUN of a SAN
connected via
Qlogic HBA, operation always takes 2 seconds to finish, a sync afterwards
takes
18 seconds. Compare to 12 min (async) or perhaps hours (sync) when using
the
internal disks. Disks in the SAN are not particularly fast per se.
This issue applies for the latest kernel available for this system
2.6.9-67.0.4.ELsmp,
and also for version and 2.6.9-55.ELsmp, and only for file systems (ext3)
on the
internal disks, which are controlled by PERC5i. Both the PERC5i firmware
(V. 5.1.1-0040)) and the megaraid_sas driver software (V. 3.16-1) were
updated to
the latest versions. The latest kernel released by RedHat,
2.6.9-67-0.7ELsmp,
won't even boot after the initrd phase, it does not find the root file
system and says:
error 6 mounting ext3
error 2 mounting none
switchroot: mount failed: 22
kernel panic
so this kernel can't be tested at all. A Service Request has been
registered at RedHat
Support, but I haven't got any solution yet. Changing the I/O Scheduler to
elevator=deadline
did not help a bit.
I don't know how to experiment with vm parameters in this version of
RedHat (Nahant).
With versions of RedHat one could change values of vm.bdflush using the
sysctl command
or echoing the values to the /proc file system. Anyway, I am afraid that
it would not help,
since it seems that there is a real brake here and not just unsuitable
parameters.
The server itself was in a quite pristine state, it has not been put into
production yet.
Anyone having the same issue? Anyone having solved it already? This is an
urgent
issue, any hint will be appreciated. In view of the fact that all
components are supported
or certified, this case tends to be a quite serious issue, in my opinion.
Thanks in advance
Rick
More information about the Linux-PowerEdge
mailing list