RHEL 4U6 x86_64 freezed by heavy I/O on PE2950
roderick.castillo at metanomics.de
roderick.castillo at metanomics.de
Wed Apr 2 09:18:06 CDT 2008
Allright, I will give the new version a try. Thanks a lot.
Be aware that mounting the file system sync/async won't contribute to any
solution; I just mentioned that in order to show that the "frozen" period
has
to do with flushing the buffer to disk.
In both cases, effective transfer speed is too slow. Something is wrong
with
the combination RHEL/PERC5i/megaraid. The results of iostat show that
the CPU is mostly bound to iowait while nothing happens (%util=100).
Bye
Rick
<Ram_Sevak at Dell.com>
02.04.2008 13:45
An
<roderick.castillo at metanomics.de>, <linux-poweredge at lists.us.dell.com>
Kopie
Thema
RE: RHEL 4U6 x86_64 freezed by heavy I/O on PE2950
Rick,
I was able to reproduce this problem with sync mount option on a PE2950
with 4GB RAM and RAID1 on a PERC5/I card. I copied some 300 odd files
(900MB in size) onto an ext3 file system. It took some 5-6 minutes to do
so which is long. I will try and test some /proc/sys/vm parameters
tuning and check if this problem gets resolved.
Although I didn't see this issue with async option.
One thing I notice is that your PERC5/I firmware is not updated. The
latest version is 5.2.1-0067 available on support.dell.com. You might
want to update the firmware and try again.
Thanks
Ram
-----Original Message-----
From: linux-poweredge-bounces at dell.com
[mailto:linux-poweredge-bounces at dell.com] On Behalf Of
roderick.castillo at metanomics.de
Sent: Tuesday, April 01, 2008 8:02 PM
To: linux-poweredge-Lists
Subject: RHEL 4U6 x86_64 freezed by heavy I/O on PE2950
Just copying a large number of files (about 2500, 800 MB) reproducibly
renders
the server unusuable for a period of time (about 12 min here), load goes
up above 12,
but the operation proceeds to an end. Only after the load goes down the
server
gets responsive again. I/O wait states can be seen to rise (while top
still runs just
at the early stage of the copy operation), but otherwise CPUs are idle.
But at times, the operation finishes in 2 seconds! Only trying a simple
"ls -l" afterwards
puts the server in the mentioned "frozen" state. This is so because at
times I/O
operations are completely buffered and apparently the issue arises only
when actually
flushing to disk. This server has 16 GB memory.
When remounting the file system with the option "sync" (default is
"async"), then
the server remains responsive, but the operation takes an unreasonable
long time
to finish (could not even wait for it to finish). This is not specific
to
the cp command.
For example, unpacking a large gz file behaves similarly.
Additionally, internal disks in a RAID 1 configuration suffer stronger
than those
in RAID 5 (also internal disks), as fas as I could test.
When copying to a file system residing on an external LUN of a SAN
connected via
Qlogic HBA, operation always takes 2 seconds to finish, a sync
afterwards
takes
18 seconds. Compare to 12 min (async) or perhaps hours (sync) when using
the
internal disks. Disks in the SAN are not particularly fast per se.
This issue applies for the latest kernel available for this system
2.6.9-67.0.4.ELsmp,
and also for version and 2.6.9-55.ELsmp, and only for file systems
(ext3)
on the
internal disks, which are controlled by PERC5i. Both the PERC5i firmware
(V. 5.1.1-0040)) and the megaraid_sas driver software (V. 3.16-1) were
updated to
the latest versions. The latest kernel released by RedHat,
2.6.9-67-0.7ELsmp,
won't even boot after the initrd phase, it does not find the root file
system and says:
error 6 mounting ext3
error 2 mounting none
switchroot: mount failed: 22
kernel panic
so this kernel can't be tested at all. A Service Request has been
registered at RedHat
Support, but I haven't got any solution yet. Changing the I/O Scheduler
to
elevator=deadline
did not help a bit.
I don't know how to experiment with vm parameters in this version of
RedHat (Nahant).
With versions of RedHat one could change values of vm.bdflush using the
sysctl command
or echoing the values to the /proc file system. Anyway, I am afraid that
it would not help,
since it seems that there is a real brake here and not just unsuitable
parameters.
The server itself was in a quite pristine state, it has not been put
into
production yet.
Anyone having the same issue? Anyone having solved it already? This is
an
urgent
issue, any hint will be appreciated. In view of the fact that all
components are supported
or certified, this case tends to be a quite serious issue, in my
opinion.
Thanks in advance
Rick
_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge at dell.com
http://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq
More information about the Linux-PowerEdge
mailing list