RedHat 9 aacraid - system fails under extreme disk IO - Reproducable test case
Paul.Anderson at priorityhealthcare.com
Tue Oct 7 12:36:00 CDT 2003
We had this same issue with our 2650's running AS 2.1. Don't know that this is the best answer, but it is the one that worked for us...Replace the on board adapter with a PERC 3/DC (LSI) adapter. Make sure that you put it on its own bus, we used slot three. In 2 of our 2650's we are even running this with the HBA's for SAN connectivity. That said, our solution is about 2 weeks old, though I did run similar tests on the systems after the new install for 8 days and was unable to make them crash.
From: Andrew Mann [mailto:amann at mythicentertainment.com]
Sent: Tuesday, October 07, 2003 12:47 PM
To: linux-poweredge at dell.com
Cc: Matt Domsch; deanna_bonds at adaptec.com; alan at redhat.com
Subject: RedHat 9 aacraid - system fails under extreme disk IO -
Reproducable test case
This has been brought up on the Dell Linux Poweredge list previously,
but it doesn't appear that a definative solution or reproducable
situation has been presented. It also seems like the previous reports
involved both heavy disk IO as well as heavy network traffic, and so the
NIC driver was suspect.
Since we have a number of 2550s and 2650s using the onboard PERC3/Di
raid controller (aacraid driver), this issue concerns us.
The following script was run with 6 instances at once on two 2550s and
2 x P3 1.2 Ghz kernel: 2.4.20-20.9smp #1 SMP
1GB of ram, 2GB of swap, 2 x 18 GB drives in a raid 1 configuration
2 x Xeon 2.2 Ghz kernel: 2.4.20-20.9smp #1 SMP
2GB of ram, 2GB of swap, 2 x 18 GB drives in a raid 1 configuration
The 2550s fail within 30 minutes of starting the tests each time (tests
were run 6 times in a row). The 2650 failed prior to 2.5 days (only 1
test run due to duration before failure). In some cases the 2550
displayed a null pointer dereference in the kernel. I'll copy down
details next time I can catch it on screen. It does not get logged to
disk, which doesn't surprise me in this situation. In most cases the
screen was blank (due to APM I'd guess?).
The systems still respond to pings, but do not respond to keyboard
actions and do not complete any tcp connections. These systems do not
have a graphical desktop installed, and in fact have a fairly minimal
set of packages installed at all.
I don't know why the 2550 would consistantly fail in such a brief
period while the 2650 would take a much longer time before failure.
I've been running the same tests on a 1750 (PERC4/Di - Megaraid based)
for some days now without a failure.
I plan on testing a non-SMP kernel on the 2550 next - not because we
can run things that way, but to maybe give some more clues.
The following script creates a 300 MB file, then rm's it, then does it
all over again. For my tests I ran 6 of these concurrently. Don't
expect the system to respond to much while these are running, though I
was able to get decent updates from top.
Alter the script as you see fit, I'm no guru with bash scripting!
while [ "1" != "0" ]; do
dd ibs=1048576 count=$MEGS if=/dev/zero of=/test/diskgrind.$$
2>&1 | cat >/dev/null
rm -f /test/diskgrind.$$
TOTAL=`expr $TOTAL + $MEGS`
echo "[$$] Completed $TOTAL megs."
Linux-PowerEdge mailing list
Linux-PowerEdge at dell.com
Please read the FAQ at http://lists.us.dell.com/faq or search the list archives at http://lists.us.dell.com/htdig/
More information about the Linux-PowerEdge