RedHat 9 aacraid - system fails under extreme disk IO - Reproducable test case

Paul Anderson Paul.Anderson at
Tue Oct 7 12:36:00 CDT 2003

We had this same issue with our 2650's running AS 2.1.  Don't know that this is the best answer, but it is the one that worked for us...Replace the on board adapter with a PERC 3/DC (LSI) adapter.  Make sure that you put it on its own bus, we used slot three.  In 2 of our 2650's we are even running this with the HBA's for SAN connectivity.  That said, our solution is about 2 weeks old, though I did run similar tests on the systems after the new install for 8 days and was unable to make them crash.


-----Original Message-----
From: Andrew Mann [mailto:amann at]
Sent: Tuesday, October 07, 2003 12:47 PM
To: linux-poweredge at
Cc: Matt Domsch; deanna_bonds at; alan at
Subject: RedHat 9 aacraid - system fails under extreme disk IO -
Reproducable test case

	This has been brought up on the Dell Linux Poweredge list previously, 
but it doesn't appear that a definative solution or reproducable 
situation has been presented.  It also seems like the previous reports 
involved both heavy disk IO as well as heavy network traffic, and so the 
NIC driver was suspect.
	Since we have a number of 2550s and 2650s using the onboard PERC3/Di 
raid controller (aacraid driver), this issue concerns us.

	The following script was run with 6 instances at once on two 2550s and 
one 2650.

2550 configuration
2 x P3 1.2 Ghz  kernel: 2.4.20-20.9smp #1 SMP
1GB of ram, 2GB of swap, 2 x 18 GB drives in a raid 1 configuration

2650 configuration
2 x Xeon 2.2 Ghz   kernel: 2.4.20-20.9smp #1 SMP
2GB of ram, 2GB of swap, 2 x 18 GB drives in a raid 1 configuration
Hyperthreading enabled

	The 2550s fail within 30 minutes of starting the tests each time (tests 
were run 6 times in a row).  The 2650 failed prior to 2.5 days (only 1 
test run due to duration before failure).  In some cases the 2550 
displayed a null pointer dereference in the kernel.  I'll copy down 
details next time I can catch it on screen.  It does not get logged to 
disk, which doesn't surprise me in this situation.  In most cases the 
screen was blank (due to APM I'd guess?).
	The systems still respond to pings, but do not respond to keyboard 
actions and do not complete any tcp connections.  These systems do not 
have a graphical desktop installed, and in fact have a fairly minimal 
set of packages installed at all.
	I don't know why the 2550 would consistantly fail in such a brief 
period while the 2650 would take a much longer time before failure. 
I've been running the same tests on a 1750 (PERC4/Di - Megaraid based) 
for some days now without a failure.
	I plan on testing a non-SMP kernel on the 2550 next - not because we 
can run things that way, but to maybe give some more clues.

	The following script creates a 300 MB file, then rm's it, then does it 
all over again.  For my tests I ran 6 of these concurrently.  Don't 
expect the system to respond to much while these are running, though I 
was able to get decent updates from top.
	Alter the script as you see fit, I'm no guru with bash scripting!



while [ "1" != "0" ]; do
         dd ibs=1048576 count=$MEGS if=/dev/zero of=/test/diskgrind.$$ 
2>&1 | cat >/dev/null
         rm -f /test/diskgrind.$$
         TOTAL=`expr $TOTAL + $MEGS`
         echo "[$$] Completed $TOTAL megs."

./ &
./ &
./ &
./ &
./ &
./ &


Andrew Mann
Systems Administrator
Mythic Entertainment

Linux-PowerEdge mailing list
Linux-PowerEdge at
Please read the FAQ at or search the list archives at

More information about the Linux-PowerEdge mailing list