RedHat 9 aacraid - system fails under extreme disk IO - Reproducable test case

Andrew Mann amann at mythicentertainment.com
Tue Oct 7 11:50:00 CDT 2003


	This has been brought up on the Dell Linux Poweredge list previously, 
but it doesn't appear that a definative solution or reproducable 
situation has been presented.  It also seems like the previous reports 
involved both heavy disk IO as well as heavy network traffic, and so the 
NIC driver was suspect.
	Since we have a number of 2550s and 2650s using the onboard PERC3/Di 
raid controller (aacraid driver), this issue concerns us.

	The following script was run with 6 instances at once on two 2550s and 
one 2650.

2550 configuration
2 x P3 1.2 Ghz  kernel: 2.4.20-20.9smp #1 SMP
1GB of ram, 2GB of swap, 2 x 18 GB drives in a raid 1 configuration

2650 configuration
2 x Xeon 2.2 Ghz   kernel: 2.4.20-20.9smp #1 SMP
2GB of ram, 2GB of swap, 2 x 18 GB drives in a raid 1 configuration
Hyperthreading enabled


	The 2550s fail within 30 minutes of starting the tests each time (tests 
were run 6 times in a row).  The 2650 failed prior to 2.5 days (only 1 
test run due to duration before failure).  In some cases the 2550 
displayed a null pointer dereference in the kernel.  I'll copy down 
details next time I can catch it on screen.  It does not get logged to 
disk, which doesn't surprise me in this situation.  In most cases the 
screen was blank (due to APM I'd guess?).
	The systems still respond to pings, but do not respond to keyboard 
actions and do not complete any tcp connections.  These systems do not 
have a graphical desktop installed, and in fact have a fairly minimal 
set of packages installed at all.
	I don't know why the 2550 would consistantly fail in such a brief 
period while the 2650 would take a much longer time before failure. 
I've been running the same tests on a 1750 (PERC4/Di - Megaraid based) 
for some days now without a failure.
	I plan on testing a non-SMP kernel on the 2550 next - not because we 
can run things that way, but to maybe give some more clues.

	The following script creates a 300 MB file, then rm's it, then does it 
all over again.  For my tests I ran 6 of these concurrently.  Don't 
expect the system to respond to much while these are running, though I 
was able to get decent updates from top.
	Alter the script as you see fit, I'm no guru with bash scripting!

cat diskgrind.sh
#!/bin/sh


MEGS=300
TOTAL=0

while [ "1" != "0" ]; do
         dd ibs=1048576 count=$MEGS if=/dev/zero of=/test/diskgrind.$$ 
2>&1 | cat >/dev/null
         rm -f /test/diskgrind.$$
         TOTAL=`expr $TOTAL + $MEGS`
         echo "[$$] Completed $TOTAL megs."
done


./diskgrind.sh &
./diskgrind.sh &
./diskgrind.sh &
./diskgrind.sh &
./diskgrind.sh &
./diskgrind.sh &



Andrew

-- 
Andrew Mann
Systems Administrator
Mythic Entertainment





More information about the Linux-PowerEdge mailing list