RedHat 9 aacraid - system fails under extreme disk IO - Reproducable test case

Thomas Petersen tomp at securityminded.net
Tue Oct 7 19:55:01 CDT 2003


I am pretty disappointed in Dell for failing to follow up on this and
resolve the issue once and for all.  This is not a new problem but it is
Dell's responsibility to rectify it as they -certify- Redhat on the 2650 --
regardless if it's a hardware or software issue Dell is responsible to their
customers.  

If this was an issue on the Microsoft platform you can bet Dell would of
worked with Microsoft and issued a patch/update long before it became a wide
spread problem.  I have always been a huge fan of Dell equipment but their
failure in this instance to support what they sell is very troubling. 

Don't get me wrong I will probably purchase Dell servers again in the future
(though not the 2650) but can anyone name one problem affecting the
Microsoft platform, related to Dell hardware and had a problem of this
magnitude, that went unresolved for as long as this one has?  System lockups
are -totally- unacceptable.  

I guess when people start choosing with their checkbooks Dell might wake up.

Thomas Petersen
SecurityMinded Technologies 

>>-----Original Message-----
>>From: Andrew Mann [mailto:amann at mythicentertainment.com] 
>>Sent: Tuesday, October 07, 2003 6:20 PM
>>To: linux-poweredge at dell.com
>>Cc: mark_salyzyn at adaptec.com
>>Subject: Re: RedHat 9 aacraid - system fails under extreme 
>>disk IO - Reproducable test case
>>
>>
>>	Unfortunately we've got a good number of 2550s and 
>>2650s in use, and 
>>replacing the RAID cards isn't ideal.  Mostly we don't have 
>>enough load 
>>to cause this problem, but every now and then we do get an 
>>unexplained 
>>lockup that pulls someone out of bed at 2 AM.
>>	I searched back through the reports of this and found 
>>some posts from 
>>Mark Salyzyn referencing AAC_NUM_FIB and AAC_NUM_IO_FIB 
>>settings.  The 
>>last comment I see is on 9/9/2003:
>>"I am suggesting that this value be (AAC_NUM_IO_FIB+64), and 
>>limited to 
>>below 512 (the maximum number of hardware FIBS the Firmware 
>>can absorb). 
>>I will begin testing the stability and side effects of this input."
>>	However, I don't see any followup, nor does the latest 
>>patchset to the 
>>2.4 series seem to contain any modifications in this area (or 
>>2.5 or 2.6 
>>since June 2003).
>>	Additionally, I've just rebuilt the aacraid module here 
>>from the RedHat 
>>SRPM of 2.4.20-20.9 with AAC_NUM_FIB=512 and 
>>AAC_NUM_IO_FIB=448, rebuilt 
>>the rdimage and such and got another crash within 5 minutes 
>>of starting 
>>the test.
>>
>>	I also see a note from Mark on 8/27/2003:
>>-----
>>There is code that does the following in the driver:
>>
>>	scsicmd->result = DID_OK << 16 | COMMAND_COMPLETE << 8 
>>| SAM_STAT_TASK_SET_FULL;
>>	aac_io_done(scsicmd);
>>	return -1;
>>
>>This is *wrong*, because the none zero return causes the 
>>system to hold 
>>the command in the queue due to the use of the new error 
>>handler, yet we 
>>have also completed the command as `BUSY' *and* as a result of the 
>>constraints of the aac_io_done call which relocks (on 
>>io_request_lock) 
>>the caller had to unlock leaving a hole that SMP machines fill. By 
>>dropping the result and done calls in these situations, and 
>>holding the 
>>locks in the caller of such routines, I believe we will close 
>>this hole.
>>
>>....
>>
>>I will report back on my tests of these changes, but will need a 
>>volunteer with kernel compile experience to report on the success in 
>>resolving this issue in the field *please*.
>>-----
>>
>>	I'm not familiar enough with the aacraid driver or scsi 
>>in general to 
>>gather the code changes necessary.  There also don't appear to be any 
>>followups.
>>
>>	Mark, do you have any updates on this?  I can make code 
>>changes, 
>>recompile, and run a test case that reliably reveals the 
>>problem here if 
>>that's helpful.
>>
>>
>>I can't see the full panic message, but the parts I can see are 
>>basically (copied by hand):
>>
>>CPU 1
>>EFLAGS: 00010086
>>
>>EIP is at rmqueue [kernel] 0x127  (2.4.20-20.9smp)
>>eax: c0343400    ebx: c03445dc    ecx: 00000000
>>edx: b6d7ca63    esi: 00000000    edi: c03445d0
>>ebp: 00038000    esp: ee643e80     ds: 0068
>>es: 0068  ss: 0068
>>
>>Process dd (pid: 956, stack page = ee643000)
>>
>>Call trace:   wakeup_kswapd   0xfb (0xee643e90)
>>               __aloc_pages_limit   0x57
>>               __alloc_pages        0x101
>>               generic_file_write   0x394
>>               ext3_file_write      0x39
>>               sys_write            0x97
>>               system_call          0x33
>>
>>	Although aacraid isn't directly implicated here, I can 
>>reproduce this 
>>on the 2550s and 2650s (aacraid) but not 1750s (megaraid).
>>
>>Andrew
>>
>>Paul Anderson wrote:
>>
>>> We had this same issue with our 2650's running AS 2.1.  Don't know 
>>> that this is the best answer, but it is the one that worked for 
>>> us...Replace the on board adapter with a PERC 3/DC (LSI) adapter.  
>>> Make sure that you put it on its own bus, we used slot 
>>three.  In 2 of 
>>> our 2650's we are even running this with the HBA's for SAN 
>>> connectivity.  That said, our solution is about 2 weeks 
>>old, though I 
>>> did run similar tests on the systems after the new install 
>>for 8 days 
>>> and was unable to make them crash.
>>> 
>>> Paul
>>> 
>>> -----Original Message-----
>>> From: Andrew Mann [mailto:amann at mythicentertainment.com]
>>> Sent: Tuesday, October 07, 2003 12:47 PM
>>> To: linux-poweredge at dell.com
>>> Cc: Matt Domsch; deanna_bonds at adaptec.com; alan at redhat.com
>>> Subject: RedHat 9 aacraid - system fails under extreme disk IO - 
>>> Reproducable test case
>>> 
>>> 
>>> 	This has been brought up on the Dell Linux Poweredge 
>>list previously,
>>> but it doesn't appear that a definative solution or reproducable 
>>> situation has been presented.  It also seems like the 
>>previous reports 
>>> involved both heavy disk IO as well as heavy network 
>>traffic, and so the 
>>> NIC driver was suspect.
>>> 	Since we have a number of 2550s and 2650s using the 
>>onboard PERC3/Di 
>>> raid controller (aacraid driver), this issue concerns us.
>>> 
>>> 	The following script was run with 6 instances at once 
>>on two 2550s 
>>> and
>>> one 2650.
>>> 
>>> 2550 configuration
>>> 2 x P3 1.2 Ghz  kernel: 2.4.20-20.9smp #1 SMP
>>> 1GB of ram, 2GB of swap, 2 x 18 GB drives in a raid 1 configuration
>>> 
>>> 2650 configuration
>>> 2 x Xeon 2.2 Ghz   kernel: 2.4.20-20.9smp #1 SMP
>>> 2GB of ram, 2GB of swap, 2 x 18 GB drives in a raid 1 configuration 
>>> Hyperthreading enabled
>>> 
>>> 
>>> 	The 2550s fail within 30 minutes of starting the tests 
>>each time 
>>> (tests
>>> were run 6 times in a row).  The 2650 failed prior to 2.5 
>>days (only 1 
>>> test run due to duration before failure).  In some cases the 2550 
>>> displayed a null pointer dereference in the kernel.  I'll copy down 
>>> details next time I can catch it on screen.  It does not 
>>get logged to 
>>> disk, which doesn't surprise me in this situation.  In most 
>>cases the 
>>> screen was blank (due to APM I'd guess?).
>>> 	The systems still respond to pings, but do not respond 
>>to keyboard 
>>> actions and do not complete any tcp connections.  These 
>>systems do not 
>>> have a graphical desktop installed, and in fact have a 
>>fairly minimal 
>>> set of packages installed at all.
>>> 	I don't know why the 2550 would consistantly fail in 
>>such a brief 
>>> period while the 2650 would take a much longer time before failure. 
>>> I've been running the same tests on a 1750 (PERC4/Di - 
>>Megaraid based) 
>>> for some days now without a failure.
>>> 	I plan on testing a non-SMP kernel on the 2550 next - 
>>not because we 
>>> can run things that way, but to maybe give some more clues.
>>> 
>>> 	The following script creates a 300 MB file, then rm's 
>>it, then does 
>>> it
>>> all over again.  For my tests I ran 6 of these concurrently.  Don't 
>>> expect the system to respond to much while these are 
>>running, though I 
>>> was able to get decent updates from top.
>>> 	Alter the script as you see fit, I'm no guru with bash 
>>scripting!
>>> 
>>> cat diskgrind.sh
>>> #!/bin/sh
>>> 
>>> 
>>> MEGS=300
>>> TOTAL=0
>>> 
>>> while [ "1" != "0" ]; do
>>>          dd ibs=1048576 count=$MEGS if=/dev/zero 
>>of=/test/diskgrind.$$
>>> 2>&1 | cat >/dev/null
>>>          rm -f /test/diskgrind.$$
>>>          TOTAL=`expr $TOTAL + $MEGS`
>>>          echo "[$$] Completed $TOTAL megs."
>>> done
>>> 
>>> 
>>> ./diskgrind.sh &
>>> ./diskgrind.sh &
>>> ./diskgrind.sh &
>>> ./diskgrind.sh &
>>> ./diskgrind.sh &
>>> ./diskgrind.sh &
>>> 
>>> 
>>> 
>>> Andrew
>>> 
>>
>>-- 
>>Andrew Mann
>>Systems Administrator
>>Mythic Entertainment
>>703-934-0446 x 224
>>
>>_______________________________________________
>>Linux-PowerEdge mailing list
>>Linux-PowerEdge at dell.com 
>>>>http://lists.us.dell.com/mailman/listinfo/linux->>poweredge
>>
>>
>>Please read the FAQ at 
>>http://lists.us.dell.com/faq or search the list archives at 
http://lists.us.dell.com/htdig/





More information about the Linux-PowerEdge mailing list