RedHat 9 aacraid - system fails under extreme disk IO - Repro ducable test case

Andrew Mann amann at mythicentertainment.com
Wed Oct 8 10:28:01 CDT 2003


	I upgraded the 3/Di firmware to 2.7-1 build 3571.  Still running RedHat 
9 (2.4.20-20.9smp), using aacraid driver 1.1.4-2166.  Problem remains. 
Kernel panic @ 11 minutes.

Andrew

McDougall, Marshall (FSH) wrote:

> I ran 14 iterations of Andrew's script on one of my 2550's for about 20
> hours before I stopped it.  I ran it on a newly installed RHES2.1 with the
> 2.4.9-e.27smp kernel.  I have the 3/DI controller V2.7-1 build 3571 with
> mirrored 18 GB drives.
> 
> Regards, Marshall
> 
> -----Original Message-----
> From: Salyzyn, Mark [mailto:mark_salyzyn at adaptec.com] 
> Sent: Wednesday, October 08, 2003 8:57 AM
> To: 'tomp at securityminded.net'; 'Andrew Mann'
> Cc: linux-poweredge at dell.com
> Subject: RE: RedHat 9 aacraid - system fails under extreme disk IO - Repro
> ducable test case
> 
> 
> I have not been able to duplicate this issue, so I am somewhat of a JAFO,
> and am *not* a definitive resource.
> 
> This issue is not just one problem. noapic kernel option and turning off
> HyperThreading have resolved some of the reported issues. Driver changes
> thus far can not eliminate the problem, but can delay the inevitable. Build
> 3157 of the Firmware appears to work fine, Build 3170 fails, but only with
> certain Seagate 15K rpm U320 drives. 
> 
> I may be wrong ... any corrections to my assumptions above would be greatly
> appreciated.
> 
> Sincerely -- Mark Salyzyn
> 
> -----Original Message-----
> From: Thomas Petersen [mailto:tomp at securityminded.net]
> Sent: Tuesday, October 07, 2003 8:52 PM
> To: 'Andrew Mann'
> Cc: linux-poweredge at dell.com; Salyzyn, Mark
> Subject: RE: RedHat 9 aacraid - system fails under extreme disk IO -
> Reproducable test case
> 
> 
> I am pretty disappointed in Dell for failing to follow up on this and
> resolve the issue once and for all.  This is not a new problem but it is
> Dell's responsibility to rectify it as they -certify- Redhat on the 2650 --
> regardless if it's a hardware or software issue Dell is responsible to their
> customers.  
> 
> If this was an issue on the Microsoft platform you can bet Dell would of
> worked with Microsoft and issued a patch/update long before it became a wide
> spread problem.  I have always been a huge fan of Dell equipment but their
> failure in this instance to support what they sell is very troubling. 
> 
> Don't get me wrong I will probably purchase Dell servers again in the future
> (though not the 2650) but can anyone name one problem affecting the
> Microsoft platform, related to Dell hardware and had a problem of this
> magnitude, that went unresolved for as long as this one has?  System lockups
> are -totally- unacceptable.  
> 
> I guess when people start choosing with their checkbooks Dell might wake up.
> 
> Thomas Petersen
> SecurityMinded Technologies 
> 
> 
>>>-----Original Message-----
>>>From: Andrew Mann [mailto:amann at mythicentertainment.com] 
>>>Sent: Tuesday, October 07, 2003 6:20 PM
>>>To: linux-poweredge at dell.com
>>>Cc: mark_salyzyn at adaptec.com
>>>Subject: Re: RedHat 9 aacraid - system fails under extreme 
>>>disk IO - Reproducable test case
>>>
>>>
>>>	Unfortunately we've got a good number of 2550s and 
>>>2650s in use, and 
>>>replacing the RAID cards isn't ideal.  Mostly we don't have 
>>>enough load 
>>>to cause this problem, but every now and then we do get an 
>>>unexplained 
>>>lockup that pulls someone out of bed at 2 AM.
>>>	I searched back through the reports of this and found 
>>>some posts from 
>>>Mark Salyzyn referencing AAC_NUM_FIB and AAC_NUM_IO_FIB 
>>>settings.  The 
>>>last comment I see is on 9/9/2003:
>>>"I am suggesting that this value be (AAC_NUM_IO_FIB+64), and 
>>>limited to 
>>>below 512 (the maximum number of hardware FIBS the Firmware 
>>>can absorb). 
>>>I will begin testing the stability and side effects of this input."
>>>	However, I don't see any followup, nor does the latest 
>>>patchset to the 
>>>2.4 series seem to contain any modifications in this area (or 
>>>2.5 or 2.6 
>>>since June 2003).
>>>	Additionally, I've just rebuilt the aacraid module here 
>>
>>>from the RedHat 
>>
>>>SRPM of 2.4.20-20.9 with AAC_NUM_FIB=512 and 
>>>AAC_NUM_IO_FIB=448, rebuilt 
>>>the rdimage and such and got another crash within 5 minutes 
>>>of starting 
>>>the test.
>>>
>>>	I also see a note from Mark on 8/27/2003:
>>>-----
>>>There is code that does the following in the driver:
>>>
>>>	scsicmd->result = DID_OK << 16 | COMMAND_COMPLETE << 8 
>>>| SAM_STAT_TASK_SET_FULL;
>>>	aac_io_done(scsicmd);
>>>	return -1;
>>>
>>>This is *wrong*, because the none zero return causes the 
>>>system to hold 
>>>the command in the queue due to the use of the new error 
>>>handler, yet we 
>>>have also completed the command as `BUSY' *and* as a result of the 
>>>constraints of the aac_io_done call which relocks (on 
>>>io_request_lock) 
>>>the caller had to unlock leaving a hole that SMP machines fill. By 
>>>dropping the result and done calls in these situations, and 
>>>holding the 
>>>locks in the caller of such routines, I believe we will close 
>>>this hole.
>>>
>>>....
>>>
>>>I will report back on my tests of these changes, but will need a 
>>>volunteer with kernel compile experience to report on the success in 
>>>resolving this issue in the field *please*.
>>>-----
>>>
>>>	I'm not familiar enough with the aacraid driver or scsi 
>>>in general to 
>>>gather the code changes necessary.  There also don't appear to be any 
>>>followups.
>>>
>>>	Mark, do you have any updates on this?  I can make code 
>>>changes, 
>>>recompile, and run a test case that reliably reveals the 
>>>problem here if 
>>>that's helpful.
>>>
>>>
>>>I can't see the full panic message, but the parts I can see are 
>>>basically (copied by hand):
>>>
>>>CPU 1
>>>EFLAGS: 00010086
>>>
>>>EIP is at rmqueue [kernel] 0x127  (2.4.20-20.9smp)
>>>eax: c0343400    ebx: c03445dc    ecx: 00000000
>>>edx: b6d7ca63    esi: 00000000    edi: c03445d0
>>>ebp: 00038000    esp: ee643e80     ds: 0068
>>>es: 0068  ss: 0068
>>>
>>>Process dd (pid: 956, stack page = ee643000)
>>>
>>>Call trace:   wakeup_kswapd   0xfb (0xee643e90)
>>>              __aloc_pages_limit   0x57
>>>              __alloc_pages        0x101
>>>              generic_file_write   0x394
>>>              ext3_file_write      0x39
>>>              sys_write            0x97
>>>              system_call          0x33
>>>
>>>	Although aacraid isn't directly implicated here, I can 
>>>reproduce this 
>>>on the 2550s and 2650s (aacraid) but not 1750s (megaraid).
>>>
>>>Andrew
>>>
>>>Paul Anderson wrote:
>>>
>>>
>>>>We had this same issue with our 2650's running AS 2.1.  Don't know 
>>>>that this is the best answer, but it is the one that worked for 
>>>>us...Replace the on board adapter with a PERC 3/DC (LSI) adapter.  
>>>>Make sure that you put it on its own bus, we used slot 
>>>
>>>three.  In 2 of 
>>>
>>>>our 2650's we are even running this with the HBA's for SAN 
>>>>connectivity.  That said, our solution is about 2 weeks 
>>>
>>>old, though I 
>>>
>>>>did run similar tests on the systems after the new install 
>>>
>>>for 8 days 
>>>
>>>>and was unable to make them crash.
>>>>
>>>>Paul
>>>>
>>>>-----Original Message-----
>>>>From: Andrew Mann [mailto:amann at mythicentertainment.com]
>>>>Sent: Tuesday, October 07, 2003 12:47 PM
>>>>To: linux-poweredge at dell.com
>>>>Cc: Matt Domsch; deanna_bonds at adaptec.com; alan at redhat.com
>>>>Subject: RedHat 9 aacraid - system fails under extreme disk IO - 
>>>>Reproducable test case
>>>>
>>>>
>>>>	This has been brought up on the Dell Linux Poweredge 
>>>
>>>list previously,
>>>
>>>>but it doesn't appear that a definative solution or reproducable 
>>>>situation has been presented.  It also seems like the 
>>>
>>>previous reports 
>>>
>>>>involved both heavy disk IO as well as heavy network 
>>>
>>>traffic, and so the 
>>>
>>>>NIC driver was suspect.
>>>>	Since we have a number of 2550s and 2650s using the 
>>>
>>>onboard PERC3/Di 
>>>
>>>>raid controller (aacraid driver), this issue concerns us.
>>>>
>>>>	The following script was run with 6 instances at once 
>>>
>>>on two 2550s 
>>>
>>>>and
>>>>one 2650.
>>>>
>>>>2550 configuration
>>>>2 x P3 1.2 Ghz  kernel: 2.4.20-20.9smp #1 SMP
>>>>1GB of ram, 2GB of swap, 2 x 18 GB drives in a raid 1 configuration
>>>>
>>>>2650 configuration
>>>>2 x Xeon 2.2 Ghz   kernel: 2.4.20-20.9smp #1 SMP
>>>>2GB of ram, 2GB of swap, 2 x 18 GB drives in a raid 1 configuration 
>>>>Hyperthreading enabled
>>>>
>>>>
>>>>	The 2550s fail within 30 minutes of starting the tests 
>>>
>>>each time 
>>>
>>>>(tests
>>>>were run 6 times in a row).  The 2650 failed prior to 2.5 
>>>
>>>days (only 1 
>>>
>>>>test run due to duration before failure).  In some cases the 2550 
>>>>displayed a null pointer dereference in the kernel.  I'll copy down 
>>>>details next time I can catch it on screen.  It does not 
>>>
>>>get logged to 
>>>
>>>>disk, which doesn't surprise me in this situation.  In most 
>>>
>>>cases the 
>>>
>>>>screen was blank (due to APM I'd guess?).
>>>>	The systems still respond to pings, but do not respond 
>>>
>>>to keyboard 
>>>
>>>>actions and do not complete any tcp connections.  These 
>>>
>>>systems do not 
>>>
>>>>have a graphical desktop installed, and in fact have a 
>>>
>>>fairly minimal 
>>>
>>>>set of packages installed at all.
>>>>	I don't know why the 2550 would consistantly fail in 
>>>
>>>such a brief 
>>>
>>>>period while the 2650 would take a much longer time before failure. 
>>>>I've been running the same tests on a 1750 (PERC4/Di - 
>>>
>>>Megaraid based) 
>>>
>>>>for some days now without a failure.
>>>>	I plan on testing a non-SMP kernel on the 2550 next - 
>>>
>>>not because we 
>>>
>>>>can run things that way, but to maybe give some more clues.
>>>>
>>>>	The following script creates a 300 MB file, then rm's 
>>>
>>>it, then does 
>>>
>>>>it
>>>>all over again.  For my tests I ran 6 of these concurrently.  Don't 
>>>>expect the system to respond to much while these are 
>>>
>>>running, though I 
>>>
>>>>was able to get decent updates from top.
>>>>	Alter the script as you see fit, I'm no guru with bash 
>>>
>>>scripting!
>>>
>>>>cat diskgrind.sh
>>>>#!/bin/sh
>>>>
>>>>
>>>>MEGS=300
>>>>TOTAL=0
>>>>
>>>>while [ "1" != "0" ]; do
>>>>         dd ibs=1048576 count=$MEGS if=/dev/zero 
>>>
>>>of=/test/diskgrind.$$
>>>
>>>>2>&1 | cat >/dev/null
>>>>         rm -f /test/diskgrind.$$
>>>>         TOTAL=`expr $TOTAL + $MEGS`
>>>>         echo "[$$] Completed $TOTAL megs."
>>>>done
>>>>
>>>>
>>>>./diskgrind.sh &
>>>>./diskgrind.sh &
>>>>./diskgrind.sh &
>>>>./diskgrind.sh &
>>>>./diskgrind.sh &
>>>>./diskgrind.sh &
>>>>
>>>>
>>>>
>>>>Andrew
>>>>
>>>
>>>-- 
>>>Andrew Mann
>>>Systems Administrator
>>>Mythic Entertainment
>>>703-934-0446 x 224
>>>
>>>_______________________________________________
>>>Linux-PowerEdge mailing list
>>>Linux-PowerEdge at dell.com 
>>>
>>>>>http://lists.us.dell.com/mailman/listinfo/linux->>poweredge
>>>
>>>
>>>Please read the FAQ at 
>>>http://lists.us.dell.com/faq or search the list archives at 
> 
> http://lists.us.dell.com/htdig/
> 
> 
> _______________________________________________
> Linux-PowerEdge mailing list
> Linux-PowerEdge at dell.com
> http://lists.us.dell.com/mailman/listinfo/linux-poweredge
> Please read the FAQ at http://lists.us.dell.com/faq or search the list
> archives at http://lists.us.dell.com/htdig/
> 
> _______________________________________________
> Linux-PowerEdge mailing list
> Linux-PowerEdge at dell.com
> http://lists.us.dell.com/mailman/listinfo/linux-poweredge
> Please read the FAQ at http://lists.us.dell.com/faq or search the list archives at http://lists.us.dell.com/htdig/
> 

-- 
Andrew Mann
Systems Administrator
Mythic Entertainment
703-934-0446 x 224




More information about the Linux-PowerEdge mailing list