RedHat 9 aacraid - system fails under extreme disk IO - Repro ducable test case

Andrew Mann amann at mythicentertainment.com
Wed Oct 8 11:48:01 CDT 2003


	I'm using two 2550 test systems, and one 2650 test system (and a 1750, 
but that's not aacraid and doesn't crash).  The 2650 crashes less 
frequently.  These systems have all been running fine, this isn't a 
"normal use" crash for us, but rather an intentional attempt to crash 
the system under high load to reproduce the problems described on the list.
	We have one 2550 in service as a high load web server that has crashed 
infrequently (1-2 time per month) with similar symptoms.
	I'm not sure hardware isn't the cause, but if it is then it's a 
hardware problem that has persisted from the 2550 to the 2650 line - 
which probably rules out a bad run/shipment of components.
	OMSA is not installed.  Nothing outside of the RedHat 9 distro is 
installed, and most of that isn't even installed.
	Unfortunately this doesn't appear to be a clear problem with a single 
element.  While the presence of an aacraid based controller (or at least 
the PERC 3/Di) seems necessary, it could just be that the some aspect of 
the way the driver functions under high load has the possibility of 
bringing out a problem elsewhere in the kernel (in the RedHat 9 series 
kernels at least).
	Since Suse 8.X and RedHat AS 2.1 don't appear to suffer from the same 
problem, this implies that a change of software can at least work around 
the problem - if the problem isn't with the software itself.

Andrew

McDougall, Marshall (FSH) wrote:

> Have you ruled out h/w as a cause?  I see you say you have many 2550's; has
> it happened on every one of them at some point in time?  I would reseat all
> the disks, cards, ram, and p/s in the current test box just to rule out a
> flakey connection.  Are all the fans running properly?  Did you take
> delivery of a bunch of Monday servers?  Do you have Open Manage installed?
> I ask these questions because there are some that have these problems and
> some that don't.  To my way of thinking, if it were broken software, it
> would be more prevalent.
> 
> My $.02
> 
> Regards, Marshall
> 
> -----Original Message-----
> From: Andrew Mann [mailto:amann at mythicentertainment.com] 
> Sent: Wednesday, October 08, 2003 10:25 AM
> To: McDougall, Marshall (FSH)
> Cc: 'Salyzyn, Mark'; linux-poweredge at dell.com
> Subject: Re: RedHat 9 aacraid - system fails under extreme disk IO - Repro
> ducable test case
> 
> 
> 	I upgraded the 3/Di firmware to 2.7-1 build 3571.  Still running
> RedHat 
> 9 (2.4.20-20.9smp), using aacraid driver 1.1.4-2166.  Problem remains. 
> Kernel panic @ 11 minutes.
> 
> Andrew
> 
> McDougall, Marshall (FSH) wrote:
> 
> 
>>I ran 14 iterations of Andrew's script on one of my 2550's for about 20
>>hours before I stopped it.  I ran it on a newly installed RHES2.1 with the
>>2.4.9-e.27smp kernel.  I have the 3/DI controller V2.7-1 build 3571 with
>>mirrored 18 GB drives.
>>
>>Regards, Marshall
>>
>>-----Original Message-----
>>From: Salyzyn, Mark [mailto:mark_salyzyn at adaptec.com] 
>>Sent: Wednesday, October 08, 2003 8:57 AM
>>To: 'tomp at securityminded.net'; 'Andrew Mann'
>>Cc: linux-poweredge at dell.com
>>Subject: RE: RedHat 9 aacraid - system fails under extreme disk IO - Repro
>>ducable test case
>>
>>
>>I have not been able to duplicate this issue, so I am somewhat of a JAFO,
>>and am *not* a definitive resource.
>>
>>This issue is not just one problem. noapic kernel option and turning off
>>HyperThreading have resolved some of the reported issues. Driver changes
>>thus far can not eliminate the problem, but can delay the inevitable.
> 
> Build
> 
>>3157 of the Firmware appears to work fine, Build 3170 fails, but only with
>>certain Seagate 15K rpm U320 drives. 
>>
>>I may be wrong ... any corrections to my assumptions above would be
> 
> greatly
> 
>>appreciated.
>>
>>Sincerely -- Mark Salyzyn
>>
>>-----Original Message-----
>>From: Thomas Petersen [mailto:tomp at securityminded.net]
>>Sent: Tuesday, October 07, 2003 8:52 PM
>>To: 'Andrew Mann'
>>Cc: linux-poweredge at dell.com; Salyzyn, Mark
>>Subject: RE: RedHat 9 aacraid - system fails under extreme disk IO -
>>Reproducable test case
>>
>>
>>I am pretty disappointed in Dell for failing to follow up on this and
>>resolve the issue once and for all.  This is not a new problem but it is
>>Dell's responsibility to rectify it as they -certify- Redhat on the 2650
> 
> --
> 
>>regardless if it's a hardware or software issue Dell is responsible to
> 
> their
> 
>>customers.  
>>
>>If this was an issue on the Microsoft platform you can bet Dell would of
>>worked with Microsoft and issued a patch/update long before it became a
> 
> wide
> 
>>spread problem.  I have always been a huge fan of Dell equipment but their
>>failure in this instance to support what they sell is very troubling. 
>>
>>Don't get me wrong I will probably purchase Dell servers again in the
> 
> future
> 
>>(though not the 2650) but can anyone name one problem affecting the
>>Microsoft platform, related to Dell hardware and had a problem of this
>>magnitude, that went unresolved for as long as this one has?  System
> 
> lockups
> 
>>are -totally- unacceptable.  
>>
>>I guess when people start choosing with their checkbooks Dell might wake
> 
> up.
> 
>>Thomas Petersen
>>SecurityMinded Technologies 
>>
>>
>>
>>>>-----Original Message-----
>>>>From: Andrew Mann [mailto:amann at mythicentertainment.com] 
>>>>Sent: Tuesday, October 07, 2003 6:20 PM
>>>>To: linux-poweredge at dell.com
>>>>Cc: mark_salyzyn at adaptec.com
>>>>Subject: Re: RedHat 9 aacraid - system fails under extreme 
>>>>disk IO - Reproducable test case
>>>>
>>>>
>>>>	Unfortunately we've got a good number of 2550s and 
>>>>2650s in use, and 
>>>>replacing the RAID cards isn't ideal.  Mostly we don't have 
>>>>enough load 
>>>>to cause this problem, but every now and then we do get an 
>>>>unexplained 
>>>>lockup that pulls someone out of bed at 2 AM.
>>>>	I searched back through the reports of this and found 
>>>>some posts from 
>>>>Mark Salyzyn referencing AAC_NUM_FIB and AAC_NUM_IO_FIB 
>>>>settings.  The 
>>>>last comment I see is on 9/9/2003:
>>>>"I am suggesting that this value be (AAC_NUM_IO_FIB+64), and 
>>>>limited to 
>>>>below 512 (the maximum number of hardware FIBS the Firmware 
>>>>can absorb). 
>>>>I will begin testing the stability and side effects of this input."
>>>>	However, I don't see any followup, nor does the latest 
>>>>patchset to the 
>>>>2.4 series seem to contain any modifications in this area (or 
>>>>2.5 or 2.6 
>>>>since June 2003).
>>>>	Additionally, I've just rebuilt the aacraid module here 
>>>
>>>>from the RedHat 
>>>
>>>
>>>>SRPM of 2.4.20-20.9 with AAC_NUM_FIB=512 and 
>>>>AAC_NUM_IO_FIB=448, rebuilt 
>>>>the rdimage and such and got another crash within 5 minutes 
>>>>of starting 
>>>>the test.
>>>>
>>>>	I also see a note from Mark on 8/27/2003:
>>>>-----
>>>>There is code that does the following in the driver:
>>>>
>>>>	scsicmd->result = DID_OK << 16 | COMMAND_COMPLETE << 8 
>>>>| SAM_STAT_TASK_SET_FULL;
>>>>	aac_io_done(scsicmd);
>>>>	return -1;
>>>>
>>>>This is *wrong*, because the none zero return causes the 
>>>>system to hold 
>>>>the command in the queue due to the use of the new error 
>>>>handler, yet we 
>>>>have also completed the command as `BUSY' *and* as a result of the 
>>>>constraints of the aac_io_done call which relocks (on 
>>>>io_request_lock) 
>>>>the caller had to unlock leaving a hole that SMP machines fill. By 
>>>>dropping the result and done calls in these situations, and 
>>>>holding the 
>>>>locks in the caller of such routines, I believe we will close 
>>>>this hole.
>>>>
>>>>....
>>>>
>>>>I will report back on my tests of these changes, but will need a 
>>>>volunteer with kernel compile experience to report on the success in 
>>>>resolving this issue in the field *please*.
>>>>-----
>>>>
>>>>	I'm not familiar enough with the aacraid driver or scsi 
>>>>in general to 
>>>>gather the code changes necessary.  There also don't appear to be any 
>>>>followups.
>>>>
>>>>	Mark, do you have any updates on this?  I can make code 
>>>>changes, 
>>>>recompile, and run a test case that reliably reveals the 
>>>>problem here if 
>>>>that's helpful.
>>>>
>>>>
>>>>I can't see the full panic message, but the parts I can see are 
>>>>basically (copied by hand):
>>>>
>>>>CPU 1
>>>>EFLAGS: 00010086
>>>>
>>>>EIP is at rmqueue [kernel] 0x127  (2.4.20-20.9smp)
>>>>eax: c0343400    ebx: c03445dc    ecx: 00000000
>>>>edx: b6d7ca63    esi: 00000000    edi: c03445d0
>>>>ebp: 00038000    esp: ee643e80     ds: 0068
>>>>es: 0068  ss: 0068
>>>>
>>>>Process dd (pid: 956, stack page = ee643000)
>>>>
>>>>Call trace:   wakeup_kswapd   0xfb (0xee643e90)
>>>>             __aloc_pages_limit   0x57
>>>>             __alloc_pages        0x101
>>>>             generic_file_write   0x394
>>>>             ext3_file_write      0x39
>>>>             sys_write            0x97
>>>>             system_call          0x33
>>>>
>>>>	Although aacraid isn't directly implicated here, I can 
>>>>reproduce this 
>>>>on the 2550s and 2650s (aacraid) but not 1750s (megaraid).
>>>>
>>>>Andrew
>>>>
>>>>Paul Anderson wrote:
>>>>
>>>>
>>>>
>>>>>We had this same issue with our 2650's running AS 2.1.  Don't know 
>>>>>that this is the best answer, but it is the one that worked for 
>>>>>us...Replace the on board adapter with a PERC 3/DC (LSI) adapter.  
>>>>>Make sure that you put it on its own bus, we used slot 
>>>>
>>>>three.  In 2 of 
>>>>
>>>>
>>>>>our 2650's we are even running this with the HBA's for SAN 
>>>>>connectivity.  That said, our solution is about 2 weeks 
>>>>
>>>>old, though I 
>>>>
>>>>
>>>>>did run similar tests on the systems after the new install 
>>>>
>>>>for 8 days 
>>>>
>>>>
>>>>>and was unable to make them crash.
>>>>>
>>>>>Paul
>>>>>
>>>>>-----Original Message-----
>>>>>From: Andrew Mann [mailto:amann at mythicentertainment.com]
>>>>>Sent: Tuesday, October 07, 2003 12:47 PM
>>>>>To: linux-poweredge at dell.com
>>>>>Cc: Matt Domsch; deanna_bonds at adaptec.com; alan at redhat.com
>>>>>Subject: RedHat 9 aacraid - system fails under extreme disk IO - 
>>>>>Reproducable test case
>>>>>
>>>>>
>>>>>	This has been brought up on the Dell Linux Poweredge 
>>>>
>>>>list previously,
>>>>
>>>>
>>>>>but it doesn't appear that a definative solution or reproducable 
>>>>>situation has been presented.  It also seems like the 
>>>>
>>>>previous reports 
>>>>
>>>>
>>>>>involved both heavy disk IO as well as heavy network 
>>>>
>>>>traffic, and so the 
>>>>
>>>>
>>>>>NIC driver was suspect.
>>>>>	Since we have a number of 2550s and 2650s using the 
>>>>
>>>>onboard PERC3/Di 
>>>>
>>>>
>>>>>raid controller (aacraid driver), this issue concerns us.
>>>>>
>>>>>	The following script was run with 6 instances at once 
>>>>
>>>>on two 2550s 
>>>>
>>>>
>>>>>and
>>>>>one 2650.
>>>>>
>>>>>2550 configuration
>>>>>2 x P3 1.2 Ghz  kernel: 2.4.20-20.9smp #1 SMP
>>>>>1GB of ram, 2GB of swap, 2 x 18 GB drives in a raid 1 configuration
>>>>>
>>>>>2650 configuration
>>>>>2 x Xeon 2.2 Ghz   kernel: 2.4.20-20.9smp #1 SMP
>>>>>2GB of ram, 2GB of swap, 2 x 18 GB drives in a raid 1 configuration 
>>>>>Hyperthreading enabled
>>>>>
>>>>>
>>>>>	The 2550s fail within 30 minutes of starting the tests 
>>>>
>>>>each time 
>>>>
>>>>
>>>>>(tests
>>>>>were run 6 times in a row).  The 2650 failed prior to 2.5 
>>>>
>>>>days (only 1 
>>>>
>>>>
>>>>>test run due to duration before failure).  In some cases the 2550 
>>>>>displayed a null pointer dereference in the kernel.  I'll copy down 
>>>>>details next time I can catch it on screen.  It does not 
>>>>
>>>>get logged to 
>>>>
>>>>
>>>>>disk, which doesn't surprise me in this situation.  In most 
>>>>
>>>>cases the 
>>>>
>>>>
>>>>>screen was blank (due to APM I'd guess?).
>>>>>	The systems still respond to pings, but do not respond 
>>>>
>>>>to keyboard 
>>>>
>>>>
>>>>>actions and do not complete any tcp connections.  These 
>>>>
>>>>systems do not 
>>>>
>>>>
>>>>>have a graphical desktop installed, and in fact have a 
>>>>
>>>>fairly minimal 
>>>>
>>>>
>>>>>set of packages installed at all.
>>>>>	I don't know why the 2550 would consistantly fail in 
>>>>
>>>>such a brief 
>>>>
>>>>
>>>>>period while the 2650 would take a much longer time before failure. 
>>>>>I've been running the same tests on a 1750 (PERC4/Di - 
>>>>
>>>>Megaraid based) 
>>>>
>>>>
>>>>>for some days now without a failure.
>>>>>	I plan on testing a non-SMP kernel on the 2550 next - 
>>>>
>>>>not because we 
>>>>
>>>>
>>>>>can run things that way, but to maybe give some more clues.
>>>>>
>>>>>	The following script creates a 300 MB file, then rm's 
>>>>
>>>>it, then does 
>>>>
>>>>
>>>>>it
>>>>>all over again.  For my tests I ran 6 of these concurrently.  Don't 
>>>>>expect the system to respond to much while these are 
>>>>
>>>>running, though I 
>>>>
>>>>
>>>>>was able to get decent updates from top.
>>>>>	Alter the script as you see fit, I'm no guru with bash 
>>>>
>>>>scripting!
>>>>
>>>>
>>>>>cat diskgrind.sh
>>>>>#!/bin/sh
>>>>>
>>>>>
>>>>>MEGS=300
>>>>>TOTAL=0
>>>>>
>>>>>while [ "1" != "0" ]; do
>>>>>        dd ibs=1048576 count=$MEGS if=/dev/zero 
>>>>
>>>>of=/test/diskgrind.$$
>>>>
>>>>
>>>>>2>&1 | cat >/dev/null
>>>>>        rm -f /test/diskgrind.$$
>>>>>        TOTAL=`expr $TOTAL + $MEGS`
>>>>>        echo "[$$] Completed $TOTAL megs."
>>>>>done
>>>>>
>>>>>
>>>>>./diskgrind.sh &
>>>>>./diskgrind.sh &
>>>>>./diskgrind.sh &
>>>>>./diskgrind.sh &
>>>>>./diskgrind.sh &
>>>>>./diskgrind.sh &
>>>>>
>>>>>
>>>>>
>>>>>Andrew
>>>>>
>>>>
>>>>-- 
>>>>Andrew Mann
>>>>Systems Administrator
>>>>Mythic Entertainment
>>>>703-934-0446 x 224
>>>>
>>>>_______________________________________________
>>>>Linux-PowerEdge mailing list
>>>>Linux-PowerEdge at dell.com 
>>>>
>>>>
>>>>>>http://lists.us.dell.com/mailman/listinfo/linux->>poweredge
>>>>
>>>>
>>>>Please read the FAQ at 
>>>>http://lists.us.dell.com/faq or search the list archives at 
>>
>>http://lists.us.dell.com/htdig/
>>
>>
>>_______________________________________________
>>Linux-PowerEdge mailing list
>>Linux-PowerEdge at dell.com
>>http://lists.us.dell.com/mailman/listinfo/linux-poweredge
>>Please read the FAQ at http://lists.us.dell.com/faq or search the list
>>archives at http://lists.us.dell.com/htdig/
>>
>>_______________________________________________
>>Linux-PowerEdge mailing list
>>Linux-PowerEdge at dell.com
>>http://lists.us.dell.com/mailman/listinfo/linux-poweredge
>>Please read the FAQ at http://lists.us.dell.com/faq or search the list
> 
> archives at http://lists.us.dell.com/htdig/
> 
> 

-- 
Andrew Mann
Systems Administrator
Mythic Entertainment
703-934-0446 x 224




More information about the Linux-PowerEdge mailing list