RedHat 9 aacraid - system fails under extreme disk IO - Repro ducable test case

Andrew Mann amann at mythicentertainment.com
Wed Oct 8 15:44:01 CDT 2003


	Confirmed.  Crashes remain.

Andrew

Salyzyn, Mark wrote:

> The aac_list* code was removed as it was deemed unnecessary and
> complicating, some time ago. The initial release of the version 1.1.2 driver
> had this aac_list_* code, but by the time the version 1.1.2 cleared into
> Alan Cox's kernel we were `back' to the kernel based list code (and instead
> moved the fib list handler into the fib element itself)
> 
> This (latest) code does not solve the problem, but as I've stated I do not
> believe this issue is a pristine bug. Some may be reporting issues
> surrounding this alternate list handling (?)
> 
> I have enclosed my `latest/greatest' code to *confirm* this, can `drop' into
> the .../drivers/scsi/aacraid directory.
> 
> Sincerely -- Mark Salyzyn
> 
> -----Original Message-----
> From: Andrew Mann [mailto:amann at mythicentertainment.com]
> Sent: Wednesday, October 08, 2003 1:36 PM
> To: Salyzyn, Mark
> Cc: 'tomp at securityminded.net'; linux-poweredge at dell.com
> Subject: Re: RedHat 9 aacraid - system fails under extreme disk IO -
> Repro ducable test case
> 
> 
> Hi Mark,
> 	I've got some potentially interesting results on this.
> I downloaded kernel-source-2.4.20.SuSE-62.src.rpm from the Suse ftp 
> site.  It's laid out very nice for this purpose.  It's separated into 
> the stock 2.4.20 kernel with patches for each arch (and a common patch 
> set).  Inside of the common patch set are patches to the aacraid driver. 
>   I applied these patches to the stock 2.4.20 kernel and copied the 
> resulting /drivers/scsi/aacraid/ directory into the redhat source tree 
> for 2.4.20-20.9.  After a dep and clean the build complained of a 
> missing compat.h.  I didn't look to see if it was really used or just a 
> Makefile dependancy - instead I just copied compat.h from the aacraid 
> build 2166 directory.  It built fine.
> 	I'm now up and running for 35 minutes - longer than any test yet.
> 	I've looked at the patches vs the mainline kernel, and while the
> FIBS 
> change from 578 to 512 is the only hardware related change, the linked 
> list handling has been completely replaced.  The mainline driver and the 
> RedHat driver both use the kernel implementation of a double linked 
> list.  The new version uses a simple single linked list.  If you're not 
> protecting access to this list correctly a double linked list will give 
> you at least 2x greater chance (usually more) of a really bad situation. 
> I believe it's possible in a single linked list to get away with a 
> number of operations without locking access to the list, especially if 
> they end up being atomic ops.  I don't think it's possible at all on a 
> double linked list.
> 	So, I'd search in this direction.  I'll send an update this evening
> if 
> things are still running fine, and I'll send one immediately if things 
> blow up again.
> 
> Andrew
> 
> Salyzyn, Mark wrote:
> 
> 
>>I have not been able to duplicate this issue, so I am somewhat of a JAFO,
>>and am *not* a definitive resource.
>>
>>This issue is not just one problem. noapic kernel option and turning off
>>HyperThreading have resolved some of the reported issues. Driver changes
>>thus far can not eliminate the problem, but can delay the inevitable.
> 
> Build
> 
>>3157 of the Firmware appears to work fine, Build 3170 fails, but only with
>>certain Seagate 15K rpm U320 drives. 
>>
>>I may be wrong ... any corrections to my assumptions above would be
> 
> greatly
> 
>>appreciated.
>>
>>Sincerely -- Mark Salyzyn
>>
>>-----Original Message-----
>>From: Thomas Petersen [mailto:tomp at securityminded.net]
>>Sent: Tuesday, October 07, 2003 8:52 PM
>>To: 'Andrew Mann'
>>Cc: linux-poweredge at dell.com; Salyzyn, Mark
>>Subject: RE: RedHat 9 aacraid - system fails under extreme disk IO -
>>Reproducable test case
>>
>>
>>I am pretty disappointed in Dell for failing to follow up on this and
>>resolve the issue once and for all.  This is not a new problem but it is
>>Dell's responsibility to rectify it as they -certify- Redhat on the 2650
> 
> --
> 
>>regardless if it's a hardware or software issue Dell is responsible to
> 
> their
> 
>>customers.  
>>
>>If this was an issue on the Microsoft platform you can bet Dell would of
>>worked with Microsoft and issued a patch/update long before it became a
> 
> wide
> 
>>spread problem.  I have always been a huge fan of Dell equipment but their
>>failure in this instance to support what they sell is very troubling. 
>>
>>Don't get me wrong I will probably purchase Dell servers again in the
> 
> future
> 
>>(though not the 2650) but can anyone name one problem affecting the
>>Microsoft platform, related to Dell hardware and had a problem of this
>>magnitude, that went unresolved for as long as this one has?  System
> 
> lockups
> 
>>are -totally- unacceptable.  
>>
>>I guess when people start choosing with their checkbooks Dell might wake
> 
> up.
> 
>>Thomas Petersen
>>SecurityMinded Technologies 
>>
>>
>>
>>>>-----Original Message-----
>>>>From: Andrew Mann [mailto:amann at mythicentertainment.com] 
>>>>Sent: Tuesday, October 07, 2003 6:20 PM
>>>>To: linux-poweredge at dell.com
>>>>Cc: mark_salyzyn at adaptec.com
>>>>Subject: Re: RedHat 9 aacraid - system fails under extreme 
>>>>disk IO - Reproducable test case
>>>>
>>>>
>>>>	Unfortunately we've got a good number of 2550s and 
>>>>2650s in use, and 
>>>>replacing the RAID cards isn't ideal.  Mostly we don't have 
>>>>enough load 
>>>>to cause this problem, but every now and then we do get an 
>>>>unexplained 
>>>>lockup that pulls someone out of bed at 2 AM.
>>>>	I searched back through the reports of this and found 
>>>>some posts from 
>>>>Mark Salyzyn referencing AAC_NUM_FIB and AAC_NUM_IO_FIB 
>>>>settings.  The 
>>>>last comment I see is on 9/9/2003:
>>>>"I am suggesting that this value be (AAC_NUM_IO_FIB+64), and 
>>>>limited to 
>>>>below 512 (the maximum number of hardware FIBS the Firmware 
>>>>can absorb). 
>>>>I will begin testing the stability and side effects of this input."
>>>>	However, I don't see any followup, nor does the latest 
>>>>patchset to the 
>>>>2.4 series seem to contain any modifications in this area (or 
>>>>2.5 or 2.6 
>>>>since June 2003).
>>>>	Additionally, I've just rebuilt the aacraid module here 
>>>
>>>>from the RedHat 
>>>
>>>
>>>>SRPM of 2.4.20-20.9 with AAC_NUM_FIB=512 and 
>>>>AAC_NUM_IO_FIB=448, rebuilt 
>>>>the rdimage and such and got another crash within 5 minutes 
>>>>of starting 
>>>>the test.
>>>>
>>>>	I also see a note from Mark on 8/27/2003:
>>>>-----
>>>>There is code that does the following in the driver:
>>>>
>>>>	scsicmd->result = DID_OK << 16 | COMMAND_COMPLETE << 8 
>>>>| SAM_STAT_TASK_SET_FULL;
>>>>	aac_io_done(scsicmd);
>>>>	return -1;
>>>>
>>>>This is *wrong*, because the none zero return causes the 
>>>>system to hold 
>>>>the command in the queue due to the use of the new error 
>>>>handler, yet we 
>>>>have also completed the command as `BUSY' *and* as a result of the 
>>>>constraints of the aac_io_done call which relocks (on 
>>>>io_request_lock) 
>>>>the caller had to unlock leaving a hole that SMP machines fill. By 
>>>>dropping the result and done calls in these situations, and 
>>>>holding the 
>>>>locks in the caller of such routines, I believe we will close 
>>>>this hole.
>>>>
>>>>....
>>>>
>>>>I will report back on my tests of these changes, but will need a 
>>>>volunteer with kernel compile experience to report on the success in 
>>>>resolving this issue in the field *please*.
>>>>-----
>>>>
>>>>	I'm not familiar enough with the aacraid driver or scsi 
>>>>in general to 
>>>>gather the code changes necessary.  There also don't appear to be any 
>>>>followups.
>>>>
>>>>	Mark, do you have any updates on this?  I can make code 
>>>>changes, 
>>>>recompile, and run a test case that reliably reveals the 
>>>>problem here if 
>>>>that's helpful.
>>>>
>>>>
>>>>I can't see the full panic message, but the parts I can see are 
>>>>basically (copied by hand):
>>>>
>>>>CPU 1
>>>>EFLAGS: 00010086
>>>>
>>>>EIP is at rmqueue [kernel] 0x127  (2.4.20-20.9smp)
>>>>eax: c0343400    ebx: c03445dc    ecx: 00000000
>>>>edx: b6d7ca63    esi: 00000000    edi: c03445d0
>>>>ebp: 00038000    esp: ee643e80     ds: 0068
>>>>es: 0068  ss: 0068
>>>>
>>>>Process dd (pid: 956, stack page = ee643000)
>>>>
>>>>Call trace:   wakeup_kswapd   0xfb (0xee643e90)
>>>>             __aloc_pages_limit   0x57
>>>>             __alloc_pages        0x101
>>>>             generic_file_write   0x394
>>>>             ext3_file_write      0x39
>>>>             sys_write            0x97
>>>>             system_call          0x33
>>>>
>>>>	Although aacraid isn't directly implicated here, I can 
>>>>reproduce this 
>>>>on the 2550s and 2650s (aacraid) but not 1750s (megaraid).
>>>>
>>>>Andrew
>>>>
>>>>Paul Anderson wrote:
>>>>
>>>>
>>>>
>>>>>We had this same issue with our 2650's running AS 2.1.  Don't know 
>>>>>that this is the best answer, but it is the one that worked for 
>>>>>us...Replace the on board adapter with a PERC 3/DC (LSI) adapter.  
>>>>>Make sure that you put it on its own bus, we used slot 
>>>>
>>>>three.  In 2 of 
>>>>
>>>>
>>>>>our 2650's we are even running this with the HBA's for SAN 
>>>>>connectivity.  That said, our solution is about 2 weeks 
>>>>
>>>>old, though I 
>>>>
>>>>
>>>>>did run similar tests on the systems after the new install 
>>>>
>>>>for 8 days 
>>>>
>>>>
>>>>>and was unable to make them crash.
>>>>>
>>>>>Paul
>>>>>
>>>>>-----Original Message-----
>>>>>From: Andrew Mann [mailto:amann at mythicentertainment.com]
>>>>>Sent: Tuesday, October 07, 2003 12:47 PM
>>>>>To: linux-poweredge at dell.com
>>>>>Cc: Matt Domsch; deanna_bonds at adaptec.com; alan at redhat.com
>>>>>Subject: RedHat 9 aacraid - system fails under extreme disk IO - 
>>>>>Reproducable test case
>>>>>
>>>>>
>>>>>	This has been brought up on the Dell Linux Poweredge 
>>>>
>>>>list previously,
>>>>
>>>>
>>>>>but it doesn't appear that a definative solution or reproducable 
>>>>>situation has been presented.  It also seems like the 
>>>>
>>>>previous reports 
>>>>
>>>>
>>>>>involved both heavy disk IO as well as heavy network 
>>>>
>>>>traffic, and so the 
>>>>
>>>>
>>>>>NIC driver was suspect.
>>>>>	Since we have a number of 2550s and 2650s using the 
>>>>
>>>>onboard PERC3/Di 
>>>>
>>>>
>>>>>raid controller (aacraid driver), this issue concerns us.
>>>>>
>>>>>	The following script was run with 6 instances at once 
>>>>
>>>>on two 2550s 
>>>>
>>>>
>>>>>and
>>>>>one 2650.
>>>>>
>>>>>2550 configuration
>>>>>2 x P3 1.2 Ghz  kernel: 2.4.20-20.9smp #1 SMP
>>>>>1GB of ram, 2GB of swap, 2 x 18 GB drives in a raid 1 configuration
>>>>>
>>>>>2650 configuration
>>>>>2 x Xeon 2.2 Ghz   kernel: 2.4.20-20.9smp #1 SMP
>>>>>2GB of ram, 2GB of swap, 2 x 18 GB drives in a raid 1 configuration 
>>>>>Hyperthreading enabled
>>>>>
>>>>>
>>>>>	The 2550s fail within 30 minutes of starting the tests 
>>>>
>>>>each time 
>>>>
>>>>
>>>>>(tests
>>>>>were run 6 times in a row).  The 2650 failed prior to 2.5 
>>>>
>>>>days (only 1 
>>>>
>>>>
>>>>>test run due to duration before failure).  In some cases the 2550 
>>>>>displayed a null pointer dereference in the kernel.  I'll copy down 
>>>>>details next time I can catch it on screen.  It does not 
>>>>
>>>>get logged to 
>>>>
>>>>
>>>>>disk, which doesn't surprise me in this situation.  In most 
>>>>
>>>>cases the 
>>>>
>>>>
>>>>>screen was blank (due to APM I'd guess?).
>>>>>	The systems still respond to pings, but do not respond 
>>>>
>>>>to keyboard 
>>>>
>>>>
>>>>>actions and do not complete any tcp connections.  These 
>>>>
>>>>systems do not 
>>>>
>>>>
>>>>>have a graphical desktop installed, and in fact have a 
>>>>
>>>>fairly minimal 
>>>>
>>>>
>>>>>set of packages installed at all.
>>>>>	I don't know why the 2550 would consistantly fail in 
>>>>
>>>>such a brief 
>>>>
>>>>
>>>>>period while the 2650 would take a much longer time before failure. 
>>>>>I've been running the same tests on a 1750 (PERC4/Di - 
>>>>
>>>>Megaraid based) 
>>>>
>>>>
>>>>>for some days now without a failure.
>>>>>	I plan on testing a non-SMP kernel on the 2550 next - 
>>>>
>>>>not because we 
>>>>
>>>>
>>>>>can run things that way, but to maybe give some more clues.
>>>>>
>>>>>	The following script creates a 300 MB file, then rm's 
>>>>
>>>>it, then does 
>>>>
>>>>
>>>>>it
>>>>>all over again.  For my tests I ran 6 of these concurrently.  Don't 
>>>>>expect the system to respond to much while these are 
>>>>
>>>>running, though I 
>>>>
>>>>
>>>>>was able to get decent updates from top.
>>>>>	Alter the script as you see fit, I'm no guru with bash 
>>>>
>>>>scripting!
>>>>
>>>>
>>>>>cat diskgrind.sh
>>>>>#!/bin/sh
>>>>>
>>>>>
>>>>>MEGS=300
>>>>>TOTAL=0
>>>>>
>>>>>while [ "1" != "0" ]; do
>>>>>        dd ibs=1048576 count=$MEGS if=/dev/zero 
>>>>
>>>>of=/test/diskgrind.$$
>>>>
>>>>
>>>>>2>&1 | cat >/dev/null
>>>>>        rm -f /test/diskgrind.$$
>>>>>        TOTAL=`expr $TOTAL + $MEGS`
>>>>>        echo "[$$] Completed $TOTAL megs."
>>>>>done
>>>>>
>>>>>
>>>>>./diskgrind.sh &
>>>>>./diskgrind.sh &
>>>>>./diskgrind.sh &
>>>>>./diskgrind.sh &
>>>>>./diskgrind.sh &
>>>>>./diskgrind.sh &
>>>>>
>>>>>
>>>>>
>>>>>Andrew
>>>>>
>>>>
>>>>-- 
>>>>Andrew Mann
>>>>Systems Administrator
>>>>Mythic Entertainment
>>>>703-934-0446 x 224
>>>>
>>>>_______________________________________________
>>>>Linux-PowerEdge mailing list
>>>>Linux-PowerEdge at dell.com 
>>>>
>>>>
>>>>>>http://lists.us.dell.com/mailman/listinfo/linux->>poweredge
>>>>
>>>>
>>>>Please read the FAQ at 
>>>>http://lists.us.dell.com/faq or search the list archives at 
>>
>>http://lists.us.dell.com/htdig/
>>
>>
> 
> 

-- 
Andrew Mann
Systems Administrator
Mythic Entertainment
703-934-0446 x 224




More information about the Linux-PowerEdge mailing list