RedHat 9 aacraid - system fails under extreme disk IO - Reproducable test case

Stefano Turolla sturolla at eso.org
Wed Oct 8 05:56:00 CDT 2003


Tom i completely agree with you,
we tried to involve DELL several times and after talking to at
least 4 different people starting again to explain what the problem was,
the final answer i got from the 'responsible for System Consulting
within Dell Computer Germany' was 

"Dear Mr. Turolla,

i just got the Information from RedHat, that the best solution is to
upgrade to RHEL-version,
because there is more intensive support for this version.
Would this be a possible solution for you ?

Rgds
Markus Wammel"

I don't think this is a solution we can accept for a machine the DELL
certified to work under Linux, and we also have the same problem with
1650's, so ia m pretty sure the problem is the the interaction between
aacraid driver and the PERC 3Di raid controller 

There is also a bugzilla bug still open and it is a bit scaring 
reading comment from Alan Cox telling that there is some patch but no 
solution.
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=92129


I think DELL should solve the problem (ongoing since at least 4 months)
even if the driver is developed by RedHat and Adaptec.
I don't think we'll buy other DELL servers if the won't solve the
problem.

ciao
stefano 

On Wed, 2003-10-08 at 02:51, Thomas Petersen wrote:
> I am pretty disappointed in Dell for failing to follow up on this and
> resolve the issue once and for all.  This is not a new problem but it is
> Dell's responsibility to rectify it as they -certify- Redhat on the 2650 --
> regardless if it's a hardware or software issue Dell is responsible to their
> customers.  
> 
> If this was an issue on the Microsoft platform you can bet Dell would of
> worked with Microsoft and issued a patch/update long before it became a wide
> spread problem.  I have always been a huge fan of Dell equipment but their
> failure in this instance to support what they sell is very troubling. 
> 
> Don't get me wrong I will probably purchase Dell servers again in the future
> (though not the 2650) but can anyone name one problem affecting the
> Microsoft platform, related to Dell hardware and had a problem of this
> magnitude, that went unresolved for as long as this one has?  System lockups
> are -totally- unacceptable.  
> 
> I guess when people start choosing with their checkbooks Dell might wake up.
> 
> Thomas Petersen
> SecurityMinded Technologies 
> 
> >>-----Original Message-----
> >>From: Andrew Mann [mailto:amann at mythicentertainment.com] 
> >>Sent: Tuesday, October 07, 2003 6:20 PM
> >>To: linux-poweredge at dell.com
> >>Cc: mark_salyzyn at adaptec.com
> >>Subject: Re: RedHat 9 aacraid - system fails under extreme 
> >>disk IO - Reproducable test case
> >>
> >>
> >>	Unfortunately we've got a good number of 2550s and 
> >>2650s in use, and 
> >>replacing the RAID cards isn't ideal.  Mostly we don't have 
> >>enough load 
> >>to cause this problem, but every now and then we do get an 
> >>unexplained 
> >>lockup that pulls someone out of bed at 2 AM.
> >>	I searched back through the reports of this and found 
> >>some posts from 
> >>Mark Salyzyn referencing AAC_NUM_FIB and AAC_NUM_IO_FIB 
> >>settings.  The 
> >>last comment I see is on 9/9/2003:
> >>"I am suggesting that this value be (AAC_NUM_IO_FIB+64), and 
> >>limited to 
> >>below 512 (the maximum number of hardware FIBS the Firmware 
> >>can absorb). 
> >>I will begin testing the stability and side effects of this input."
> >>	However, I don't see any followup, nor does the latest 
> >>patchset to the 
> >>2.4 series seem to contain any modifications in this area (or 
> >>2.5 or 2.6 
> >>since June 2003).
> >>	Additionally, I've just rebuilt the aacraid module here 
> >>from the RedHat 
> >>SRPM of 2.4.20-20.9 with AAC_NUM_FIB=512 and 
> >>AAC_NUM_IO_FIB=448, rebuilt 
> >>the rdimage and such and got another crash within 5 minutes 
> >>of starting 
> >>the test.
> >>
> >>	I also see a note from Mark on 8/27/2003:
> >>-----
> >>There is code that does the following in the driver:
> >>
> >>	scsicmd->result = DID_OK << 16 | COMMAND_COMPLETE << 8 
> >>| SAM_STAT_TASK_SET_FULL;
> >>	aac_io_done(scsicmd);
> >>	return -1;
> >>
> >>This is *wrong*, because the none zero return causes the 
> >>system to hold 
> >>the command in the queue due to the use of the new error 
> >>handler, yet we 
> >>have also completed the command as `BUSY' *and* as a result of the 
> >>constraints of the aac_io_done call which relocks (on 
> >>io_request_lock) 
> >>the caller had to unlock leaving a hole that SMP machines fill. By 
> >>dropping the result and done calls in these situations, and 
> >>holding the 
> >>locks in the caller of such routines, I believe we will close 
> >>this hole.
> >>
> >>....
> >>
> >>I will report back on my tests of these changes, but will need a 
> >>volunteer with kernel compile experience to report on the success in 
> >>resolving this issue in the field *please*.
> >>-----
> >>
> >>	I'm not familiar enough with the aacraid driver or scsi 
> >>in general to 
> >>gather the code changes necessary.  There also don't appear to be any 
> >>followups.
> >>
> >>	Mark, do you have any updates on this?  I can make code 
> >>changes, 
> >>recompile, and run a test case that reliably reveals the 
> >>problem here if 
> >>that's helpful.
> >>
> >>
> >>I can't see the full panic message, but the parts I can see are 
> >>basically (copied by hand):
> >>
> >>CPU 1
> >>EFLAGS: 00010086
> >>
> >>EIP is at rmqueue [kernel] 0x127  (2.4.20-20.9smp)
> >>eax: c0343400    ebx: c03445dc    ecx: 00000000
> >>edx: b6d7ca63    esi: 00000000    edi: c03445d0
> >>ebp: 00038000    esp: ee643e80     ds: 0068
> >>es: 0068  ss: 0068
> >>
> >>Process dd (pid: 956, stack page = ee643000)
> >>
> >>Call trace:   wakeup_kswapd   0xfb (0xee643e90)
> >>               __aloc_pages_limit   0x57
> >>               __alloc_pages        0x101
> >>               generic_file_write   0x394
> >>               ext3_file_write      0x39
> >>               sys_write            0x97
> >>               system_call          0x33
> >>
> >>	Although aacraid isn't directly implicated here, I can 
> >>reproduce this 
> >>on the 2550s and 2650s (aacraid) but not 1750s (megaraid).
> >>
> >>Andrew
> >>
> >>Paul Anderson wrote:
> >>
> >>> We had this same issue with our 2650's running AS 2.1.  Don't know 
> >>> that this is the best answer, but it is the one that worked for 
> >>> us...Replace the on board adapter with a PERC 3/DC (LSI) adapter.  
> >>> Make sure that you put it on its own bus, we used slot 
> >>three.  In 2 of 
> >>> our 2650's we are even running this with the HBA's for SAN 
> >>> connectivity.  That said, our solution is about 2 weeks 
> >>old, though I 
> >>> did run similar tests on the systems after the new install 
> >>for 8 days 
> >>> and was unable to make them crash.
> >>> 
> >>> Paul
> >>> 
> >>> -----Original Message-----
> >>> From: Andrew Mann [mailto:amann at mythicentertainment.com]
> >>> Sent: Tuesday, October 07, 2003 12:47 PM
> >>> To: linux-poweredge at dell.com
> >>> Cc: Matt Domsch; deanna_bonds at adaptec.com; alan at redhat.com
> >>> Subject: RedHat 9 aacraid - system fails under extreme disk IO - 
> >>> Reproducable test case
> >>> 
> >>> 
> >>> 	This has been brought up on the Dell Linux Poweredge 
> >>list previously,
> >>> but it doesn't appear that a definative solution or reproducable 
> >>> situation has been presented.  It also seems like the 
> >>previous reports 
> >>> involved both heavy disk IO as well as heavy network 
> >>traffic, and so the 
> >>> NIC driver was suspect.
> >>> 	Since we have a number of 2550s and 2650s using the 
> >>onboard PERC3/Di 
> >>> raid controller (aacraid driver), this issue concerns us.
> >>> 
> >>> 	The following script was run with 6 instances at once 
> >>on two 2550s 
> >>> and
> >>> one 2650.
> >>> 
> >>> 2550 configuration
> >>> 2 x P3 1.2 Ghz  kernel: 2.4.20-20.9smp #1 SMP
> >>> 1GB of ram, 2GB of swap, 2 x 18 GB drives in a raid 1 configuration
> >>> 
> >>> 2650 configuration
> >>> 2 x Xeon 2.2 Ghz   kernel: 2.4.20-20.9smp #1 SMP
> >>> 2GB of ram, 2GB of swap, 2 x 18 GB drives in a raid 1 configuration 
> >>> Hyperthreading enabled
> >>> 
> >>> 
> >>> 	The 2550s fail within 30 minutes of starting the tests 
> >>each time 
> >>> (tests
> >>> were run 6 times in a row).  The 2650 failed prior to 2.5 
> >>days (only 1 
> >>> test run due to duration before failure).  In some cases the 2550 
> >>> displayed a null pointer dereference in the kernel.  I'll copy down 
> >>> details next time I can catch it on screen.  It does not 
> >>get logged to 
> >>> disk, which doesn't surprise me in this situation.  In most 
> >>cases the 
> >>> screen was blank (due to APM I'd guess?).
> >>> 	The systems still respond to pings, but do not respond 
> >>to keyboard 
> >>> actions and do not complete any tcp connections.  These 
> >>systems do not 
> >>> have a graphical desktop installed, and in fact have a 
> >>fairly minimal 
> >>> set of packages installed at all.
> >>> 	I don't know why the 2550 would consistantly fail in 
> >>such a brief 
> >>> period while the 2650 would take a much longer time before failure. 
> >>> I've been running the same tests on a 1750 (PERC4/Di - 
> >>Megaraid based) 
> >>> for some days now without a failure.
> >>> 	I plan on testing a non-SMP kernel on the 2550 next - 
> >>not because we 
> >>> can run things that way, but to maybe give some more clues.
> >>> 
> >>> 	The following script creates a 300 MB file, then rm's 
> >>it, then does 
> >>> it
> >>> all over again.  For my tests I ran 6 of these concurrently.  Don't 
> >>> expect the system to respond to much while these are 
> >>running, though I 
> >>> was able to get decent updates from top.
> >>> 	Alter the script as you see fit, I'm no guru with bash 
> >>scripting!
> >>> 
> >>> cat diskgrind.sh
> >>> #!/bin/sh
> >>> 
> >>> 
> >>> MEGS=300
> >>> TOTAL=0
> >>> 
> >>> while [ "1" != "0" ]; do
> >>>          dd ibs=1048576 count=$MEGS if=/dev/zero 
> >>of=/test/diskgrind.$$
> >>> 2>&1 | cat >/dev/null
> >>>          rm -f /test/diskgrind.$$
> >>>          TOTAL=`expr $TOTAL + $MEGS`
> >>>          echo "[$$] Completed $TOTAL megs."
> >>> done
> >>> 
> >>> 
> >>> ./diskgrind.sh &
> >>> ./diskgrind.sh &
> >>> ./diskgrind.sh &
> >>> ./diskgrind.sh &
> >>> ./diskgrind.sh &
> >>> ./diskgrind.sh &
> >>> 
> >>> 
> >>> 
> >>> Andrew
> >>> 
> >>
> >>-- 
> >>Andrew Mann
> >>Systems Administrator
> >>Mythic Entertainment
> >>703-934-0446 x 224
> >>
> >>_______________________________________________
> >>Linux-PowerEdge mailing list
> >>Linux-PowerEdge at dell.com 
> >>>>http://lists.us.dell.com/mailman/listinfo/linux->>poweredge
> >>
> >>
> >>Please read the FAQ at 
> >>http://lists.us.dell.com/faq or search the list archives at 
> http://lists.us.dell.com/htdig/
> 
> 
> _______________________________________________
> Linux-PowerEdge mailing list
> Linux-PowerEdge at dell.com
> http://lists.us.dell.com/mailman/listinfo/linux-poweredge
> Please read the FAQ at http://lists.us.dell.com/faq or search the list archives at http://lists.us.dell.com/htdig/
-- 
+------+---------+--------+--------+--------+---------+--------+-------+
| Stefano Turolla                             Phone : +49 89 32006537  |
| UNIX System Manager                         Fax   : +49 89 32006380  |
| European Southern Observatory (ESO):        E-Mail: sturolla at eso.org |
| Karl-Schwarzschild-strasse 2 D-85748 Garching bei Muenchen           |
+------+---------+--------+--------+--------+---------+--------+-------+
Computers are like airconditioners ,
they stop working properly if you open WINDOWS





More information about the Linux-PowerEdge mailing list