RedHat 9 aacraid - system fails under extreme disk IO - Reproducable test case

Randy Palmer Randy_Palmer at avid.com
Mon Oct 27 09:41:22 CST 2003


Where can I find instructions on implementing this "aac_eh_reset" workaround to prevent aacraid lockups on my PE2650?

  -RP



Subject: RE: RedHat 9 aacraid - system fails under extreme disk IO - Repro
      ducable test case
From: Russell Stuart <rstuart at lubemobile.com.au>
To: "Salyzyn, Mark" <mark_salyzyn at adaptec.com>
Cc: linux-poweredge at dell.com, linux-aacraid-devel at dell.com,
   "'Mark Haverkamp'" <markh at osdl.org>
Organization: Lube Mobile
Date: 27 Oct 2003 08:15:47 +1000
 
Just to confirm, your work around works.  I have been thrashing it for a
week.  No failures.  It would of failed by now under all other tests I
have done.  SMP, SMT and raid write caching are enabled.  There is one
odd message in /var/log/messages:
 
Oct 25 10:03:17 mephisto kernel: aacraid: Host adapter reset request.
SCSI hang ?
Oct 25 10:03:17 mephisto kernel: aacraid: Outstanding commands on
(0,0,0,0):
Oct 25 10:03:17 mephisto kernel:    0 C  2a 00 00 00 00 6f 00 00 08 00
Oct 25 10:03:17 mephisto kernel:    1 A* 2a 00 03 20 31 77 00 00 60 00
Oct 25 10:03:17 mephisto kernel:    2 C  2a 00 00 08 00 6f 00 00 08 00
Oct 25 10:03:17 mephisto kernel:    3 C  2a 00 00 00 00 47 00 00 08 00
Oct 25 10:03:17 mephisto kernel:    4 A  2a 00 02 46 e6 a7 00 00 80 00
Oct 25 10:03:17 mephisto kernel:    5 A  28 00 01 10 2a bf 00 00 08 00
Oct 25 10:03:17 mephisto kernel:    6 C  2a 00 00 10 00 87 00 00 10 00
Oct 25 10:03:17 mephisto kernel:    7 A  2a 00 02 46 e7 27 00 00 20 00
Oct 25 10:03:17 mephisto kernel:    8 C  2a 00 00 10 00 6f 00 00 10 00
Oct 25 10:03:17 mephisto kernel:    9 C  2a 00 02 c0 00 87 00 00 10 00
 
So thanks.  Best of luck with the firmware bug.
 
On Mon, 2003-10-20 at 23:33, Salyzyn, Mark wrote:
> This also deals with the long standing thread (Since April)
> 
> Subject: kernel: aacraid: Host adapter reset request. SCSI hang ?
> 
> We have a driver workaround!!!
> 
> The root cause is traced to `something' triggering the Firmware to flush
> it's cache at too high of a priority, causing the adapter to be reticent
on
> new commands until the flush has completed. That something could be
> management applications, device misbehavior, bus conditions or the
position
> of the moon. I have clocked a worst case of 73 seconds where the adapter
is
> too busy, a Firmware fix is in the works, but given the longer lead times
> for acceptance of new Firmware and Driver packaging I am providing a
driver
> source workaround for those that are experiencing this problem and have
the
> savvy to build their own driver modules. Not all Firmware variants have
this
> problem, so taking this driver is optional. This driver has only been unit
> tested on a handful of systems.
> 
> The workaround is to wait up to an additional 60 seconds until all
commands
> are complete in aac_eh_reset handler, effectively waiting for the firmware
> to complete the cache flush. Upon return, the error recovery code in the
> SCSI layer then can issue it's test unit ready and get a timely enough
> response so that it does not take the device(s) offline.
> 
> Sincerely -- Mark Salyzyn
> 
> -----Original Message-----
> From: Russell Stuart [mailto:rstuart at lubemobile.com.au]
> Sent: Wednesday, October 08, 2003 9:19 PM
> To: Salyzyn, Mark
> Cc: linux-poweredge at dell.com
> Subject: RE: RedHat 9 aacraid - system fails under extreme disk IO -
> Reproducable test case
> 
> 
> You describe my setup exactly.  I have two machines, one fails within 4
> hours and the other works perfectly.
> 
> Are you saying then that if I revert the firmware to Build 3157 the
> problem will go away?  Is there some reason I should not do this (like
> other bugs in 3157)?
> 
> On Wed, 2003-10-08 at 23:56, Salyzyn, Mark wrote:
> > I have not been able to duplicate this issue, so I am somewhat of a
JAFO,
> > and am *not* a definitive resource.
> > 
> > This issue is not just one problem. noapic kernel option and turning off
> > HyperThreading have resolved some of the reported issues. Driver changes
> > thus far can not eliminate the problem, but can delay the inevitable.
> Build
> > 3157 of the Firmware appears to work fine, Build 3170 fails, but only
with
> > certain Seagate 15K rpm U320 drives. 
> > 
> > I may be wrong ... any corrections to my assumptions above would be
> greatly
> > appreciated.
> > 
> > Sincerely -- Mark Salyzyn
> > 
> > -----Original Message-----
> > From: Thomas Petersen [mailto:tomp at securityminded.net]
> > Sent: Tuesday, October 07, 2003 8:52 PM
> > To: 'Andrew Mann'
> > Cc: linux-poweredge at dell.com; Salyzyn, Mark
> > Subject: RE: RedHat 9 aacraid - system fails under extreme disk IO -
> > Reproducable test case
> > 
> > 
> > I am pretty disappointed in Dell for failing to follow up on this and
> > resolve the issue once and for all.  This is not a new problem but it is
> > Dell's responsibility to rectify it as they -certify- Redhat on the 2650
> --
> > regardless if it's a hardware or software issue Dell is responsible to
> their
> > customers.  
> > 
> > If this was an issue on the Microsoft platform you can bet Dell would of
> > worked with Microsoft and issued a patch/update long before it became a
> wide
> > spread problem.  I have always been a huge fan of Dell equipment but
their
> > failure in this instance to support what they sell is very troubling. 
> > 
> > Don't get me wrong I will probably purchase Dell servers again in the
> future
> > (though not the 2650) but can anyone name one problem affecting the
> > Microsoft platform, related to Dell hardware and had a problem of this
> > magnitude, that went unresolved for as long as this one has?  System
> lockups
> > are -totally- unacceptable.  
> > 
> > I guess when people start choosing with their checkbooks Dell might wake
> up.
> > 
> > Thomas Petersen
> > SecurityMinded Technologies 
> > 
> > >>-----Original Message-----
> > >>From: Andrew Mann [mailto:amann at mythicentertainment.com] 
> > >>Sent: Tuesday, October 07, 2003 6:20 PM
> > >>To: linux-poweredge at dell.com
> > >>Cc: mark_salyzyn at adaptec.com
> > >>Subject: Re: RedHat 9 aacraid - system fails under extreme 
> > >>disk IO - Reproducable test case
> > >>
> > >>
> > >>      Unfortunately we've got a good number of 2550s and 
> > >>2650s in use, and 
> > >>replacing the RAID cards isn't ideal.  Mostly we don't have 
> > >>enough load 
> > >>to cause this problem, but every now and then we do get an 
> > >>unexplained 
> > >>lockup that pulls someone out of bed at 2 AM.
> > >>      I searched back through the reports of this and found 
> > >>some posts from 
> > >>Mark Salyzyn referencing AAC_NUM_FIB and AAC_NUM_IO_FIB 
> > >>settings.  The 
> > >>last comment I see is on 9/9/2003:
> > >>"I am suggesting that this value be (AAC_NUM_IO_FIB+64), and 
> > >>limited to 
> > >>below 512 (the maximum number of hardware FIBS the Firmware 
> > >>can absorb). 
> > >>I will begin testing the stability and side effects of this input."
> > >>      However, I don't see any followup, nor does the latest 
> > >>patchset to the 
> > >>2.4 series seem to contain any modifications in this area (or 
> > >>2.5 or 2.6 
> > >>since June 2003).
> > >>      Additionally, I've just rebuilt the aacraid module here 
> > >>from the RedHat 
> > >>SRPM of 2.4.20-20.9 with AAC_NUM_FIB=512 and 
> > >>AAC_NUM_IO_FIB=448, rebuilt 
> > >>the rdimage and such and got another crash within 5 minutes 
> > >>of starting 
> > >>the test.
> > >>
> > >>      I also see a note from Mark on 8/27/2003:
> > >>-----
> > >>There is code that does the following in the driver:
> > >>
> > >>      scsicmd->result = DID_OK << 16 | COMMAND_COMPLETE << 8 
> > >>| SAM_STAT_TASK_SET_FULL;
> > >>      aac_io_done(scsicmd);
> > >>      return -1;
> > >>
> > >>This is *wrong*, because the none zero return causes the 
> > >>system to hold 
> > >>the command in the queue due to the use of the new error 
> > >>handler, yet we 
> > >>have also completed the command as `BUSY' *and* as a result of the 
> > >>constraints of the aac_io_done call which relocks (on 
> > >>io_request_lock) 
> > >>the caller had to unlock leaving a hole that SMP machines fill. By 
> > >>dropping the result and done calls in these situations, and 
> > >>holding the 
> > >>locks in the caller of such routines, I believe we will close 
> > >>this hole.
> > >>
> > >>....
> > >>
> > >>I will report back on my tests of these changes, but will need a 
> > >>volunteer with kernel compile experience to report on the success in 
> > >>resolving this issue in the field *please*.
> > >>-----
> > >>
> > >>      I'm not familiar enough with the aacraid driver or scsi 
> > >>in general to 
> > >>gather the code changes necessary.  There also don't appear to be any 
> > >>followups.
> > >>
> > >>      Mark, do you have any updates on this?  I can make code 
> > >>changes, 
> > >>recompile, and run a test case that reliably reveals the 
> > >>problem here if 
> > >>that's helpful.
> > >>
> > >>
> > >>I can't see the full panic message, but the parts I can see are 
> > >>basically (copied by hand):
> > >>
> > >>CPU 1
> > >>EFLAGS: 00010086
> > >>
> > >>EIP is at rmqueue [kernel] 0x127  (2.4.20-20.9smp)
> > >>eax: c0343400    ebx: c03445dc    ecx: 00000000
> > >>edx: b6d7ca63    esi: 00000000    edi: c03445d0
> > >>ebp: 00038000    esp: ee643e80     ds: 0068
> > >>es: 0068  ss: 0068
> > >>
> > >>Process dd (pid: 956, stack page = ee643000)
> > >>
> > >>Call trace:   wakeup_kswapd   0xfb (0xee643e90)
> > >>               __aloc_pages_limit   0x57
> > >>               __alloc_pages        0x101
> > >>               generic_file_write   0x394
> > >>               ext3_file_write      0x39
> > >>               sys_write            0x97
> > >>               system_call          0x33
> > >>
> > >>      Although aacraid isn't directly implicated here, I can 
> > >>reproduce this 
> > >>on the 2550s and 2650s (aacraid) but not 1750s (megaraid).
> > >>
> > >>Andrew
> > >>
> > >>Paul Anderson wrote:
> > >>
> > >>> We had this same issue with our 2650's running AS 2.1.  Don't know 
> > >>> that this is the best answer, but it is the one that worked for 
> > >>> us...Replace the on board adapter with a PERC 3/DC (LSI) adapter.  
> > >>> Make sure that you put it on its own bus, we used slot 
> > >>three.  In 2 of 
> > >>> our 2650's we are even running this with the HBA's for SAN 
> > >>> connectivity.  That said, our solution is about 2 weeks 
> > >>old, though I 
> > >>> did run similar tests on the systems after the new install 
> > >>for 8 days 
> > >>> and was unable to make them crash.
> > >>> 
> > >>> Paul
> > >>> 
> > >>> -----Original Message-----
> > >>> From: Andrew Mann [mailto:amann at mythicentertainment.com]
> > >>> Sent: Tuesday, October 07, 2003 12:47 PM
> > >>> To: linux-poweredge at dell.com
> > >>> Cc: Matt Domsch; deanna_bonds at adaptec.com; alan at redhat.com
> > >>> Subject: RedHat 9 aacraid - system fails under extreme disk IO - 
> > >>> Reproducable test case
> > >>> 
> > >>> 
> > >>>     This has been brought up on the Dell Linux Poweredge 
> > >>list previously,
> > >>> but it doesn't appear that a definative solution or reproducable 
> > >>> situation has been presented.  It also seems like the 
> > >>previous reports 
> > >>> involved both heavy disk IO as well as heavy network 
> > >>traffic, and so the 
> > >>> NIC driver was suspect.
> > >>>     Since we have a number of 2550s and 2650s using the 
> > >>onboard PERC3/Di 
> > >>> raid controller (aacraid driver), this issue concerns us.
> > >>> 
> > >>>     The following script was run with 6 instances at once 
> > >>on two 2550s 
> > >>> and
> > >>> one 2650.
> > >>> 
> > >>> 2550 configuration
> > >>> 2 x P3 1.2 Ghz  kernel: 2.4.20-20.9smp #1 SMP
> > >>> 1GB of ram, 2GB of swap, 2 x 18 GB drives in a raid 1 configuration
> > >>> 
> > >>> 2650 configuration
> > >>> 2 x Xeon 2.2 Ghz   kernel: 2.4.20-20.9smp #1 SMP
> > >>> 2GB of ram, 2GB of swap, 2 x 18 GB drives in a raid 1 configuration 
> > >>> Hyperthreading enabled
> > >>> 
> > >>> 
> > >>>     The 2550s fail within 30 minutes of starting the tests 
> > >>each time 
> > >>> (tests
> > >>> were run 6 times in a row).  The 2650 failed prior to 2.5 
> > >>days (only 1 
> > >>> test run due to duration before failure).  In some cases the 2550 
> > >>> displayed a null pointer dereference in the kernel.  I'll copy down 
> > >>> details next time I can catch it on screen.  It does not 
> > >>get logged to 
> > >>> disk, which doesn't surprise me in this situation.  In most 
> > >>cases the 
> > >>> screen was blank (due to APM I'd guess?).
> > >>>     The systems still respond to pings, but do not respond 
> > >>to keyboard 
> > >>> actions and do not complete any tcp connections.  These 
> > >>systems do not 
> > >>> have a graphical desktop installed, and in fact have a 
> > >>fairly minimal 
> > >>> set of packages installed at all.
> > >>>     I don't know why the 2550 would consistantly fail in 
> > >>such a brief 
> > >>> period while the 2650 would take a much longer time before failure. 
> > >>> I've been running the same tests on a 1750 (PERC4/Di - 
> > >>Megaraid based) 
> > >>> for some days now without a failure.
> > >>>     I plan on testing a non-SMP kernel on the 2550 next - 
> > >>not because we 
> > >>> can run things that way, but to maybe give some more clues.
> > >>> 
> > >>>     The following script creates a 300 MB file, then rm's 
> > >>it, then does 
> > >>> it
> > >>> all over again.  For my tests I ran 6 of these concurrently.  Don't 
> > >>> expect the system to respond to much while these are 
> > >>running, though I 
> > >>> was able to get decent updates from top.
> > >>>     Alter the script as you see fit, I'm no guru with bash 
> > >>scripting!
> > >>> 
> > >>> cat diskgrind.sh
> > >>> #!/bin/sh
> > >>> 
> > >>> 
> > >>> MEGS=300
> > >>> TOTAL=0
> > >>> 
> > >>> while [ "1" != "0" ]; do
> > >>>          dd ibs=1048576 count=$MEGS if=/dev/zero 
> > >>of=/test/diskgrind.$$
> > >>> 2>&1 | cat >/dev/null
> > >>>          rm -f /test/diskgrind.$$
> > >>>          TOTAL=`expr $TOTAL + $MEGS`
> > >>>          echo "[$$] Completed $TOTAL megs."
> > >>> done
> > >>> 
> > >>> 
> > >>> ./diskgrind.sh &
> > >>> ./diskgrind.sh &
> > >>> ./diskgrind.sh &
> > >>> ./diskgrind.sh &
> > >>> ./diskgrind.sh &
> > >>> ./diskgrind.sh &
> > >>> 
> > >>> 
> > >>> 
> > >>> Andrew
> > >>> 
> > >>
> > >>-- 
> > >>Andrew Mann
> > >>Systems Administrator
> > >>Mythic Entertainment
> > >>703-934-0446 x 224
> > >>
> > >>_______________________________________________
> > >>Linux-PowerEdge mailing list
> > >>Linux-PowerEdge at dell.com 
> > >>>>http://lists.us.dell.com/mailman/listinfo/linux->>poweredge
> > >>
> > >>
> > >>Please read the FAQ at 
> > >>http://lists.us.dell.com/faq or search the list archives at 
> > http://lists.us.dell.com/htdig/
> > 
> > 
> > _______________________________________________
> > Linux-PowerEdge mailing list
> > Linux-PowerEdge at dell.com
> > http://lists.us.dell.com/mailman/listinfo/linux-poweredge
> > Please read the FAQ at http://lists.us.dell.com/faq or search the list
> archives at http://lists.us.dell.com/htdig/
> 
 
 
--__--__--
 
Message: 3
Date: Mon, 27 Oct 2003 18:36:58 +0900
From: Joe Stevens <joe at spin.ad.jp>
To: linux-poweredge at dell.com
Cc: linux-aacraid-devel at dell.com
Subject: Re: RedHat 9 aacraid - system fails under extreme disk IO - Repro
 ducable test case
 
Did we ever confirm if there is a big problem with just running firmware
3157 unill the fix is released?
 
 
Russell Stuart wrote:
 
> Just to confirm, your work around works.  I have been thrashing it for a
> week.  No failures.  It would of failed by now under all other tests I
> have done.  SMP, SMT and raid write caching are enabled.  There is one
> odd message in /var/log/messages:
> 
> Oct 25 10:03:17 mephisto kernel: aacraid: Host adapter reset request.
> SCSI hang ?
> Oct 25 10:03:17 mephisto kernel: aacraid: Outstanding commands on
> (0,0,0,0):
> Oct 25 10:03:17 mephisto kernel:    0 C  2a 00 00 00 00 6f 00 00 08 00
> Oct 25 10:03:17 mephisto kernel:    1 A* 2a 00 03 20 31 77 00 00 60 00
> Oct 25 10:03:17 mephisto kernel:    2 C  2a 00 00 08 00 6f 00 00 08 00
> Oct 25 10:03:17 mephisto kernel:    3 C  2a 00 00 00 00 47 00 00 08 00
> Oct 25 10:03:17 mephisto kernel:    4 A  2a 00 02 46 e6 a7 00 00 80 00
> Oct 25 10:03:17 mephisto kernel:    5 A  28 00 01 10 2a bf 00 00 08 00
> Oct 25 10:03:17 mephisto kernel:    6 C  2a 00 00 10 00 87 00 00 10 00
> Oct 25 10:03:17 mephisto kernel:    7 A  2a 00 02 46 e7 27 00 00 20 00
> Oct 25 10:03:17 mephisto kernel:    8 C  2a 00 00 10 00 6f 00 00 10 00
> Oct 25 10:03:17 mephisto kernel:    9 C  2a 00 02 c0 00 87 00 00 10 00
> 
> So thanks.  Best of luck with the firmware bug.
> 
> On Mon, 2003-10-20 at 23:33, Salyzyn, Mark wrote:
> 
>>This also deals with the long standing thread (Since April)
>>
>>Subject: kernel: aacraid: Host adapter reset request. SCSI hang ?
>>
>>We have a driver workaround!!!
>>
>>The root cause is traced to `something' triggering the Firmware to flush
>>it's cache at too high of a priority, causing the adapter to be reticent
on
>>new commands until the flush has completed. That something could be
>>management applications, device misbehavior, bus conditions or the
position
>>of the moon. I have clocked a worst case of 73 seconds where the adapter
is
>>too busy, a Firmware fix is in the works, but given the longer lead times
>>for acceptance of new Firmware and Driver packaging I am providing a
driver
>>source workaround for those that are experiencing this problem and have
the
>>savvy to build their own driver modules. Not all Firmware variants have
this
>>problem, so taking this driver is optional. This driver has only been unit
>>tested on a handful of systems.
>>
>>The workaround is to wait up to an additional 60 seconds until all
commands
>>are complete in aac_eh_reset handler, effectively waiting for the firmware
>>to complete the cache flush. Upon return, the error recovery code in the
>>SCSI layer then can issue it's test unit ready and get a timely enough
>>response so that it does not take the device(s) offline.
>>
>>Sincerely -- Mark Salyzyn
>>
>>-----Original Message-----
>>From: Russell Stuart [mailto:rstuart at lubemobile.com.au]
>>Sent: Wednesday, October 08, 2003 9:19 PM
>>To: Salyzyn, Mark
>>Cc: linux-poweredge at dell.com
>>Subject: RE: RedHat 9 aacraid - system fails under extreme disk IO -
>>Reproducable test case
>>
>>
>>You describe my setup exactly.  I have two machines, one fails within 4
>>hours and the other works perfectly.
>>
>>Are you saying then that if I revert the firmware to Build 3157 the
>>problem will go away?  Is there some reason I should not do this (like
>>other bugs in 3157)?
>>
>>On Wed, 2003-10-08 at 23:56, Salyzyn, Mark wrote:
>>
>>>I have not been able to duplicate this issue, so I am somewhat of a JAFO,
>>>and am *not* a definitive resource.
>>>
>>>This issue is not just one problem. noapic kernel option and turning off
>>>HyperThreading have resolved some of the reported issues. Driver changes
>>>thus far can not eliminate the problem, but can delay the inevitable.
>>
>>Build
>>
>>>3157 of the Firmware appears to work fine, Build 3170 fails, but only
with
>>>certain Seagate 15K rpm U320 drives. 
>>>
>>>I may be wrong ... any corrections to my assumptions above would be
>>
>>greatly
>>
>>>appreciated.
>>>
>>>Sincerely -- Mark Salyzyn
>>>
>>>-----Original Message-----
>>>From: Thomas Petersen [mailto:tomp at securityminded.net]
>>>Sent: Tuesday, October 07, 2003 8:52 PM
>>>To: 'Andrew Mann'
>>>Cc: linux-poweredge at dell.com; Salyzyn, Mark
>>>Subject: RE: RedHat 9 aacraid - system fails under extreme disk IO -
>>>Reproducable test case
>>>
>>>
>>>I am pretty disappointed in Dell for failing to follow up on this and
>>>resolve the issue once and for all.  This is not a new problem but it is
>>>Dell's responsibility to rectify it as they -certify- Redhat on the 2650
>>
>>--
>>
>>>regardless if it's a hardware or software issue Dell is responsible to
>>
>>their
>>
>>>customers.  
>>>
>>>If this was an issue on the Microsoft platform you can bet Dell would of
>>>worked with Microsoft and issued a patch/update long before it became a
>>
>>wide
>>
>>>spread problem.  I have always been a huge fan of Dell equipment but
their
>>>failure in this instance to support what they sell is very troubling. 
>>>
>>>Don't get me wrong I will probably purchase Dell servers again in the
>>
>>future
>>
>>>(though not the 2650) but can anyone name one problem affecting the
>>>Microsoft platform, related to Dell hardware and had a problem of this
>>>magnitude, that went unresolved for as long as this one has?  System
>>
>>lockups
>>
>>>are -totally- unacceptable.  
>>>
>>>I guess when people start choosing with their checkbooks Dell might wake
>>
>>up.
>>
>>>Thomas Petersen
>>>SecurityMinded Technologies 
>>>
>>>
>>>>>-----Original Message-----
>>>>>From: Andrew Mann [mailto:amann at mythicentertainment.com] 
>>>>>Sent: Tuesday, October 07, 2003 6:20 PM
>>>>>To: linux-poweredge at dell.com
>>>>>Cc: mark_salyzyn at adaptec.com
>>>>>Subject: Re: RedHat 9 aacraid - system fails under extreme 
>>>>>disk IO - Reproducable test case
>>>>>
>>>>>
>>>>> Unfortunately we've got a good number of 2550s and 
>>>>>2650s in use, and 
>>>>>replacing the RAID cards isn't ideal.  Mostly we don't have 
>>>>>enough load 
>>>>>to cause this problem, but every now and then we do get an 
>>>>>unexplained 
>>>>>lockup that pulls someone out of bed at 2 AM.
>>>>> I searched back through the reports of this and found 
>>>>>some posts from 
>>>>>Mark Salyzyn referencing AAC_NUM_FIB and AAC_NUM_IO_FIB 
>>>>>settings.  The 
>>>>>last comment I see is on 9/9/2003:
>>>>>"I am suggesting that this value be (AAC_NUM_IO_FIB+64), and 
>>>>>limited to 
>>>>>below 512 (the maximum number of hardware FIBS the Firmware 
>>>>>can absorb). 
>>>>>I will begin testing the stability and side effects of this input."
>>>>> However, I don't see any followup, nor does the latest 
>>>>>patchset to the 
>>>>>2.4 series seem to contain any modifications in this area (or 
>>>>>2.5 or 2.6 
>>>>>since June 2003).
>>>>> Additionally, I've just rebuilt the aacraid module here 
>>>>
>>>>>from the RedHat 
>>>>
>>>>>SRPM of 2.4.20-20.9 with AAC_NUM_FIB=512 and 
>>>>>AAC_NUM_IO_FIB=448, rebuilt 
>>>>>the rdimage and such and got another crash within 5 minutes 
>>>>>of starting 
>>>>>the test.
>>>>>
>>>>> I also see a note from Mark on 8/27/2003:
>>>>>-----
>>>>>There is code that does the following in the driver:
>>>>>
>>>>> scsicmd->result = DID_OK << 16 | COMMAND_COMPLETE << 8 
>>>>>| SAM_STAT_TASK_SET_FULL;
>>>>> aac_io_done(scsicmd);
>>>>> return -1;
>>>>>
>>>>>This is *wrong*, because the none zero return causes the 
>>>>>system to hold 
>>>>>the command in the queue due to the use of the new error 
>>>>>handler, yet we 
>>>>>have also completed the command as `BUSY' *and* as a result of the 
>>>>>constraints of the aac_io_done call which relocks (on 
>>>>>io_request_lock) 
>>>>>the caller had to unlock leaving a hole that SMP machines fill. By 
>>>>>dropping the result and done calls in these situations, and 
>>>>>holding the 
>>>>>locks in the caller of such routines, I believe we will close 
>>>>>this hole.
>>>>>
>>>>>....
>>>>>
>>>>>I will report back on my tests of these changes, but will need a 
>>>>>volunteer with kernel compile experience to report on the success in 
>>>>>resolving this issue in the field *please*.
>>>>>-----
>>>>>
>>>>> I'm not familiar enough with the aacraid driver or scsi 
>>>>>in general to 
>>>>>gather the code changes necessary.  There also don't appear to be any 
>>>>>followups.
>>>>>
>>>>> Mark, do you have any updates on this?  I can make code 
>>>>>changes, 
>>>>>recompile, and run a test case that reliably reveals the 
>>>>>problem here if 
>>>>>that's helpful.
>>>>>
>>>>>
>>>>>I can't see the full panic message, but the parts I can see are 
>>>>>basically (copied by hand):
>>>>>
>>>>>CPU 1
>>>>>EFLAGS: 00010086
>>>>>
>>>>>EIP is at rmqueue [kernel] 0x127  (2.4.20-20.9smp)
>>>>>eax: c0343400    ebx: c03445dc    ecx: 00000000
>>>>>edx: b6d7ca63    esi: 00000000    edi: c03445d0
>>>>>ebp: 00038000    esp: ee643e80     ds: 0068
>>>>>es: 0068  ss: 0068
>>>>>
>>>>>Process dd (pid: 956, stack page = ee643000)
>>>>>
>>>>>Call trace:   wakeup_kswapd   0xfb (0xee643e90)
>>>>>              __aloc_pages_limit   0x57
>>>>>              __alloc_pages        0x101
>>>>>              generic_file_write   0x394
>>>>>              ext3_file_write      0x39
>>>>>              sys_write            0x97
>>>>>              system_call          0x33
>>>>>
>>>>> Although aacraid isn't directly implicated here, I can 
>>>>>reproduce this 
>>>>>on the 2550s and 2650s (aacraid) but not 1750s (megaraid).
>>>>>
>>>>>Andrew
>>>>>
>>>>>Paul Anderson wrote:
>>>>>
>>>>>
>>>>>>We had this same issue with our 2650's running AS 2.1.  Don't know 
>>>>>>that this is the best answer, but it is the one that worked for 
>>>>>>us...Replace the on board adapter with a PERC 3/DC (LSI) adapter.  
>>>>>>Make sure that you put it on its own bus, we used slot 
>>>>>
>>>>>three.  In 2 of 
>>>>>
>>>>>>our 2650's we are even running this with the HBA's for SAN 
>>>>>>connectivity.  That said, our solution is about 2 weeks 
>>>>>
>>>>>old, though I 
>>>>>
>>>>>>did run similar tests on the systems after the new install 
>>>>>
>>>>>for 8 days 
>>>>>
>>>>>>and was unable to make them crash.
>>>>>>
>>>>>>Paul
>>>>>>
>>>>>>-----Original Message-----
>>>>>>From: Andrew Mann [mailto:amann at mythicentertainment.com]
>>>>>>Sent: Tuesday, October 07, 2003 12:47 PM
>>>>>>To: linux-poweredge at dell.com
>>>>>>Cc: Matt Domsch; deanna_bonds at adaptec.com; alan at redhat.com
>>>>>>Subject: RedHat 9 aacraid - system fails under extreme disk IO - 
>>>>>>Reproducable test case
>>>>>>
>>>>>>
>>>>>>      This has been brought up on the Dell Linux Poweredge 
>>>>>
>>>>>list previously,
>>>>>
>>>>>>but it doesn't appear that a definative solution or reproducable 
>>>>>>situation has been presented.  It also seems like the 
>>>>>
>>>>>previous reports 
>>>>>
>>>>>>involved both heavy disk IO as well as heavy network 
>>>>>
>>>>>traffic, and so the 
>>>>>
>>>>>>NIC driver was suspect.
>>>>>>      Since we have a number of 2550s and 2650s using the 
>>>>>
>>>>>onboard PERC3/Di 
>>>>>
>>>>>>raid controller (aacraid driver), this issue concerns us.
>>>>>>
>>>>>>      The following script was run with 6 instances at once 
>>>>>
>>>>>on two 2550s 
>>>>>
>>>>>>and
>>>>>>one 2650.
>>>>>>
>>>>>>2550 configuration
>>>>>>2 x P3 1.2 Ghz  kernel: 2.4.20-20.9smp #1 SMP
>>>>>>1GB of ram, 2GB of swap, 2 x 18 GB drives in a raid 1 configuration
>>>>>>
>>>>>>2650 configuration
>>>>>>2 x Xeon 2.2 Ghz   kernel: 2.4.20-20.9smp #1 SMP
>>>>>>2GB of ram, 2GB of swap, 2 x 18 GB drives in a raid 1 configuration 
>>>>>>Hyperthreading enabled
>>>>>>
>>>>>>
>>>>>>      The 2550s fail within 30 minutes of starting the tests 
>>>>>
>>>>>each time 
>>>>>
>>>>>>(tests
>>>>>>were run 6 times in a row).  The 2650 failed prior to 2.5 
>>>>>
>>>>>days (only 1 
>>>>>
>>>>>>test run due to duration before failure).  In some cases the 2550 
>>>>>>displayed a null pointer dereference in the kernel.  I'll copy down 
>>>>>>details next time I can catch it on screen.  It does not 
>>>>>
>>>>>get logged to 
>>>>>
>>>>>>disk, which doesn't surprise me in this situation.  In most 
>>>>>
>>>>>cases the 
>>>>>
>>>>>>screen was blank (due to APM I'd guess?).
>>>>>>      The systems still respond to pings, but do not respond 
>>>>>
>>>>>to keyboard 
>>>>>
>>>>>>actions and do not complete any tcp connections.  These 
>>>>>
>>>>>systems do not 
>>>>>
>>>>>>have a graphical desktop installed, and in fact have a 
>>>>>
>>>>>fairly minimal 
>>>>>
>>>>>>set of packages installed at all.
>>>>>>      I don't know why the 2550 would consistantly fail in 
>>>>>
>>>>>such a brief 
>>>>>
>>>>>>period while the 2650 would take a much longer time before failure. 
>>>>>>I've been running the same tests on a 1750 (PERC4/Di - 
>>>>>
>>>>>Megaraid based) 
>>>>>
>>>>>>for some days now without a failure.
>>>>>>      I plan on testing a non-SMP kernel on the 2550 next - 
>>>>>
>>>>>not because we 
>>>>>
>>>>>>can run things that way, but to maybe give some more clues.
>>>>>>
>>>>>>      The following script creates a 300 MB file, then rm's 
>>>>>
>>>>>it, then does 
>>>>>
>>>>>>it
>>>>>>all over again.  For my tests I ran 6 of these concurrently.  Don't 
>>>>>>expect the system to respond to much while these are 
>>>>>
>>>>>running, though I 
>>>>>
>>>>>>was able to get decent updates from top.
>>>>>>      Alter the script as you see fit, I'm no guru with bash 
>>>>>
>>>>>scripting!
>>>>>
>>>>>>cat diskgrind.sh
>>>>>>#!/bin/sh
>>>>>>
>>>>>>
>>>>>>MEGS=300
>>>>>>TOTAL=0
>>>>>>
>>>>>>while [ "1" != "0" ]; do
>>>>>>         dd ibs=1048576 count=$MEGS if=/dev/zero 
>>>>>
>>>>>of=/test/diskgrind.$$
>>>>>
>>>>>>2>&1 | cat >/dev/null
>>>>>>         rm -f /test/diskgrind.$$
>>>>>>         TOTAL=`expr $TOTAL + $MEGS`
>>>>>>         echo "[$$] Completed $TOTAL megs."
>>>>>>done
>>>>>>
>>>>>>
>>>>>>./diskgrind.sh &
>>>>>>./diskgrind.sh &
>>>>>>./diskgrind.sh &
>>>>>>./diskgrind.sh &
>>>>>>./diskgrind.sh &
>>>>>>./diskgrind.sh &
>>>>>>
>>>>>>
>>>>>>
>>>>>>Andrew
>>>>>>
>>>>>
>>>>>-- 
>>>>>Andrew Mann
>>>>>Systems Administrator
>>>>>Mythic Entertainment
>>>>>703-934-0446 x 224
>>>>>
>>>>>_______________________________________________
>>>>>Linux-PowerEdge mailing list
>>>>>Linux-PowerEdge at dell.com 
>>>>>
>>>>>>>http://lists.us.dell.com/mailman/listinfo/linux->>poweredge
>>>>>
>>>>>
>>>>>Please read the FAQ at 
>>>>>http://lists.us.dell.com/faq or search the list archives at 
>>>
>>>http://lists.us.dell.com/htdig/
>>>
 
_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge at dell.com
http://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq or search the list
archives at http://lists.us.dell.com/htdig/
 
End of Linux-PowerEdge Digest
 




More information about the Linux-PowerEdge mailing list