megaraid_sas waiting for command and then offline

David Gwynne dlg at itee.uq.edu.au
Wed Dec 13 02:00:59 CST 2006


On 12/12/2006, at 3:24 PM, Brett G. Durrett wrote:
>
> I am still seeing this and we have between 2 and 5 failures per week
> (across almost 20 machines).  I am seeing it on ext3 (we migrated  
> all of
> the machines from XFS) and with ReadAhead disabled.

As said before, I'm seeing this on a completely different operating  
system which leads me to believe that this isn't a driver issue.

However, it would be nice to have it reproducible on a supported  
operating system so this issue could get some proper attention.

> You mention a firmware update but I don't see any new PERC 5 firmware
> packages on Dell's site... can you give me a pointer to the  
> firmware update?

I upgraded to the latest LSI firmware, version 5.0.1-0061, and still  
hit the problem. I also tried limiting the number of openings on the  
device from the default (1008) down to 16 and still hit it.

>
> Also, has anybody had this problem on RHE?  Dell does not support  
> Linux
> unless it is RHE... I would be surprised is somehow RHE did not have
> this problem.

I agree.

>
> B-
>
>
>
> Joe Malicki wrote:
>>>  I have the same or a similar issue running 2.6.17 SMP x86_64 - the
>>> megaraid_sas driver hangs waiting for commands and then the  
>>> filesystem
>>> unmounts, leaving the machine in an unusable state until there is  
>>> a hard
>>> reboot (the machine is responsive but any access, shell or  
>>> otherwise, is
>>> impossible without the filesystem). While I do not have much  
>>> debugging
>>> information available, this happens to me about once every 6-7  
>>> days in
>>> my pool of seven machines, so I can probably get debugging info.  
>>> Since
>>> the disk is offline and I can't get remote console, I don't have any
>>> details except something similar to Dave Lloyd's post, below.
>>>
>>
>> Brett, is this still happening to you?  We're seeing this very
>> sporadically, but it does concern us.  We've seen driver updates in
>> 2.6.19 (v00.00.03.05) and a new Dell PERC 5/i firmware:
>>
>> Package Version - 5.0.2-0003
>> Firmware Version - 1.00.01-0157
>> SASBIOS Version - MT23
>> Ctrl-R Version - 1.02-007
>> MPT Version - 00.06.71.00-IT
>>
>> and haven't been able to reproduce it, but we can't find a test  
>> case to
>> reliably reproduce the problem to know that anything was fixed  
>> (out of
>> 31 identically configured Dell 2950's with the PERC 5/i RAID  
>> controller
>> (configured with 6 300MB SAS drives in a RAID 5, most (all?) of them
>> Maxtor Atlas 10k, not hot spare).  Our 2950s do have 16GB of RAM  
>> each,
>> so the firmware update (which mentions that it fixes DMA beyond 8GB)
>> sounds promising, but I would think that if that was the problem  
>> we were
>> experiencing, we would reproduce this much more often?  We are  
>> certainly
>> using the RAM for cache and memory, it's not like we've never touched
>> beyond 8GB.
>>
>> Does anyone have a test case to reproduce this problem reliably, or a
>> detailed description of what actually happens (on low levels) when  
>> this
>> problem occurs that can help to make a test?  We are more  
>> interested in
>> making this reproducible now than in finding a workaround... if  
>> anyone
>> has any tips on how to make this *more* likely to happen we'd like to
>> know (so far, I know to try to use XFS and enable ReadAhead).
>>
>> We have seen this correlated with Patrol Reads going on at the same
>> time, but aren't sure if this is a red herring, and haven't been  
>> able to
>> force the issue to happen by enabling Patrol Reads.
>>
>> We've only ever seen these on two machines - one machine  
>> reproduces the
>> problem in a little over a week, and the other has reproduced it a  
>> small
>> number of times.  The machines that reproduce it run an experimental
>> demo workload, but we have not found a test case so far to  
>> reproduce the
>> problem on demand to find or verify solutions.  We're currently  
>> swapping
>> out machines to verify that there are no hardware problems, but the
>> machines diagnose themselves cleanly, and the workload they run is
>> different enough that something about the workload we can't yet
>> synthesize into a test case is the problem.
>>
>> Thank you!
>> Joe Malicki
>> Software Engineer
>> Metacarta, Inc.
>> email: jmalicki at metacarta.com
>>
>>
>>> The only thing that the machines with these failures seem to have in
>>> common is the fact that they are almost exclusively writes - they  
>>> are
>>> slave database machines with large memory and pretty much just
>>> replicate. The read/write machines seem to have less failures.
>>>
>>> I am happy to help provide debugging information in any  
>>> reasonable way.
>>> In the mean time, if there are any known suggestions or  
>>> workarounds for
>>> the problem, I would be grateful for the guidance.
>>>
>>> Here are what details on the controller. If you want additional  
>>> info,
>>> let me know exactly what you need and I will do what I can to get  
>>> it to
>>> you.:
>>>
>>> Product Name : PERC 5/i Integrated
>>> Serial No : 12345
>>> FW Package Build: 5.0.1-0030
>>> FW Version : 1.00.01-0088
>>> BIOS Version : MT23
>>> Ctrl-R Version :1.02-007
>>>
>>> B-
>>>
>>
>>
>
> _______________________________________________
> Linux-PowerEdge mailing list
> Linux-PowerEdge at dell.com
> http://lists.us.dell.com/mailman/listinfo/linux-poweredge
> Please read the FAQ at http://lists.us.dell.com/faq
>
>



More information about the Linux-PowerEdge mailing list