megaraid_sas waiting for command and then offline
dlg at itee.uq.edu.au
Wed Dec 13 02:00:59 CST 2006
On 12/12/2006, at 3:24 PM, Brett G. Durrett wrote:
> I am still seeing this and we have between 2 and 5 failures per week
> (across almost 20 machines). I am seeing it on ext3 (we migrated
> all of
> the machines from XFS) and with ReadAhead disabled.
As said before, I'm seeing this on a completely different operating
system which leads me to believe that this isn't a driver issue.
However, it would be nice to have it reproducible on a supported
operating system so this issue could get some proper attention.
> You mention a firmware update but I don't see any new PERC 5 firmware
> packages on Dell's site... can you give me a pointer to the
> firmware update?
I upgraded to the latest LSI firmware, version 5.0.1-0061, and still
hit the problem. I also tried limiting the number of openings on the
device from the default (1008) down to 16 and still hit it.
> Also, has anybody had this problem on RHE? Dell does not support
> unless it is RHE... I would be surprised is somehow RHE did not have
> this problem.
> Joe Malicki wrote:
>>> I have the same or a similar issue running 2.6.17 SMP x86_64 - the
>>> megaraid_sas driver hangs waiting for commands and then the
>>> unmounts, leaving the machine in an unusable state until there is
>>> a hard
>>> reboot (the machine is responsive but any access, shell or
>>> otherwise, is
>>> impossible without the filesystem). While I do not have much
>>> information available, this happens to me about once every 6-7
>>> days in
>>> my pool of seven machines, so I can probably get debugging info.
>>> the disk is offline and I can't get remote console, I don't have any
>>> details except something similar to Dave Lloyd's post, below.
>> Brett, is this still happening to you? We're seeing this very
>> sporadically, but it does concern us. We've seen driver updates in
>> 2.6.19 (v00.00.03.05) and a new Dell PERC 5/i firmware:
>> Package Version - 5.0.2-0003
>> Firmware Version - 1.00.01-0157
>> SASBIOS Version - MT23
>> Ctrl-R Version - 1.02-007
>> MPT Version - 00.06.71.00-IT
>> and haven't been able to reproduce it, but we can't find a test
>> case to
>> reliably reproduce the problem to know that anything was fixed
>> (out of
>> 31 identically configured Dell 2950's with the PERC 5/i RAID
>> (configured with 6 300MB SAS drives in a RAID 5, most (all?) of them
>> Maxtor Atlas 10k, not hot spare). Our 2950s do have 16GB of RAM
>> so the firmware update (which mentions that it fixes DMA beyond 8GB)
>> sounds promising, but I would think that if that was the problem
>> we were
>> experiencing, we would reproduce this much more often? We are
>> using the RAM for cache and memory, it's not like we've never touched
>> beyond 8GB.
>> Does anyone have a test case to reproduce this problem reliably, or a
>> detailed description of what actually happens (on low levels) when
>> problem occurs that can help to make a test? We are more
>> interested in
>> making this reproducible now than in finding a workaround... if
>> has any tips on how to make this *more* likely to happen we'd like to
>> know (so far, I know to try to use XFS and enable ReadAhead).
>> We have seen this correlated with Patrol Reads going on at the same
>> time, but aren't sure if this is a red herring, and haven't been
>> able to
>> force the issue to happen by enabling Patrol Reads.
>> We've only ever seen these on two machines - one machine
>> reproduces the
>> problem in a little over a week, and the other has reproduced it a
>> number of times. The machines that reproduce it run an experimental
>> demo workload, but we have not found a test case so far to
>> reproduce the
>> problem on demand to find or verify solutions. We're currently
>> out machines to verify that there are no hardware problems, but the
>> machines diagnose themselves cleanly, and the workload they run is
>> different enough that something about the workload we can't yet
>> synthesize into a test case is the problem.
>> Thank you!
>> Joe Malicki
>> Software Engineer
>> Metacarta, Inc.
>> email: jmalicki at metacarta.com
>>> The only thing that the machines with these failures seem to have in
>>> common is the fact that they are almost exclusively writes - they
>>> slave database machines with large memory and pretty much just
>>> replicate. The read/write machines seem to have less failures.
>>> I am happy to help provide debugging information in any
>>> reasonable way.
>>> In the mean time, if there are any known suggestions or
>>> workarounds for
>>> the problem, I would be grateful for the guidance.
>>> Here are what details on the controller. If you want additional
>>> let me know exactly what you need and I will do what I can to get
>>> it to
>>> Product Name : PERC 5/i Integrated
>>> Serial No : 12345
>>> FW Package Build: 5.0.1-0030
>>> FW Version : 1.00.01-0088
>>> BIOS Version : MT23
>>> Ctrl-R Version :1.02-007
> Linux-PowerEdge mailing list
> Linux-PowerEdge at dell.com
> Please read the FAQ at http://lists.us.dell.com/faq
More information about the Linux-PowerEdge