megasas failures at MWT2

Aaron van Meerten aaron at hep.uchicago.edu
Wed Mar 31 15:27:19 CDT 2010


Hi Dell Linux Folks,

Our site recently purchased a number of Dell R710's with 2 Perc6/E controllers, and 3 MD1000's SAS-connected to each controller, 6 MD1000's per R710.

We are running Scientific Linux 5.3, which is very close to RHEL5.

We have seen two critical failures recently on different nodes, with the megasas drivers and Perc6/E controllers becoming entirely unresponsive, and all volumes associated with the controller going offline.  I'm curious if anyone on this list has seen any of this variety before, and has any suggestions for what to do to fix it.  So far we've been able to duplicate it by applying a large number of writes over the network into this storage.  The only solution to remove ourselves from this state is to do a power cycle, as the kernel won't even shutdown cleanly once it reaches this state.


A more complete description of the problem is below.


At some point during write-load, the RAID controller or enclosure falls into a state where it is failing to correctly communicate with the OS.  This causes at least the virtual disk device (e.g. /dev/sde) and usually all other virtual disk devices on the controller to fail.  The failure is complete, no data can be read or written, and an ls -l shows a 'no available memory' error.

Once we fall into this state, there is no solution available except to reboot the whole server.  Once the server reboots, the stack is reset and the machine works as expected.

Looking into the details of the driver and firmware version stack I find the following:

We are running the latest firmware on these MD1000s: A.04
We are running the latest firmware on the Perc6/E: 6.2.0-0013
We are running a slightly older driver than dell recommends on their site for the megasas: 00.00.04.08-RH2
Dell suggests 00.00.04.17.  We have upgraded 2 nodes to this new driver version and are testing with that configuration as well.

So far we have seen 2 of these failures, so more testing will be needed to ensure that we have a good idea to see how often this problem can be expected.

Anyone else have any ideas or questions?


Details of the symptoms of the failure on uct2-s8

The issue starts with this message:
sd 2:2:2:0: megasas: RESET -12716530 cmd=8a retries=0
Basically we're trying to send a RESET across the sas channel.  This is a relatively common activity for SAS when dealing with physical device timeout, but it's failing in this case.  However, once we see this message:
megasas: failed to do reset
We know we've failed and that we've got larger issues.
sd 2:2:2:0: megasas: RESET -12716530 cmd=8a retries=0
megasas: cannot recover from previous reset failures
sd 2:2:2:0: megasas: RESET -12716530 cmd=8a retries=0
megasas: cannot recover from previous reset failures
sd 2:2:2:0: scsi: Device offlined - not ready after error recovery
sd 2:2:2:0: scsi: Device offlined - not ready after error recovery

Once this happens, we are in the state of filesystem known as "entirely broken":
sd 2:2:2:0: timing out command, waited 360s
sd 2:2:2:0: SCSI error: return code = 0x06000000
end_request: I/O error, dev sdg, sector 34909388801
sd 2:2:2:0: timing out command, waited 360s
sd 2:2:2:0: SCSI error: return code = 0x06000000
end_request: I/O error, dev sdg, sector 34909388808
sd 2:2:2:0: rejecting I/O to offline device
sd 2:2:2:0: rejecting I/O to offline device
sd 2:2:2:0: rejecting I/O to offline device
Device sdg, XFS metadata write error block 0x820c30001 in sdg

We have a call trace which indicates some kind of low level IRQ problem:

Call Trace:
<IRQ>  [<ffffffff800bae01>] __report_bad_irq+0x30/0x7d
[<ffffffff800bb034>] note_interrupt+0x1e6/0x227
[<ffffffff800ba530>] __do_IRQ+0xbd/0x103
[<ffffffff80012348>] __do_softirq+0x89/0x133
[<ffffffff8006c9bf>] do_IRQ+0xe7/0xf5
[<ffffffff8005d615>] ret_from_intr+0x0/0xa
<EOI>  [<ffffffff801983e7>] acpi_processor_idle_simple+0x17d/0x30e
[<ffffffff801983e7>] acpi_processor_idle_simple+0x17d/0x30e
[<ffffffff801983e7>] acpi_processor_idle_simple+0x17d/0x30e
[<ffffffff80197b2d>] acpi_safe_halt+0x25/0x36
[<ffffffff8019834a>] acpi_processor_idle_simple+0xe0/0x30e
[<ffffffff8006b126>] __exit_idle+0x1c/0x2a
[<ffffffff8019826a>] acpi_processor_idle_simple+0x0/0x30e
[<ffffffff800494cc>] cpu_idle+0x95/0xb8
[<ffffffff803fd7fd>] start_kernel+0x220/0x225
[<ffffffff803fd22f>] _sinittext+0x22f/0x236

handlers:
[<ffffffff880b65ec>] (megasas_isr+0x0/0x45 [megaraid_sas])
Disabling IRQ #122
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 1899 bytes
Desc: not available
Url : http://lists.us.dell.com/pipermail/linux-poweredge/attachments/20100331/2be99058/attachment.p7s 


More information about the Linux-PowerEdge mailing list