PE1950 PERC6/E crashes

Stephen Dowdy sdowdy at ucar.edu
Thu Sep 3 18:53:52 CDT 2009


Howard,

Howard Powell wrote, On 09/03/2009 05:17 PM:

> I just updated the PE1950 BIOS to version: 2.6.1
> I updated the PE1960 BMC to 2.37
> I just updated the PERC 6/E firmware to: 6.2.0-0013
> The PERC 6/E driver is the "in box" driver provided by the OS:
> 00.00.04.01-RH1
> The MD1000 firmware is version A.04
> 
> From what I can tell, everything is up to date.

Can't vouch for that, but certainly, i'd suggest making
sure of those items.  Also, check the MD1000 *drive*
firmware versions.  (see below)

> Under heavy I/O load, such as when moving several hundred GB of files
> over a few hours the megasas controller will panic and the MD1000 RAID
> will go offline.  I've appended to this email the klog output when this
> happens.  This problem has been occurring for months under various
> kernels and various (older) firmware and drivers.

I'd grab the MegaSAS tools from lsi and use

    megacli fwtermlog dsply a0

to see if anything aberrant is going on inside the PERC6.  (have fun
entering the world of the psychotic embedded device programmer!)

Also, grab megasasctl from sourceforge and run:

    megasasctl -svv a0

(these presume adapter 0, use a1, a2... as needed)
megasasctl will show you the drive model and firmware.
There are some known issues with some drives that may cause
the PERC6 to "get lost" while handling them and i've often
seen the linux kernel "give up" if it doesn't get a response
and mark the filesystem "offline".

Disk firmware can usually be found at support.dell.com under
the specific device (poweredge 1950 or possibly an entry for
the MD1000)  You usually need to burn a CD and update that
way.


> Sep  3 18:17:29 halo kernel: irq 169: nobody cared (try booting with the
> "irqpoll" option)

Good Old "nobody cared".

    grep 169: /proc/interrupts

to see what's assigned to that IRQ.

Do you have an nVidia graphics card on that system? (if so, what's in
    cat /proc/driver/nvidia/version

try also booting with "processor.max_cstate=1"
(acpi processor idling in the call trace is suspicious)

Dell's been having some BIOS IRQ routing related issue i've been
running into lately, though i haven't seen them on 9th gen systems.

--stephen


> Sep  3 18:30:32 halo kernel:  [<ffffffff800b7c61>] __do_IRQ+0xbd/0x103
> Sep  3 18:30:32 halo kernel:  [<ffffffff80011fc4>] __do_softirq+0x89/0x133
> Sep  3 18:30:32 halo kernel:  [<ffffffff8006c95d>] do_IRQ+0xe7/0xf5
> Sep  3 18:30:32 halo kernel:  [<ffffffff8018d056>]
> acpi_processor_idle+0x0/0x440
> Sep  3 18:30:32 halo kernel:  [<ffffffff8005d615>] ret_from_intr+0x0/0xa
> Sep  3 18:30:32 halo kernel:  <EOI>  [<ffffffff8018cfe7>]
> acpi_safe_halt+0x25/0x36
> Sep  3 18:30:32 halo kernel:  [<ffffffff8018d1dd>]
> acpi_processor_idle+0x187/0x440
> Sep  3 18:30:32 halo kernel:  [<ffffffff8018d056>]
> acpi_processor_idle+0x0/0x440




-- 
Stephen Dowdy  -  Systems Administrator  -  NCAR/RAL
303.497.2869   -  sdowdy at ucar.edu        -  http://www.ral.ucar.edu/~sdowdy/



More information about the Linux-PowerEdge mailing list