PERC 4, RAID 5 problem on PE 2850

linux-poweredge at mjo.tc linux-poweredge at mjo.tc
Wed Sep 6 06:04:52 CDT 2006


Hi,

We had a problem with a PE 2850 at weekend: the disks appeared to
essentially disappeared! When I logged into the console via the DRAC
card, I appeared to be in a Busybox without any disks at all. Somewhat
irritated, I cycled the power and all appeared to be well.

At the time it crashed the system wasn't significantly loaded, and
there doesn't appear to be anything useful in the kernel
logs. However, the PERC card does have some info: it appears that one
of the drives timed out, then was removed from the system, and the
hot-spare was then used to rebuild the array.

Some details: the machine has a DRAC 4 with 3 Seagate drives in RAID
5, plus a hot spare. It runs Ubuntu with a locally compiled Xen
kernel. I've been using the LSI megarc utility to query the card,
because we don't have OMSA.

    ------------------------------------------------------------
    Exec: -pdFailInfo -a0 -chAll -idAll
    ------------------------------------------------------------
    Extended Phys Drv Failure Log:
    Adp-0 Ch 0 Id 01 [SEAGATE ST373207LC      D704]
    MM-DD-YY hh:mm:ss Ch Id   Reason/Reason String
    09-02-06 03:41:04 0  01   Select timeout
    
    ------------------------------------------------------------
    Exec: -getNVRAMLog -a0
    ------------------------------------------------------------
    ==================
    Maximum entries supported          : 64
    Total entries in use               : 1
    Sequence number of the first entry : 0
    Sequence number of the last entry  : 0
    NVRAM Logged:
    SenseData:
    00 00 09 00 00 00 00 00 00 00 00 00 80 24 00 00 00 00 
    CdbData:
    2a 00 05 a5 0b c0 00 00 40 00 00 00 00 00 00 00 
    SeqNo=0 ctl=0 chn=0 tgt=1 event= 36:PHYSDEV_REMOVED_DEAD
    Logged at: Sep 02 03:41:04 2006

It seems to me that the PERC card failed here: I'm surprised that a
select timeout is enough to mark the drive as bad, but even if it is,
then I think the system should have kept working.

Some questions:

1. Am I right to think that this is definately a problem with the PERC
   or the drives, and it's not a problem with my kernel ?

2. Should the card have kept working ? If so, have I misconfigured
   something (I enclose more details below) ? If so, how should I fix
   this ?

3. I'd like to read the SMART log from the failed drive. Can I do that
   with the PERC or will I have to plug it into another machine ? If
   so, is there a recommended SCSI PCI card ?

4. Has anyone seen this sort of thing before ? IS it likely that the
   whole batch of drives is dodgy, or is it more likely that the PERC
   is broken/over-zealous/misconfigured ?

Other data:
    ------------------------------------------------------------
    Exec: -ctlrInfo -a0
    ------------------------------------------------------------
    Information of Adapter-0 (#Adapter(s) on system: 1)
    Firmware Version : 521XBIOS Version : H430
    Logical Drives : 01 DRAM : 256MB
    Rebuild Rate : 30%
    Flush Interval : 4 secs
    Number Of Chnls : 2Bios Status : Enabled 
    Alarm State : AbsentAuto Rebuild : Enabled 
    FW : SPAN-8, 40-LDBIOS Config AutoSelection : USER
    BIOS Echos Mesg : ONBIOS Stops On Error : ON
    Initiator Id : 7(Clustered Firmware)
    Board SN: 33686018

All drives are Seagate ST373207LC rev. D704, serial numbers 3KT4Q7KS,
3KT4Q6LV, 3KT4Q4X4, and 3KT4QADX.
 
Cheers,
-- 
Martin Oldfield
AdamsNames Limited



More information about the Linux-PowerEdge mailing list