Dell 2550 Perc 3/di hardware failure (?)

Brant Faircloth faircloth at gmail.com
Wed Apr 1 12:51:55 CDT 2009


Hi everyone,

I have an issue with one of our 2550's and a container that keeps dying.

Long story short is that problems arose ~2 weeks ago when the machine  
kicked over into read-only mode following a container failure.  Given  
that the drives in the machine were very old, we purchased new drives,  
set them back up in RAID 1 and I updated to Debian 5.0.  After about 5  
days of uptime, the container died yet again.  Following some  
searching around on the web, I came across info. indicating that more  
recent kernels may have problems with the Perc 3/di and aacraid (i,e.  
issues with aacraid.dacmode:  http://bugzilla.kernel.org/show_bug.cgi?id=9133) 
.  Hoping that this was the issue, I low-level formatted the drives,  
and reinstalled Debian 4.0 (reverting to an older kernel @ 2.6.18 - we  
have an identical machine with RAID 5 that is running well under  
Debian 4).  The install(s) have always gone without a hitch, but after  
~15 hours of uptime, the problems have returned.

The machine is currently up, but in read-only mode.  I am pondering my  
options, which include:  replacing the Perc 3/di with a different  
controller or switching over to SCSI-only (i,e. no raid).  I'm not  
entirely sure that the latter option is going to help, depending on  
the source of the problems.

Any suggestions are appreciated and thanks in advance.

cheers,
-brant

The issue kicks-off in syslog with:

Apr  1 03:14:03 charybdis kernel: AAC:ID(0:00:0) Timeout detected on  
cmd[0x2a]
Apr  1 03:14:03 charybdis kernel: AAC:ID(0:01:0) Timeout detected on  
cmd[0x2a]
Apr  1 03:14:03 charybdis kernel: AAC:SCSI Channel[0]: Timeout  
Detected On 10 Command(s)
Apr  1 03:14:03 charybdis kernel: AAC:HIM_EVENT_HA_FAILED:SCSI bus  
reset issued on channel 0
Apr  1 03:14:13 charybdis kernel: AAC: <...repeats 1 more times>
Apr  1 03:14:13 charybdis kernel: AAC:SCSI Channel[0]: Timeout  
Detected On 12 Command(s)
Apr  1 03:14:13 charybdis kernel: AAC:HIM_EVENT_HA_FAILED:SCSI bus  
reset issued on channel 0
Apr  1 03:14:23 charybdis kernel: AAC: <...repeats 1 more times>
Apr  1 03:14:13 charybdis kernel: AAC:HIM_EVENT_HA_FAILED:SCSI bus  
reset issued on channel 0
Apr  1 03:14:23 charybdis kernel: AAC: <...repeats 1 more times>

Various additional info:

charybdis:/home/bcf# uname -a
Linux charybdis 2.6.18-6-686 #1 SMP Sat Dec 27 09:31:05 UTC 2008 i686  
GNU/Linux

charybdis:/home/bcf# modinfo aacraid
filename:       /lib/modules/2.6.18-6-686/kernel/drivers/scsi/aacraid/ 
aacraid.ko
author:         Red Hat Inc and Adaptec
description:    Dell PERC2, 2/Si, 3/Si, 3/Di, Adaptec Advanced Raid  
Products, HP NetRAID-4M, IBM ServeRAID & ICP SCSI driver
license:        GPL
version:        1.1-5[2409]-mh2
vermagic:       2.6.18-6-686 SMP mod_unload 686 REGPARM gcc-4.1

AFA0> controller details
Executing: controller details
Controller Information
----------------------
              Device Name: AFA0
          Controller Type: PERC 3/Di
              Access Mode: READ-WRITE
Controller Serial Number: Last Six Digits = 9C21D2
          Number of Buses: 2
          Devices per Bus: 15
           Controller CPU: i960 R series
     Controller CPU Speed: 100 Mhz
        Controller Memory: 128 Mbytes
            Battery State: Ok

Component Revisions
-------------------
                 CLI: 2.8-0 (Build #6076)
                 API: 2.8-0 (Build #6076)
     Miniport Driver: 1.1-5 (Build #2409)
Controller Software: 2.8-1 (Build #6098)
     Controller BIOS: 2.8-1 (Build #6098)
Controller Firmware: (Build #6098)

AFA0> container list
Executing: container list
Num          Total  Oth Chunk          Scsi   Partition
Label Type   Size   Ctr Size   Usage   B:ID:L Offset:Size
----- ------ ------ --- ------ ------- ------ -------------
  0    Mirror 68.4GB            Valid   0:00:0 64.0KB!68.4GB
  /dev/sda             R1Mirror0        0:01:0 64.0KB!68.4GB

AFA0> disk list
Executing: disk list

B:ID:L  Device Type     Blocks    Bytes/Block Usage            Shared
------  --------------  --------- ----------- ---------------- ------
0:01:0   Disk            0         0           Offline         NO

AFA0> disk show space
Executing: disk show space

Scsi B:ID:L Usage      Size
----------- ---------- -------------
   0:00:0     Dead      64.0KB:68.4GB
   0:01:0     Dead      64.0KB:68.4GB


AFA0> enclosure show status
Executing: enclosure show status
Command Error: <The command or requested operation to the disk  
enclosure failed.>


AFA0> disk show smart
Executing: disk show smart

      Smart    Method of         Enable
      Capable  Informational     Exception  Performance  Error
B:ID:L  Device   Exceptions(MRIE)  Control    Enabled      Count
------  -------  ----------------  ---------  -----------  ------
0:01:0     N

AFA0> diagnostic show history

Executing: diagnostic show history
No switches specified, defaulting to "/current".

  *** HISTORY BUFFER FROM CURRENT CONTROLLER RUN ***

[00]: MpdEvent event bus 0 = 1 (HIM_EVENT_IO_CHANNEL_RESET).
[01]: SCSI bus reset detected on channel 0
[02]: neither side of mirror exists, can't write data
[03]:  <...repeats 50 more times>
[04]: MpdEvent event bus 0 = 8 (HIM_EVENT_TRANSPORT_MODE_CHANGE).
[05]:
[06]: HIM_EVENT_TRANSPORT_MODE_CHANGE-SCSI bus reset issued on ch
[07]: annel 0
[08]: neither side of mirror exists, can't write data
[09]:  <...repeats 1031 more times>
[10]: MpdEvent event bus 0 = 8 (HIM_EVENT_TRANSPORT_MODE_CHANGE).
[11]:
[12]: HIM_EVENT_TRANSPORT_MODE_CHANGE-SCSI bus reset issued on ch
[13]: annel 0
[14]: neither side of mirror exists, can't write data
[15]:  <...repeats 235999 more times>
[16]: MpdEvent event bus 0 = 1 (HIM_EVENT_IO_CHANNEL_RESET).
[17]: SCSI bus reset detected on channel 0
[18]: neither side of mirror exists, can't write data
[19]:  <...repeats 125 more times>
[20]: MpdEvent event bus 0 = 8 (HIM_EVENT_TRANSPORT_MODE_CHANGE).
[21]:
[22]: HIM_EVENT_TRANSPORT_MODE_CHANGE-SCSI bus reset issued on ch
[23]: annel 0
[24]: neither side of mirror exists, can't write data
[25]:  <...repeats 571453 more times>
[26]: MpdEvent event bus 0 = 8 (HIM_EVENT_TRANSPORT_MODE_CHANGE).
[27]:
[28]: HIM_EVENT_TRANSPORT_MODE_CHANGE-SCSI bus reset issued on ch
[29]: annel 0
[30]: neither side of mirror exists, can't write data
[31]:  <...repeats 1303 more times>
[32]: MpdEvent event bus 0 = 8 (HIM_EVENT_TRANSPORT_MODE_CHANGE).
[33]:
[34]: HIM_EVENT_TRANSPORT_MODE_CHANGE-SCSI bus reset issued on ch
[35]: annel 0
[36]: neither side of mirror exists, can't write data
[37]:  <...repeats 31686 more times>
[38]: MpdEvent event bus 0 = 8 (HIM_EVENT_TRANSPORT_MODE_CHANGE).
[39]:
[40]: HIM_EVENT_TRANSPORT_MODE_CHANGE-SCSI bus reset issued on ch
[41]: annel 0
[42]: neither side of mirror exists, can't write data
[43]:  <...repeats 128 more times>
[44]: MpdEvent event bus 0 = 8 (HIM_EVENT_TRANSPORT_MODE_CHANGE).
[45]:
[46]: HIM_EVENT_TRANSPORT_MODE_CHANGE-SCSI bus reset issued on ch
[47]: annel 0
[48]: neither side of mirror exists, can't write data
[49]:  <...repeats 5099 more times>
[50]: MpdEvent event bus 0 = 8 (HIM_EVENT_TRANSPORT_MODE_CHANGE).
[51]:
[52]: HIM_EVENT_TRANSPORT_MODE_CHANGE-SCSI bus reset issued on ch
[53]: annel 0
[54]: MpdEvent event bus 0 = 1 (HIM_EVENT_IO_CHANNEL_RESET).
[55]: SCSI bus reset detected on channel 0
[56]: neither side of mirror exists, can't write data
[57]:  <...repeats 142 more times>
[58]: MpdEvent event bus 0 = 1 (HIM_EVENT_IO_CHANNEL_RESET).
[59]: SCSI bus reset detected on channel 0
[60]: neither side of mirror exists, can't write data
[61]:  <...repeats 56 more times>
[62]: MpdEvent event bus 0 = 8 (HIM_EVENT_TRANSPORT_MODE_CHANGE).
[63]:
[64]: HIM_EVENT_TRANSPORT_MODE_CHANGE-SCSI bus reset issued on ch
[65]: annel 0
[66]: neither side of mirror exists, can't write data
[67]:  <...repeats 3299 more times>
[68]: MpdEvent event bus 0 = 8 (HIM_EVENT_TRANSPORT_MODE_CHANGE).
[69]:
[70]: HIM_EVENT_TRANSPORT_MODE_CHANGE-SCSI bus reset issued on ch
[71]: annel 0
[72]: neither side of mirror exists, can't write data
[73]:  <...repeats 4512299 more times>
[74]: ID(0:01:0); Simulating selection timeout due to NEXUS_ERROR
[75]:  [command:0x28]
[76]: ID(0:01:0) Cmd[0x28] Fail: Block Range 594332 : 594333 at 4
[77]: 8104 sec
[78]: failing read io
[79]: neither side of mirror exists, can't write data
[80]:  <...repeats 1586099 more times>
[81]: ID(0:01:0); Simulating selection timeout due to NEXUS_ERROR
[82]:  [command:0x28]
[83]: ID(0:01:0) Cmd[0x28] Fail: Block Range 601574 : 601575 at 4
[84]: 9720 sec
[85]: failing read io
[86]: neither side of mirror exists, can't write data
[87]:  <...repeats 9357402 more times>
[88]: 2 can't read mbr dev_t:0
[89]: 2 can't read mbr dev_t:1
[90]: 2 can't read mbr dev_t:0
[91]: 2 can't read mbr dev_t:1
[92]: neither side of mirror exists, can't write data
[93]:  <...repeats 24399 more times>
[94]: 2 can't read mbr dev_t:1
[95]: neither side of mirror exists, can't write data
[96]:  <...repeats 596299 more times>
[97]: 2 can't read mbr dev_t:1
[98]: neither side of mirror exists, can't write data
[99]:

========================
History Output Complete.

-------------------
Brant Faircloth

< * )
  (_ \\
  _ ||




-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 194 bytes
Desc: This is a digitally signed message part
Url : http://lists.us.dell.com/pipermail/linux-poweredge/attachments/20090401/14c6f29d/attachment.sig 


More information about the Linux-PowerEdge mailing list