2550 Perc3/Di Enclosure "offline"?! Help! (long)

David Kinnvall david.kinnvall at alertir.com
Thu Aug 15 10:45:00 CDT 2002


Hi all!

This is a pretty long one, since I wanted to make sure to get as
many relevant details in as possible. Please, bear with me.

Early this morning, about 00:37 CEST, there was a short burst of
log activity from the scsi/aacraid layers on one of the 2550's in
our rack. Excerpt from /var/log/messages:

Aug 15 00:37:02 db1 kernel: AAC:ID(0:00:0) Timeout detected on cmd[0x28]
Aug 15 00:37:02 db1 kernel: AAC:ID(0:01:0) Timeout detected on cmd[0x28]
Aug 15 00:37:02 db1 kernel: AAC:ID(0:02:0) Timeout detected on cmd[0x28]
Aug 15 00:37:02 db1 kernel: AAC:SCSI Channel[0]: Timeout Detected On 19 
Command(s)
Aug 15 00:37:07 db1 kernel: AAC:SCSI Channel[0]: Timeout Detected On 5 
Command(s)
Aug 15 00:37:12 db1 kernel: AAC:ID(0:01:0); Abort Timeout. Resetting Bus 0
Aug 15 00:37:13 db1 kernel: AAC:SCSI Channel[0]: Timeout Detected On 7 
Command(s)
Aug 15 00:38:02 db1 kernel: AAC:ID(0:00:0); Error Event [command:0x28]
Aug 15 00:38:02 db1 kernel: AAC:ID(0:00:0); Unit Attention 
[k:0x6,c:0x29,q:0x2]
Aug 15 00:38:02 db1 kernel: AAC:ID(0:01:0); Error Event [command:0x28]
Aug 15 00:38:02 db1 kernel: AAC:ID(0:01:0); Unit Attention 
[k:0x6,c:0x29,q:0x2]
Aug 15 00:38:02 db1 kernel: AAC:ID(0:02:0); Error Event [command:0x28]
Aug 15 00:38:02 db1 kernel: AAC:ID(0:02:0); Unit Attention 
[k:0x6,c:0x29,q:0x2]
Aug 15 00:38:02 db1 kernel: AAC:ID(0:00:0) Timeout detected on cmd[0x2a]
Aug 15 00:38:02 db1 kernel: AAC:ID(0:01:0) Timeout detected on cmd[0x2a]
Aug 15 00:38:02 db1 kernel: AAC:ID(0:02:0) Timeout detected on cmd[0x2a]
Aug 15 00:38:02 db1 kernel: AAC:SCSI Channel[0]: Timeout Detected On 14 
Command(s)
Aug 15 00:38:02 db1 kernel: AAC:ID(0:00:0); Aborted Command [command:0x2a]
Aug 15 00:38:02 db1 kernel: AAC:ID(0:01:0); Aborted Command [command:0x2a]
Aug 15 00:38:02 db1 kernel: AAC:ID(0:02:0); Aborted Command [command:0x2a]
Aug 15 00:38:02 db1 kernel: AAC:SCSI Channel[0]: Timeout Detected On 7 
Command(s)
Aug 15 00:38:02 db1 kernel: scsi : aborting command due to timeout : pid 
0, scsi0, channel 0, id 0, lun 0 Write (10) 00 01 37 62 30 00 00 10 00
Aug 15 00:38:02 db1 kernel: aacraid:0 ABORT
Aug 15 00:38:02 db1 kernel: scsi : aborting command due to timeout : pid 
0, scsi0, channel 0, id 0, lun 0 Write (10) 00 01 51 3a 28 00 00 10 00
Aug 15 00:38:02 db1 kernel: aacraid:0 ABORT
Aug 15 00:38:02 db1 kernel: AAC:ID(0:00:0); Abort Timeout. Resetting Bus 0
Aug 15 00:38:02 db1 kernel: AAC:Enclosure 0:6:0 offline

Whoops, "enclosure offline"?!

Hmm. That wasn't what I expected after 251 days of uptime without any
major hiccups. Now, getting into afacli to have a look at what the
controller's view of things is, I find that the enclosure *is* actually
offline as indicated in the log, meaning I can't access any information
from it at all anymore:

AFA0> enclosure list
Executing: enclosure list
Enclosure 0 not found.
Command Error: <The operation failed because the specified enclosure is 
offline.>

Of course, since the enclosure is managing the drives physically(?), I
can no longer manipulate the drives in case of hotswaps etc. Nor can I
monitor backplane status (fans, temps, etc.) I have been in contact
with Dell support in Sweden (one phone call and one mail exchange) but
neither a cause nor a solution has yet been identified.

My primary hope for a hint at a solution goes to the list. (praise!)

More background:
- I have two identically configured machines running as db servers:
   2550's with 1GB RAM, (3+1)*18GB disk in RAID5, 2*1GHz P3 (plain).
   Red Hat Linux 7.2 with _old_ 2.4.7-10smp kernel, configured that way
   because of problems with the aacraid and e100 drivers in the updated
   kernels of that time. It has been *very* stable so far.
- Primary app is PostgreSQL 7.2 handling real-time stock-exchange feed
   (20-30 db updates/sec 24/7) and webserver queries against it.
- The machine is still up and the RAID-array is still handling the
   data, both read and write. It is just that the enclosure seems to
   have popped out of existance as far as the controller is concerned.
- There are no grown defects on any of the drives, the cache memory is
   there, the battery is charged and everything else is fine as well.
- There have been no other errors/problems related to aacraid/scsi in
   the logs for a *long* time.
- The load is typically 0.2-0.3'ish, with spikes now and then mostly
   depending on peculiarities in the stock-exchange feed.
- There have been no hardware nor software upgrades/changes on the
   machine, except for openssl/openssh-stuff for security reasons.
- It is located behind a firewall with no public access allowed.

My most urgent concerns:
- Is the controller/enclosure/array *actually* faulty?
- Is it just a cosmic glitch that is solved by a reboot, making
   the enclosure known in controller space again?
- Do I risk losing data/hw by rebooting (sigh, the uptime...)?
- Can I get the enclosure back online *without* rebooting?
- Has the time come for the machine to be upgraded to RH7.3?
- Is RH7.3 and all associated drivers solid on a 2550 with dual
   1GHz P3, bcm5700, two e100's, Perc3/Di in RADI5-setup? (I could
   reliably hang the machine with most 7.2-errata kernels using
   enough load, which is why I'm still stuck with 2.4.7-10smp...)

Appended below is some output from afacli to provide more insight,
including the diagnostic history at the bottom. Looks interesting
to me and could possibly provide a clue to one with more knowledge.

With the very best regards,

David Kinnvall
Alert Investor Relations

========

AFA0> controller details
Executing: controller details
Controller Information
----------------------
          Remote Computer: .
              Device Name: AFA0
          Controller Type: PERC 3/Di
              Access Mode: READ-WRITE
Controller Serial Number: Last Six Digits = 1410D2
          Number of Buses: 1
          Devices per Bus: 15
           Controller CPU: i960 R series
     Controller CPU Speed: 100 Mhz
        Controller Memory: 126 Mbytes
            Battery State: Ok

Component Revisions
-------------------
                 CLI: 3.0-0 (Build #4880)
                 API: 3.0-0 (Build #4880)
     Miniport Driver: 3.0-0 (Build #5125)
Controller Software: 2.6-0 (Build #3512)
     Controller BIOS: 2.6-0 (Build #3512)
Controller Firmware: (Build #3512)

AFA0> disk list
Executing: disk list

B:ID:L  Device Type     Blocks    Bytes/Block Usage            Shared Rate
------  --------------  --------- ----------- ---------------- ------ ----
0:00:0   Disk            35566478  512         Initialized      NO     160
0:01:0   Disk            35566478  512         Initialized      NO     160
0:02:0   Disk            35566478  512         Initialized      NO     160
0:03:0   Disk            35566478  512         Initialized      NO     160

AFA0> container list
Executing: container list
Num          Total  Oth Chunk          Scsi   Partition
Label Type   Size   Ctr Size   Usage   B:ID:L Offset:Size
----- ------ ------ --- ------ ------- ------ -------------
  0    RAID-5 33.8GB       32KB Open    0:00:0 64.0KB:16.9GB
  /dev/sda                              0:01:0 64.0KB:16.9GB
                                        0:02:0 64.0KB:16.9GB

AFA0> enclosure list
Executing: enclosure list
Enclosure 0 not found.
Command Error: <The operation failed because the specified enclosure is 
offline.>

AFA0> enclosure show status
Executing: enclosure show status

!!Attempt to communicate with enclosure number 0 failed!!

AFA0> task list
Executing: task list

Controller Tasks

TaskId Function  Done%  Container State Specific1 Specific2
------ -------- ------- --------- ----- --------- ---------

No tasks currently running on controller

AFA0> diagnostic show history
Executing: diagnostic show history
No switches specified, defaulting to "/current".



  *** HISTORY BUFFER FROM CURRENT CONTROLLER RUN ***

[00]: NameServe: bogus mount index requested [10]
[01]: NameServe: bogus mount index requested [11]
[02]: NameServe: bogus mount index requested [12]
[03]: NameServe: bogus mount index requested [13]
[04]: NameServe: bogus mount index requested [14]
[05]: NameServe: bogus mount index requested [15]
[06]: Enclosure 0 - Temperature 235, over threshold 120
[07]: Enclosure 0 - Temperature 190, over threshold 120
[08]: Enclosure 0 - Temperature 168, over threshold 120
[09]: ID(0:00:0) Timeout detected on cmd[0x28]
[10]: ID(0:00:0): Timeout detected on cmd[0x28]
[11]:  <...repeats 5 more times>
[12]: ID(0:01:0) Timeout detected on cmd[0x28]
[13]: ID(0:01:0): Timeout detected on cmd[0x28]
[14]:  <...repeats 5 more times>
[15]: ID(0:02:0) Timeout detected on cmd[0x28]
[16]: ID(0:02:0): Timeout detected on cmd[0x28]
[17]:  <...repeats 3 more times>
[18]: SCSI Channel[0]: Timeout Detected On 19 Command(s)
[19]: ID(0:00:0): Timeout detected on cmd[0x28]
[20]: ID(0:01:0): Timeout detected on cmd[0x28]
[21]:  <...repeats 1 more times>
[22]: ID(0:02:0): Timeout detected on cmd[0x28]
[23]:  <...repeats 1 more times>
[24]: SCSI Channel[0]: Timeout Detected On 5 Command(s)
[25]: ID(0:00:0): Timeout detected on cmd[0x28]
[26]:  <...repeats 6 more times>
[27]: ID(0:01:0); Abort Timeout. Resetting Bus 0
[28]:  --- SrbFlags=8000, Nexus Flags=211B3
[29]: ==> Deferring all io on bus: 0
[30]: SCSI Channel[0]: Timeout Detected On 7 Command(s)
[31]:   >> SP_IsolateError: Error isolation initiated on bus: 0
[32]:           >> SP_IsolateError: 0:00:0 Enable Output
[33]:                   >> SP_IsolateError: 0:00:0 Waiting for completion: 8
[34]:  <...repeats 1 more times>
[35]: ID(0:00:0); Error Event [command:0x28]
[36]: ID(0:00:0); Unit Attention [k:0x6,c:0x29,q:0x2]
[37]: ID(0:00:0); Power On, Reset, or Bus Device Reset
[38]:

[39]:           >> SP_IsolateError: 0:00:0 Re-Lock
[40]:                   >> SP_IsolateError: Start Drain 0:00:0 (3 
outstanding) a
[41]: t 16488136 ticks
[42]:                   >> SP_IsolateError: Draining 0:00:0 (3 outstanding)
[43]:                   >> SP_IsolateError: Draining 0:00:0 Done
[44]:           >> SP_IsolateError: 0:01:0 Enable Output
[45]:                   >> SP_IsolateError: 0:01:0 Waiting for completion: 9
[46]: ID(0:01:0); Error Event [command:0x28]
[47]: ID(0:01:0); Unit Attention [k:0x6,c:0x29,q:0x2]
[48]: ID(0:01:0); Power On, Reset, or Bus Device Reset
[49]:

[50]:           >> SP_IsolateError: 0:01:0 Re-Lock
[51]:                   >> SP_IsolateError: Start Drain 0:01:0 (6 
outstanding) a
[52]: t 16488142 ticks
[53]:                   >> SP_IsolateError: Draining 0:01:0 Done
[54]:           >> SP_IsolateError: 0:02:0 Enable Output
[55]:                   >> SP_IsolateError: 0:02:0 Waiting for completion: 7
[56]: ID(0:02:0); Error Event [command:0x28]
[57]: ID(0:02:0); Unit Attention [k:0x6,c:0x29,q:0x2]
[58]: ID(0:02:0); Power On, Reset, or Bus Device Reset
[59]:

[60]:           >> SP_IsolateError: 0:02:0 Re-Lock
[61]:                   >> SP_IsolateError: Start Drain 0:02:0 (6 
outstanding) a
[62]: t 16488149 ticks
[63]:                   >> SP_IsolateError: Draining 0:02:0 Done
[64]:           >> SP_IsolateError: 0:06:0 Enable Output
[65]:                   >> SP_IsolateError: 0:06:0 Waiting for completion: 1
[66]:  <...repeats 26 more times>
[67]:                   >> SP_IsolateError: Device 0:06:0 FAILED (timed out)
[68]: >> SP_Isolate releasing all drives!
[69]: [SP_Task] Bus 0: Not Started: 13, FlushFlags: 0x0
[70]: , calling scheduler
[71]:

[72]: ID(0:00:0) Timeout detected on cmd[0x2a]
[73]: ID(0:00:0): Timeout detected on cmd[0x2a]
[74]:  <...repeats 6 more times>
[75]: ID(0:01:0) Timeout detected on cmd[0x2a]
[76]: ID(0:01:0): Timeout detected on cmd[0x2a]
[77]:  <...repeats 2 more times>
[78]: ID(0:01:0): Timeout detected on cmd[0x28]
[79]: ID(0:02:0) Timeout detected on cmd[0x2a]
[80]: SCSI Channel[0]: Timeout Detected On 14 Command(s)
[81]: ID(0:00:0); Aborted Command [command:0x2a]
[82]: ID(0:01:0); Aborted Command [command:0x2a]
[83]: ID(0:02:0); Aborted Command [command:0x2a]
[84]: ID(0:00:0): Timeout detected on cmd[0x28]
[85]: ID(0:00:0): Timeout detected on cmd[0x2a]
[86]: ID(0:01:0): Timeout detected on cmd[0x28]
[87]:  <...repeats 1 more times>
[88]: ID(0:01:0): Timeout detected on cmd[0x2a]
[89]: ID(0:02:0): Timeout detected on cmd[0x28]
[90]: ID(0:02:0): Timeout detected on cmd[0x2a]
[91]: SCSI Channel[0]: Timeout Detected On 7 Command(s)
[92]: ID(0:00:0); Abort Timeout. Resetting Bus 0
[93]:  --- SrbFlags=8000, Nexus Flags=211B3
[94]: ==> Deferring all io on bus: 0
[95]: >> SP_Isolate releasing all drives!
[96]: [SP_Task] Bus 0: Not Started: 0, FlushFlags: 0x0
[97]:

[98]: Enclosure 0:6:0 offline
[99]:

========================
History Output Complete.




More information about the Linux-PowerEdge mailing list