Exact afacli syntax/procedure to replace/hotswap drive (2550-Perc3/Di) ?

David Kinnvall david.kinnvall at alertir.com
Wed Apr 7 10:19:00 CDT 2004


Hi, all!

Summary: I need to replace a disk in one of our 2550 machines
having a RAID-5 array controlled by a Perc3/Di, and I want to
be absolutely positively certain that I get the steps and the
afacli syntax correct. I beg the list for help, since after I
have searched for and read all docs, mails, howtos, FAQs and
what-not I still feel not quite convinced that there are no
possible hickups that I have missed. See below for details.

Log entries from yesterday morning:

Apr  6 01:46:09 db2 kernel: AAC:ID(0:02:0); Error Event [command:0x28]
Apr  6 01:46:09 db2 kernel: AAC:ID(0:02:0); Medium Error, Block Range 35538209 : 35538224
Apr  6 01:46:09 db2 kernel: AAC:ID(0:02:0); Read Retries Exhausted
Apr  6 01:46:10 db2 kernel: AAC:RAID5 Container 0 Drive 0:2:0 Failure
Apr  6 01:46:10 db2 kernel: AAC:Container 0 started REBUILD task on drive 0:3:0
Apr  6 05:22:35 db2 kernel: AAC:Container 0 completed REBUILD task:

The Error Event, the Medium Error and Read Retries Exhausted
entries started occuring on this machine late last year, then
rather long in between, but more and more common up until know,
when it finally failed. Always the same drive (0:2:0), but NOT
the same Block Range, and always during the night and a rather
gruesome nightly database update causing quite some disk load.

afacli data (as of now):

AFA0> controller details
Executing: controller details
Controller Information
----------------------
         Remote Computer: .
             Device Name: AFA0
         Controller Type: PERC 3/Di
             Access Mode: READ-WRITE
Controller Serial Number: Last Six Digits = 1C10D2
         Number of Buses: 1
         Devices per Bus: 15
          Controller CPU: i960 R series
    Controller CPU Speed: 100 Mhz
       Controller Memory: 126 Mbytes
           Battery State: Ok

Component Revisions
-------------------
                CLI: 3.0-0 (Build #4880)
                API: 3.0-0 (Build #4880)
    Miniport Driver: 3.0-0 (Build #5125)
Controller Software: 2.6-0 (Build #3512)
    Controller BIOS: 2.6-0 (Build #3512)
Controller Firmware: (Build #3512)

AFA0> container list
Executing: container list
Num          Total  Oth Chunk          Scsi   Partition
Label Type   Size   Ctr Size   Usage   B:ID:L Offset:Size
----- ------ ------ --- ------ ------- ------ -------------
 0    RAID-5 33.8GB       32KB Open    0:00:0 64.0KB:16.9GB 
 /dev/sda                              0:01:0 64.0KB:16.9GB 
                                       0:03:0 64.0KB:16.9GB 

Hmm: I would have expected a failed drive to still show up
in the container listing, but as a failed drive. It looks to
me as it has been more or less thrown out from the container
and is not considered as part of it, by the controller. Is
this expected behavior?

AFA0> disk list
Executing: disk list

B:ID:L  Device Type     Blocks    Bytes/Block Usage            Shared Rate
------  --------------  --------- ----------- ---------------- ------ ----
0:00:0   Disk            35566478  512         Initialized      NO     160 
0:01:0   Disk            35566478  512         Initialized      NO     160 
0:02:0   Disk            35566478  512         Initialized      NO     160 
0:03:0   Disk            35566478  512         Initialized      NO     160 

Still there.

AFA0> enclosure show slot
Executing: enclosure show slot

Enclosure
ID (B:ID:L) Slot scsiId Insert  Status
----------- ---- ------ ------- ------------------------------------------
 0  0:06:0   0   0:00:0     1   OK ACTIVATE 
 0  0:06:0   1   0:01:0     1   OK ACTIVATE 
 0  0:06:0   2   0:02:0     1   OK UNCONFIG ACTIVATE 
 0  0:06:0   3   0:03:0     1   OK HOTSPARE ACTIVATE 

Hmm: The above surprises me, since I did NOT expect the "failed"
drive (0:02:0) to have a status of "OK UNCONFIG ACTIVATE", rather
a status involving at least "... FAILED ...". The 0:06:0 is the
hotspare that kicked in, correctly it seems.

The increasing frequency of medium errors exhausted read retries
have prompted me to order a replacement disk anyway, and it is
scheduled to arrive tomorrow morning, local Swedish time. When
talking to Dell's technical support in Sweden regarding the exact
procedure for replacing the drive I have been less than impressed,
to put it gently... So, as I said, my hope now goes to the list:

- Has the drive actually failed? Why does the status text not
  reflect this? What does UNCONFIG in this scenario really mean?

- What could be the likely cause of the errors I have seen for
  the drive during the last months? Does it motivate replacement?

- Will a replacement drive have to be coerced back into the
  container? What about assigning hotspare/failover status to
  the new disk or the previous one that has now kicked in? Is
  that necessary/recommended?

- When replacing the drive, whether actually failed or not, what
  is the *correct* sequence of afacli commands and physical steps
  to take, in order to get the drive replaced with NO risk to the
  data and/or machine uptime? My best guess, as gleaned from the
  various sources I have been able to find on my own so far, is:

// begin (using // for my inline comments only)
# afacli
...
FASTCMD> open afa0
...
AFA0> container list
Executing: container list
Num          Total  Oth Chunk          Scsi   Partition
Label Type   Size   Ctr Size   Usage   B:ID:L Offset:Size
----- ------ ------ --- ------ ------- ------ -------------
 0    RAID-5 33.8GB       32KB Open    0:00:0 64.0KB:16.9GB 
 /dev/sda                              0:01:0 64.0KB:16.9GB 
                                       0:03:0 64.0KB:16.9GB 

// 0:02:0 missing from the container, that's the target

AFA0> disk list
Executing: disk list

B:ID:L  Device Type     Blocks    Bytes/Block Usage            Shared Rate
------  --------------  --------- ----------- ---------------- ------ ----
0:00:0   Disk            35566478  512         Initialized      NO     160 
0:01:0   Disk            35566478  512         Initialized      NO     160 
0:02:0   Disk            35566478  512         Initialized      NO     160 
0:03:0   Disk            35566478  512         Initialized      NO     160 

// ok, it's there, so known to the controller

AFA0> enclosure show slot
Executing: enclosure show slot

Enclosure
ID (B:ID:L) Slot scsiId Insert  Status
----------- ---- ------ ------- ------------------------------------------
 0  0:06:0   0   0:00:0     1   OK ACTIVATE 
 0  0:06:0   1   0:01:0     1   OK ACTIVATE 
 0  0:06:0   2   0:02:0     1   OK UNCONFIG ACTIVATE 
 0  0:06:0   3   0:03:0     1   OK HOTSPARE ACTIVATE 

// not currently actively working, it seems
// so: prepare it for replacement

AFA0> enclosure prepare slot 2
AFA0> enclosure show slot 2
// to check status after prepare
//   what should I expect/want here?
//   what to do if that does not happen?
// if in wanted state, physically remove drive, and then
AFA0> enclosure prepare slot 2
// to force the controller to detect the now missing drive
//   what should I expect/want here?
//   what to do if not so?
// if in wanted state, ping controller
AFA0> controller rescan
// what should happen here? new status? what, if not?
// if in wanted state, insert replacement drive
// what should happen here? what to do if not?
AFA0> task list
// to monitor progress

// end

Any steps missed? Unnecessary? Risky? What can go wrong? Etc...

I'm really nervous about doing this, so any help/insight from
the list would be MUCH appreciated.

Best regards,

David Kinnvall




More information about the Linux-PowerEdge mailing list