Exact afacli syntax/procedure to replace/hotswap drive (2550-Perc3/Di) ?
David Kinnvall
david.kinnvall at alertir.com
Wed Apr 7 10:19:00 CDT 2004
Hi, all!
Summary: I need to replace a disk in one of our 2550 machines
having a RAID-5 array controlled by a Perc3/Di, and I want to
be absolutely positively certain that I get the steps and the
afacli syntax correct. I beg the list for help, since after I
have searched for and read all docs, mails, howtos, FAQs and
what-not I still feel not quite convinced that there are no
possible hickups that I have missed. See below for details.
Log entries from yesterday morning:
Apr 6 01:46:09 db2 kernel: AAC:ID(0:02:0); Error Event [command:0x28]
Apr 6 01:46:09 db2 kernel: AAC:ID(0:02:0); Medium Error, Block Range 35538209 : 35538224
Apr 6 01:46:09 db2 kernel: AAC:ID(0:02:0); Read Retries Exhausted
Apr 6 01:46:10 db2 kernel: AAC:RAID5 Container 0 Drive 0:2:0 Failure
Apr 6 01:46:10 db2 kernel: AAC:Container 0 started REBUILD task on drive 0:3:0
Apr 6 05:22:35 db2 kernel: AAC:Container 0 completed REBUILD task:
The Error Event, the Medium Error and Read Retries Exhausted
entries started occuring on this machine late last year, then
rather long in between, but more and more common up until know,
when it finally failed. Always the same drive (0:2:0), but NOT
the same Block Range, and always during the night and a rather
gruesome nightly database update causing quite some disk load.
afacli data (as of now):
AFA0> controller details
Executing: controller details
Controller Information
----------------------
Remote Computer: .
Device Name: AFA0
Controller Type: PERC 3/Di
Access Mode: READ-WRITE
Controller Serial Number: Last Six Digits = 1C10D2
Number of Buses: 1
Devices per Bus: 15
Controller CPU: i960 R series
Controller CPU Speed: 100 Mhz
Controller Memory: 126 Mbytes
Battery State: Ok
Component Revisions
-------------------
CLI: 3.0-0 (Build #4880)
API: 3.0-0 (Build #4880)
Miniport Driver: 3.0-0 (Build #5125)
Controller Software: 2.6-0 (Build #3512)
Controller BIOS: 2.6-0 (Build #3512)
Controller Firmware: (Build #3512)
AFA0> container list
Executing: container list
Num Total Oth Chunk Scsi Partition
Label Type Size Ctr Size Usage B:ID:L Offset:Size
----- ------ ------ --- ------ ------- ------ -------------
0 RAID-5 33.8GB 32KB Open 0:00:0 64.0KB:16.9GB
/dev/sda 0:01:0 64.0KB:16.9GB
0:03:0 64.0KB:16.9GB
Hmm: I would have expected a failed drive to still show up
in the container listing, but as a failed drive. It looks to
me as it has been more or less thrown out from the container
and is not considered as part of it, by the controller. Is
this expected behavior?
AFA0> disk list
Executing: disk list
B:ID:L Device Type Blocks Bytes/Block Usage Shared Rate
------ -------------- --------- ----------- ---------------- ------ ----
0:00:0 Disk 35566478 512 Initialized NO 160
0:01:0 Disk 35566478 512 Initialized NO 160
0:02:0 Disk 35566478 512 Initialized NO 160
0:03:0 Disk 35566478 512 Initialized NO 160
Still there.
AFA0> enclosure show slot
Executing: enclosure show slot
Enclosure
ID (B:ID:L) Slot scsiId Insert Status
----------- ---- ------ ------- ------------------------------------------
0 0:06:0 0 0:00:0 1 OK ACTIVATE
0 0:06:0 1 0:01:0 1 OK ACTIVATE
0 0:06:0 2 0:02:0 1 OK UNCONFIG ACTIVATE
0 0:06:0 3 0:03:0 1 OK HOTSPARE ACTIVATE
Hmm: The above surprises me, since I did NOT expect the "failed"
drive (0:02:0) to have a status of "OK UNCONFIG ACTIVATE", rather
a status involving at least "... FAILED ...". The 0:06:0 is the
hotspare that kicked in, correctly it seems.
The increasing frequency of medium errors exhausted read retries
have prompted me to order a replacement disk anyway, and it is
scheduled to arrive tomorrow morning, local Swedish time. When
talking to Dell's technical support in Sweden regarding the exact
procedure for replacing the drive I have been less than impressed,
to put it gently... So, as I said, my hope now goes to the list:
- Has the drive actually failed? Why does the status text not
reflect this? What does UNCONFIG in this scenario really mean?
- What could be the likely cause of the errors I have seen for
the drive during the last months? Does it motivate replacement?
- Will a replacement drive have to be coerced back into the
container? What about assigning hotspare/failover status to
the new disk or the previous one that has now kicked in? Is
that necessary/recommended?
- When replacing the drive, whether actually failed or not, what
is the *correct* sequence of afacli commands and physical steps
to take, in order to get the drive replaced with NO risk to the
data and/or machine uptime? My best guess, as gleaned from the
various sources I have been able to find on my own so far, is:
// begin (using // for my inline comments only)
# afacli
...
FASTCMD> open afa0
...
AFA0> container list
Executing: container list
Num Total Oth Chunk Scsi Partition
Label Type Size Ctr Size Usage B:ID:L Offset:Size
----- ------ ------ --- ------ ------- ------ -------------
0 RAID-5 33.8GB 32KB Open 0:00:0 64.0KB:16.9GB
/dev/sda 0:01:0 64.0KB:16.9GB
0:03:0 64.0KB:16.9GB
// 0:02:0 missing from the container, that's the target
AFA0> disk list
Executing: disk list
B:ID:L Device Type Blocks Bytes/Block Usage Shared Rate
------ -------------- --------- ----------- ---------------- ------ ----
0:00:0 Disk 35566478 512 Initialized NO 160
0:01:0 Disk 35566478 512 Initialized NO 160
0:02:0 Disk 35566478 512 Initialized NO 160
0:03:0 Disk 35566478 512 Initialized NO 160
// ok, it's there, so known to the controller
AFA0> enclosure show slot
Executing: enclosure show slot
Enclosure
ID (B:ID:L) Slot scsiId Insert Status
----------- ---- ------ ------- ------------------------------------------
0 0:06:0 0 0:00:0 1 OK ACTIVATE
0 0:06:0 1 0:01:0 1 OK ACTIVATE
0 0:06:0 2 0:02:0 1 OK UNCONFIG ACTIVATE
0 0:06:0 3 0:03:0 1 OK HOTSPARE ACTIVATE
// not currently actively working, it seems
// so: prepare it for replacement
AFA0> enclosure prepare slot 2
AFA0> enclosure show slot 2
// to check status after prepare
// what should I expect/want here?
// what to do if that does not happen?
// if in wanted state, physically remove drive, and then
AFA0> enclosure prepare slot 2
// to force the controller to detect the now missing drive
// what should I expect/want here?
// what to do if not so?
// if in wanted state, ping controller
AFA0> controller rescan
// what should happen here? new status? what, if not?
// if in wanted state, insert replacement drive
// what should happen here? what to do if not?
AFA0> task list
// to monitor progress
// end
Any steps missed? Unnecessary? Risky? What can go wrong? Etc...
I'm really nervous about doing this, so any help/insight from
the list would be MUCH appreciated.
Best regards,
David Kinnvall
More information about the Linux-PowerEdge
mailing list