Exact afacli syntax/procedure to replace/hotswap drive (2550-Perc3/Di) ?

Paul paul at kbs.net.au
Wed Apr 7 11:18:31 CDT 2004


Hi David,

I will try and some based off some of my experiences.

----- Original Message ----- 
From: "David Kinnvall" <david.kinnvall at alertir.com>
To: <linux-poweredge at dell.com>
Sent: Thursday, April 08, 2004 1:18 AM
Subject: Exact afacli syntax/procedure to replace/hotswap drive
(2550-Perc3/Di) ?


> Hi, all!
>
> Summary: I need to replace a disk in one of our 2550 machines
> having a RAID-5 array controlled by a Perc3/Di, and I want to
> be absolutely positively certain that I get the steps and the
> afacli syntax correct. I beg the list for help, since after I
> have searched for and read all docs, mails, howtos, FAQs and
> what-not I still feel not quite convinced that there are no
> possible hickups that I have missed. See below for details.
>
> Log entries from yesterday morning:
>
> Apr  6 01:46:09 db2 kernel: AAC:ID(0:02:0); Error Event [command:0x28]
> Apr  6 01:46:09 db2 kernel: AAC:ID(0:02:0); Medium Error, Block Range
35538209 : 35538224
> Apr  6 01:46:09 db2 kernel: AAC:ID(0:02:0); Read Retries Exhausted
> Apr  6 01:46:10 db2 kernel: AAC:RAID5 Container 0 Drive 0:2:0 Failure
> Apr  6 01:46:10 db2 kernel: AAC:Container 0 started REBUILD task on drive
0:3:0
> Apr  6 05:22:35 db2 kernel: AAC:Container 0 completed REBUILD task:
>
> The Error Event, the Medium Error and Read Retries Exhausted
> entries started occuring on this machine late last year, then
> rather long in between, but more and more common up until know,
> when it finally failed. Always the same drive (0:2:0), but NOT
> the same Block Range, and always during the night and a rather
> gruesome nightly database update causing quite some disk load.
>
> afacli data (as of now):
>
> AFA0> controller details
> Executing: controller details
> Controller Information
> ----------------------
>          Remote Computer: .
>              Device Name: AFA0
>          Controller Type: PERC 3/Di
>              Access Mode: READ-WRITE
> Controller Serial Number: Last Six Digits = 1C10D2
>          Number of Buses: 1
>          Devices per Bus: 15
>           Controller CPU: i960 R series
>     Controller CPU Speed: 100 Mhz
>        Controller Memory: 126 Mbytes
>            Battery State: Ok
>
> Component Revisions
> -------------------
>                 CLI: 3.0-0 (Build #4880)
>                 API: 3.0-0 (Build #4880)
>     Miniport Driver: 3.0-0 (Build #5125)
> Controller Software: 2.6-0 (Build #3512)
>     Controller BIOS: 2.6-0 (Build #3512)
> Controller Firmware: (Build #3512)
>
> AFA0> container list
> Executing: container list
> Num          Total  Oth Chunk          Scsi   Partition
> Label Type   Size   Ctr Size   Usage   B:ID:L Offset:Size
> ----- ------ ------ --- ------ ------- ------ -------------
>  0    RAID-5 33.8GB       32KB Open    0:00:0 64.0KB:16.9GB
>  /dev/sda                              0:01:0 64.0KB:16.9GB
>                                        0:03:0 64.0KB:16.9GB
>
> Hmm: I would have expected a failed drive to still show up
> in the container listing, but as a failed drive. It looks to
> me as it has been more or less thrown out from the container
> and is not considered as part of it, by the controller. Is
> this expected behavior?
>
> AFA0> disk list
> Executing: disk list
>
> B:ID:L  Device Type     Blocks    Bytes/Block Usage            Shared Rate
> ------  --------------  --------- ----------- ---------------- ------ ----
> 0:00:0   Disk            35566478  512         Initialized      NO     160
> 0:01:0   Disk            35566478  512         Initialized      NO     160
> 0:02:0   Disk            35566478  512         Initialized      NO     160
> 0:03:0   Disk            35566478  512         Initialized      NO     160
>
> Still there.
>
> AFA0> enclosure show slot
> Executing: enclosure show slot
>
> Enclosure
> ID (B:ID:L) Slot scsiId Insert  Status
> ----------- ---- ------ ------- ------------------------------------------
>  0  0:06:0   0   0:00:0     1   OK ACTIVATE
>  0  0:06:0   1   0:01:0     1   OK ACTIVATE
>  0  0:06:0   2   0:02:0     1   OK UNCONFIG ACTIVATE
>  0  0:06:0   3   0:03:0     1   OK HOTSPARE ACTIVATE
>
> Hmm: The above surprises me, since I did NOT expect the "failed"
> drive (0:02:0) to have a status of "OK UNCONFIG ACTIVATE", rather
> a status involving at least "... FAILED ...". The 0:06:0 is the
> hotspare that kicked in, correctly it seems.
>
> The increasing frequency of medium errors exhausted read retries
> have prompted me to order a replacement disk anyway, and it is
> scheduled to arrive tomorrow morning, local Swedish time. When
> talking to Dell's technical support in Sweden regarding the exact
> procedure for replacing the drive I have been less than impressed,
> to put it gently... So, as I said, my hope now goes to the list:
>
> - Has the drive actually failed? Why does the status text not
>   reflect this? What does UNCONFIG in this scenario really mean?

Yes, it appears the controller tried to read the same part x amount of times
and gave up.
It's only cause of action is to mark the disk as failed. Maybe it couldn't
spin it properly, seek to location or read a block of data.
Your logs would suggest it can't read a portion of the disk correctly.

>
> - What could be the likely cause of the errors I have seen for
>   the drive during the last months? Does it motivate replacement?
>

Just disk parts showing wear and tear one would assume.
You can do "disk show defects 0" and see how many defects it has and how
many have grown.
Matt D explained to me that grown defects are errors that have occured on
the disk since it was manufactured

> - Will a replacement drive have to be coerced back into the
>   container? What about assigning hotspare/failover status to
>   the new disk or the previous one that has now kicked in? Is
>   that necessary/recommended?
>

To be honest I've never used a hotspare. When a disk fails we simply rip it
out and throw in a new one.
We let the controller simply do the autorebuild once the new disk spins up
correctly. No action is required by us via afacli etc...
Yes assigning a hotspare is a good practice and should be done, I agree.

> - When replacing the drive, whether actually failed or not, what
>   is the *correct* sequence of afacli commands and physical steps
>   to take, in order to get the drive replaced with NO risk to the
>   data and/or machine uptime? My best guess, as gleaned from the
>   various sources I have been able to find on my own so far, is:
>
> // begin (using // for my inline comments only)
> # afacli
> ...
> FASTCMD> open afa0
> ...
> AFA0> container list
> Executing: container list
> Num          Total  Oth Chunk          Scsi   Partition
> Label Type   Size   Ctr Size   Usage   B:ID:L Offset:Size
> ----- ------ ------ --- ------ ------- ------ -------------
>  0    RAID-5 33.8GB       32KB Open    0:00:0 64.0KB:16.9GB
>  /dev/sda                              0:01:0 64.0KB:16.9GB
>                                        0:03:0 64.0KB:16.9GB
>
> // 0:02:0 missing from the container, that's the target
>
> AFA0> disk list
> Executing: disk list
>
> B:ID:L  Device Type     Blocks    Bytes/Block Usage            Shared Rate
> ------  --------------  --------- ----------- ---------------- ------ ----
> 0:00:0   Disk            35566478  512         Initialized      NO     160
> 0:01:0   Disk            35566478  512         Initialized      NO     160
> 0:02:0   Disk            35566478  512         Initialized      NO     160
> 0:03:0   Disk            35566478  512         Initialized      NO     160
>
> // ok, it's there, so known to the controller
>
> AFA0> enclosure show slot
> Executing: enclosure show slot
>
> Enclosure
> ID (B:ID:L) Slot scsiId Insert  Status
> ----------- ---- ------ ------- ------------------------------------------
>  0  0:06:0   0   0:00:0     1   OK ACTIVATE
>  0  0:06:0   1   0:01:0     1   OK ACTIVATE
>  0  0:06:0   2   0:02:0     1   OK UNCONFIG ACTIVATE
>  0  0:06:0   3   0:03:0     1   OK HOTSPARE ACTIVATE
>
> // not currently actively working, it seems
> // so: prepare it for replacement
>
> AFA0> enclosure prepare slot 2
> AFA0> enclosure show slot 2
> // to check status after prepare
> //   what should I expect/want here?
> //   what to do if that does not happen?
> // if in wanted state, physically remove drive, and then
> AFA0> enclosure prepare slot 2
> // to force the controller to detect the now missing drive
> //   what should I expect/want here?
> //   what to do if not so?
> // if in wanted state, ping controller
> AFA0> controller rescan
> // what should happen here? new status? what, if not?
> // if in wanted state, insert replacement drive
> // what should happen here? what to do if not?
> AFA0> task list
> // to monitor progress
>
> // end
>
> Any steps missed? Unnecessary? Risky? What can go wrong? Etc...
>

Those commands seem to be common among steps to rebuild a failed container.
I myself have NOT done these but I do recall a similar sequence to rebuild a
container.
The perc controllers are pretty friendly and won't just randomly start
writing sectors and destory data.
The config of the raid container is stored on each disk, so its pretty fail
safe from things going wrong.

Dell PE Support should be able to assit in you using afacli and what the
steps to rebuild the array.
No reboot or shutdown should be required. I've removed a disk, replaced it
with a new disk and let it auto rebuild itself
on a running linux system. I didn't need to login, use afacli, change
runlevels or anything. That was on a Pe2550

What version of linux are you using? If its a supported one (ie: redhat)
they should be able to give you step by step to fix the broken container.

Let me know if you get stuck

> I'm really nervous about doing this, so any help/insight from
> the list would be MUCH appreciated.
>
> Best regards,
>
> David Kinnvall
>
> _______________________________________________
> Linux-PowerEdge mailing list
> Linux-PowerEdge at dell.com
> http://lists.us.dell.com/mailman/listinfo/linux-poweredge
> Please read the FAQ at http://lists.us.dell.com/faq or search the list
archives at http://lists.us.dell.com/htdig/
>




More information about the Linux-PowerEdge mailing list