the 'right' way to rebuild a container

Steve_Boley@Dell.com Steve_Boley at Dell.com
Sat Oct 23 07:51:00 CDT 2004


Dell(tm) PowerEdge(tm) Expandable RAID Controller 2, 2/si, 3/si, and
3/di Drive Rebuild Guide

HOW TO REBUILD A FAILED DRIVE WITH PERC2, 2/si, 3/si, and 3/di RAID
CONTROLLERS.

First command to use before rebuilding any failed drives is the
container list command. The drive id, 
if failed, will either be a missing member, or have an exclamation mark
next to it. All drive syntax 
for SCSI ids are (bus[channel]:scsi id:lun[always zero]) The endstate
necessary for drives to rebuild 
is MISSING MEMBER (remember this).

Original Drive Rebuilding

a. Quicker method (but more difficult) :

If drive is a missing member skip the next step but if an exclamation
mark next to drive SCSI id is there:

1. Use disk remove dead_partitions (bus, id, lun) using the id of the
drive identified during the container 
list command. NEVER pull a drive that is not showing as a MISSING
MEMBER. Further down I will add the instructions 
for preparing and removing a drive that is in an array and not failed or
missing (ie SMART ALERTS). Whether the 
container is failed or not and container is not degraded or critical, it
is still a member of container even if 
failed and if is PULLED or INITIALIZED it will DROP the CONTAINER and
DATA LOSS will occur.

2. Next command is controller rescan and then do another container list.

3. If drive is not showing as a missing member repeat the controller
rescan.

4. Next command is container set failover x [x is container number found
in container list] (bb,ii,ll) [bus, id, lun]. 
If the drive is part of more than 1 container, use the number of the
lowest container and procede numerically on up 
all the containers the drive is part of.

For example: container set failover 0 (0,3,0) SCSI id3 for container 0

Should hear the drive array being hammered and command to check the
status of the rebuild is task list.

b. Easier method but requires reboot:

You have to have 2.X firmware on the controller and 2.5 or higher is
preferred. Do NOT flash firmware on controller 
while drive is failed!

To check if autorebuild feature of controller is enabled run 'controller
show automatic_failover'. If disabled do 
'controller set automatic_failover' for autorebuild to be turned on.

1. Do the container list and identify which drive is failed with
exclamation mark.

2. Reboot the system and while it is reposting and before the raid
controller initializes, pull the failed drive.

3. It will then come up as missing member when raid initializes. After
booted up insert the drive and the autorebuild 
will kick in and reinitialize the drive and start rebuild. The
autorebuild will only work when the drive is in missing 
member status.

4. task list -- will give status of rebuild

New Drive Rebuilding

1. Follow previous instructions in section a steps 1 through 3 until you
have drive showing as missing member or you
can follow the procedure in section b as well as an easier solution.

2. Insert new drive into system after missing member and raid controller
should scan the bus and autospin the drive 
and autorebuild function will kick in on the controller.

3. Use the task list command to monitor the rebuild progress.

Non-Failed Drive (SMART Alert)

1. container list to see what drive is failed with exclamation mark.

2. enclosure show slot -- to show slot versus scsi id

3. enclosure prepare slot X (x is number of slot)

4. enclosure show slot X again to see if slot is deactivated

5. Remove drive and do enclosure prepare slot again to reactivate and be
missing member.

6. controller rescan

7. Insert new drive and should auto initialize and start rebuild.

8. task list to follow progress of rebuild





Steve Boley
Advanced SCSI Solutions Team
Dell Incorporated

 

-----Original Message-----
From: linux-poweredge-admin-Lists On Behalf Of Jeff Potter
Sent: Friday, October 22, 2004 4:44 PM
To: linux-poweredge-Lists; Andrew Mann
Subject: Re: the 'right' way to rebuild a container


>    Pulling the drive while hot is probably more dangerous to the drive

> than the controller/array, but I think the drive might be spun down 
> when it's moved to a fail state anyway, and if not, 80 pin drives 
> should be able to handle it decently - especially if it's a bad drive 
> that you're not going to use again.  I don't see any reason a 
> controller burp is more likely to destroy a container during an 
> insert/pull than during normal operation.  I haven't seen any 
> container loss reports on this list that have been attributed to a 
> drive replacement or rebuild - all that I can recall seem to be 
> related to resizing containers.

I actually lost a raid-5 container during a rebuild on a perc 3/di last
summer -- issued the prepare slot, waited, yanked the failed drive,
popped a new one in, it started rebuilding automatically, and then... 
then, well... the new drive itself fail during the rebuild, which caused
the raid controller to mark the entire container invalid. (All the data,
of course, was still sitting on the other drives, so it should have
still worked!) It was a very, very painful day!

I would strongly recommend popping your replacement drives in for
rebuilds at least during low usage periods so that in the event
something does go wrong-- even if it's less than 1% -- it's not at the
worse possible time. I expect that a lot of the Dell techs have gotten
very weary of suggesting a hot rebuild simply because of the pain it
causes if the container is lost.

best,
Jeff

_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge at dell.com
http://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq





More information about the Linux-PowerEdge mailing list