the 'right' way to rebuild a container

Andrew Mann amann at mythicentertainment.com
Fri Oct 15 21:12:01 CDT 2004


    Pulling the drive while hot is probably more dangerous to the drive 
than the controller/array, but I think the drive might be spun down when 
it's moved to a fail state anyway, and if not, 80 pin drives should be 
able to handle it decently - especially if it's a bad drive that you're 
not going to use again.  I don't see any reason a controller burp is 
more likely to destroy a container during an insert/pull than during 
normal operation.  I haven't seen any container loss reports on this 
list that have been attributed to a drive replacement or rebuild - all 
that I can recall seem to be related to resizing containers.
    My personal experience with the Perc3/Di is that you're much more 
likely to have the raid controller lock up due to conditions beyond your 
control (like firmware bugs, driver bugs, automatic bad block remapping, 
misbehaving scsi disk).  On container losses, I've dealt with 3 of them, 
and all 3 were caused by an undetected error on one drive, followed some 
time later by a detected error on a second drive in the array.  When the 
array attempted to rebuild it found the previously undetected errored 
disk and could not complete the rebuild.  I'd rank this as the most 
likely candidate for data loss.
    For hardware downtime for the last year, #1 by far is scheduled 
maintenance (firmware updates, fan replacements, etc) with about two 
hours per server total , #2 is the aacraid freeze under certain load 
patterns with less than 1 minute per server aggregate, and #3 is the 
above mentioned two-drive failure in an array with less than one minute 
aggregate downtime.  These three account for all of the hardware related 
downtime.  I'd say I've swapped around 35-40 bad drives in the last year.

    Following from all that, my take is:
1) If you need absolutely 100% uptime, you need to look into 
clustering/failover systems.  A server will go down at some point no 
matter how careful you are.  More preparation can reduce the downtime 
but not eliminate it.
2) Data loss happens, even with raid.  Make backups on a schedule that 
are an acceptable tradeoff between the data you will lose when you roll 
back and the time consumed to make the backups.
3) Decide wether to rebuild live or offline based on a comparison 
between the cost of a fraction of a percent increase of unscheduled 
downtime to the ease/difficulty of scheduling downtime and the cost of 
doing so.

Andrew

Todd Santi wrote:

>
>
>You didn't mention what controller is in these machines, so can't speak
>specifically to your environment.  Although I have, in the past, just
>swapped out drives on other Dell systems, there is a more graceful way to
>do it.  With the correct management tools installed, you can manage the
>raid/drives from the os.  I recall something about "preparing the slot"
>before removing a disk.  This is the preferred/safer way to do it.  You can
>also monitor the rebuild progress, set failovers, etc. from the management
>utils.  Could it be that your "2nd" guy didn't have the raid mgnt utils
>installed on the system?  If you don't have the raid mngt utils installed,
>no Dell support person is probably going to tell you to just yank the bad
>drive, and pop in the new one.  Bottom line, it is risky to do so.  One
>burp from the controller, and your whole container could be gone, and
>you're starting from scratch.  The 2nd scenario you described is about the
>safest way to swap out a failed drive.  Another thought is, if these
>machines get a lot of traffic, and pretty heavy use, performance can be
>severely degradated during the rebuild process on raid 5.  I wouldn't want
>to hammer a machine that was attempting a rebuild, or if it was a db
>server, and risk corrupting your db.  But I'm cautious that way.
>
>If you want as much up-time as possible, then you really should have the
>raid mgnt utils installed.  Then read the documentation, pretty much all
>you need to know should be there.  Just swapping out drives can be risky.
>Even if the risk is slight, you're still rolling the dice with production
>machines.
>
>
>Todd Santi
>Systems Administrator
>Sybex, Inc.
>
>
>
>linux-poweredge-admin at dell.com wrote on 10/15/2004 01:55:31 PM:
>
>  
>
>>Errr haven't had to do this with a Dell (Adaptec/LSI) RAID yet but for
>>me thats mostly the point of RAID, you don't go down when you lose a
>>disk. I'd hope you can safely rebuild on a live system....
>>    
>>
>
>  
>
>>Greg
>>    
>>
>
>  
>
>>On Fri, 2004-10-15 at 16:27, Glenn L. Wentworth wrote:
>>    
>>
>>>I am sure this has probably been answered before but I don't remember
>>>seeing it so I'll just ask it again.  Also the answer may end a
>>>discussion we are having internally about the problem.
>>>
>>>We have just installed some new 2650s. The systems have 4 drives setup
>>>in a raid-5.  The systems are in disparate locations so they are
>>>      
>>>
>managed
>  
>
>>>by different people.
>>>
>>>Two of the machines (1 in each location) lost a drive.  At one site the
>>>admin got the new drive popped the failed drive out, put the new drive
>>>in and walked away letting the system run and rebuild at the same time.
>>>That system seems OK, continues to run and best of all was 'up' the
>>>whole time.
>>>
>>>The other group called Dell and followed the directions of a Dell tech.
>>>The process was: take the system down, replace the drive, come up into
>>>the ctrl-a raid manager and rebuild the container.  Then bring the
>>>system back up.
>>>
>>>Both methods seem to work.  Except of course the second system was
>>>off-line for some 4 hours while the container was rebuilt.
>>>
>>>If there is not a downside to hot swapping a failed drive while the
>>>system is running why does Dell have the support techs tell customers
>>>      
>>>
>to
>  
>
>>>rebuild the raid array with the machine off-line?  And other than being
>>>off-line for 4 hours are there other pros and cons to the two ways of
>>>fixing a raid array with a failed drive?
>>>
>>>
>>>glw
>>>      
>>>
>>--
>>Greg Dickie
>>just a guy
>>Maximum Throughput
>>    
>>
>
>  
>
>>_______________________________________________
>>Linux-PowerEdge mailing list
>>Linux-PowerEdge at dell.com
>>http://lists.us.dell.com/mailman/listinfo/linux-poweredge
>>Please read the FAQ at http://lists.us.dell.com/faq
>>    
>>
>
>_______________________________________________
>Linux-PowerEdge mailing list
>Linux-PowerEdge at dell.com
>http://lists.us.dell.com/mailman/listinfo/linux-poweredge
>Please read the FAQ at http://lists.us.dell.com/faq
>
>  
>




More information about the Linux-PowerEdge mailing list