the 'right' way to rebuild a container

Andrew Mann amann at mythicentertainment.com
Fri Oct 15 17:58:00 CDT 2004


	In my experience, rarely (less than 1% of the time) the Perc3/Di 
controllers will bomb in some fashion during a rebuild.  I've seen this 
only once in a large number of rebuilds and it was with an older 
firmware on the controller.  In contrast I've seen automatic bad block 
remapping take the Perc3/Di down around 3 times somehow.  I suspect the 
very rare instances of the Perc3/Di failing causing unexpected downtime 
has caused Dell support to take the stance that scheduled downtime is 
preferable to potential unexpected downtime.  EXT3 recovered fine in all 
of these instances and the difficulty and delay in scheduled downtime 
for us means that it's a choice between a very slim chance of failure 
doing a "hot" rebuild vs a small chance of losing the whole system due 
to the other drive failing in the period it would take to schedule downtime.
	Further, the older Perc3/Di firmware, when rebooted with 1 drive failed 
in RAID-1 would sometimes come up and automatically degrade the 
container to a volume from a mirror, at which point the command line 
tool 'afacli' seemed to be the only way to restore it to the proper state.

	Some other quirks I've seen with the Perc3/Di and rebuilding:
- Sometimes the controller wont accept any new drive you put in for 
rebuild - it'll fail the rebuild process on every one.  Power cycling 
(cold boot) seems to be required to fix this.
- More often the controller wont see a new drive that's inserted.  A 
'controller rescan' from afacli will find the drive.


	So far I haven't had problems doing some things that aren't exactly on 
the Dell list of best practices:
- Pulling a drive from one mirror and putting it into another system to 
duplicate the data on the container by rebuilding the container on both 
systems.
- Building system images on a container in one chassis, powering down, 
and then inserting the disks into another chassis (same system type, 
firmware revision, etc).
- Installing an OS while containers are scrubbing.


Andrew


Greg Dickie wrote:
> Errr haven't had to do this with a Dell (Adaptec/LSI) RAID yet but for
> me thats mostly the point of RAID, you don't go down when you lose a
> disk. I'd hope you can safely rebuild on a live system....
> 
> Greg
> 
> 
> On Fri, 2004-10-15 at 16:27, Glenn L. Wentworth wrote:
> 
>>I am sure this has probably been answered before but I don't remember
>>seeing it so I'll just ask it again.  Also the answer may end a
>>discussion we are having internally about the problem.
>>
>>We have just installed some new 2650s. The systems have 4 drives setup
>>in a raid-5.  The systems are in disparate locations so they are managed
>>by different people.  
>>
>>Two of the machines (1 in each location) lost a drive.  At one site the
>>admin got the new drive popped the failed drive out, put the new drive
>>in and walked away letting the system run and rebuild at the same time. 
>>That system seems OK, continues to run and best of all was 'up' the
>>whole time.  
>>
>>The other group called Dell and followed the directions of a Dell tech.
>>The process was: take the system down, replace the drive, come up into
>>the ctrl-a raid manager and rebuild the container.  Then bring the
>>system back up.  
>>
>>Both methods seem to work.  Except of course the second system was
>>off-line for some 4 hours while the container was rebuilt.
>>
>>If there is not a downside to hot swapping a failed drive while the
>>system is running why does Dell have the support techs tell customers to
>>rebuild the raid array with the machine off-line?  And other than being
>>off-line for 4 hours are there other pros and cons to the two ways of
>>fixing a raid array with a failed drive?
>>
>>
>>glw

-- 
Andrew Mann
Systems Administrator
Mythic Entertainment




More information about the Linux-PowerEdge mailing list