> The result: the entire RAID volume/array would be _gone_.
> Is there a fail-proof way/document to recover from H/W RAID problems?

Nothing is fool proof unless you are prepared to spend an amazing amount
of money and even then..  RAID is a method of _risk mitigation_.  You are
only reducing the disk of data loss, not eliminating it unfortunately

yes, i too have had the management spiel "what do you mean we lost data?
it was on a RAID [chants magic mantra word], that shouldn't be possible!"

> The last technique seems to be (without hot-swap) more-or-less:
> 1. Shut the system down.
> 2. Pull the good drives out, but leave the one we think is bad.
> 3. Boot the system, and go into the BIOS to see what might be going on, if a prompt comes up for "accpeting changes", say "yes".
> 4. Find the drive is bad, bring the system back down
> 5. Replace the bad drive.
> 6. Turn the system back on, and similar to #3, when asked to "accept the changes", say "yes".
> The system will rebuild the logical volume.  Assming here, for example, that we are dealing with a RAID5 array.
> Any input would be appreciated?

I wouldn't do it that way.  pulling good drives out of a system to be is more
likely to cause problems.

The first thing is to have a data storage and recovery policy.  This will lead
to how you are going to backup this data so you can actually recover when the
raid volume fails. (backup might be a online duplication to another server or
to tape archival)

Once you know you have good backups, my methodology with dell hardware raid is:

o log a support call to dell and get the replacement drive
o check the replacement drive is the correct one from dell
o identify the failed disk using afacli or megamon
o check again to make sure
o turn off power to the slot if it is an onboard disk (ref: afacli)
o remove the failed drive and replace it with the new drive
o verify the new disk is visible and the volume is rebuilding.

I have never had to power cycle a system to swap a failed drive yet with dell kit



