PERC 4e/DC in 2850 - lost 1 disk, RAID5 array failed

Greg Dickie greg at max-t.com
Fri Jul 7 10:56:54 CDT 2006



As others have pointed out this is completely incorrect. The array
should be degraded, it should beep and light up and warn you that if you
lose another disk you will lose data but the whole point is that its
still working. If there was any kind of data integrity issue besides the
one failed drive then you would want to have the controller shut down.
It can still serve data by using the parity information. Adding a spare
and rebuilding will just return the array to a normal redundant state.
Having a hot spare simply lets the controller immediately start the
rebuild. IMHO if a support person doesn't know the correct answer they
should ask someone and not make something up.

Greg

On Fri, 2006-07-07 at 10:09 -0500, Fran Fabrizio wrote:
> I heard back from my Dell tech contact, and this is what he had to say:
> 
> "If no Hot Spare was available, the PERC has to wait on you.  When a 
> disk in a RAID set fails, the controller looks for a Hot Spare and 
> begins rebuilding the RAID set.  If we think this through, the 
> 'in-action' on the PERC is actually protecting your data.  If it were to 
> continue processing with a failed drive and no Hot Spare you could 
> corrupt the RAID set.  With a Hot Spare you will continue to run but in 
> a degraded state while the rebuild continues."
> 
> If this is truly the case, that seems disappointing.  I argue that the 
> RAID5 should continue serving data whether or not a hot spare is 
> available.  If no hot spare is available and the PERC decides to shut 
> down, then I have no opportunity to even attempt to retrieve my data.
> 
> I'm also not sure this is correct because I -was- able to get several GB 
> worth of data off of the array post-disk-failure, just not all of the 
> data before the host OS started getting flaky.
> 
> The other hole in this argument is that it takes several hours to 
> rebuild a disk.  So why would the PERC decide it's ok to run with no 
> safety net for 10 hours if a hot spare is present, but not allow me to 
> run without a safety net for say 30 minutes to save my data elsewhere?
> 
> Anyone else have thoughts on this explanation?
> 
-- 
Greg Dickie
just a guy
Maximum Throughput



More information about the Linux-PowerEdge mailing list