PERC 4e/DC in 2850 - lost 1 disk, RAID5 array failed

Cody_Sparks at Dell.com Cody_Sparks at Dell.com
Fri Jul 7 10:37:06 CDT 2006


Fran,

That's incorrect.  Regardless of whether you have a hotspare, a RAID5
array SHOULD remain accessible with only one failed drive.  It can still
write data to the array.

If your system went down and the array was inaccessible, something else
is going on beyond a single failed disk.  You could have a faulty PERC,
issues with another drive in the array, data corruption (do you run
frequent consistency checks)?  It could be an issue with the PERC
firmware revision (is that current?), drive firmware on each drive,
OS/kernel, etc.

I don't know what the cause of your issue is, but a single disk can fail
in a RAID5 and the array should just be "degraded" which means
"accessible, but no redundancy -- one more disk failure will destroy the
array".  You may want to call tech support back to troubleshoot further,
but that is incorrect -- the PERC doesn't shut down a RAID5 after a
single disk failure to prevent data loss, and it doesn't care whether
you have a hot spare -- a hot spare just speeds up the process of
rebuilding the array back to an "optimal" (read "redundant") state.

--Cody

-----Original Message-----
From: linux-poweredge-bounces at dell.com
[mailto:linux-poweredge-bounces at dell.com] On Behalf Of eclark
Sent: Friday, July 07, 2006 10:24 AM
To: linux-poweredge-Lists
Cc: Fran Fabrizio
Subject: Re: PERC 4e/DC in 2850 - lost 1 disk, RAID5 array failed

Because running with a hot spare is the exact same during a failure as
you running without a hot spare. As soon as a drive fails, data should
become unavailable until after a rebuild, because the data is
interleaved across all drives in the array. If it didnt degrade the
array, you could potentially write to data thats bound for the dying
disk. Its not just PERC cards that degrade arrays. Lots and lots of them
do it. Thats why taking a proactive stance with regards to drive
failures is what matters, not the effort taken to fix the issue after
the fact. 

On Friday 07 July 2006 11:09 am, Fran Fabrizio wrote:
> I heard back from my Dell tech contact, and this is what he had to
say:
>
> "If no Hot Spare was available, the PERC has to wait on you.  When a 
> disk in a RAID set fails, the controller looks for a Hot Spare and 
> begins rebuilding the RAID set.  If we think this through, the 
> 'in-action' on the PERC is actually protecting your data.  If it were 
> to continue processing with a failed drive and no Hot Spare you could 
> corrupt the RAID set.  With a Hot Spare you will continue to run but 
> in a degraded state while the rebuild continues."
>
> If this is truly the case, that seems disappointing.  I argue that the
> RAID5 should continue serving data whether or not a hot spare is 
> available.  If no hot spare is available and the PERC decides to shut 
> down, then I have no opportunity to even attempt to retrieve my data.
>
> I'm also not sure this is correct because I -was- able to get several 
> GB worth of data off of the array post-disk-failure, just not all of 
> the data before the host OS started getting flaky.
>
> The other hole in this argument is that it takes several hours to 
> rebuild a disk.  So why would the PERC decide it's ok to run with no 
> safety net for 10 hours if a hot spare is present, but not allow me to

> run without a safety net for say 30 minutes to save my data elsewhere?
>
> Anyone else have thoughts on this explanation?

_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge at dell.com
http://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq



More information about the Linux-PowerEdge mailing list