Question on how RAID works
jason at rtfmconsult.com
Wed Oct 8 17:54:00 CDT 2003
On Wed, 8 Oct 2003, Karl Zander wrote:
> We have a PE 1650 running RH 7.3. It has 3, 36 GB drives in RAID 5.
> One drive in the container went off-line. The server crashed: kernel
> panic, I/O problems etc. Dell's phone support engineers where very helpful
> and we got the drive back on-line and the container rebuilt. Physically
> the drive seems to be OK.
> My question is about RAID 5. If one drive does fail, aren't the remaining
> two drives supposed to be able to carry on? I realize nothing is
> perfect. Certainly my own case shows there are times when a single drive
> failure can take down the entire server. For my own general edification,
> how reasonable is it to expect RAID 5 to carry on with a single drive
> failure? 50-50? 80-20? Just looking for some general guidelines so I
> set proper expectations with the powers-that-be.
your expectation is correct. we have had a number of disk failures with RAID1
and RAID5 configs with no interruption to service - that's the design/operational
goal of going with RAID in the first place (so i imagine you must be somewhat
to address this specific problem first
o have you put the latest firmware onto your
- system bios
- system backplane/esm
- scsi raid controller
o have you updated to a recent errata kernel for your OS and installed any
other patches that may be required/recommended.
o do you have any meaningful errors in your system logs which may indicate
a drive was going bad or some other problem
o have you installed afacli - can you look at things like the number of defects
on each drive and see if any of them have grown defects. this would be a
leading indicator of drive failure
o is your system on a UPS which provides filtered power ? i have seen a number
of instances where flaky power (sometimes a flaky power supply) causes some
WRT RAID5 drive failure it is very reasonable to expect the system to carry
on and IMHO there is a bug if your system has crashed as a result. yes, even
scsi drives sometimes do bad things on the bus when going down that cause
the controller/other drives problems which can be severe enough to take down
a system.. but this should be rare/isolated instances (unlike IDE raid where
you can expect this to be more likely though mitigated if each device is on
its own bus).
More information about the Linux-PowerEdge