Question on how RAID works

jason andrade jason at rtfmconsult.com
Wed Oct 8 17:54:00 CDT 2003


On Wed, 8 Oct 2003, Karl Zander wrote:

> We have a PE 1650 running RH 7.3.  It has 3, 36 GB drives in RAID 5.
>
> One drive in the container went off-line.  The server crashed: kernel
> panic, I/O problems etc.  Dell's phone support engineers where very helpful
> and we got the drive back on-line and the container rebuilt.   Physically
> the drive seems to be OK.

ugh.

> My question is about RAID 5.  If one drive does fail, aren't the remaining
> two drives supposed to be able to carry on?  I realize nothing is
> perfect.  Certainly my own case shows there are times when a single drive
> failure can take down the entire server.   For my own general edification,
> how reasonable is it to expect RAID 5 to carry on with a single drive
> failure?  50-50?  80-20?   Just looking for some general guidelines so I
> set proper expectations with the powers-that-be.

your expectation is correct.  we have had a number of disk failures with RAID1
and RAID5 configs with no interruption to service - that's the design/operational
goal of going with RAID in the first place (so i imagine you must be somewhat
annoyed..)

to address this specific problem first

o have you put the latest firmware onto your
	- system bios
	- system backplane/esm
	- scsi raid controller

o have you updated to a recent errata kernel for your OS and installed any
  other patches that may be required/recommended.

o do you have any meaningful errors in your system logs which may indicate
  a drive was going bad or some other problem

o have you installed afacli - can you look at things like the number of defects
  on each drive and see if any of them have grown defects.  this would be a
  leading indicator of drive failure

o is your system on a UPS which provides filtered power ? i have seen a number
  of instances where flaky power (sometimes a flaky power supply) causes some
  issues.

WRT RAID5 drive failure it is very reasonable to expect the system to carry
on and IMHO there is a bug if your system has crashed as a result.  yes, even
scsi drives sometimes do bad things on the bus when going down that cause
the controller/other drives problems which can be severe enough to take down
a system.. but this should be rare/isolated instances (unlike IDE raid where
you can expect this to be more likely though mitigated if each device is on
its own bus).


regards,

-jason




More information about the Linux-PowerEdge mailing list