PERC 4e/DC in 2850 - lost 1 disk, RAID5 array failed

Fran Fabrizio fran at cis.uab.edu
Thu Jul 6 22:06:08 CDT 2006


I had a disturbing experience with a RAID5 array on a PERC 4e/DC in a 
2850 today.  I was sitting in my office when I heard the PERC's alarm go 
off, so I went into the server room to discover that one of the 5 drives 
in the array was blinking amber.  This server is a VMware ESX server and 
OS, and I have an identical one as well, so I calmly went about shutting 
down the virtual machines one by one, and copying their disk filess over 
to the other host, figuring the degraded array would keep serving data 
at least, in the meantime.

The first couple of virtual machine disk files went fine, but when I got 
to the third and fourth ones, it would not let me copy them, reporting 
Device or Resource Busy.  The virtual machines corresponding to those 
disk files were completely dead - could not access any services on them, 
could not log in, and pulling up the console showed a blank screen.

Then the filesystem on the VMware host itself started acting up, hanging 
midway through commands, etc....

In short, the RAID5 did not work as advertised.  My understanding is 
that it should survive one disk failing and continue to serve data from 
this degraded state, in fact, this is one of the major reasons I chose 
RAID5.  Am I misunderstanding something here, or did my PERC 4e/DC 
completely fail to do its job?

I eventually had to hard reboot the server, and upon reboot, the PERC 
complained that one disk had failed and that the array was in a degraded 
state.  Since it did not want to serve up the data, I'm now trying to 
rebuild that disk from the BIOS, but I thought all of this could be done 
online, while still serving data, and not having 12 hours of downtime 
while the disk rebuilds!

Am I not understanding this, or did this PERC completely fail?

Thanks,
Fran



More information about the Linux-PowerEdge mailing list