Preventing I/O starvation on MD1000s triggered by a failed disk.

Bond Masuda bond.masuda at jlbond.com
Mon Aug 23 22:46:03 CDT 2010


Do you know if you encountered a URE? I've been running into URE's on
large RAID-5 arrays that use 500GB, 750GB, 1TB drives... basically,
RAID-5 might survive a single disk failure, but the rebuild will kill it
due to URE. In this situation, usually you can force "online" the disk
that had the URE and still be able to read data in degraded state as
long as you avoid the block that has the URE. If URE is the issue, then
you should start considering RAID-6.

Were there any kernel messages when the I/O stopped? Have you dumped the
log from the controller? If not, that's where I would start looking....

-Bond

On Tue, 2010-08-24 at 12:23 +1000, Jeff Ewing wrote:
> I had a 1TB SATA disk fail on an NFS server running RHEL5.2. A rebuild onto a global hot spare was triggered. One hour later, when the rebuild was 42% complete, serviced NFS requests dropped from 20 per second to zero. CPU2 went to 100% utilization, in an I/O wait state. Soon after, the internal drives became read only and the server needed to be power reset through the DRAC (server was not configured to take crash dumps).
> 
> This hardware configuration had been in production and stable for many months.
> 
> How could this be prevented in future?
> 
> ============================================================
> Server Configuration
> --------------------
> Dell PowerEdge 2950
> Two Quad core E5440 CPUs
> 16 GB RAM
> Red Hat Enterprise Linux Version 5.2
> Kernel  2.6.18-92.1.6.el5  (x86_64)
> PERC Driver :  00.00.03.21
> PERC Firmware : 6.2.0-0013
> 
> Dell Support (Server/MD1000) Pro Support for IT 
> 
> Storage configuration:
> ---------------------
> 2 * PERC6E with two MD1000s attached to each 
> 
> Controller 1:
> MD1000 with SAS 400GB 10K RPM 
> MD1000 with SATA 1 TB 7.2K RPM
> 
> Controller 2
> MD1000 with SATA 750GB 7.2K RPM
> MD1000 with SATA 2 TB 7.2K RPM
> 
> PERC6E Controller Configurations:
> -------------------------
> Controller Rebuild Rate : 30%
> Three RAID 5 Virtual Disks on each MD1000 
>   (5 disks / 5 disks /4 disks + 1 Hot Spare)
> Read Policy              : No Read Ahead
> Write Policy             : Write Back
> ============================================================
> 
> 
> 
> Jeff Ewing
> 
> _______________________________________________
> Linux-PowerEdge mailing list
> Linux-PowerEdge at dell.com
> https://lists.us.dell.com/mailman/listinfo/linux-poweredge
> Please read the FAQ at http://lists.us.dell.com/faq




More information about the Linux-PowerEdge mailing list