Preventing I/O starvation on MD1000s triggered by a failed disk.

Jeff Ewing jewing at aconex.com
Mon Aug 23 21:23:03 CDT 2010


I had a 1TB SATA disk fail on an NFS server running RHEL5.2. A rebuild onto a global hot spare was triggered. One hour later, when the rebuild was 42% complete, serviced NFS requests dropped from 20 per second to zero. CPU2 went to 100% utilization, in an I/O wait state. Soon after, the internal drives became read only and the server needed to be power reset through the DRAC (server was not configured to take crash dumps).

This hardware configuration had been in production and stable for many months.

How could this be prevented in future?

============================================================
Server Configuration
--------------------
Dell PowerEdge 2950
Two Quad core E5440 CPUs
16 GB RAM
Red Hat Enterprise Linux Version 5.2
Kernel  2.6.18-92.1.6.el5  (x86_64)
PERC Driver :  00.00.03.21
PERC Firmware : 6.2.0-0013

Dell Support (Server/MD1000) Pro Support for IT 

Storage configuration:
---------------------
2 * PERC6E with two MD1000s attached to each 

Controller 1:
MD1000 with SAS 400GB 10K RPM 
MD1000 with SATA 1 TB 7.2K RPM

Controller 2
MD1000 with SATA 750GB 7.2K RPM
MD1000 with SATA 2 TB 7.2K RPM

PERC6E Controller Configurations:
-------------------------
Controller Rebuild Rate : 30%
Three RAID 5 Virtual Disks on each MD1000 
  (5 disks / 5 disks /4 disks + 1 Hot Spare)
Read Policy              : No Read Ahead
Write Policy             : Write Back
============================================================



Jeff Ewing



More information about the Linux-PowerEdge mailing list