Preventing I/O starvation on MD1000s triggered by a failed disk.

Kevin Davidson kevin at indigospring.co.uk
Tue Aug 24 18:04:53 CDT 2010


On 24 Aug 2010, at 22:25, Stroller <stroller at stellar.eclipse.co.uk> wrote:

> 
> Are you seriously telling me that if I go out and buy a 2TB external  
> drive from PC World, fill it up with movies, it's sure to fail before  
> I've used it 6 full times? Because that's what your "1 per every 12  
> TB" claim seems to imply. I don't think manufacturers would release  
> drives with such poor reliability, because I don't think consumers  
> would stand for it.

You should search for reports of CERN's studies on silent data loss. They really care about this as each run of the LHC will generate terabytes of data that they cannot afford to be corrupted before they analyse it. Turns out their findings were pretty scary. Data reported as being correctly written and correctly read back was not the same as that compared to a known good copy. In frequencies that will turn up in regular use of terabyte and bigger sized disks. It looks like a combination of controller errors and media failures. RAID doesn't help you at all; the best it can do is tell you there's a problem. ZFS is about the only technology that can combat this.

The problem is not with drive failure, or even bad blocks; it's much more insidious and cannot be detected without putting checksums elsewhere and periodically checking the data and checksum agree. This problem has always been there, it's an issue now because we are dealing with more data and hitting this more often.

-- 
Kevin Davidson
Apple Certified System Administrator
Sent from my iPad

indigospring :Making Sense of IT
w http://www.indigospring.co.uk/
t 0870 745 4001




More information about the Linux-PowerEdge mailing list