how to get rid of bad blocks in a file on PERC 5/I?

Tim Small tim at seoss.co.uk
Fri Apr 30 03:00:09 CDT 2010


Adam Nielsen wrote:
> I believe that when hard disks discover they have a bad sector 
> they attempt to remap it themselves, but it may not always happen right 
> away.  So it's possible that by the time you rebuilt the array the 
> sectors had been relocated.
>   

I believe the standard behaviour is:

. Read and apply simple (fast/hardware-implemented AKA "online") error
correction

    . If that fails try to use more complex (slow/firmware-implemented
AKA "offline") ECC - retry this a (usually configurable) number of times.

        . In the case of successful correction (we have the user data),
write the data back to the sector, and then read-check it to see if it
was written successfully.

           . If the re-read-verify is OK, then continue as normal (maybe
increment one of the SMART counters)

           . If the re-read-verify fails, then reallocate the sector
(use a "spare" hidden reserved sector elsewhere on the disk).  Increment
the SMART "reallocated sector" count.

       . If the "offline" ECC fails, then we've really lost data, so
return a read-error to the disk controller - mark the sector as
"pending" - attempting to read the sector again may restart the
"offline" correction attempts.




If the controller later tries to WRITE to that sector instead of reading
it, then the drive will do the "write, and verify" step again as above
with the new data (i.e. see if the data can then be read, and if-not
then reallocate it).

In the case of a RAID controller, standard practise is for the
controller to reconstruct the data from the other drives, and then issue
the write instruction back to the original drive.  The better RAID
implementations will actually REPORT THIS TO YOU, when it happens (e.g.
Linux software RAID, so that you know the drive may be unwell).  To make
matters worse you can't even reliably check the SMART data yourself with
some of the Dell/LSI controllers - and LSI/Dell don't seem to care
enough to fix this...

https://bugzilla.kernel.org/show_bug.cgi?id=14831

> However given the subsequent failures I would think that the drive may 
> actually be fine - maybe you can run a self test on it without going 
> through a RAID controller.
>   

Using smartctl to check what's gone on the with the drive itself would
be the best thing to do, I think...  Recent smartctl has support for
communicating with drives behind PERCs.


> I don't know whether the situation has improved in recent years, the 
> experiences were enough to persuade me to switch to software RAID which 
> I have stuck with ever since.
>   

ACK.  My conclusion is also to use AHCI, and software RAID.  It's more
reliable generally, and if you do find a bug, the maintainers are
responsive (or you can even fix it yourself, or pay someone else to -
this is Open Source right?  Presumably that's why people use Linux in
the first place?).  Oh, and it's cheaper too.

Tim.


-- 
South East Open Source Solutions Limited
Registered in England and Wales with company number 06134732.
Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ
VAT number: 900 6633 53  http://seoss.co.uk/ +44-(0)1273-808309



More information about the Linux-PowerEdge mailing list