how to get rid of bad blocks in a file on PERC 5/I?

Bond Masuda bond.masuda at jlbond.com
Thu Apr 29 19:13:39 CDT 2010


Hi everyone,

I could use some help trying to get rid of some bad blocks on a RAID-5 on
PERC 5/I controller.

Let me start by describing the setup:

- 8x SATA 500GB HDD on PERC 5/I with Dell firmware 5.2.2-0072
- The 8 drives are setup as RAID-5 and shows up in RHEL 5.4 as /dev/sdc
- /dev/sdc is formatted with XFS (XFS support from CentOS repository)
- /dev/sdc is mounted at /data
- for this discussion, let's call the disks 0:0, 0:1, 0:2, 0:3, 1:4, 1:5,
1:6, and 1:7

So, disk 1:6 all of sudden is marked as "failed" and /dev/sdc becomes
degraded. Just in case another disk goes bad, we decided to take a backup at
that moment; this would provide us with a copy of the "latest" instead of 1
day or older stuff. Our backup methodology is just a simple rsync of /data
to an external drive. 

During the backup run, we noticed that there was one 4GB file that did not
copy correctly. So, we used dd_rescue to make a copy of it but find out
there are 8 blocks that are not readable. (blocks 5608072-5608079, the point
is they are in a contiguous range, file system-wise)

So, at this point, we're glad that we just did a backup of everything since
now we're concerned that the rebuild of 1:6 might not succeed if there are
unreadable sectors somewhere. Just to see what might happen, and since 1:6
seems to still be spinning, we decided to force it to rebuild without
replacing it. To our surprise, the rebuild of 1:6 actually succeeded!?!??!
We then run a 'omconfig storage vdisk action=checkconsistency controller=0
vdisk=0' and it completes successfully! Does checkconsistency make hard
drives remap bad sectors?

Now /dev/sdc is once again in good health, so we think, but we're suspicious
of 1:6. At this point in time, we unmount /data and do xfs_check on it. it
reports that "block ?/? type unknown not expected" and so we run xfs_repair
on /data. everything seems to complete okay and we cleanly mount /data
again. we go back to examine the 4GB file. We tried another dd_rescue on it
but got the same exact results; 8 blocks in the same exact range as before
did not read. When we use rsync to copy this file, we get the following
console error messages:

end_request: I/O error, dev sdc, sector 6012984362
end_request: I/O error, dev sdc, sector 6012984362
end_request: I/O error, dev sdc, sector 6012984874
end_request: I/O error, dev sdc, sector 6012984874

These messages makes me think that there are bad sectors on one or more of
the disks? (would you people agree?) What I don't get is, we were having
this problem in degraded mode (when 1:6 was in "failed" state), so how could
it rebuild 1:6 with read errors somewhere other than 1:6?

In the middle of investigating all this, disk 1:6 again goes into "failed"
state. When we used 'omreport storage pdisk controller=0' some of the fields
for 1:6 were filled with garbage. This time, we think 1:6 is really toast
and decide to replace it with a spare we have. We put in the spare drive and
it begins to rebuild. Again, we're not sure it will be able to successfully
rebuild since we think there are read errors somewhere else. But to our
surprise, the new 1:6 disk rebuilds successfully and /dev/sdc is once again
in Status=Ok, State=Ready.

We go back to investigating the 4GB file with bad blocks. We try another
dd_rescue to copy it, but this time we get 16 bad blocks! The 1st 8 are
exactly as before (5608072-5608079), but there were also blocks
5608584-5608591; the 2nd range of 8 blocks being contiguous but separate
from the 1st 8. Problem getting worse?

On the one hand, since the rebuild of 1:6 succeeded twice, this makes us
think this is NOT a hard disk issue, but maybe a XFS issue? But, after the
xfs_repair, xfs_check says /data is in good condition. And the "I/O error,
dev sdc, sector XXXX" makes us think it could be a hard disk issue?

Some thoughts? Advice?
-Bond







More information about the Linux-PowerEdge mailing list