Possible disk corruption - help and advice appreciated

Jefferson Ogata poweredge at antibozo.net
Sun Sep 19 09:43:59 CDT 2010


On 2010-09-19 13:41, Faris Raouf wrote:
> One of our R200s seems to be having some disk problems but I don't
> understand what's happening. Any info and advice would be appreciated.
> 
> Repairing or fault-finding linux filesystems is all new stuff to me - in 10
> years or so I've never had to worry about it until now -  so please be
> gentle with me.  
> 
> The system in question has two 500Gb SATA drives connected to a SAS 6/iR
> hardware raid controller as a RAID-1 mirrored pair.
> 
> OMSA reports no errors but I'm seeing rather a lot of this kind of thing in
> my logs:
> 
> EXT3-fs error (device sda3): ext3_lookup: unlinked inode 35753000 in dir
> #35752458
> 
> The same thing happened a few weeks ago, and on rebooting I was horrified to
> find fsck reporting "Duplicate of bad block in use" then finding myself in a
> recovery console (I think?) and quickly having to learn a few things about
> fsck and to get it to repair things. It was reporting things like
> "multiply-linked blocks in inode" but after a lot of pressing "y" I was able
> to reboot. There was no apparent data loss, all seemed to be fine and there
> were no more of those errors...until a week or so ago.
> 
> That's when I started getting these "unlinked inode" errors again, and I
> expect I'm going to have to go through a reboot and fsck hell again shortly.
> 
> The systems runs Centos 5.5 but with a Virtuozzo (same as OpenVZ)
> 2.6.18-028stab070.2 kernel.
> 
> I honestly don't know where to begin on this one. If there are bad blocks on
> a disk, surely OMSA would report a problem?
> 
> What *useful* and ideally not heart stopping things should I be looking to
> do at the next reboot to try to get to the bottom of this? I can't begin to
> describe the horrors I went through the first time -- it was at 2am on a
> Saturday and all I had for help was Google and my co-lo company's duty
> engineer who did his absolute best to help but wasn't a Linux expert -- and
> I'd like to try to avoid that situation this time.
> 
> The worrying thing is that I'm currently unable to backup one particularly
> vital Container (VE) on the server in question. The backup fails but doesn't
> give me any indication as to why. I would not be surprised if the two things
> were related. But it puts me in a chicken and egg situation which doesn't
> help my stress levels.

My guess would be that the mirror is inconsistent and that you're
getting weird results depending on which disk is consulted for a given
block.

Not an easy thing to recover from unfortunately.

I think the safest recovery strategy would be to power the system down
completely, image the two drives on another system that doesn't have a
RAID controller (so you get the RAID metadata as well), then pick one of
them to be the "correct" one and boot on that drive alone. Do your fsck
and see how bad the filesystem looks.

Then shut down, swap disks, and boot/fsck again to see if the other disk
seems more consistent.

Choose the disk that's in better shape, boot from that into the RAID
BIOS, then add the other disk back in and rebuild the mirror on the less
consistent disk.

If anything goes wrong, you can restore the images you took on the other
system.



More information about the Linux-PowerEdge mailing list