Possible disk corruption - help and advice appreciated

Faris Raouf asterisk at raouf.net
Sun Sep 19 08:41:37 CDT 2010


Dear all,

One of our R200s seems to be having some disk problems but I don't
understand what's happening. Any info and advice would be appreciated.

Repairing or fault-finding linux filesystems is all new stuff to me - in 10
years or so I've never had to worry about it until now -  so please be
gentle with me.  

The system in question has two 500Gb SATA drives connected to a SAS 6/iR
hardware raid controller as a RAID-1 mirrored pair.

OMSA reports no errors but I'm seeing rather a lot of this kind of thing in
my logs:

EXT3-fs error (device sda3): ext3_lookup: unlinked inode 35753000 in dir
#35752458

The same thing happened a few weeks ago, and on rebooting I was horrified to
find fsck reporting "Duplicate of bad block in use" then finding myself in a
recovery console (I think?) and quickly having to learn a few things about
fsck and to get it to repair things. It was reporting things like
"multiply-linked blocks in inode" but after a lot of pressing "y" I was able
to reboot. There was no apparent data loss, all seemed to be fine and there
were no more of those errors...until a week or so ago.

That's when I started getting these "unlinked inode" errors again, and I
expect I'm going to have to go through a reboot and fsck hell again shortly.

The systems runs Centos 5.5 but with a Virtuozzo (same as OpenVZ)
2.6.18-028stab070.2 kernel.

I honestly don't know where to begin on this one. If there are bad blocks on
a disk, surely OMSA would report a problem?

What *useful* and ideally not heart stopping things should I be looking to
do at the next reboot to try to get to the bottom of this? I can't begin to
describe the horrors I went through the first time -- it was at 2am on a
Saturday and all I had for help was Google and my co-lo company's duty
engineer who did his absolute best to help but wasn't a Linux expert -- and
I'd like to try to avoid that situation this time.

The worrying thing is that I'm currently unable to backup one particularly
vital Container (VE) on the server in question. The backup fails but doesn't
give me any indication as to why. I would not be surprised if the two things
were related. But it puts me in a chicken and egg situation which doesn't
help my stress levels.

Thanks,

Faris.



More information about the Linux-PowerEdge mailing list