Error messages and data corruption
Tony Molloy
tony.molloy at ul.ie
Mon May 8 09:12:45 CDT 2006
On Monday 08 May 2006 15:00, Marcus Franke wrote:
> Hello all,
>
> i've googled a bit and finally found this list and references to
> my problem in it.
>
> My story goes like this:
>
> Two years ago in Decembre we bought a new and shiny 2850 with four
> 15k disks in a raid 10 configuration. We have installed an RHEL3.
> This machine did well until last year in Decembre when we started
> to have problems with the database (mysql) mysteriously crashing
> with errors in the innodb file and in the transaction files.
>
> It started to become very annoying when we had to restore the
> database from replication slave because the filesystem the database
> was located in was totally destroyed. Even offline scandisk could
> not bring the ext3 partition back into life.
>
> After the second total crash of the database we moved the database
> off from the RAID10 to a second set of disks running in the
> machine now.
>
> Dell support changed the raid controller, cache memory, backplane
> and the four disks the raid10 consists of. Together with the
> controller exchange we updated to the last firmware found on
> dell webservers.
>
> Error is still present and now in a form I have found being
> described and experianced by some of you:
>
> Suddenly the box was offline and not any longer could be accessed
> via ssh/www altough it seemed the server was still running and
> could process some requests but not fully.
>
> Looking at the console we saw a lot of lines spamming about IO
> errors on different sectors and only a hard reset could resurrect
> the server. Sadly, we had no logs after the first crash because
> the messages logfile could not be written..
>
> I configured syslog to use a second server to be notified about
> kernel messages and a week later we experianced the next major
> crash and this time I have logs:
>
> May 2 08:16:21 192.168.1.49 syslogd: /var/log/messages: Read-only file
snip...
>
> Before the firmeware update I had some I/O errors in the kernel log but
> none of the others. I asked in the scsi kernel mailinglist for any
> information about where I can find more info about this ominous dev
> 08:10 thing and how it could be mapped to a physical device in my
> RAID10 config.
>
> Even with all disks in the config being exchanged already. Those new
> disks in the server being damaged again? And how can it happen that the
> hardware raid is screwing up the filesystem?
>
> I read suggestions about these errors happening when the machine is
> under heavy load, but those two major crashes the server went down in
> flames the first time at around 4 o'clock in the morning and second
> time close to 8 o'clock. These are no times of very heavy load.. Just a
> few visitors on the webserver and just a few updates in the database at
> these times.
>
> But searching the web I did not find any real solutions..
>
> Just hints like:
>
> - disable patrol read (why? as if I understand the docs right patrol
> read only works in times of lesser load)
> - enable write through and do not use the cache (at a performance loss)
> - update firmware of the controller
> - update bios of the server
>
>
> Im feeling a bit lost and like a detective trying to solve a major
> puzzle. Most annoying fact is the server ran very fine for a year long
> with nearly no downtimes and just reboots for kernel updates. But in
> the last 4-5 months this has been a constant source of extra work and
> worries and starts to become a neverending story.
>
> I guess I ran every diagnostic tool dell has on its webpages.
>
> What the heck is going on with these PERC 4e controllers?
>
> regards,
> Marcus
>
Don't know if this will help but I seem to remember reading on some
mailing list possibly Centos, that RAID10 in the 2850 is unreliable!!! or
it could be that RAID10 on the PERC 4 is unreliable. I've got several
2850's with PERC 4 configured as RAID5 and they have given me no trouble
whatsoever.
Regards,
Tony
> _______________________________________________
> Linux-PowerEdge mailing list
> Linux-PowerEdge at dell.com
> http://lists.us.dell.com/mailman/listinfo/linux-poweredge
> Please read the FAQ at http://lists.us.dell.com/faq
--
Tony Molloy.
Dept. of Comp. Sci.
University of Limerick
More information about the Linux-PowerEdge
mailing list