Error messages and data corruption

Marcus Franke mfranke at evendi.de
Mon May 8 09:00:32 CDT 2006


Hello all,

i've googled a bit and finally found this list and references to 
my problem in it.

My story goes like this:

Two years ago in Decembre we bought a new and shiny 2850 with four
15k disks in a raid 10 configuration. We have installed an RHEL3.
This machine did well until last year in Decembre when we started 
to have problems with the database (mysql) mysteriously crashing 
with errors in the innodb file and in the transaction files.

It started to become very annoying when we had to restore the
database from replication slave because the filesystem the database
was located in was totally destroyed. Even offline scandisk could
not bring the ext3 partition back into life.

After the second total crash of the database we moved the database
off from the RAID10 to a second set of disks running in the 
machine now.

Dell support changed the raid controller, cache memory, backplane
and the four disks the raid10 consists of. Together with the 
controller exchange we updated to the last firmware found on
dell webservers.

Error is still present and now in a form I have found being 
described and experianced by some of you:

Suddenly the box was offline and not any longer could be accessed
via ssh/www altough it seemed the server was still running and
could process some requests but not fully.

Looking at the console we saw a lot of lines spamming about IO
errors on different sectors and only a hard reset could resurrect
the server. Sadly, we had no logs after the first crash because
the messages logfile could not be written..

I configured syslog to use a second server to be notified about
kernel messages and a week later we experianced the next major
crash and this time I have logs:

May  2 08:16:21 192.168.1.49 syslogd: /var/log/messages: Read-only file system 
May  2 08:16:21 192.168.1.49 syslogd: /var/log/maillog: Read-only file system 
May  2 08:16:21 192.168.1.49 kernel: megaraid: aborting-397499622 cmd=2a <c=0 t=0 l=0> 
May  2 08:16:21 192.168.1.49 kernel: megaraid: aborting-397499622 cmd=2a <c=0 t=0 l=0> 
May  2 08:16:21 192.168.1.49 kernel: megaraid: aborting-397499624 cmd=2a <c=0 t=0 l=0> 
May  2 08:16:21 192.168.1.49 kernel: megaraid: aborting-397499624 cmd=2a <c=0 t=0 l=0> 
May  2 08:16:21 192.168.1.49 kernel: megaraid: aborting-397499610 cmd=2a <c=0 t=0 l=0> 
May  2 08:16:21 192.168.1.49 kernel: megaraid: aborting-397499610 cmd=2a <c=0 t=0 l=0> 
May  2 08:16:21 192.168.1.49 kernel: megaraid: aborting-397499604 cmd=2a <c=0 t=0 l=0> 
May  2 08:16:21 192.168.1.49 kernel: megaraid: aborting-397499604 cmd=2a <c=0 t=0 l=0> 
May  2 08:16:21 192.168.1.49 kernel: megaraid: aborting-397499615 cmd=2a <c=0 t=0 l=0> 
May  2 08:16:21 192.168.1.49 kernel: megaraid: aborting-397499615 cmd=2a <c=0 t=0 l=0> 
May  2 08:16:21 192.168.1.49 kernel: megaraid: aborting-397499612 cmd=2a <c=0 t=0 l=0> 
May  2 08:16:21 192.168.1.49 kernel: megaraid: aborting-397499612 cmd=2a <c=0 t=0 l=0> 
May  2 08:16:21 192.168.1.49 kernel: megaraid: aborting-397499619 cmd=2a <c=0 t=0 l=0> 
May  2 08:16:22 192.168.1.49 kernel: megaraid: aborting-397499567 cmd=2a <c=0 t=1 l=0> 
May  2 08:16:22 192.168.1.49 kernel: megaraid: aborting-397499567 cmd=2a <c=0 t=1 l=0> 
May  2 08:16:22 192.168.1.49 syslogd: /var/log/cron: Read-only file system 
May  2 08:16:22 192.168.1.49 kernel: megaraid: aborting-397499575 cmd=2a <c=0 t=1 l=0> 
May  2 08:16:22 192.168.1.49 kernel: megaraid: aborting-397499575 cmd=2a <c=0 t=1 l=0> 
May  2 08:16:22 192.168.1.49 kernel: megaraid: reset-397499570 cmd=2a <c=0 t=1 l=0> 
May  2 08:16:22 192.168.1.49 kernel: megaraid: reset-397499570 cmd=2a <c=0 t=1 l=0> 
May  2 08:16:22 192.168.1.49 kernel: megaraid: reset-397499627 cmd=2a <c=0 t=0 l=0> 
May  2 08:16:22 192.168.1.49 kernel: megaraid: reset-397499627 cmd=2a <c=0 t=0 l=0> 
May  2 08:16:22 192.168.1.49 kernel: megaraid: hw error, cannot reset 
May  2 08:16:22 192.168.1.49 kernel: megaraid: hw error, cannot reset 
May  2 08:16:22 192.168.1.49 kernel: megaraid: reset-397499605 cmd=2a <c=0 t=0 l=0> 
May  2 08:16:22 192.168.1.49 kernel: megaraid: reset-397499605 cmd=2a <c=0 t=0 l=0> 
May  2 08:16:22 192.168.1.49 kernel: megaraid: hw error, cannot reset 
May  2 08:16:22 192.168.1.49 kernel: megaraid: reset-397499621 cmd=2a <c=0 t=0 l=0> 
May  2 08:16:22 192.168.1.49 kernel: megaraid: reset-397499621 cmd=2a <c=0 t=0 l=0> 
May  2 08:16:22 192.168.1.49 kernel: megaraid: hw error, cannot reset 
May  2 08:16:22 192.168.1.49 kernel: megaraid: hw error, cannot reset 
May  2 08:16:22 192.168.1.49 kernel: megaraid: reset-397499623 cmd=2a <c=0 t=0 l=0> 
May  2 08:16:22 192.168.1.49 kernel: megaraid: hw error, cannot reset 
May  2 08:16:22 192.168.1.49 kernel: megaraid: hw error, cannot reset 
May  2 08:16:22 192.168.1.49 kernel: megaraid: reset-397499548 cmd=2a <c=0 t=1 l=0> 
May  2 08:16:22 192.168.1.49 kernel: megaraid: reset-397499548 cmd=2a <c=0 t=1 l=0> 
May  2 08:16:22 192.168.1.49 kernel: megaraid: reset-397499569 cmd=2a <c=0 t=1 l=0> 
May  2 08:16:22 192.168.1.49 kernel: megaraid: reset-397499569 cmd=2a <c=0 t=1 l=0> 
May  2 08:16:22 192.168.1.49 kernel:  I/O error: dev 08:10, sector 173099720 
May  2 08:16:22 192.168.1.49 kernel:  I/O error: dev 08:10, sector 173099720 
May  2 08:16:22 192.168.1.49 kernel:  I/O error: dev 08:10, sector 173102760 
May  2 08:16:22 192.168.1.49 kernel:  I/O error: dev 08:10, sector 173102760 
May  2 08:16:22 192.168.1.49 kernel:  I/O error: dev 08:10, sector 173102776 
May  2 08:16:22 192.168.1.49 kernel:  I/O error: dev 08:10, sector 173102776 
May  2 08:16:22 192.168.1.49 kernel:  I/O error: dev 08:10, sector 173102784 
May  2 08:16:22 192.168.1.49 kernel:  I/O error: dev 08:10, sector 173102784 
May  2 08:16:22 192.168.1.49 kernel:  I/O error: dev 08:10, sector 173102808 
May  2 08:16:22 192.168.1.49 kernel:  I/O error: dev 08:10, sector 173102808 


I cutted some lines, but its a complete overview about the various error messages
being logged by the kernel.

Before the firmeware update I had some I/O errors in the kernel log but none
of the others. I asked in the scsi kernel mailinglist for any information about
where I can find more info about this ominous dev 08:10 thing and how it could
be mapped to a physical device in my RAID10 config. 

Even with all disks in the config being exchanged already. Those new disks in
the server being damaged again? And how can it happen that the hardware raid
is screwing up the filesystem? 

I read suggestions about these errors happening when the machine is under heavy
load, but those two major crashes the server went down in flames the first time
at around 4 o'clock in the morning and second time close to 8 o'clock. These are
no times of very heavy load.. Just a few visitors on the webserver and just a few
updates in the database at these times.

But searching the web I did not find any real solutions..

Just hints like:

- disable patrol read (why? as if I understand the docs right patrol read only works
	in times of lesser load)
- enable write through and do not use the cache (at a performance loss)
- update firmware of the controller
- update bios of the server


Im feeling a bit lost and like a detective trying to solve a major puzzle.
Most annoying fact is the server ran very fine for a year long with nearly
no downtimes and just reboots for kernel updates. But in the last 4-5 months
this has been a constant source of extra work and worries and starts to 
become a neverending story.

I guess I ran every diagnostic tool dell has on its webpages.

What the heck is going on with these PERC 4e controllers?

regards,
Marcus



More information about the Linux-PowerEdge mailing list