PE 1850 - LSI Logic MegaRAID (PERC 4e/Si) hw problems

maurice.croes at roxpro.be maurice.croes at roxpro.be
Thu Apr 12 19:29:44 CDT 2007


Hi all,

We've been using Dell PE's for a couple of years now, but lately we've
been having problems with our PE 1850.

Randomly f*cking up, even without alot of disk activity.

Some details,

Server: PE 1850
RAID controller: LSI Logic MegaRAID 522A, Dell PowerEdge Expandable RAID controller 4e/Si
OS: Linux, debian distribution, kernel 2.4.32, megaraid: v2.10.10.1

The firmware used to be an older (default) firmware, but since we already noticed that there was urgent upgrade available, we flashed that.

But we are still having some problems.


As far as I can tell, before the flash, we noticed alot of lines in the console looking like,

 > megaraid: aborting-29762 cmd=2a <c=2 t=0 l=0>
 > megaraid abort: 29762:21[255:128], fw owner

and linux was completely unusable (alot of bus errors when executing commands, partition that was remounted read-only, etc)

after the flash, we noticed the following lines in the console,

 > I/O error, dev sda3, sector ..

But with the same problems (bus errors, readonly filesystem, ..)
It is possible that the system was still generating "abort" errors, but this was not visible in the console at the time we were in the datacenter.


Besides running e2fsck -c on the partition, I was searching for more information.


Now, I used the linttylog tool to dump a logfile,

http://www.roxpro.be/dell/tty.log


I know this looks bad, particularly,

line 104: T12: rebuildResume checksum is bad - initializing NVRAM structure

line 106: T12: RMW: NVRAM structure invalid - initializing

line 147+: ECC Error: Multi-Bit Read error from ATU, addr=c6dfcde0,
syndrome=66 [bit=255]
ECC Error: Single-Bit Read error from ATU, addr=c6dfcdf0, syndrome=c4
[bit=2]
Multi-bit or overflow encountered (mcisr=3)...shutting down
Total ecc errors encountered this boot=3

lines 270 - 872 look weird


after that we disabled mostly all of the daemons, etc.. just seeing if
the system would crash without much read/write access to the disks.. and
it is still up & running from 2007/04/11 18:39.
If you can call that "running".


Can anyone let me know what the possible solutions are ? 
Or is it just plain faulty hardware, that's acting up after 1 year in production with no problems whatsoever, until this week.

Thanks in advance for the help.

Regards,
Maurice 



More information about the Linux-PowerEdge mailing list