PE 1850 - LSI Logic MegaRAID (PERC 4e/Si) hw problems

Adam Williams awilliam at mdah.state.ms.us
Fri Apr 13 16:06:53 CDT 2007


looks from the log that either the system memory or memory on the PERC 
card is bad.  Have you ran memtest86.com   to tell if the system memory 
is bad?  Dell has a utility to check the card. on the dell ftp site i 
think its BR61001.exe but I don't remember if it supports the 4e/SI.

maurice.croes at roxpro.be wrote:
> Hi all,
>
> We've been using Dell PE's for a couple of years now, but lately we've
> been having problems with our PE 1850.
>
> Randomly f*cking up, even without alot of disk activity.
>
> Some details,
>
> Server: PE 1850
> RAID controller: LSI Logic MegaRAID 522A, Dell PowerEdge Expandable RAID controller 4e/Si
> OS: Linux, debian distribution, kernel 2.4.32, megaraid: v2.10.10.1
>
> The firmware used to be an older (default) firmware, but since we already noticed that there was urgent upgrade available, we flashed that.
>
> But we are still having some problems.
>
>
> As far as I can tell, before the flash, we noticed alot of lines in the console looking like,
>
>  > megaraid: aborting-29762 cmd=2a <c=2 t=0 l=0>
>  > megaraid abort: 29762:21[255:128], fw owner
>
> and linux was completely unusable (alot of bus errors when executing commands, partition that was remounted read-only, etc)
>
> after the flash, we noticed the following lines in the console,
>
>  > I/O error, dev sda3, sector ..
>
> But with the same problems (bus errors, readonly filesystem, ..)
> It is possible that the system was still generating "abort" errors, but this was not visible in the console at the time we were in the datacenter.
>
>
> Besides running e2fsck -c on the partition, I was searching for more information.
>
>
> Now, I used the linttylog tool to dump a logfile,
>
> http://www.roxpro.be/dell/tty.log
>
>
> I know this looks bad, particularly,
>
> line 104: T12: rebuildResume checksum is bad - initializing NVRAM structure
>
> line 106: T12: RMW: NVRAM structure invalid - initializing
>
> line 147+: ECC Error: Multi-Bit Read error from ATU, addr=c6dfcde0,
> syndrome=66 [bit=255]
> ECC Error: Single-Bit Read error from ATU, addr=c6dfcdf0, syndrome=c4
> [bit=2]
> Multi-bit or overflow encountered (mcisr=3)...shutting down
> Total ecc errors encountered this boot=3
>
> lines 270 - 872 look weird
>
>
> after that we disabled mostly all of the daemons, etc.. just seeing if
> the system would crash without much read/write access to the disks.. and
> it is still up & running from 2007/04/11 18:39.
> If you can call that "running".
>
>
> Can anyone let me know what the possible solutions are ? 
> Or is it just plain faulty hardware, that's acting up after 1 year in production with no problems whatsoever, until this week.
>
> Thanks in advance for the help.
>
> Regards,
> Maurice 
>
> _______________________________________________
> Linux-PowerEdge mailing list
> Linux-PowerEdge at dell.com
> http://lists.us.dell.com/mailman/listinfo/linux-poweredge
> Please read the FAQ at http://lists.us.dell.com/faq
>   



More information about the Linux-PowerEdge mailing list