PE 1850 - LSI Logic MegaRAID (PERC 4e/Si) hw problems

Linux PowerEdge linux.poweredge at gmail.com
Sat Apr 14 08:11:13 CDT 2007


Maurice,

The PERC's cache-DIMM is bad.  If the system is still under warranty
you should call Dell support with the ServiceTag and tty.log in-hand.
While replacing just the DIMM should resolve it, the way the parts are
sometimes kitted Support may end up replacing the entire RAID
controller/components.

Depending on the level of support on the system (and the
technician/group you reach) you might get a little push-back
[initially] because of the OS you are running but the tty.log is a
direct history of the controller's operation and is an OS independent
log.



On 4/12/07, maurice.croes at roxpro.be <maurice.croes at roxpro.be> wrote:
> Hi all,
>
> We've been using Dell PE's for a couple of years now, but lately we've
> been having problems with our PE 1850.
>
> Randomly f*cking up, even without alot of disk activity.
>
> Some details,
>
> Server: PE 1850
> RAID controller: LSI Logic MegaRAID 522A, Dell PowerEdge Expandable RAID controller 4e/Si
> OS: Linux, debian distribution, kernel 2.4.32, megaraid: v2.10.10.1
>
> The firmware used to be an older (default) firmware, but since we already noticed that there was urgent upgrade available, we flashed that.
>
> But we are still having some problems.
>
>
> As far as I can tell, before the flash, we noticed alot of lines in the console looking like,
>
>  > megaraid: aborting-29762 cmd=2a <c=2 t=0 l=0>
>  > megaraid abort: 29762:21[255:128], fw owner
>
> and linux was completely unusable (alot of bus errors when executing commands, partition that was remounted read-only, etc)
>
> after the flash, we noticed the following lines in the console,
>
>  > I/O error, dev sda3, sector ..
>
> But with the same problems (bus errors, readonly filesystem, ..)
> It is possible that the system was still generating "abort" errors, but this was not visible in the console at the time we were in the datacenter.
>
>
> Besides running e2fsck -c on the partition, I was searching for more information.
>
>
> Now, I used the linttylog tool to dump a logfile,
>
> http://www.roxpro.be/dell/tty.log
>
>
> I know this looks bad, particularly,
>
> line 104: T12: rebuildResume checksum is bad - initializing NVRAM structure
>
> line 106: T12: RMW: NVRAM structure invalid - initializing
>
> line 147+: ECC Error: Multi-Bit Read error from ATU, addr=c6dfcde0,
> syndrome=66 [bit=255]
> ECC Error: Single-Bit Read error from ATU, addr=c6dfcdf0, syndrome=c4
> [bit=2]
> Multi-bit or overflow encountered (mcisr=3)...shutting down
> Total ecc errors encountered this boot=3
>
> lines 270 - 872 look weird
>
>
> after that we disabled mostly all of the daemons, etc.. just seeing if
> the system would crash without much read/write access to the disks.. and
> it is still up & running from 2007/04/11 18:39.
> If you can call that "running".
>
>
> Can anyone let me know what the possible solutions are ?
> Or is it just plain faulty hardware, that's acting up after 1 year in production with no problems whatsoever, until this week.
>
> Thanks in advance for the help.
>
> Regards,
> Maurice
>
> _______________________________________________
> Linux-PowerEdge mailing list
> Linux-PowerEdge at dell.com
> http://lists.us.dell.com/mailman/listinfo/linux-poweredge
> Please read the FAQ at http://lists.us.dell.com/faq
>



More information about the Linux-PowerEdge mailing list