PowerEdge 1950 PCI Errors

Benjamin_Gordy at Dell.com Benjamin_Gordy at Dell.com
Wed Oct 20 11:12:47 CDT 2010


-----Original Message-----
From: linux-poweredge-bounces-Lists On Behalf Of Mark Watts
Sent: Wednesday, October 20, 2010 9:38 AM
To: linux-poweredge-Lists
Subject: PowerEdge 1950 PCI Errors

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


I have a PCI-X Intel PRO/1000 MT Quad Port Server Adapter in Slot 2 on a
PowerEdge 1950. OS is CentOS 5.4.

Shortly after enabling one of the ports for use on a 100Mbit network,
NFS data transfer across that link stalled.
All traffic through this interface seems to have ceased - even ping is
timing out machines that were previously pingable.


The following log entries are observed through OMSA:

Status: OK              Wed Oct 20 12:01:51 2010        Err Reg Pointer:
Link Tuning sensor, OEM Diagnostic data event was asserted

Status: Critical        Wed Oct 20 12:01:51 2010        PCIE Fatal Err:
Critical Event sensor, bus fatal error (Bus 0 Device 2 Function 0) was
asserted

Status: Critical        Wed Oct 20 12:01:51 2010        PCI Parity Err:
Critical Event sensor, PCI PERR (Slot 2) was asserted


Similarly, the following errors are seen in dmesg/syslog:

Uhhuh. NMI received for unknown reason 30 on CPU 0.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue

EDAC MC0: UE row 0, channel-a= 0 channel-b= 1 labels "-": (Branch=0
DRAM-Bank=0 RDWR=Read RAS=0 CAS=0, UE Err=0x20 (Non-Aliased
Uncorrectable Non-Mirrored Demand Data ECC))

EDAC MC0: UE row 0, channel-a= 0 channel-b= 1 labels "-": (Branch=0
DRAM-Bank=0 RDWR=Read RAS=0 CAS=0, UE Err=0x100 (Non-Aliased
Uncorrectable Patrol Data ECC))


Can anyone enlighten me as to what's happened here?
Do I have bad RAM, a bad Quad-Card, both or neither?

Cheers,

Mark.

- -- 
Mark Watts BSc RHCE MBCS
Senior Systems Engineer, IPR Secure Managed Hosting
www.QinetiQ.com
QinetiQ - Delivering customer-focused solutions
GPG Key: http://www.linux-corner.info/mwatts.gpg
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/

iEYEARECAAYFAky+/tgACgkQBn4EFUVUIO1BSQCglNrufn0kODjEeVuxGeFjt4Bv
4LIAoPSuKzk7Mttd27aes5wAQb62wX2o
=rC5Y
-----END PGP SIGNATURE-----

_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge at dell.com
https://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq

Hi Mark,

Try reseating the PCI NIC or removing it for testing purposes.  Make sure your firmware is up to date.  The EDAC error message is not related to the PCIE Fatal Err message.  EDAC should be disabled.  

The blacklist on RHEL5 should be under /etc/modprobe.d/; Alternately, you may add the following to /etc/modprobe.conf:
    alias   i5000_edac   /dev/null
    alias   edac_mc   /dev/null
    options edac_mc panic_on_ue=0
 
 
For a system already running with the edac module loaded:
- run 'lsmod | grep -i edac'; should return 'i5000_edac' and 'edac_mc';
- run 'modprobe -r <modules>' where <modules> are the listed edac modules from the lsmod command
- once the modules have been removed from the kernel, edac should be disabled (for this boot)
 
  
EDAC is a kernel level driver, and it's talking directly to the chipset, reading registers, and then just dumping out raw register values.  When it accesses these read-once registers they get cleared so no information will be collected/logged by the Dell ESM.  Without this information being obtained by the Dell ESM, there will never be any [LCD -- hardware level] alerts if a warning or failure threshold is reached.  Also, there are no 'screens' available that will clearly identify the component logged by EDAC whereas Dell ESM already has the ability to log and identify a "problematic" component.  Additionally, EDAC is primarily an ECC memory reporting module so things like fans, temeratures, voltages, ... will not necessarily be caught and properly reported by EDAC (although it does report on some PCI bus parity events).  
 
EDAC was designed primarily for systems that do NOT have an "event managing" BIOS/BMC pair (like Dell servers do) yet the chipset can report errors such as SBE.  At this time EDAC is not something supported or validated by Dell and as such it is recommended to NOT use EDAC.
 
More information on EDAC:
edac.txt
http://lwn.net/Articles/168975/
 
EDAC Project
http://bluesmoke.sourceforge.net/



Ben Gordy




More information about the Linux-PowerEdge mailing list