[Linux-PowerEdge] C6145 ECC Error, how to find bad DIMM?
john.hanks at usu.edu
Thu Jan 31 11:32:23 CST 2013
In the web gui part of the BMC I found a table of SMI Handler events which
does list the DIMM slot along with the error, but like you I am drawing a
blank at a way to pull this information from IPMI. The 0x60 turned out to
be (I think) the number of the overall memory sensor that reports any
memory errors. And when I look at individual events I see the same thing
you have documented. In case it helps, for me
a10002 -> CPU A, DIMM 1
a10000 -> CPU A, DIMM 2
a10300 -> CPU D, DIMM 6
As a first step of troubleshooting I reseated these dimms and rotated
everything one slot, so if they really are bad and I see additional errors,
I'll have more data points but this is still less than satisfying as a way
to pull such information from ipmi and something that doesn't require
empirically testing each hardware phenotype would be much better.
But for the short term finding the SMI Handler event log solves my
immediate problem of locating the bad DIMM(s) without having to test
potentially all 32.
Also, Vincent's suggestion to look at edac-utils was a great pointer to a
really useful looking tool.
On Tue, Jan 29, 2013 at 4:10 AM, Sven Ulland <sveniu at opera.com> wrote:
> On 01/28/2013 10:15 PM, John Hanks wrote:
> > 77 | 01/28/2013 | 13:03:37 | Memory #0x60 | Uncorrectable ECC | Asserted
> > Does anyone know hos I can map #0x60 back to a specific DIMM slot or
> > even to a specific bank/CPU? I'm really not looking forward to
> > searching through 32 DIMMs, swapping them one at a time and waiting
> > to see if I get another ECC error.
> I asked something similar here:
> ..pointing to this:
> While I say that something about event data 'a19001' can be mapped to
> point to DIMM_A1, this turns out to be an over simplification or
> simply a lucky shot. I don't really have a clear answer.
> Here I've correlated the ipmi sel raw event data (as described in
> Fred's mail) with Dell's iDRAC6 system event log (web UI) on some M610
> blades. As you can see, it does not really match up (Fred's prediction
> in paratheses):
> Event data a19001 => DIMM_A1 (group A slot 1. OK)
> Event data a19002 => DIMM_A2 (group A slot 2. OK)
> Event data a29101 => DIMM_B3 (group B slot 1. ?)
> Event data a19101 => DIMM_B3 (group B slot 1. ?)
> Event data a19040 => DIMM_B1 (group A slot 0. ?)
> Event data a19108 => DIMM_B6 (group B slot 8. ?)
> Event data a19102 => DIMM_B4 (group B slot 2. ?)
> I wish Dell could bring some clarity to this, as they can cleary do
> the mapping perfectly fine.
> Linux-PowerEdge mailing list
> Linux-PowerEdge at dell.com
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Linux-PowerEdge