[Linux-PowerEdge] C6145 ECC Error, how to find bad DIMM?

Sven Ulland sveniu at opera.com
Tue Jan 29 05:10:58 CST 2013


On 01/28/2013 10:15 PM, John Hanks wrote:
> 77 | 01/28/2013 | 13:03:37 | Memory #0x60 | Uncorrectable ECC | Asserted
>
> Does anyone know hos I can map #0x60 back to a specific DIMM slot or
> even to a specific bank/CPU? I'm really not looking forward to
> searching through 32 DIMMs, swapping them one at a time and waiting
> to see if I get another ECC error.

I asked something similar here:
http://lists.us.dell.com/pipermail/linux-poweredge/2012-July/046605.html
..pointing to this:
http://lists.us.dell.com/pipermail/linux-poweredge/2006-October/027701.html

While I say that something about event data 'a19001' can be mapped to
point to DIMM_A1, this turns out to be an over simplification or
simply a lucky shot. I don't really have a clear answer.

Here I've correlated the ipmi sel raw event data (as described in
Fred's mail) with Dell's iDRAC6 system event log (web UI) on some M610
blades. As you can see, it does not really match up (Fred's prediction
in paratheses):

   Event data a19001 => DIMM_A1 (group A slot 1. OK)
   Event data a19002 => DIMM_A2 (group A slot 2. OK)
   Event data a29101 => DIMM_B3 (group B slot 1. ?)
   Event data a19101 => DIMM_B3 (group B slot 1. ?)
   Event data a19040 => DIMM_B1 (group A slot 0. ?)
   Event data a19108 => DIMM_B6 (group B slot 8. ?)
   Event data a19102 => DIMM_B4 (group B slot 2. ?)

I wish Dell could bring some clarity to this, as they can cleary do
the mapping perfectly fine.

Sven



More information about the Linux-PowerEdge mailing list