[Linux-PowerEdge] T7820 one bad DIMM among 8: which one?

Mauricio Tavares raubvogel at gmail.com
Tue Feb 12 22:38:18 CST 2019


[EXTERNAL EMAIL] 

On Tue, Feb 12, 2019 at 4:50 PM Tru Huynh <tru at pasteur.fr> wrote:
>
>
> [EXTERNAL EMAIL]
>
> Hello
>
> One of our T7820 running CentOS-7 x86_64 3.10.0-957.5.1.el7.x86_64
> latest bios 1.9.2 (01/24/2019) is logging:
>
> dmesg:
> [15108.602969] mce: [Hardware Error]: Machine check events logged
> [15108.603012] EDAC skx MC3: HANDLING MCE MEMORY ERROR
> [15108.603015] EDAC skx MC3: CPU 8: Machine Check Event: 0 Bank 18: 8c000040000800c2
> [15108.603016] EDAC skx MC3: TSC 0
> [15108.603018] EDAC skx MC3: ADDR 134fef57c0
> [15108.603019] EDAC skx MC3: MISC 900040004000086
> [15108.603021] EDAC skx MC3: PROCESSOR 0:50654 TIME 1549992804 SOCKET 1 APIC 10
> [15108.603030] EDAC MC3: 1 CE memory scrubbing error on CPU_SrcID#1_MC#1_Chan#2_DIMM#0 (channel:2 slot:0 page:0x134fef5 offset:0x7c0 grain:32 syndrome:0x0 -  err_code:0008:00c2 socket:1 imc:1 rank:0 bg:1 ba:3 row:29ff col:358)
>
> /var/log/messages:
> Feb 12 18:33:24 ibet kernel: mce: [Hardware Error]: Machine check events logged
> Feb 12 18:33:24 ibet kernel: EDAC MC3: 1 CE memory scrubbing error on CPU_SrcID#1_MC#1_Chan#2_DIMM#0 (channel:2 slot:0 page:0x134fef5 offset:0x7c0 grain:32 syndrome:0x0 -  err_code:0008:00c2 socket:1 imc:1 rank:0 bg:1 ba:3 row:29ff col:358)
> Feb 12 18:33:24 ibet mcelog: Hardware event. This is not a software error.
> Feb 12 18:33:24 ibet mcelog: MCE 0
> Feb 12 18:33:24 ibet mcelog: CPU 8 BANK 18
> Feb 12 18:33:24 ibet mcelog: MISC 900040004000086 ADDR 134fef57c0
> Feb 12 18:33:24 ibet mcelog: TIME 1549992804 Tue Feb 12 18:33:24 2019
> Feb 12 18:33:24 ibet mcelog: MCG status:
> Feb 12 18:33:24 ibet mcelog: MCi status:
> Feb 12 18:33:24 ibet mcelog: Corrected error
> Feb 12 18:33:24 ibet mcelog: MCi_MISC register valid
> Feb 12 18:33:24 ibet mcelog: MCi_ADDR register valid
> Feb 12 18:33:24 ibet mcelog: MCA: MEMORY CONTROLLER MS_CHANNEL2_ERR
> Feb 12 18:33:24 ibet mcelog: Transaction: Memory scrubbing error
> Feb 12 18:33:24 ibet mcelog: MemCtrl: Corrected patrol scrub error
> Feb 12 18:33:24 ibet mcelog: STATUS 8c000040000800c2 MCGSTATUS 0
> Feb 12 18:33:24 ibet mcelog: MCGCAP 7000c14 APICID 10 SOCKETID 1
> Feb 12 18:33:24 ibet mcelog: PPIN fdf60614f277367e
> Feb 12 18:33:24 ibet mcelog: MICROCODE 200004d
> Feb 12 18:33:24 ibet mcelog: CPUID Vendor Intel Family 6 Model 85
>
> The Dell embedded basic diagnostic tests (F12 on boot) does not show any errors, but that is expected
> since the issue is corrected as stated "MemCtrl: Corrected patrol scrub error".
>
> The error doesn't show immediately after boot, this time it occured ~2h after a cold boot.
>
> There are 8x 16GB DIMMS on that machine, and Dell support is only willing to ship one stick
> and let me find which one is unhealthy... Is there a tool available to identify the bad one?
>
> dmidecode can let me identify DIMM[1-6]_CPU[0-1] but which one is "CPU 8 BANK 18"
>
> Cheers
>
> Tru
>
> --
> Dr Tru Huynh | mailto:tru at pasteur.fr | tel +33 1 45 68 87 37
> https://research.pasteur.fr/en/team/structural-bioinformatics/
> Institut Pasteur, 25-28 rue du Docteur Roux, 75724 Paris CEDEX 15 France
>
      It keeps saying "channel:2 slot:0" or "channel:2 dimm:0", so you
can associate that with the dmidecode output.

Pop the cover open and stick your head there. Chances are you will see
a table listing all the memory slots and how they are called; it might
be in a sticker under the cover. How are they called? I mean,
channel:2 dimm:0 sounds like the first memory slot in the 3rd bank.

Have you asked the DRAC what's up?

Also, I have seen (older) machines which had a LED for each memory
slot. Bad memory would cause LED to be sad.


> _______________________________________________
> Linux-PowerEdge mailing list
> Linux-PowerEdge at dell.com
> https://lists.us.dell.com/mailman/listinfo/linux-poweredge



More information about the Linux-PowerEdge mailing list