[Linux-PowerEdge] T7820 one bad DIMM among 8: which one?

Tru Huynh tru at pasteur.fr
Tue Feb 12 15:49:57 CST 2019


[EXTERNAL EMAIL] 

Hello

One of our T7820 running CentOS-7 x86_64 3.10.0-957.5.1.el7.x86_64
latest bios 1.9.2 (01/24/2019) is logging:

dmesg:
[15108.602969] mce: [Hardware Error]: Machine check events logged
[15108.603012] EDAC skx MC3: HANDLING MCE MEMORY ERROR
[15108.603015] EDAC skx MC3: CPU 8: Machine Check Event: 0 Bank 18: 8c000040000800c2
[15108.603016] EDAC skx MC3: TSC 0 
[15108.603018] EDAC skx MC3: ADDR 134fef57c0 
[15108.603019] EDAC skx MC3: MISC 900040004000086 
[15108.603021] EDAC skx MC3: PROCESSOR 0:50654 TIME 1549992804 SOCKET 1 APIC 10
[15108.603030] EDAC MC3: 1 CE memory scrubbing error on CPU_SrcID#1_MC#1_Chan#2_DIMM#0 (channel:2 slot:0 page:0x134fef5 offset:0x7c0 grain:32 syndrome:0x0 -  err_code:0008:00c2 socket:1 imc:1 rank:0 bg:1 ba:3 row:29ff col:358)

/var/log/messages:
Feb 12 18:33:24 ibet kernel: mce: [Hardware Error]: Machine check events logged
Feb 12 18:33:24 ibet kernel: EDAC MC3: 1 CE memory scrubbing error on CPU_SrcID#1_MC#1_Chan#2_DIMM#0 (channel:2 slot:0 page:0x134fef5 offset:0x7c0 grain:32 syndrome:0x0 -  err_code:0008:00c2 socket:1 imc:1 rank:0 bg:1 ba:3 row:29ff col:358)
Feb 12 18:33:24 ibet mcelog: Hardware event. This is not a software error.
Feb 12 18:33:24 ibet mcelog: MCE 0
Feb 12 18:33:24 ibet mcelog: CPU 8 BANK 18
Feb 12 18:33:24 ibet mcelog: MISC 900040004000086 ADDR 134fef57c0
Feb 12 18:33:24 ibet mcelog: TIME 1549992804 Tue Feb 12 18:33:24 2019
Feb 12 18:33:24 ibet mcelog: MCG status:
Feb 12 18:33:24 ibet mcelog: MCi status:
Feb 12 18:33:24 ibet mcelog: Corrected error
Feb 12 18:33:24 ibet mcelog: MCi_MISC register valid
Feb 12 18:33:24 ibet mcelog: MCi_ADDR register valid
Feb 12 18:33:24 ibet mcelog: MCA: MEMORY CONTROLLER MS_CHANNEL2_ERR
Feb 12 18:33:24 ibet mcelog: Transaction: Memory scrubbing error
Feb 12 18:33:24 ibet mcelog: MemCtrl: Corrected patrol scrub error
Feb 12 18:33:24 ibet mcelog: STATUS 8c000040000800c2 MCGSTATUS 0
Feb 12 18:33:24 ibet mcelog: MCGCAP 7000c14 APICID 10 SOCKETID 1
Feb 12 18:33:24 ibet mcelog: PPIN fdf60614f277367e
Feb 12 18:33:24 ibet mcelog: MICROCODE 200004d
Feb 12 18:33:24 ibet mcelog: CPUID Vendor Intel Family 6 Model 85

The Dell embedded basic diagnostic tests (F12 on boot) does not show any errors, but that is expected
since the issue is corrected as stated "MemCtrl: Corrected patrol scrub error".

The error doesn't show immediately after boot, this time it occured ~2h after a cold boot.

There are 8x 16GB DIMMS on that machine, and Dell support is only willing to ship one stick
and let me find which one is unhealthy... Is there a tool available to identify the bad one?

dmidecode can let me identify DIMM[1-6]_CPU[0-1] but which one is "CPU 8 BANK 18"

Cheers

Tru

-- 
Dr Tru Huynh | mailto:tru at pasteur.fr | tel +33 1 45 68 87 37
https://research.pasteur.fr/en/team/structural-bioinformatics/
Institut Pasteur, 25-28 rue du Docteur Roux, 75724 Paris CEDEX 15 France  



More information about the Linux-PowerEdge mailing list