AW: Machine check exception (Debian Etch)

Morten P.D. Stevens mstevens at win-professional.com
Wed Feb 4 07:08:03 CST 2009


> How sure are you that the above report indicates a *CPU* fault - those are, in my experience, *incredibly* rare.

I´m not 100% sure, but i think the mcelog message is well-defined.

To be certain just run the dell diagnostic utility and memtest.
http://support.dell.com/support/downloads/format.aspx?c=us&l=en&s=gen&deviceid=196&libid=13&releaseid=R197222&vercnt=3&formatcnt=0&SystemID=PWE_R905&servicetag=&os=WX64&osl=en&catid=-1&impid=-1
http://www.memtest.org/

Use this download to create a bootable diagnostic media with windows or linux.

After running the complete dell diagnostics you´ll see if it's a memory or cpu-related problem.

Best regards,

Morten Stevens

-----Ursprüngliche Nachricht-----
Von: linux-poweredge-bounces at dell.com [mailto:linux-poweredge-bounces at dell.com] Im Auftrag von Dave Ewart
Gesendet: Mittwoch, 4. Februar 2009 10:17
An: linux-poweredge at dell.com
Betreff: Re: Machine check exception (Debian Etch)

On Tuesday, 03.02.2009 at 16:43 +0100, Morten P.D. Stevens wrote:

> >     $ sudo mcelog --ascii --k8
> >      MCE 0
> >      HARDWARE ERROR. This is *NOT* a software problem!
> >      Please contact your hardware vendor
> >      CPU 3 BANK 4 TSC 32559b687fb05 
> >      MISC e00c0ffe01000000 ADDR 1019aa6cc4 
> >      STATUS 9c034480011c017b MCGSTATUS 0
> > 
> >      MCE 0
> >      HARDWARE ERROR. This is *NOT* a software problem!
> >      Please contact your hardware vendor
> >      HARDWARE ERROR. This is *NOT* a software problem!
> >      Please contact your hardware vendor
> >      CPU 3 0 data cache TSC 32559b687fb05
> >      MISC e00c0ffe01000000 
> >       Data cache ECC error (syndrome 6)
> >            bit39 = res7
> >            bit42 = res10
> >            bit46 = corrected ecc error
> >            bit59 = misc error valid
> >       memory/cache error 'evict mem transaction, generic transaction, level generic'
> >      STATUS 9c034480011c017b MCGSTATUS 0
>
> i think it´s a CPU-related problem with the CPU Data Cache.

Well, I can read that it says "CPU ... data cache" too, but given that I've had other errors previously reported by CPUs which suggest *RAM* issues, I was wondering if anyone had more specific pointers.

How sure are you that the above report indicates a *CPU* fault - those are, in my experience, *incredibly* rare.

> When you try cat /proc/cpuinfo CPU3 means the first physical CPU Core 4.

Yes, that's what the reports normally mean, but "CPU 3 0" could mean CPU number 3 (i.e. the fourth), core 0 (i.e. the first).  How sure are you about the terminology used specifically in these MCE logs?

Dave.

--
Dave Ewart
davee at ceu.ox.ac.uk
Computing Manager, Cancer Epidemiology Unit University of Oxford / Cancer Research UK
PGP: CC70 1883 BD92 E665 B840 118B 6E94 2CFD 694D E370 Get key from http://www.ceu.ox.ac.uk/~davee/davee-ceu-ox-ac-uk.asc
N 51.7518, W 1.2016



More information about the Linux-PowerEdge mailing list