AW: Machine check exception (Debian Etch)
Morten P.D. Stevens
mstevens at win-professional.com
Wed Feb 4 07:08:03 CST 2009
> How sure are you that the above report indicates a *CPU* fault - those are, in my experience, *incredibly* rare.
I´m not 100% sure, but i think the mcelog message is well-defined.
To be certain just run the dell diagnostic utility and memtest.
http://support.dell.com/support/downloads/format.aspx?c=us&l=en&s=gen&deviceid=196&libid=13&releaseid=R197222&vercnt=3&formatcnt=0&SystemID=PWE_R905&servicetag=&os=WX64&osl=en&catid=-1&impid=-1
http://www.memtest.org/
Use this download to create a bootable diagnostic media with windows or linux.
After running the complete dell diagnostics you´ll see if it's a memory or cpu-related problem.
Best regards,
Morten Stevens
-----Ursprüngliche Nachricht-----
Von: linux-poweredge-bounces at dell.com [mailto:linux-poweredge-bounces at dell.com] Im Auftrag von Dave Ewart
Gesendet: Mittwoch, 4. Februar 2009 10:17
An: linux-poweredge at dell.com
Betreff: Re: Machine check exception (Debian Etch)
On Tuesday, 03.02.2009 at 16:43 +0100, Morten P.D. Stevens wrote:
> > $ sudo mcelog --ascii --k8
> > MCE 0
> > HARDWARE ERROR. This is *NOT* a software problem!
> > Please contact your hardware vendor
> > CPU 3 BANK 4 TSC 32559b687fb05
> > MISC e00c0ffe01000000 ADDR 1019aa6cc4
> > STATUS 9c034480011c017b MCGSTATUS 0
> >
> > MCE 0
> > HARDWARE ERROR. This is *NOT* a software problem!
> > Please contact your hardware vendor
> > HARDWARE ERROR. This is *NOT* a software problem!
> > Please contact your hardware vendor
> > CPU 3 0 data cache TSC 32559b687fb05
> > MISC e00c0ffe01000000
> > Data cache ECC error (syndrome 6)
> > bit39 = res7
> > bit42 = res10
> > bit46 = corrected ecc error
> > bit59 = misc error valid
> > memory/cache error 'evict mem transaction, generic transaction, level generic'
> > STATUS 9c034480011c017b MCGSTATUS 0
>
> i think it´s a CPU-related problem with the CPU Data Cache.
Well, I can read that it says "CPU ... data cache" too, but given that I've had other errors previously reported by CPUs which suggest *RAM* issues, I was wondering if anyone had more specific pointers.
How sure are you that the above report indicates a *CPU* fault - those are, in my experience, *incredibly* rare.
> When you try cat /proc/cpuinfo CPU3 means the first physical CPU Core 4.
Yes, that's what the reports normally mean, but "CPU 3 0" could mean CPU number 3 (i.e. the fourth), core 0 (i.e. the first). How sure are you about the terminology used specifically in these MCE logs?
Dave.
--
Dave Ewart
davee at ceu.ox.ac.uk
Computing Manager, Cancer Epidemiology Unit University of Oxford / Cancer Research UK
PGP: CC70 1883 BD92 E665 B840 118B 6E94 2CFD 694D E370 Get key from http://www.ceu.ox.ac.uk/~davee/davee-ceu-ox-ac-uk.asc
N 51.7518, W 1.2016
More information about the Linux-PowerEdge
mailing list