PowerEdge 1750 SMP instability on CentOS 4.1

Peter Kjellström cap at nsc.liu.se
Tue Jul 5 02:44:55 CDT 2005


Hello,

I'd bet my .02 euros on a bad 2nd CPU. Especially since you see this:
 System software event - CPU Internal Error detected

But if you want one more test that wont cost you to much time you could 
allways try to boot knopix which is based on debian (download, burn, boot). 

/Peter

On Monday 04 July 2005 22.35, tman_dell at trejan.com wrote:
> I've got a 2.8GHz PowerEdge 1750 with 4GB RAM running RAID5 over 3 73GB
> Maxtor drives plus the RAC card installed.  It's running with the latest
> firmware for the BIOS, ESM, backplane, SDR, PERC and RAC.  The OS is
> CentOS 4.1 with all the latest updates.  I've installed OSMA and it
> works perfectly.
>
> My problem is that if I use a SMP kernel (2.6.9-11.Elsmp) then it is
> highly unstable.  If you don't actually do anything then it is usually
> fine but once you actually use the system then it will randomly crash.
> The time period from boot up to a crash is totally random.  I've had it
> crash just after bootup to only crash 1-2 hours after bootup.
>
> It kernel panics and just dies.  I can't get the whole panic written
> down because it scrolls off the top of the RAC remote console before I
> get a chance to look at it.
>
> Both CPUs have the same stepping from what I can tell and they're
> exactly the same speed.
>
> The bits of the panic I can capture are:
>
> Code:  Bad EIP value.
>  <0>Fatal exception: panic in 5 seconds
> Unable to handle kernel NULL pointer dereference at virtual address
> 00000070  printing eip: C011a54d *pde = 033e2001
> Oops: 0000 [#2]
> SMP
> Modules linked in: i2c_dev i2c_core md5 ipv6 ppp_async ppp_generic slhc
> crc-ccitt dcdipm(U) dcdbas(U) ipt_REJECT ipt_state ip_conntrack
> iptable_filter ip_tables button battery ac ohci_hcd tg3 floppy sg ext3
> jbd dm_mod megaraid_mbox megaraid_mm sd_mod scsi_mod
> CPU:    5
> EIP     0060:[<c011a54d>]    Tainted: P      VLI
> EFLAGS: 00010046   (2.6.9-11.Elsmp)
> EIP is at do_page_fault+0xa0/0x5b6
> Eax: f7e01000   ebx: c3318d80  ecx: f7e01054   edx: f6e01110
> Esi: 00000000   edi: c011a4ad  ebp: 00000000   esp: f7e01044
> Ds: 007b   es: 007b    ss: 0068
> Process  (pid: 0, threadinfo=f7e00000 task=c30fe100)
> Stack: 00000000 00000070 00000000 00000000 f6e01110 c02d85f2 00000000
> 0000000e
>        0000000b 00000000 00000000 00000000 00000000 00000000 00030001
> 00000000
>        00000000 00000000 00000000 00000000 00000000 00000000 00000000
>
> If I use a UP kernel (2.6.9-11.EL) then it is stable and no amount of
> CPU or IO activity will cause it to crash.  2.6.9-5.0.3.Elsmp also
> crashes with the same symptoms.
>
> On one of the crashes, the SEL had one line logged to it.  All other
> crashes haven't produced anything.
>
> 00700 01-jul-2005 21:41:58 System software event - CPU Internal Error
> detected
>
> Anybody got any ideas?  Is this a known problem or do I have a bad CPU?
>
>  - Trevor

-- 
------------------------------------------------------------
  Peter Kjellström               |
  National Supercomputer Centre  |
  Sweden                         | http://www.nsc.liu.se
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://lists.us.dell.com/pipermail/linux-poweredge/attachments/20050705/70c22437/attachment.bin


More information about the Linux-PowerEdge mailing list