Thermal issues with SC1435 servers??

Howard, Chris HowardC at
Mon Apr 19 14:05:52 CDT 2010

I assume you have ruled out some kind of marginal power
situation in the machine room?

> -----Original Message-----
> From: Cris Rhea [mailto:crhea at]
> Sent: Monday, April 19, 2010 12:56 PM
> To: linux-poweredge at
> Subject: Thermal issues with SC1435 servers??
> I have a bunch (~50) SC1435 servers as part of an HPC cluster.
> Over the last several weeks, I'll come to work in the morning to find
> one
> of them dead from either a "CPUx thermal tripped" or "CPUx voltage
> sensor"
> problem.  I'll have to power them back on (or sometimes, unplug them
> before
> the power button will work) and view the SEL to see what happened
> (nothing
> in the Linux system logs).  Once powered back on, they boot/run
> normally.
> I've had this happen across 9 different machines, so I'm thinking this
> is not just a simple case of flakey hardware.
> Running CentOS 5 as part of an HPC environment. The cluster jobs push
> the CPUs, so these machines run hot. These failures are getting old as
> they crash the jobs on them at the time of the BIOS-induced "power
> off".
> I've asked my technical/sales guy to look into this to see if there
> perhaps a bad batch of boards, but he can't find anything.
> I emailed a ticket to Dell, but they want me to call their HPC group
> (not thrilled with the prospect of staying on the phone for hours
> someone tells me to load/run "dset" on all my nodes...)
> Does this issue ring a bell with anybody?
> --- Cris
> --
>  Cristopher J. Rhea
>  Mayo Clinic - Research Computing Facility
>  200 First St SW, Rochester, MN 55905
>  crhea at Mayo.EDU
>  (507) 284-0587
> _______________________________________________
> Linux-PowerEdge mailing list
> Linux-PowerEdge at
> Please read the FAQ at

More information about the Linux-PowerEdge mailing list