Thermal issues with SC1435 servers??

Cris Rhea crhea at mayo.edu
Mon Apr 19 13:56:21 CDT 2010


I have a bunch (~50) SC1435 servers as part of an HPC cluster.

Over the last several weeks, I'll come to work in the morning to find one
of them dead from either a "CPUx thermal tripped" or "CPUx voltage sensor" 
problem.  I'll have to power them back on (or sometimes, unplug them before
the power button will work) and view the SEL to see what happened (nothing 
in the Linux system logs).  Once powered back on, they boot/run normally.

I've had this happen across 9 different machines, so I'm thinking this
is not just a simple case of flakey hardware. 

Running CentOS 5 as part of an HPC environment. The cluster jobs push 
the CPUs, so these machines run hot. These failures are getting old as 
they crash the jobs on them at the time of the BIOS-induced "power off". 

I've asked my technical/sales guy to look into this to see if there was 
perhaps a bad batch of boards, but he can't find anything.
I emailed a ticket to Dell, but they want me to call their HPC group
(not thrilled with the prospect of staying on the phone for hours while
someone tells me to load/run "dset" on all my nodes...)

Does this issue ring a bell with anybody? 

--- Cris

-- 
 Cristopher J. Rhea                     
 Mayo Clinic - Research Computing Facility
 200 First St SW, Rochester, MN 55905
 crhea at Mayo.EDU
 (507) 284-0587



More information about the Linux-PowerEdge mailing list