Thermal issues with SC1435 servers??
HowardC at prpa.org
Mon Apr 19 14:05:52 CDT 2010
I assume you have ruled out some kind of marginal power
situation in the machine room?
> -----Original Message-----
> From: Cris Rhea [mailto:crhea at mayo.edu]
> Sent: Monday, April 19, 2010 12:56 PM
> To: linux-poweredge at dell.com
> Subject: Thermal issues with SC1435 servers??
> I have a bunch (~50) SC1435 servers as part of an HPC cluster.
> Over the last several weeks, I'll come to work in the morning to find
> of them dead from either a "CPUx thermal tripped" or "CPUx voltage
> problem. I'll have to power them back on (or sometimes, unplug them
> the power button will work) and view the SEL to see what happened
> in the Linux system logs). Once powered back on, they boot/run
> I've had this happen across 9 different machines, so I'm thinking this
> is not just a simple case of flakey hardware.
> Running CentOS 5 as part of an HPC environment. The cluster jobs push
> the CPUs, so these machines run hot. These failures are getting old as
> they crash the jobs on them at the time of the BIOS-induced "power
> I've asked my technical/sales guy to look into this to see if there
> perhaps a bad batch of boards, but he can't find anything.
> I emailed a ticket to Dell, but they want me to call their HPC group
> (not thrilled with the prospect of staying on the phone for hours
> someone tells me to load/run "dset" on all my nodes...)
> Does this issue ring a bell with anybody?
> --- Cris
> Cristopher J. Rhea
> Mayo Clinic - Research Computing Facility
> 200 First St SW, Rochester, MN 55905
> crhea at Mayo.EDU
> (507) 284-0587
> Linux-PowerEdge mailing list
> Linux-PowerEdge at dell.com
> Please read the FAQ at http://lists.us.dell.com/faq
More information about the Linux-PowerEdge