Thermal issues with SC1435 servers??
crhea at mayo.edu
Mon Apr 19 15:55:34 CDT 2010
On Mon, Apr 19, 2010 at 02:53:21PM -0500, Wayne_Weilnau at Dell.com wrote:
> I've seen other HPC customers who have had thermal issues,
> especially with systems in the top of the rack. If you have any
> leakage of air from the hot aisle into the cold aisle, it would be
> possible the inlet (ambient) temperature for a system could be higher
> than you realize. I don't have a 1435 and don't have the specs in
> front of me, but I would think 71F is within the operating range of the
> system. If not, it is barely outside the operating range. I believe
> the 1435 has a Baseboard Management Controller (BMC) that records
> hardware events into the System Event Log (SEL). You should be able
> to view the SEL during POST by pressing CTRL-E. You can also view the
> SEL through IPMI Tool or OMSA. I would check the SEL for any
> events, especially for thermal sensors.
> Wayne Weilnau
> Systems Management Technologist
> Dell | OpenManage Software Development
Systems are from bottom to top of rack... yes, our hot/cold aisle
stuff is a bit sloppy (don't have under-floor cold air), but I figured
I'd have a pattern as you suggest (e.g., systems at top of rack).
Temp reading is low-tech thermometer on front door of rack at eye-level.
The only place I see these errors is in the SEL. Upon powering the
machines back up, I do the CTRL-E and look at the SEL. I get simple
messages like "CPUx thermal tripped asserted".
I've taken one system apart and re-done the thermal goo between the
CPU/heatsink. Didn't help. Replaced the MB and it has behaved since then.
Perhaps, if this isn't a common problem, I really do just have 8 more
systems that have bad thermal sensors on the MB.
Cristopher J. Rhea
Mayo Clinic - Research Computing Facility
200 First St SW, Rochester, MN 55905
crhea at Mayo.EDU
More information about the Linux-PowerEdge