serious stability issues with Dell C6145 and C410x

Mark Nipper nipsy at mail.utexas.edu
Fri Jul 29 10:16:51 CDT 2011


	So we're having some serious issues with a brand new
C6145 and attached C410x with two nVidia Tesla M2070's.

	The first problems I ran into were with a stock RHEL 6.1
installation.  Right off the bat there were issues with CPU's
locking up for no apparent reason:
---
Jul 28 22:23:03 snuffles kernel: [38293.200101] BUG: soft lockup - CPU#10 stuck for 67s! [python:9507]

The offending process always varies, but seems to be anytime I
try to run something which accesses the GPGPU's.  I've even had
simple things like nvidia-smi hang.

	This behavior usually results in a stuck system.  I can
typically still do certain things at this point.  But a lot of
things will result in hanging indefinitely and even trying to
reboot the machine fails.  Ultimately I end up holding down the
power to get it to shut off completely.

	So I thought, maybe this is just a problem with "older"
software.  Fast forward to now where I'm running Debian testing
using a 2.6.39 kernel and some of the latest nVidia 280.11
drivers.  I'm seeing the exact same problems I was seeing under
RHEL.  Of course, the software selection is better out of the box
for a lot of the OpenCL related stuff (the lack of a decent GPGPU
related repository for RHEL was surprising).  But, it doesn't
matter really since the hardware itself still isn't stable.

	Anyone else have any experience with this hardware yet?
I should mention the C6145 has 4x AMD Opteron(tm) Processor 6164
HE's, so 48 cores, and 256GB of RAM.  I'm sure this only
complicates things even further.

-- 
Mark Nipper
nipsy at mail.utexas.edu
+1 512 471 3483 - office
+1 979 575 3193 - cell
-
He hoped and prayed that there wasn't an afterlife. Then he
realized there was a contradiction involved here and merely
hoped that there wasn't an afterlife.
 -- Douglas Adams



More information about the Linux-PowerEdge mailing list