serious stability issues with Dell C6145 and C410x
Stijn De Weirdt
stijn.deweirdt at ugent.be
Fri Jul 29 10:39:29 CDT 2011
we have some 6145 here (but no C410x) and only 4*8 cores and 64GB ram.
(and we only have them for about 1 month)
we run a recompiled 2.6.32-131.4 and saw that this really mattered a lot
wrt compute times. the main changes were to disable no_hz and set the
cpu_freq to 100Hz. (we also stripped a lot of unnecessary stuff from the
default kernels (these are compute nodes after all).
(the bios settings are performance, so no power saving features enabled)
we also had performance issues with the raid0 of the SAS2008 cards we
have. new firmware fixed that, but it was not standard (we got help from
dell support though)
for now things are starting to look good, my only remaining issue with
the boxes is that i can get the pcie max payload higher then 128byte on
our IB cards (something also important for your setup i assume).
> So we're having some serious issues with a brand new
> C6145 and attached C410x with two nVidia Tesla M2070's.
> The first problems I ran into were with a stock RHEL 6.1
> installation. Right off the bat there were issues with CPU's
> locking up for no apparent reason:
> Jul 28 22:23:03 snuffles kernel: [38293.200101] BUG: soft lockup - CPU#10 stuck for 67s! [python:9507]
> The offending process always varies, but seems to be anytime I
> try to run something which accesses the GPGPU's. I've even had
> simple things like nvidia-smi hang.
> This behavior usually results in a stuck system. I can
> typically still do certain things at this point. But a lot of
> things will result in hanging indefinitely and even trying to
> reboot the machine fails. Ultimately I end up holding down the
> power to get it to shut off completely.
> So I thought, maybe this is just a problem with "older"
> software. Fast forward to now where I'm running Debian testing
> using a 2.6.39 kernel and some of the latest nVidia 280.11
> drivers. I'm seeing the exact same problems I was seeing under
> RHEL. Of course, the software selection is better out of the box
> for a lot of the OpenCL related stuff (the lack of a decent GPGPU
> related repository for RHEL was surprising). But, it doesn't
> matter really since the hardware itself still isn't stable.
> Anyone else have any experience with this hardware yet?
> I should mention the C6145 has 4x AMD Opteron(tm) Processor 6164
> HE's, so 48 cores, and 256GB of RAM. I'm sure this only
> complicates things even further.
More information about the Linux-PowerEdge