R620/12G servers w/CentOS 5.7 and low-latency?

Matt Garman matthew.garman at gmail.com
Tue Jul 3 12:35:48 CDT 2012

I posted about this a while ago, see follow-up below...

On Fri, May 18, 2012 at 12:50 PM, Matt Garman <matthew.garman at gmail.com> wrote:
> I was just wondering if anyone else on this list happens to be using
> the new 12G servers (R620 in our case) with CentOS 5.7 in an ultra-low
> latency environment?  We have configured the standard BIOS settings
> appropriately (disable c-cstates/c1e, max performance, etc---a
> familiar drill from previous-gen servers).
> However, what we are seeing in our simulations and testing is
> inconsistent performance results.  Even doing the same exact test, one
> right after the other, the results vary substantially (up AND down).
> A full-on simulation one day to the next can vary by 50%.
> With the previous-gen (11G) servers, I had a similar performance issue
> that was ultimately fixed by a BIOS update (but six months after the
> 11G release).  I'm hoping to not see a repeat of this kind of issue.
> I'm working the formal channels with Dell on this, but just thought
> I'd throw a feeler out there to see if anyone else is seeing similar
> issues.

I've been able to isolate at least one facet of the problem.  It seems
to me that contended pthread mutexes and condition variable signaling
is slower under Sandy Bridge than previous-gen CPUs.

What follows is a message I posted to the Linux Kernel Mailing List; I
thought I'd reproduce it here for anyone who might be
interested/curious.  I include a link to a sample program which
demonstrates the issue.

LKML post:

I have been looking at the performance of two servers:
    - dual Xeon X5550 2.67GHz (Nehalem, Dell R610)
    - dual Xeon E5-2690 2.90 GHz (Sandy Bridge, Dell R620 & HP dl360g8p)

For my particular (proprietary) application, the Sandy Bridge systems
are significantly slower.  At least one facet of this problem has to
do with:
    - pthread condition variable signaling
    - pthread mutex lock contention

I wrote a simple (<300 lines) C program that demonstrates this:

The program has two tests:
    - "lc", a lock contention test, where two threads "fight" over
incrementing and decrementing an integer, arbitrated with a
    - "cv", a condition variable signaling test, where two threads
"politely" take turns incrementing and decrementing an integer,
signaling each other with a condition variable

The program uses pthread_setaffinity_np() to pin each thread to its
own CPU core.

I would expect the SNB-based servers to be faster, since they have a
clockspeed and architecture advantage.

Results of X5550 @ 2.67 GHz server under CentOS 5.7:
# ./snb_slow_demo -c 3 -C 5 -t cv -n 50000000
runtime, seconds ........ 143.339958
# ./snb_slow_demo -c 3 -C 5 -t lc -n 500000000
runtime, seconds ........ 58.278671

Results of Dell E5-2690 @ 2.90 GHz under CentOS 5.7:
# ./snb_slow_demo -c 2 -C 4 -t cv -n 50000000
runtime, seconds ........ 179.272697
# ./snb_slow_demo -c 2 -C 4 -t lc -n 500000000
runtime, seconds ........ 103.437226

I upgraded the E5-2690 server to CentOS 6.2, then tried both the
current release kernel.org kernel version 3.4.4, and also 3.5.0-rc5.
The "lc" test results are about the same, but the "cv" tests are worse
yet: the same test takes about 229 seconds to run.

Also noteworthy is that the HP has generally better performance than
the Dell.  But the HP E5-2690 is still worse than the X5550.

In all cases, for all servers, I disabled power-saving features (cpu
frequency scaling, C-states, C1E).  I verified with i7z that all CPUs
spend 100% of their time in state C0.

Is this simply a corner case where Sandy Bridge is worse than its
predecessor?  Or is there an implementation problem?

More information about the Linux-PowerEdge mailing list