R620/12G servers w/CentOS 5.7 and low-latency?
matthew.garman at gmail.com
Tue Jul 3 12:35:48 CDT 2012
I posted about this a while ago, see follow-up below...
On Fri, May 18, 2012 at 12:50 PM, Matt Garman <matthew.garman at gmail.com> wrote:
> I was just wondering if anyone else on this list happens to be using
> the new 12G servers (R620 in our case) with CentOS 5.7 in an ultra-low
> latency environment? We have configured the standard BIOS settings
> appropriately (disable c-cstates/c1e, max performance, etc---a
> familiar drill from previous-gen servers).
> However, what we are seeing in our simulations and testing is
> inconsistent performance results. Even doing the same exact test, one
> right after the other, the results vary substantially (up AND down).
> A full-on simulation one day to the next can vary by 50%.
> With the previous-gen (11G) servers, I had a similar performance issue
> that was ultimately fixed by a BIOS update (but six months after the
> 11G release). I'm hoping to not see a repeat of this kind of issue.
> I'm working the formal channels with Dell on this, but just thought
> I'd throw a feeler out there to see if anyone else is seeing similar
I've been able to isolate at least one facet of the problem. It seems
to me that contended pthread mutexes and condition variable signaling
is slower under Sandy Bridge than previous-gen CPUs.
What follows is a message I posted to the Linux Kernel Mailing List; I
thought I'd reproduce it here for anyone who might be
interested/curious. I include a link to a sample program which
demonstrates the issue.
I have been looking at the performance of two servers:
- dual Xeon X5550 2.67GHz (Nehalem, Dell R610)
- dual Xeon E5-2690 2.90 GHz (Sandy Bridge, Dell R620 & HP dl360g8p)
For my particular (proprietary) application, the Sandy Bridge systems
are significantly slower. At least one facet of this problem has to
- pthread condition variable signaling
- pthread mutex lock contention
I wrote a simple (<300 lines) C program that demonstrates this:
The program has two tests:
- "lc", a lock contention test, where two threads "fight" over
incrementing and decrementing an integer, arbitrated with a
- "cv", a condition variable signaling test, where two threads
"politely" take turns incrementing and decrementing an integer,
signaling each other with a condition variable
The program uses pthread_setaffinity_np() to pin each thread to its
own CPU core.
I would expect the SNB-based servers to be faster, since they have a
clockspeed and architecture advantage.
Results of X5550 @ 2.67 GHz server under CentOS 5.7:
# ./snb_slow_demo -c 3 -C 5 -t cv -n 50000000
runtime, seconds ........ 143.339958
# ./snb_slow_demo -c 3 -C 5 -t lc -n 500000000
runtime, seconds ........ 58.278671
Results of Dell E5-2690 @ 2.90 GHz under CentOS 5.7:
# ./snb_slow_demo -c 2 -C 4 -t cv -n 50000000
runtime, seconds ........ 179.272697
# ./snb_slow_demo -c 2 -C 4 -t lc -n 500000000
runtime, seconds ........ 103.437226
I upgraded the E5-2690 server to CentOS 6.2, then tried both the
current release kernel.org kernel version 3.4.4, and also 3.5.0-rc5.
The "lc" test results are about the same, but the "cv" tests are worse
yet: the same test takes about 229 seconds to run.
Also noteworthy is that the HP has generally better performance than
the Dell. But the HP E5-2690 is still worse than the X5550.
In all cases, for all servers, I disabled power-saving features (cpu
frequency scaling, C-states, C1E). I verified with i7z that all CPUs
spend 100% of their time in state C0.
Is this simply a corner case where Sandy Bridge is worse than its
predecessor? Or is there an implementation problem?
More information about the Linux-PowerEdge