Memory performance on PE R610 (Adv ECC vs Optimizer)

Stephen Dowdy sdowdy at ucar.edu
Fri May 29 13:49:59 CDT 2009


JACOB_LIBERMAN at dell.com wrote, On 05/29/2009 11:49 AM:
> If its memory bandwidth you're after, populate 1 DIMM per channel per socket across both sockets.  (6 DIMMs total)
> 
> On an R610 with 6 1333 MHz UDIMMs you should expect stream bandwidth of ~36 GB/s. (BIOS 1.0.4 and 1.1.4, 8 threads)
> 
> Copy	36577
> Scale	36212
> Add	34232
> Triad	35240

> I have heaps of performance data if you need design recommendations for a particular application.

Jacob,

Send it all! ;)
This isn't my projects' system, so i'm not sure what they're doing
with it.  My projects would be doing WRF.  So, btw, do you have
performance data on the R610 vs R410 for HPCC WRF applications?
Seems there should be no performance penalty if using the same
processors and a single row of DIMMs optimized to the specific
processor module in each scenario, right?

For comparison...
I only get about 8GB/s using OpenMP threaded version of STREAM, but
with 1 proc module, and 1066MHz DIMMs. Should i be seeing better
than this, and if so, why am i not? (from what you show above, i
expect to see roughly:
   0.5 (one socket) * 35000 (yields ~ 17GB/sec) \
     * 0.8 (1066/1333 scale memspeed) => ~13GB/sec
Should tri-channel (your setup) versus dual-channel (mine)
have that much impact?


This system's Configuration:
    R610
    single E5530 module
    4x 2GB 1066MHz DIMMs

# grep 'model name' /proc/cpuinfo | head -1
model name      : Intel(R) Xeon(R) CPU           E5530  @ 2.40GHz

# /root/ssi.sh | grep ssi_cpu
ssi_cpu_cap_ht=1    * can do hyperthreading *
ssi_cpu_cap_nx=1    * can do NoExecute *
ssi_cpu_cap_pae=1   * can do PAE *
ssi_cpu_cap_vmx=1   * Vanderpool VM *
ssi_cpu_clock=2393.999
ssi_cpu_core_count=4   * total cores *
ssi_cpu_cores_per_chip=4   * cores / socket *
ssi_cpu_count=8     * total "cpus" *
ssi_cpu_is_ht=1     * hyperthread/SMT enabled *
ssi_cpu_is_mc=1     * is multicore *
ssi_cpu_siblings=8  * 8 "processors" *
ssi_cpu_sockets=1   * 1 socket occupied *

# dmidecode -t memory | agrep -d'^Handle' -v 'Size: No Module' | egrep '^[[:space:]]*(Size|Locator|Speed)' | paste - - -
        Size: 2048 MB           Locator: DIMM_A1                Speed: 1066 MHz (0.9 ns)
        Size: 2048 MB           Locator: DIMM_A2                Speed: 1066 MHz (0.9 ns)
        Size: 2048 MB           Locator: DIMM_A4                Speed: 1066 MHz (0.9 ns)
        Size: 2048 MB           Locator: DIMM_A5                Speed: 1066 MHz (0.9 ns)
(this is Optimizer Mode configuration)

Stream test with N=32000000
thrust:stream# OMP_NUM_THREADS=4 ./stream_c_omp.exe
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 32000000, Offset = 0
Total memory required = 732.4 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 4
-------------------------------------------------------------
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 45081 microseconds.
   (= 45081 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        7964.6757       0.0645       0.0643       0.0647
Scale:       7783.5016       0.0660       0.0658       0.0662
Add:         8723.6035       0.0884       0.0880       0.0890
Triad:       8835.4905       0.0875       0.0869       0.0881
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

thanks,
--stephen
-- 
Stephen Dowdy  -  Systems Administrator  -  NCAR/RAL
303.497.2869   -  sdowdy at ucar.edu        -  http://www.ral.ucar.edu/~sdowdy/



More information about the Linux-PowerEdge mailing list