[mvapich-discuss] Shared Memory Performance
Christopher Co
cco2 at cray.com
Fri Jun 26 15:01:09 EDT 2009
I have found the source of the problem with the shared memory latency
through the IMB Ping Pong test. After a lot of digging, I found that
the default initialization of IMB is to enable MPI_THREAD_MULTIPLE and
in the Ping Pong source code, the "source" of the MPI_Send/Receive
functions uses MPI_ANY_SOURCE. These two factors were skewing all the
results except for Intel's MPI. After fixing IMB to be initialized to
MPI_THREAD_SINGLE and changing source to be the correct value, I
produced similar numbers (using cores 5 and 7 for further increased
performance) between IMB Ping Pong, OSU Latency, and my own basic Ping
Pong timing. The numbers are below. There is still an unknown issue
about the 0 and 1 byte latencies are off (and it looks like OSU Latency
numbers are correct here). From my testing, I noticed that the first
part of the 1000 repetitions IMB did for 0 byte latencies were extremely
high values. IMB does do an MPI_Barrier before it starts to ensure that
the send/receive start together.
#---------------------------------------------------
# Intel (R) MPI Benchmark Suite V3.0, MPI-1 part
#---------------------------------------------------
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 0.37 0.00
1 1000 0.43 2.19
2 1000 0.40 4.82
4 1000 0.46 8.21
8 1000 0.44 17.16
16 1000 0.45 34.10
32 1000 0.47 65.42
64 1000 0.48 127.55
128 1000 0.51 237.48
256 1000 0.56 435.19
512 1000 0.66 742.70
1024 1000 0.83 1171.62
2048 1000 1.17 1669.28
4096 1000 1.84 2124.07
8192 1000 3.24 2409.85
16384 1000 6.25 2501.23
32768 1000 10.77 2901.97
65536 640 16.68 3747.09
131072 320 25.72 4860.57
262144 160 43.62 5730.71
524288 80 81.07 6167.53
1048576 40 173.55 5762.10
2097152 20 1165.35 1716.23
4194304 10 2689.10 1487.49
# OSU MPI Latency Test v3.1.1
# Size Latency (us)
0 0.30
1 0.38
2 0.39
4 0.46
8 0.44
16 0.44
32 0.46
64 0.47
128 0.49
256 0.53
512 0.63
1024 0.79
2048 1.11
4096 1.80
8192 3.24
16384 6.36
32768 10.99
65536 16.34
131072 24.75
262144 41.51
524288 75.74
1048576 157.31
2097152 1159.87
4194304 2696.29
Christopher Co wrote:
> I have found that the CX-1 I am running on has two Intel Xeon E5472 3
> GHz processors (Harpertown). Your test results were on Nehalem
> processors. When I have received the correct CPU mapping, I've gotten
> roughly 0.8 usec to Ping Pong 8 bytes. I wonder if this can account for
> the discrepancy. Anyways, I'll investigate this further and get more
> data but I wanted to throw this information out there in case it can be
> helpful.
>
>
> Chris
>
> Christopher Co wrote:
>
>> Those specifications are correct. I am seeing that the MV2_CPU_MAPPING
>> option does not have an effect on which cores are chosen so when I
>> launch a Ping-Pong, 2 cores are arbitrarily chosen by mpirun_rsh. One
>> thing that might be hindering PLPA support is that I do not have
>> sudo/root access on the machine. I installed everything into my home
>> directory. Could this be the issue?
>>
>>
>> Chris
>>
>> Dhabaleswar Panda wrote:
>>
>>
>>> Could you let us know what issues you are seeing when using
>>> MV2_CPU_MAPPING. The PLPA support is embedded in MVAPICH2 code. It does
>>> not require any additional configure/install. I am assuming that you are
>>> using the Gen2 (OFED) interface with mpirun_rsh and your systems are
>>> Linux-based.
>>>
>>> Thanks,
>>>
>>> DK
>>>
>>>
>>> On Tue, 16 Jun 2009, Christopher Co wrote:
>>>
>>>
>>>
>>>
>>>> I am having issues with running processes on the cores I specify using
>>>> MV2_CPU_MAPPING. Is the PLPA support for mapping MPI processes to cores
>>>> embedded in MVAPICH2 or does it link to an existing PLPA on
>>>> configure/install? Also, I want to confirm that no extra configure
>>>> options are needed to enable this feature.
>>>>
>>>>
>>>> Thanks,
>>>> Chris
>>>>
>>>> Dhabaleswar Panda wrote:
>>>>
>>>>
>>>>
>>>>> Thanks for letting us know that you are using MVAPICH2 1.4. I believe you
>>>>> are taking numbers on Intel systems. Please note that on Intel systems,
>>>>> two cores next to each other within the same chip are numbered as 0 and 4
>>>>> (not 0 and 1). Thus, the default setting (with processes 0 and 1) run
>>>>> across the chips and thus, you are seeing worse performance. Please run
>>>>> your tests across cores 0 and 4 and you should be able to see better
>>>>> performance. Depending on which pairs of processes you use, you may see
>>>>> some differences in performance for short and large messages (depends on
>>>>> whether these cores are within the same chip, same socket or across
>>>>> sockets). I am attaching some numbers below on our Nehalem system with
>>>>> these two CPU mappings and you can see the performance difference.
>>>>>
>>>>> MVAPICH2 provides flexible mapping of MPI processes to cores within a
>>>>> node. You can try out performance across various pairs and you will see
>>>>> performance difference. More details on such mapping are available from
>>>>> here:
>>>>>
>>>>> http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.4rc1.html#x1-360006.8
>>>>>
>>>>> Also, starting from MVAPICH2 1.4, a new single-copy kernel-based
>>>>> shared-memory scheme (LiMIC2) is introduced. This is `off' by default.
>>>>> You can use it to get better performance for larger message sizes. You
>>>>> need to configure with enable-limic2 and you also need to use
>>>>> MV2_SMP_USE_LIMIC2=1. More details are available from here:
>>>>>
>>>>> http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.4rc1.html#x1-370006.9
>>>>>
>>>>> Here are some performance numbers with different CPU mappings.
>>>>>
>>>>> OSU MPI latency with Default CPU mapping (LiMIC2 is off)
>>>>> --------------------------------------------------------
>>>>>
>>>>> # OSU MPI Latency Test v3.1.1
>>>>> # Size Latency (us)
>>>>> 0 0.77
>>>>> 1 0.95
>>>>> 2 0.95
>>>>> 4 0.94
>>>>> 8 0.94
>>>>> 16 0.94
>>>>> 32 0.96
>>>>> 64 0.99
>>>>> 128 1.09
>>>>> 256 1.22
>>>>> 512 1.37
>>>>> 1024 1.61
>>>>> 2048 1.79
>>>>> 4096 2.43
>>>>> 8192 5.42
>>>>> 16384 6.73
>>>>> 32768 9.57
>>>>> 65536 15.34
>>>>> 131072 28.71
>>>>> 262144 53.13
>>>>> 524288 100.24
>>>>> 1048576 199.98
>>>>> 2097152 387.28
>>>>> 4194304 991.68
>>>>>
>>>>> OSU MPI latency with CPU mapping 0:4 (LiMIC2 is off)
>>>>> ----------------------------------------------------
>>>>>
>>>>> # OSU MPI Latency Test v3.1.1
>>>>> # Size Latency (us)
>>>>> 0 0.34
>>>>> 1 0.40
>>>>> 2 0.40
>>>>> 4 0.40
>>>>> 8 0.40
>>>>> 16 0.40
>>>>> 32 0.42
>>>>> 64 0.42
>>>>> 128 0.45
>>>>> 256 0.50
>>>>> 512 0.55
>>>>> 1024 0.67
>>>>> 2048 0.91
>>>>> 4096 1.35
>>>>> 8192 3.66
>>>>> 16384 5.01
>>>>> 32768 7.41
>>>>> 65536 12.90
>>>>> 131072 25.21
>>>>> 262144 49.71
>>>>> 524288 97.17
>>>>> 1048576 187.50
>>>>> 2097152 465.57
>>>>> 4194304 1196.31
>>>>>
>>>>> Let us know if you get better performance with appropriate CPU mapping.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> DK
>>>>>
>>>>>
>>>>> On Mon, 15 Jun 2009, Christopher Co wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> I am using MVAPICH2 1.4 with the default configuration (since the CX-1
>>>>>> uses Mellanox Infiniband). I am fairly certain my CPU mapping was
>>>>>> on-node for both cases (curiously, is there a way for MVAPICH2 to print
>>>>>> out the nodes/cores running). I have the numbers for Ping Pong for the
>>>>>> off-node case. I should have included this in my earlier message:
>>>>>> Processes # repetitions #bytes Intel MPI time (usec)] MVAPICH2 time
>>>>>> (usec)
>>>>>> 2 1000 0 4.16 3.4
>>>>>>
>>>>>> 1000 1 4.67 3.56
>>>>>>
>>>>>> 1000 2 4.21 3.56
>>>>>>
>>>>>> 1000 4 4.23 3.62
>>>>>>
>>>>>> 1000 8 4.33 3.63
>>>>>>
>>>>>> 1000 16 4.33 3.64
>>>>>>
>>>>>> 1000 32 4.38 3.73
>>>>>>
>>>>>> 1000 64 4.44 3.92
>>>>>>
>>>>>> 1000 128 5.61 4.71
>>>>>>
>>>>>> 1000 256 5.92 5.23
>>>>>>
>>>>>> 1000 512 6.52 5.79
>>>>>>
>>>>>> 1000 1024 7.68 7.06
>>>>>>
>>>>>> 1000 2048 9.97 9.36
>>>>>>
>>>>>> 1000 4096 12.39 11.97
>>>>>>
>>>>>> 1000 8192 17.86 22.53
>>>>>>
>>>>>> 1000 16384 27.44 28.27
>>>>>>
>>>>>> 1000 32768 40.32 39.82
>>>>>>
>>>>>> 640 65536 63.61 62.97
>>>>>>
>>>>>> 320 131072 109.69 110.01
>>>>>>
>>>>>> 160 262144 204.71 206.9
>>>>>>
>>>>>> 80 524288 400.72 397.1
>>>>>>
>>>>>> 40 1048576 775.64 776.45
>>>>>>
>>>>>> 20 2097152 1523.95 1535.65
>>>>>>
>>>>>> 10 4194304 3018.84 3054.89
>>>>>>
>>>>>>
>>>>>>
>>>>>> Chris
>>>>>>
>>>>>>
>>>>>> Dhabaleswar Panda wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Can you tell us which version of MVAPICH2 you are using and which
>>>>>>> option(s) are configured? Are you using correct CPU mapping in both
>>>>>>> cases?
>>>>>>>
>>>>>>> DK
>>>>>>>
>>>>>>> On Mon, 15 Jun 2009, Christopher Co wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I am doing performance analysis on a Cray CX1 machine. I have run the
>>>>>>>> Pallas MPI benchmark and have noticed a considerable performance
>>>>>>>> difference between MVAPICH2 and Intel MPI on all the tests when shared
>>>>>>>> memory is used. I have also run the benchmark for non-shared memory and
>>>>>>>> the two performed nearly the same (MVAPICH2 was slightly faster). Is
>>>>>>>> this slowdown on shared memory a known issue and/or are there fixes or
>>>>>>>> switches I can enable or disable to get more speed?
>>>>>>>>
>>>>>>>> To give an idea of what I'm seeing, for the simple Ping Pong test for
>>>>>>>> two processes on the same chip, the numbers looks like:
>>>>>>>>
>>>>>>>> Processes # repetitions
>>>>>>>> #bytes Intel MPI time (usec) MVAPICH2
>>>>>>>> time (usec)
>>>>>>>> 2 1000 0 0.35 0.94
>>>>>>>>
>>>>>>>> 1000 1 0.44 1.24
>>>>>>>>
>>>>>>>> 1000 2 0.45 1.17
>>>>>>>>
>>>>>>>> 1000 4 0.45 1.08
>>>>>>>>
>>>>>>>> 1000 8 0.45 1.11
>>>>>>>>
>>>>>>>> 1000 16 0.44 1.13
>>>>>>>>
>>>>>>>> 1000 32 0.45 1.21
>>>>>>>>
>>>>>>>> 1000 64 0.47 1.35
>>>>>>>>
>>>>>>>> 1000 128 0.48 1.75
>>>>>>>>
>>>>>>>> 1000 256 0.51 2.92
>>>>>>>>
>>>>>>>> 1000 512 0.57 3.41
>>>>>>>>
>>>>>>>> 1000 1024 0.76 3.85
>>>>>>>>
>>>>>>>> 1000 2048 0.98 4.27
>>>>>>>>
>>>>>>>> 1000 4096 1.53 5.14
>>>>>>>>
>>>>>>>> 1000 8192 2.59 8.04
>>>>>>>>
>>>>>>>> 1000 16384 4.86 14.34
>>>>>>>>
>>>>>>>> 1000 32768 7.17 33.92
>>>>>>>>
>>>>>>>> 640 65536 11.65 43.27
>>>>>>>>
>>>>>>>> 320 131072 20.97 66.98
>>>>>>>>
>>>>>>>> 160 262144 39.64 118.58
>>>>>>>>
>>>>>>>> 80 524288 84.91 224.40
>>>>>>>>
>>>>>>>> 40 1048576 212.76 461.80
>>>>>>>>
>>>>>>>> 20 2097152 458.55 1053.67
>>>>>>>>
>>>>>>>> 10 4194304 1738.30 2649.30
>>>>>>>>
>>>>>>>>
>>>>>>>> Hopefully the table came out clear. MVAPICH2 always lags behind by a
>>>>>>>> considerable amount. Any insight is much appreciated. Thanks!
>>>>>>>>
>>>>>>>>
>>>>>>>> Chris Co
>>>>>>>> _______________________________________________
>>>>>>>> mvapich-discuss mailing list
>>>>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
More information about the mvapich-discuss
mailing list