[mvapich-discuss] Shared Memory Performance
Christopher Co
cco2 at cray.com
Wed Jun 17 17:21:07 EDT 2009
I have found that the CX-1 I am running on has two Intel Xeon E5472 3
GHz processors (Harpertown). Your test results were on Nehalem
processors. When I have received the correct CPU mapping, I've gotten
roughly 0.8 usec to Ping Pong 8 bytes. I wonder if this can account for
the discrepancy. Anyways, I'll investigate this further and get more
data but I wanted to throw this information out there in case it can be
helpful.
Chris
Christopher Co wrote:
> Those specifications are correct. I am seeing that the MV2_CPU_MAPPING
> option does not have an effect on which cores are chosen so when I
> launch a Ping-Pong, 2 cores are arbitrarily chosen by mpirun_rsh. One
> thing that might be hindering PLPA support is that I do not have
> sudo/root access on the machine. I installed everything into my home
> directory. Could this be the issue?
>
>
> Chris
>
> Dhabaleswar Panda wrote:
>
>> Could you let us know what issues you are seeing when using
>> MV2_CPU_MAPPING. The PLPA support is embedded in MVAPICH2 code. It does
>> not require any additional configure/install. I am assuming that you are
>> using the Gen2 (OFED) interface with mpirun_rsh and your systems are
>> Linux-based.
>>
>> Thanks,
>>
>> DK
>>
>>
>> On Tue, 16 Jun 2009, Christopher Co wrote:
>>
>>
>>
>>> I am having issues with running processes on the cores I specify using
>>> MV2_CPU_MAPPING. Is the PLPA support for mapping MPI processes to cores
>>> embedded in MVAPICH2 or does it link to an existing PLPA on
>>> configure/install? Also, I want to confirm that no extra configure
>>> options are needed to enable this feature.
>>>
>>>
>>> Thanks,
>>> Chris
>>>
>>> Dhabaleswar Panda wrote:
>>>
>>>
>>>> Thanks for letting us know that you are using MVAPICH2 1.4. I believe you
>>>> are taking numbers on Intel systems. Please note that on Intel systems,
>>>> two cores next to each other within the same chip are numbered as 0 and 4
>>>> (not 0 and 1). Thus, the default setting (with processes 0 and 1) run
>>>> across the chips and thus, you are seeing worse performance. Please run
>>>> your tests across cores 0 and 4 and you should be able to see better
>>>> performance. Depending on which pairs of processes you use, you may see
>>>> some differences in performance for short and large messages (depends on
>>>> whether these cores are within the same chip, same socket or across
>>>> sockets). I am attaching some numbers below on our Nehalem system with
>>>> these two CPU mappings and you can see the performance difference.
>>>>
>>>> MVAPICH2 provides flexible mapping of MPI processes to cores within a
>>>> node. You can try out performance across various pairs and you will see
>>>> performance difference. More details on such mapping are available from
>>>> here:
>>>>
>>>> http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.4rc1.html#x1-360006.8
>>>>
>>>> Also, starting from MVAPICH2 1.4, a new single-copy kernel-based
>>>> shared-memory scheme (LiMIC2) is introduced. This is `off' by default.
>>>> You can use it to get better performance for larger message sizes. You
>>>> need to configure with enable-limic2 and you also need to use
>>>> MV2_SMP_USE_LIMIC2=1. More details are available from here:
>>>>
>>>> http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.4rc1.html#x1-370006.9
>>>>
>>>> Here are some performance numbers with different CPU mappings.
>>>>
>>>> OSU MPI latency with Default CPU mapping (LiMIC2 is off)
>>>> --------------------------------------------------------
>>>>
>>>> # OSU MPI Latency Test v3.1.1
>>>> # Size Latency (us)
>>>> 0 0.77
>>>> 1 0.95
>>>> 2 0.95
>>>> 4 0.94
>>>> 8 0.94
>>>> 16 0.94
>>>> 32 0.96
>>>> 64 0.99
>>>> 128 1.09
>>>> 256 1.22
>>>> 512 1.37
>>>> 1024 1.61
>>>> 2048 1.79
>>>> 4096 2.43
>>>> 8192 5.42
>>>> 16384 6.73
>>>> 32768 9.57
>>>> 65536 15.34
>>>> 131072 28.71
>>>> 262144 53.13
>>>> 524288 100.24
>>>> 1048576 199.98
>>>> 2097152 387.28
>>>> 4194304 991.68
>>>>
>>>> OSU MPI latency with CPU mapping 0:4 (LiMIC2 is off)
>>>> ----------------------------------------------------
>>>>
>>>> # OSU MPI Latency Test v3.1.1
>>>> # Size Latency (us)
>>>> 0 0.34
>>>> 1 0.40
>>>> 2 0.40
>>>> 4 0.40
>>>> 8 0.40
>>>> 16 0.40
>>>> 32 0.42
>>>> 64 0.42
>>>> 128 0.45
>>>> 256 0.50
>>>> 512 0.55
>>>> 1024 0.67
>>>> 2048 0.91
>>>> 4096 1.35
>>>> 8192 3.66
>>>> 16384 5.01
>>>> 32768 7.41
>>>> 65536 12.90
>>>> 131072 25.21
>>>> 262144 49.71
>>>> 524288 97.17
>>>> 1048576 187.50
>>>> 2097152 465.57
>>>> 4194304 1196.31
>>>>
>>>> Let us know if you get better performance with appropriate CPU mapping.
>>>>
>>>> Thanks,
>>>>
>>>> DK
>>>>
>>>>
>>>> On Mon, 15 Jun 2009, Christopher Co wrote:
>>>>
>>>>
>>>>
>>>>
>>>>> I am using MVAPICH2 1.4 with the default configuration (since the CX-1
>>>>> uses Mellanox Infiniband). I am fairly certain my CPU mapping was
>>>>> on-node for both cases (curiously, is there a way for MVAPICH2 to print
>>>>> out the nodes/cores running). I have the numbers for Ping Pong for the
>>>>> off-node case. I should have included this in my earlier message:
>>>>> Processes # repetitions #bytes Intel MPI time (usec)] MVAPICH2 time
>>>>> (usec)
>>>>> 2 1000 0 4.16 3.4
>>>>>
>>>>> 1000 1 4.67 3.56
>>>>>
>>>>> 1000 2 4.21 3.56
>>>>>
>>>>> 1000 4 4.23 3.62
>>>>>
>>>>> 1000 8 4.33 3.63
>>>>>
>>>>> 1000 16 4.33 3.64
>>>>>
>>>>> 1000 32 4.38 3.73
>>>>>
>>>>> 1000 64 4.44 3.92
>>>>>
>>>>> 1000 128 5.61 4.71
>>>>>
>>>>> 1000 256 5.92 5.23
>>>>>
>>>>> 1000 512 6.52 5.79
>>>>>
>>>>> 1000 1024 7.68 7.06
>>>>>
>>>>> 1000 2048 9.97 9.36
>>>>>
>>>>> 1000 4096 12.39 11.97
>>>>>
>>>>> 1000 8192 17.86 22.53
>>>>>
>>>>> 1000 16384 27.44 28.27
>>>>>
>>>>> 1000 32768 40.32 39.82
>>>>>
>>>>> 640 65536 63.61 62.97
>>>>>
>>>>> 320 131072 109.69 110.01
>>>>>
>>>>> 160 262144 204.71 206.9
>>>>>
>>>>> 80 524288 400.72 397.1
>>>>>
>>>>> 40 1048576 775.64 776.45
>>>>>
>>>>> 20 2097152 1523.95 1535.65
>>>>>
>>>>> 10 4194304 3018.84 3054.89
>>>>>
>>>>>
>>>>>
>>>>> Chris
>>>>>
>>>>>
>>>>> Dhabaleswar Panda wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Can you tell us which version of MVAPICH2 you are using and which
>>>>>> option(s) are configured? Are you using correct CPU mapping in both
>>>>>> cases?
>>>>>>
>>>>>> DK
>>>>>>
>>>>>> On Mon, 15 Jun 2009, Christopher Co wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am doing performance analysis on a Cray CX1 machine. I have run the
>>>>>>> Pallas MPI benchmark and have noticed a considerable performance
>>>>>>> difference between MVAPICH2 and Intel MPI on all the tests when shared
>>>>>>> memory is used. I have also run the benchmark for non-shared memory and
>>>>>>> the two performed nearly the same (MVAPICH2 was slightly faster). Is
>>>>>>> this slowdown on shared memory a known issue and/or are there fixes or
>>>>>>> switches I can enable or disable to get more speed?
>>>>>>>
>>>>>>> To give an idea of what I'm seeing, for the simple Ping Pong test for
>>>>>>> two processes on the same chip, the numbers looks like:
>>>>>>>
>>>>>>> Processes # repetitions
>>>>>>> #bytes Intel MPI time (usec) MVAPICH2
>>>>>>> time (usec)
>>>>>>> 2 1000 0 0.35 0.94
>>>>>>>
>>>>>>> 1000 1 0.44 1.24
>>>>>>>
>>>>>>> 1000 2 0.45 1.17
>>>>>>>
>>>>>>> 1000 4 0.45 1.08
>>>>>>>
>>>>>>> 1000 8 0.45 1.11
>>>>>>>
>>>>>>> 1000 16 0.44 1.13
>>>>>>>
>>>>>>> 1000 32 0.45 1.21
>>>>>>>
>>>>>>> 1000 64 0.47 1.35
>>>>>>>
>>>>>>> 1000 128 0.48 1.75
>>>>>>>
>>>>>>> 1000 256 0.51 2.92
>>>>>>>
>>>>>>> 1000 512 0.57 3.41
>>>>>>>
>>>>>>> 1000 1024 0.76 3.85
>>>>>>>
>>>>>>> 1000 2048 0.98 4.27
>>>>>>>
>>>>>>> 1000 4096 1.53 5.14
>>>>>>>
>>>>>>> 1000 8192 2.59 8.04
>>>>>>>
>>>>>>> 1000 16384 4.86 14.34
>>>>>>>
>>>>>>> 1000 32768 7.17 33.92
>>>>>>>
>>>>>>> 640 65536 11.65 43.27
>>>>>>>
>>>>>>> 320 131072 20.97 66.98
>>>>>>>
>>>>>>> 160 262144 39.64 118.58
>>>>>>>
>>>>>>> 80 524288 84.91 224.40
>>>>>>>
>>>>>>> 40 1048576 212.76 461.80
>>>>>>>
>>>>>>> 20 2097152 458.55 1053.67
>>>>>>>
>>>>>>> 10 4194304 1738.30 2649.30
>>>>>>>
>>>>>>>
>>>>>>> Hopefully the table came out clear. MVAPICH2 always lags behind by a
>>>>>>> considerable amount. Any insight is much appreciated. Thanks!
>>>>>>>
>>>>>>>
>>>>>>> Chris Co
>>>>>>> _______________________________________________
>>>>>>> mvapich-discuss mailing list
>>>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>
>>>>
>>
>>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
More information about the mvapich-discuss
mailing list