[mvapich-discuss] Shared Memory Performance

Wed Jun 17 17:21:07 EDT 2009

I have found that the CX-1 I am running on has two Intel Xeon E5472 3
GHz processors (Harpertown).  Your test results were on Nehalem
processors.  When I have received the correct CPU mapping, I've gotten
roughly 0.8 usec to Ping Pong 8 bytes.  I wonder if this can account for
the discrepancy.  Anyways, I'll investigate this further and get more
data but I wanted to throw this information out there in case it can be
helpful.

Chris

Christopher Co wrote:
> Those specifications are correct.  I am seeing that the MV2_CPU_MAPPING
> option does not have an effect on which cores are chosen so when I
> launch a Ping-Pong, 2 cores are arbitrarily chosen by mpirun_rsh.  One
> thing that might be hindering PLPA support is that I do not have
> sudo/root access on the machine.   I installed everything into my home
> directory.  Could this be the issue?
>
>
> Chris
>
> Dhabaleswar Panda wrote:
>   
>> Could you let us know what issues you are seeing when using
>> MV2_CPU_MAPPING. The PLPA support is embedded in MVAPICH2 code. It does
>> not require any additional configure/install. I am assuming that you are
>> using the Gen2 (OFED) interface with mpirun_rsh and your systems are
>> Linux-based.
>>
>> Thanks,
>>
>> DK
>>
>>
>> On Tue, 16 Jun 2009, Christopher Co wrote:
>>
>>   
>>     
>>> I am having issues with running processes on the cores I specify using
>>> MV2_CPU_MAPPING. Is the PLPA support for mapping MPI processes to cores
>>> embedded in MVAPICH2 or does it link to an existing PLPA on
>>> configure/install? Also, I want to confirm that no extra configure
>>> options are needed to enable this feature.
>>>
>>>
>>> Thanks,
>>> Chris
>>>
>>> Dhabaleswar Panda wrote:
>>>     
>>>       
>>>> Thanks for letting us know that you are using MVAPICH2 1.4.  I believe you
>>>> are taking numbers on Intel systems. Please note that on Intel systems,
>>>> two cores next to each other within the same chip are numbered as 0 and 4
>>>> (not 0 and 1). Thus, the default setting (with processes 0 and 1) run
>>>> across the chips and thus, you are seeing worse performance. Please run
>>>> your tests across cores 0 and 4 and you should be able to see better
>>>> performance. Depending on which pairs of processes you use, you may see
>>>> some differences in performance for short and large messages (depends on
>>>> whether these cores are within the same chip, same socket or across
>>>> sockets). I am attaching some numbers below on our Nehalem system with
>>>> these two CPU mappings and you can see the performance difference.
>>>>
>>>> MVAPICH2 provides flexible mapping of MPI processes to cores within a
>>>> node. You can try out performance across various pairs and you will see
>>>> performance difference. More details on such mapping are available from
>>>> here:
>>>>
>>>> http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.4rc1.html#x1-360006.8
>>>>
>>>> Also, starting from MVAPICH2 1.4, a new single-copy kernel-based
>>>> shared-memory scheme (LiMIC2) is introduced. This is `off' by default.
>>>> You can use it to get better performance for larger message sizes. You
>>>> need to configure with enable-limic2 and you also need to use
>>>> MV2_SMP_USE_LIMIC2=1.  More details are available from here:
>>>>
>>>> http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.4rc1.html#x1-370006.9
>>>>
>>>> Here are some performance numbers with different CPU mappings.
>>>>
>>>> OSU MPI latency with Default CPU mapping (LiMIC2 is off)
>>>> --------------------------------------------------------
>>>>
>>>> # OSU MPI Latency Test v3.1.1
>>>> # Size            Latency (us)
>>>> 0                         0.77
>>>> 1                         0.95
>>>> 2                         0.95
>>>> 4                         0.94
>>>> 8                         0.94
>>>> 16                        0.94
>>>> 32                        0.96
>>>> 64                        0.99
>>>> 128                       1.09
>>>> 256                       1.22
>>>> 512                       1.37
>>>> 1024                      1.61
>>>> 2048                      1.79
>>>> 4096                      2.43
>>>> 8192                      5.42
>>>> 16384                     6.73
>>>> 32768                     9.57
>>>> 65536                    15.34
>>>> 131072                   28.71
>>>> 262144                   53.13
>>>> 524288                  100.24
>>>> 1048576                 199.98
>>>> 2097152                 387.28
>>>> 4194304                 991.68
>>>>
>>>> OSU MPI latency with CPU mapping 0:4 (LiMIC2 is off)
>>>> ----------------------------------------------------
>>>>
>>>> # OSU MPI Latency Test v3.1.1
>>>> # Size            Latency (us)
>>>> 0                         0.34
>>>> 1                         0.40
>>>> 2                         0.40
>>>> 4                         0.40
>>>> 8                         0.40
>>>> 16                        0.40
>>>> 32                        0.42
>>>> 64                        0.42
>>>> 128                       0.45
>>>> 256                       0.50
>>>> 512                       0.55
>>>> 1024                      0.67
>>>> 2048                      0.91
>>>> 4096                      1.35
>>>> 8192                      3.66
>>>> 16384                     5.01
>>>> 32768                     7.41
>>>> 65536                    12.90
>>>> 131072                   25.21
>>>> 262144                   49.71
>>>> 524288                   97.17
>>>> 1048576                 187.50
>>>> 2097152                 465.57
>>>> 4194304                1196.31
>>>>
>>>> Let us know if you get better performance with appropriate CPU mapping.
>>>>
>>>> Thanks,
>>>>
>>>> DK
>>>>
>>>>
>>>> On Mon, 15 Jun 2009, Christopher Co wrote:
>>>>
>>>>
>>>>       
>>>>         
>>>>> I am using MVAPICH2 1.4 with the default configuration (since the CX-1
>>>>> uses Mellanox Infiniband).  I am fairly certain my CPU mapping was
>>>>> on-node for both cases (curiously, is there a way for MVAPICH2 to print
>>>>> out the nodes/cores running).  I have the numbers for Ping Pong for the
>>>>> off-node case.  I should have included this in my earlier message:
>>>>> Processes 	# repetitions 	#bytes 	Intel MPI time (usec)] 	MVAPICH2 time
>>>>> (usec)
>>>>> 2 	1000 	0 	4.16 	3.4
>>>>>
>>>>> 	1000 	1 	4.67 	3.56
>>>>>
>>>>> 	1000 	2 	4.21 	3.56
>>>>>
>>>>> 	1000 	4 	4.23 	3.62
>>>>>
>>>>> 	1000 	8 	4.33 	3.63
>>>>>
>>>>> 	1000 	16 	4.33 	3.64
>>>>>
>>>>> 	1000 	32 	4.38 	3.73
>>>>>
>>>>> 	1000 	64 	4.44 	3.92
>>>>>
>>>>> 	1000 	128 	5.61 	4.71
>>>>>
>>>>> 	1000 	256 	5.92 	5.23
>>>>>
>>>>> 	1000 	512 	6.52 	5.79
>>>>>
>>>>> 	1000 	1024 	7.68 	7.06
>>>>>
>>>>> 	1000 	2048 	9.97 	9.36
>>>>>
>>>>> 	1000 	4096 	12.39 	11.97
>>>>>
>>>>> 	1000 	8192 	17.86 	22.53
>>>>>
>>>>> 	1000 	16384 	27.44 	28.27
>>>>>
>>>>> 	1000 	32768 	40.32 	39.82
>>>>>
>>>>> 	640 	65536 	63.61 	62.97
>>>>>
>>>>> 	320 	131072 	109.69 	110.01
>>>>>
>>>>> 	160 	262144 	204.71 	206.9
>>>>>
>>>>> 	80 	524288 	400.72 	397.1
>>>>>
>>>>> 	40 	1048576 	775.64 	776.45
>>>>>
>>>>> 	20 	2097152 	1523.95 	1535.65
>>>>>
>>>>> 	10 	4194304 	3018.84 	3054.89
>>>>>
>>>>>
>>>>>
>>>>> Chris
>>>>>
>>>>>
>>>>> Dhabaleswar Panda wrote:
>>>>>
>>>>>         
>>>>>           
>>>>>> Can you tell us which version of MVAPICH2 you are using and which
>>>>>> option(s) are configured? Are you using correct CPU mapping in both
>>>>>> cases?
>>>>>>
>>>>>> DK
>>>>>>
>>>>>> On Mon, 15 Jun 2009, Christopher Co wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>           
>>>>>>             
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am doing performance analysis on a Cray CX1 machine.  I have run the
>>>>>>> Pallas MPI benchmark and have noticed a considerable performance
>>>>>>> difference between MVAPICH2 and Intel MPI on all the tests when shared
>>>>>>> memory is used.  I have also run the benchmark for non-shared memory and
>>>>>>> the two performed nearly the same (MVAPICH2 was slightly faster).  Is
>>>>>>> this slowdown on shared memory a known issue and/or are there fixes or
>>>>>>> switches I can enable or disable to get more speed?
>>>>>>>
>>>>>>> To give an idea of what I'm seeing, for the simple Ping Pong test for
>>>>>>> two processes on the same chip, the numbers looks like:
>>>>>>>
>>>>>>>              Processes 	           # repetitions
>>>>>>> #bytes 	                Intel MPI time (usec) 	                MVAPICH2
>>>>>>> time (usec)
>>>>>>> 2 	1000 	0 	0.35 	0.94
>>>>>>>
>>>>>>> 	1000 	1 	0.44 	1.24
>>>>>>>
>>>>>>> 	1000 	2 	0.45 	1.17
>>>>>>>
>>>>>>> 	1000 	4 	0.45 	1.08
>>>>>>>
>>>>>>> 	1000 	8 	0.45 	1.11
>>>>>>>
>>>>>>> 	1000 	16 	0.44 	1.13
>>>>>>>
>>>>>>> 	1000 	32 	0.45 	1.21
>>>>>>>
>>>>>>> 	1000 	64 	0.47 	1.35
>>>>>>>
>>>>>>> 	1000 	128 	0.48 	1.75
>>>>>>>
>>>>>>> 	1000 	256 	0.51 	2.92
>>>>>>>
>>>>>>> 	1000 	512 	0.57 	3.41
>>>>>>>
>>>>>>> 	1000 	1024 	0.76 	3.85
>>>>>>>
>>>>>>> 	1000 	2048 	0.98 	4.27
>>>>>>>
>>>>>>> 	1000 	4096 	1.53 	5.14
>>>>>>>
>>>>>>> 	1000 	8192 	2.59 	8.04
>>>>>>>
>>>>>>> 	1000 	16384 	4.86 	14.34
>>>>>>>
>>>>>>> 	1000 	32768 	7.17 	33.92
>>>>>>>
>>>>>>> 	640 	65536 	11.65 	43.27
>>>>>>>
>>>>>>> 	320 	131072 	20.97 	66.98
>>>>>>>
>>>>>>> 	160 	262144 	39.64 	118.58
>>>>>>>
>>>>>>> 	80 	524288 	84.91 	224.40
>>>>>>>
>>>>>>> 	40 	1048576 	212.76 	461.80
>>>>>>>
>>>>>>> 	20 	2097152 	458.55 	1053.67
>>>>>>>
>>>>>>> 	10 	4194304 	1738.30 	2649.30
>>>>>>>
>>>>>>>
>>>>>>> Hopefully the table came out clear.  MVAPICH2 always lags behind by a
>>>>>>> considerable amount.  Any insight is much appreciated.  Thanks!
>>>>>>>
>>>>>>>
>>>>>>> Chris Co
>>>>>>> _______________________________________________
>>>>>>> mvapich-discuss mailing list
>>>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>             
>>>>>>>               
>>>>       
>>>>         
>>   
>>     
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>