[mvapich-discuss] Shared Memory Performance

Fri Jun 26 15:01:09 EDT 2009

I have found the source of the problem with the shared memory latency
through the IMB Ping Pong test.  After a lot of digging, I found that
the default initialization of IMB is to enable MPI_THREAD_MULTIPLE and
in the Ping Pong source code, the "source" of the MPI_Send/Receive
functions uses MPI_ANY_SOURCE.  These two factors were skewing all the
results except for Intel's MPI.  After fixing IMB to be initialized to
MPI_THREAD_SINGLE and changing source to be the correct value, I
produced similar numbers (using cores 5 and 7 for further increased
performance) between IMB Ping Pong, OSU Latency, and my own basic Ping
Pong timing.  The numbers are below.  There is still an unknown issue
about the 0 and 1 byte latencies are off (and it looks like OSU Latency
numbers are correct here).  From my testing, I noticed that the first
part of the 1000 repetitions IMB did for 0 byte latencies were extremely
high values.  IMB does do an MPI_Barrier before it starts to ensure that
the send/receive start together.

#---------------------------------------------------
#    Intel (R) MPI Benchmark Suite V3.0, MPI-1 part
#---------------------------------------------------
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000         0.37         0.00
            1         1000         0.43         2.19
            2         1000         0.40         4.82
            4         1000         0.46         8.21
            8         1000         0.44        17.16
           16         1000         0.45        34.10
           32         1000         0.47        65.42
           64         1000         0.48       127.55
          128         1000         0.51       237.48
          256         1000         0.56       435.19
          512         1000         0.66       742.70
         1024         1000         0.83      1171.62
         2048         1000         1.17      1669.28
         4096         1000         1.84      2124.07
         8192         1000         3.24      2409.85
        16384         1000         6.25      2501.23
        32768         1000        10.77      2901.97
        65536          640        16.68      3747.09
       131072          320        25.72      4860.57
       262144          160        43.62      5730.71
       524288           80        81.07      6167.53
      1048576           40       173.55      5762.10
      2097152           20      1165.35      1716.23
      4194304           10      2689.10      1487.49

# OSU MPI Latency Test v3.1.1
# Size            Latency (us)
0                         0.30
1                         0.38
2                         0.39
4                         0.46
8                         0.44
16                        0.44
32                        0.46
64                        0.47
128                       0.49
256                       0.53
512                       0.63
1024                      0.79
2048                      1.11
4096                      1.80
8192                      3.24
16384                     6.36
32768                    10.99
65536                    16.34
131072                   24.75
262144                   41.51
524288                   75.74
1048576                 157.31
2097152                1159.87
4194304                2696.29

Christopher Co wrote:
> I have found that the CX-1 I am running on has two Intel Xeon E5472 3
> GHz processors (Harpertown).  Your test results were on Nehalem
> processors.  When I have received the correct CPU mapping, I've gotten
> roughly 0.8 usec to Ping Pong 8 bytes.  I wonder if this can account for
> the discrepancy.  Anyways, I'll investigate this further and get more
> data but I wanted to throw this information out there in case it can be
> helpful.
>
>
> Chris
>
> Christopher Co wrote:
>   
>> Those specifications are correct.  I am seeing that the MV2_CPU_MAPPING
>> option does not have an effect on which cores are chosen so when I
>> launch a Ping-Pong, 2 cores are arbitrarily chosen by mpirun_rsh.  One
>> thing that might be hindering PLPA support is that I do not have
>> sudo/root access on the machine.   I installed everything into my home
>> directory.  Could this be the issue?
>>
>>
>> Chris
>>
>> Dhabaleswar Panda wrote:
>>   
>>     
>>> Could you let us know what issues you are seeing when using
>>> MV2_CPU_MAPPING. The PLPA support is embedded in MVAPICH2 code. It does
>>> not require any additional configure/install. I am assuming that you are
>>> using the Gen2 (OFED) interface with mpirun_rsh and your systems are
>>> Linux-based.
>>>
>>> Thanks,
>>>
>>> DK
>>>
>>>
>>> On Tue, 16 Jun 2009, Christopher Co wrote:
>>>
>>>   
>>>     
>>>       
>>>> I am having issues with running processes on the cores I specify using
>>>> MV2_CPU_MAPPING. Is the PLPA support for mapping MPI processes to cores
>>>> embedded in MVAPICH2 or does it link to an existing PLPA on
>>>> configure/install? Also, I want to confirm that no extra configure
>>>> options are needed to enable this feature.
>>>>
>>>>
>>>> Thanks,
>>>> Chris
>>>>
>>>> Dhabaleswar Panda wrote:
>>>>     
>>>>       
>>>>         
>>>>> Thanks for letting us know that you are using MVAPICH2 1.4.  I believe you
>>>>> are taking numbers on Intel systems. Please note that on Intel systems,
>>>>> two cores next to each other within the same chip are numbered as 0 and 4
>>>>> (not 0 and 1). Thus, the default setting (with processes 0 and 1) run
>>>>> across the chips and thus, you are seeing worse performance. Please run
>>>>> your tests across cores 0 and 4 and you should be able to see better
>>>>> performance. Depending on which pairs of processes you use, you may see
>>>>> some differences in performance for short and large messages (depends on
>>>>> whether these cores are within the same chip, same socket or across
>>>>> sockets). I am attaching some numbers below on our Nehalem system with
>>>>> these two CPU mappings and you can see the performance difference.
>>>>>
>>>>> MVAPICH2 provides flexible mapping of MPI processes to cores within a
>>>>> node. You can try out performance across various pairs and you will see
>>>>> performance difference. More details on such mapping are available from
>>>>> here:
>>>>>
>>>>> http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.4rc1.html#x1-360006.8
>>>>>
>>>>> Also, starting from MVAPICH2 1.4, a new single-copy kernel-based
>>>>> shared-memory scheme (LiMIC2) is introduced. This is `off' by default.
>>>>> You can use it to get better performance for larger message sizes. You
>>>>> need to configure with enable-limic2 and you also need to use
>>>>> MV2_SMP_USE_LIMIC2=1.  More details are available from here:
>>>>>
>>>>> http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.4rc1.html#x1-370006.9
>>>>>
>>>>> Here are some performance numbers with different CPU mappings.
>>>>>
>>>>> OSU MPI latency with Default CPU mapping (LiMIC2 is off)
>>>>> --------------------------------------------------------
>>>>>
>>>>> # OSU MPI Latency Test v3.1.1
>>>>> # Size            Latency (us)
>>>>> 0                         0.77
>>>>> 1                         0.95
>>>>> 2                         0.95
>>>>> 4                         0.94
>>>>> 8                         0.94
>>>>> 16                        0.94
>>>>> 32                        0.96
>>>>> 64                        0.99
>>>>> 128                       1.09
>>>>> 256                       1.22
>>>>> 512                       1.37
>>>>> 1024                      1.61
>>>>> 2048                      1.79
>>>>> 4096                      2.43
>>>>> 8192                      5.42
>>>>> 16384                     6.73
>>>>> 32768                     9.57
>>>>> 65536                    15.34
>>>>> 131072                   28.71
>>>>> 262144                   53.13
>>>>> 524288                  100.24
>>>>> 1048576                 199.98
>>>>> 2097152                 387.28
>>>>> 4194304                 991.68
>>>>>
>>>>> OSU MPI latency with CPU mapping 0:4 (LiMIC2 is off)
>>>>> ----------------------------------------------------
>>>>>
>>>>> # OSU MPI Latency Test v3.1.1
>>>>> # Size            Latency (us)
>>>>> 0                         0.34
>>>>> 1                         0.40
>>>>> 2                         0.40
>>>>> 4                         0.40
>>>>> 8                         0.40
>>>>> 16                        0.40
>>>>> 32                        0.42
>>>>> 64                        0.42
>>>>> 128                       0.45
>>>>> 256                       0.50
>>>>> 512                       0.55
>>>>> 1024                      0.67
>>>>> 2048                      0.91
>>>>> 4096                      1.35
>>>>> 8192                      3.66
>>>>> 16384                     5.01
>>>>> 32768                     7.41
>>>>> 65536                    12.90
>>>>> 131072                   25.21
>>>>> 262144                   49.71
>>>>> 524288                   97.17
>>>>> 1048576                 187.50
>>>>> 2097152                 465.57
>>>>> 4194304                1196.31
>>>>>
>>>>> Let us know if you get better performance with appropriate CPU mapping.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> DK
>>>>>
>>>>>
>>>>> On Mon, 15 Jun 2009, Christopher Co wrote:
>>>>>
>>>>>
>>>>>       
>>>>>         
>>>>>           
>>>>>> I am using MVAPICH2 1.4 with the default configuration (since the CX-1
>>>>>> uses Mellanox Infiniband).  I am fairly certain my CPU mapping was
>>>>>> on-node for both cases (curiously, is there a way for MVAPICH2 to print
>>>>>> out the nodes/cores running).  I have the numbers for Ping Pong for the
>>>>>> off-node case.  I should have included this in my earlier message:
>>>>>> Processes 	# repetitions 	#bytes 	Intel MPI time (usec)] 	MVAPICH2 time
>>>>>> (usec)
>>>>>> 2 	1000 	0 	4.16 	3.4
>>>>>>
>>>>>> 	1000 	1 	4.67 	3.56
>>>>>>
>>>>>> 	1000 	2 	4.21 	3.56
>>>>>>
>>>>>> 	1000 	4 	4.23 	3.62
>>>>>>
>>>>>> 	1000 	8 	4.33 	3.63
>>>>>>
>>>>>> 	1000 	16 	4.33 	3.64
>>>>>>
>>>>>> 	1000 	32 	4.38 	3.73
>>>>>>
>>>>>> 	1000 	64 	4.44 	3.92
>>>>>>
>>>>>> 	1000 	128 	5.61 	4.71
>>>>>>
>>>>>> 	1000 	256 	5.92 	5.23
>>>>>>
>>>>>> 	1000 	512 	6.52 	5.79
>>>>>>
>>>>>> 	1000 	1024 	7.68 	7.06
>>>>>>
>>>>>> 	1000 	2048 	9.97 	9.36
>>>>>>
>>>>>> 	1000 	4096 	12.39 	11.97
>>>>>>
>>>>>> 	1000 	8192 	17.86 	22.53
>>>>>>
>>>>>> 	1000 	16384 	27.44 	28.27
>>>>>>
>>>>>> 	1000 	32768 	40.32 	39.82
>>>>>>
>>>>>> 	640 	65536 	63.61 	62.97
>>>>>>
>>>>>> 	320 	131072 	109.69 	110.01
>>>>>>
>>>>>> 	160 	262144 	204.71 	206.9
>>>>>>
>>>>>> 	80 	524288 	400.72 	397.1
>>>>>>
>>>>>> 	40 	1048576 	775.64 	776.45
>>>>>>
>>>>>> 	20 	2097152 	1523.95 	1535.65
>>>>>>
>>>>>> 	10 	4194304 	3018.84 	3054.89
>>>>>>
>>>>>>
>>>>>>
>>>>>> Chris
>>>>>>
>>>>>>
>>>>>> Dhabaleswar Panda wrote:
>>>>>>
>>>>>>         
>>>>>>           
>>>>>>             
>>>>>>> Can you tell us which version of MVAPICH2 you are using and which
>>>>>>> option(s) are configured? Are you using correct CPU mapping in both
>>>>>>> cases?
>>>>>>>
>>>>>>> DK
>>>>>>>
>>>>>>> On Mon, 15 Jun 2009, Christopher Co wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>           
>>>>>>>             
>>>>>>>               
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I am doing performance analysis on a Cray CX1 machine.  I have run the
>>>>>>>> Pallas MPI benchmark and have noticed a considerable performance
>>>>>>>> difference between MVAPICH2 and Intel MPI on all the tests when shared
>>>>>>>> memory is used.  I have also run the benchmark for non-shared memory and
>>>>>>>> the two performed nearly the same (MVAPICH2 was slightly faster).  Is
>>>>>>>> this slowdown on shared memory a known issue and/or are there fixes or
>>>>>>>> switches I can enable or disable to get more speed?
>>>>>>>>
>>>>>>>> To give an idea of what I'm seeing, for the simple Ping Pong test for
>>>>>>>> two processes on the same chip, the numbers looks like:
>>>>>>>>
>>>>>>>>              Processes 	           # repetitions
>>>>>>>> #bytes 	                Intel MPI time (usec) 	                MVAPICH2
>>>>>>>> time (usec)
>>>>>>>> 2 	1000 	0 	0.35 	0.94
>>>>>>>>
>>>>>>>> 	1000 	1 	0.44 	1.24
>>>>>>>>
>>>>>>>> 	1000 	2 	0.45 	1.17
>>>>>>>>
>>>>>>>> 	1000 	4 	0.45 	1.08
>>>>>>>>
>>>>>>>> 	1000 	8 	0.45 	1.11
>>>>>>>>
>>>>>>>> 	1000 	16 	0.44 	1.13
>>>>>>>>
>>>>>>>> 	1000 	32 	0.45 	1.21
>>>>>>>>
>>>>>>>> 	1000 	64 	0.47 	1.35
>>>>>>>>
>>>>>>>> 	1000 	128 	0.48 	1.75
>>>>>>>>
>>>>>>>> 	1000 	256 	0.51 	2.92
>>>>>>>>
>>>>>>>> 	1000 	512 	0.57 	3.41
>>>>>>>>
>>>>>>>> 	1000 	1024 	0.76 	3.85
>>>>>>>>
>>>>>>>> 	1000 	2048 	0.98 	4.27
>>>>>>>>
>>>>>>>> 	1000 	4096 	1.53 	5.14
>>>>>>>>
>>>>>>>> 	1000 	8192 	2.59 	8.04
>>>>>>>>
>>>>>>>> 	1000 	16384 	4.86 	14.34
>>>>>>>>
>>>>>>>> 	1000 	32768 	7.17 	33.92
>>>>>>>>
>>>>>>>> 	640 	65536 	11.65 	43.27
>>>>>>>>
>>>>>>>> 	320 	131072 	20.97 	66.98
>>>>>>>>
>>>>>>>> 	160 	262144 	39.64 	118.58
>>>>>>>>
>>>>>>>> 	80 	524288 	84.91 	224.40
>>>>>>>>
>>>>>>>> 	40 	1048576 	212.76 	461.80
>>>>>>>>
>>>>>>>> 	20 	2097152 	458.55 	1053.67
>>>>>>>>
>>>>>>>> 	10 	4194304 	1738.30 	2649.30
>>>>>>>>
>>>>>>>>
>>>>>>>> Hopefully the table came out clear.  MVAPICH2 always lags behind by a
>>>>>>>> considerable amount.  Any insight is much appreciated.  Thanks!
>>>>>>>>
>>>>>>>>
>>>>>>>> Chris Co
>>>>>>>> _______________________________________________
>>>>>>>> mvapich-discuss mailing list
>>>>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>             
>>>>>>>>               
>>>>>>>>                 
>>>>>       
>>>>>         
>>>>>           
>>>   
>>>     
>>>       
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>   
>>     
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>