[mvapich-discuss] Shared Memory Performance

Fri Jun 26 16:50:29 EDT 2009

Chris - Thanks for the detailed investigation of the issues and insights
here.  Glad to know that you are getting the desired performance with
MVAPICH2 now. For 0 and 1 byte, the MPI_Barrier used in IMB might be
skewing the results. IMB folks can provide more insights here.

Thanks,

DK

On Fri, 26 Jun 2009, Christopher Co wrote:

> I have found the source of the problem with the shared memory latency
> through the IMB Ping Pong test.  After a lot of digging, I found that
> the default initialization of IMB is to enable MPI_THREAD_MULTIPLE and
> in the Ping Pong source code, the "source" of the MPI_Send/Receive
> functions uses MPI_ANY_SOURCE.  These two factors were skewing all the
> results except for Intel's MPI.  After fixing IMB to be initialized to
> MPI_THREAD_SINGLE and changing source to be the correct value, I
> produced similar numbers (using cores 5 and 7 for further increased
> performance) between IMB Ping Pong, OSU Latency, and my own basic Ping
> Pong timing.  The numbers are below.  There is still an unknown issue
> about the 0 and 1 byte latencies are off (and it looks like OSU Latency
> numbers are correct here).  From my testing, I noticed that the first
> part of the 1000 repetitions IMB did for 0 byte latencies were extremely
> high values.  IMB does do an MPI_Barrier before it starts to ensure that
> the send/receive start together.
>
>
> #---------------------------------------------------
> #    Intel (R) MPI Benchmark Suite V3.0, MPI-1 part
> #---------------------------------------------------
> #---------------------------------------------------
> # Benchmarking PingPong
> # #processes = 2
> #---------------------------------------------------
>        #bytes #repetitions      t[usec]   Mbytes/sec
>             0         1000         0.37         0.00
>             1         1000         0.43         2.19
>             2         1000         0.40         4.82
>             4         1000         0.46         8.21
>             8         1000         0.44        17.16
>            16         1000         0.45        34.10
>            32         1000         0.47        65.42
>            64         1000         0.48       127.55
>           128         1000         0.51       237.48
>           256         1000         0.56       435.19
>           512         1000         0.66       742.70
>          1024         1000         0.83      1171.62
>          2048         1000         1.17      1669.28
>          4096         1000         1.84      2124.07
>          8192         1000         3.24      2409.85
>         16384         1000         6.25      2501.23
>         32768         1000        10.77      2901.97
>         65536          640        16.68      3747.09
>        131072          320        25.72      4860.57
>        262144          160        43.62      5730.71
>        524288           80        81.07      6167.53
>       1048576           40       173.55      5762.10
>       2097152           20      1165.35      1716.23
>       4194304           10      2689.10      1487.49
>
> # OSU MPI Latency Test v3.1.1
> # Size            Latency (us)
> 0                         0.30
> 1                         0.38
> 2                         0.39
> 4                         0.46
> 8                         0.44
> 16                        0.44
> 32                        0.46
> 64                        0.47
> 128                       0.49
> 256                       0.53
> 512                       0.63
> 1024                      0.79
> 2048                      1.11
> 4096                      1.80
> 8192                      3.24
> 16384                     6.36
> 32768                    10.99
> 65536                    16.34
> 131072                   24.75
> 262144                   41.51
> 524288                   75.74
> 1048576                 157.31
> 2097152                1159.87
> 4194304                2696.29
>
>
>
>
> Christopher Co wrote:
> > I have found that the CX-1 I am running on has two Intel Xeon E5472 3
> > GHz processors (Harpertown).  Your test results were on Nehalem
> > processors.  When I have received the correct CPU mapping, I've gotten
> > roughly 0.8 usec to Ping Pong 8 bytes.  I wonder if this can account for
> > the discrepancy.  Anyways, I'll investigate this further and get more
> > data but I wanted to throw this information out there in case it can be
> > helpful.
> >
> >
> > Chris
> >
> > Christopher Co wrote:
> >
> >> Those specifications are correct.  I am seeing that the MV2_CPU_MAPPING
> >> option does not have an effect on which cores are chosen so when I
> >> launch a Ping-Pong, 2 cores are arbitrarily chosen by mpirun_rsh.  One
> >> thing that might be hindering PLPA support is that I do not have
> >> sudo/root access on the machine.   I installed everything into my home
> >> directory.  Could this be the issue?
> >>
> >>
> >> Chris
> >>
> >> Dhabaleswar Panda wrote:
> >>
> >>
> >>> Could you let us know what issues you are seeing when using
> >>> MV2_CPU_MAPPING. The PLPA support is embedded in MVAPICH2 code. It does
> >>> not require any additional configure/install. I am assuming that you are
> >>> using the Gen2 (OFED) interface with mpirun_rsh and your systems are
> >>> Linux-based.
> >>>
> >>> Thanks,
> >>>
> >>> DK
> >>>
> >>>
> >>> On Tue, 16 Jun 2009, Christopher Co wrote:
> >>>
> >>>
> >>>
> >>>
> >>>> I am having issues with running processes on the cores I specify using
> >>>> MV2_CPU_MAPPING. Is the PLPA support for mapping MPI processes to cores
> >>>> embedded in MVAPICH2 or does it link to an existing PLPA on
> >>>> configure/install? Also, I want to confirm that no extra configure
> >>>> options are needed to enable this feature.
> >>>>
> >>>>
> >>>> Thanks,
> >>>> Chris
> >>>>
> >>>> Dhabaleswar Panda wrote:
> >>>>
> >>>>
> >>>>
> >>>>> Thanks for letting us know that you are using MVAPICH2 1.4.  I believe you
> >>>>> are taking numbers on Intel systems. Please note that on Intel systems,
> >>>>> two cores next to each other within the same chip are numbered as 0 and 4
> >>>>> (not 0 and 1). Thus, the default setting (with processes 0 and 1) run
> >>>>> across the chips and thus, you are seeing worse performance. Please run
> >>>>> your tests across cores 0 and 4 and you should be able to see better
> >>>>> performance. Depending on which pairs of processes you use, you may see
> >>>>> some differences in performance for short and large messages (depends on
> >>>>> whether these cores are within the same chip, same socket or across
> >>>>> sockets). I am attaching some numbers below on our Nehalem system with
> >>>>> these two CPU mappings and you can see the performance difference.
> >>>>>
> >>>>> MVAPICH2 provides flexible mapping of MPI processes to cores within a
> >>>>> node. You can try out performance across various pairs and you will see
> >>>>> performance difference. More details on such mapping are available from
> >>>>> here:
> >>>>>
> >>>>> http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.4rc1.html#x1-360006.8
> >>>>>
> >>>>> Also, starting from MVAPICH2 1.4, a new single-copy kernel-based
> >>>>> shared-memory scheme (LiMIC2) is introduced. This is `off' by default.
> >>>>> You can use it to get better performance for larger message sizes. You
> >>>>> need to configure with enable-limic2 and you also need to use
> >>>>> MV2_SMP_USE_LIMIC2=1.  More details are available from here:
> >>>>>
> >>>>> http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.4rc1.html#x1-370006.9
> >>>>>
> >>>>> Here are some performance numbers with different CPU mappings.
> >>>>>
> >>>>> OSU MPI latency with Default CPU mapping (LiMIC2 is off)
> >>>>> --------------------------------------------------------
> >>>>>
> >>>>> # OSU MPI Latency Test v3.1.1
> >>>>> # Size            Latency (us)
> >>>>> 0                         0.77
> >>>>> 1                         0.95
> >>>>> 2                         0.95
> >>>>> 4                         0.94
> >>>>> 8                         0.94
> >>>>> 16                        0.94
> >>>>> 32                        0.96
> >>>>> 64                        0.99
> >>>>> 128                       1.09
> >>>>> 256                       1.22
> >>>>> 512                       1.37
> >>>>> 1024                      1.61
> >>>>> 2048                      1.79
> >>>>> 4096                      2.43
> >>>>> 8192                      5.42
> >>>>> 16384                     6.73
> >>>>> 32768                     9.57
> >>>>> 65536                    15.34
> >>>>> 131072                   28.71
> >>>>> 262144                   53.13
> >>>>> 524288                  100.24
> >>>>> 1048576                 199.98
> >>>>> 2097152                 387.28
> >>>>> 4194304                 991.68
> >>>>>
> >>>>> OSU MPI latency with CPU mapping 0:4 (LiMIC2 is off)
> >>>>> ----------------------------------------------------
> >>>>>
> >>>>> # OSU MPI Latency Test v3.1.1
> >>>>> # Size            Latency (us)
> >>>>> 0                         0.34
> >>>>> 1                         0.40
> >>>>> 2                         0.40
> >>>>> 4                         0.40
> >>>>> 8                         0.40
> >>>>> 16                        0.40
> >>>>> 32                        0.42
> >>>>> 64                        0.42
> >>>>> 128                       0.45
> >>>>> 256                       0.50
> >>>>> 512                       0.55
> >>>>> 1024                      0.67
> >>>>> 2048                      0.91
> >>>>> 4096                      1.35
> >>>>> 8192                      3.66
> >>>>> 16384                     5.01
> >>>>> 32768                     7.41
> >>>>> 65536                    12.90
> >>>>> 131072                   25.21
> >>>>> 262144                   49.71
> >>>>> 524288                   97.17
> >>>>> 1048576                 187.50
> >>>>> 2097152                 465.57
> >>>>> 4194304                1196.31
> >>>>>
> >>>>> Let us know if you get better performance with appropriate CPU mapping.
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> DK
> >>>>>
> >>>>>
> >>>>> On Mon, 15 Jun 2009, Christopher Co wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>> I am using MVAPICH2 1.4 with the default configuration (since the CX-1
> >>>>>> uses Mellanox Infiniband).  I am fairly certain my CPU mapping was
> >>>>>> on-node for both cases (curiously, is there a way for MVAPICH2 to print
> >>>>>> out the nodes/cores running).  I have the numbers for Ping Pong for the
> >>>>>> off-node case.  I should have included this in my earlier message:
> >>>>>> Processes 	# repetitions 	#bytes 	Intel MPI time (usec)] 	MVAPICH2 time
> >>>>>> (usec)
> >>>>>> 2 	1000 	0 	4.16 	3.4
> >>>>>>
> >>>>>> 	1000 	1 	4.67 	3.56
> >>>>>>
> >>>>>> 	1000 	2 	4.21 	3.56
> >>>>>>
> >>>>>> 	1000 	4 	4.23 	3.62
> >>>>>>
> >>>>>> 	1000 	8 	4.33 	3.63
> >>>>>>
> >>>>>> 	1000 	16 	4.33 	3.64
> >>>>>>
> >>>>>> 	1000 	32 	4.38 	3.73
> >>>>>>
> >>>>>> 	1000 	64 	4.44 	3.92
> >>>>>>
> >>>>>> 	1000 	128 	5.61 	4.71
> >>>>>>
> >>>>>> 	1000 	256 	5.92 	5.23
> >>>>>>
> >>>>>> 	1000 	512 	6.52 	5.79
> >>>>>>
> >>>>>> 	1000 	1024 	7.68 	7.06
> >>>>>>
> >>>>>> 	1000 	2048 	9.97 	9.36
> >>>>>>
> >>>>>> 	1000 	4096 	12.39 	11.97
> >>>>>>
> >>>>>> 	1000 	8192 	17.86 	22.53
> >>>>>>
> >>>>>> 	1000 	16384 	27.44 	28.27
> >>>>>>
> >>>>>> 	1000 	32768 	40.32 	39.82
> >>>>>>
> >>>>>> 	640 	65536 	63.61 	62.97
> >>>>>>
> >>>>>> 	320 	131072 	109.69 	110.01
> >>>>>>
> >>>>>> 	160 	262144 	204.71 	206.9
> >>>>>>
> >>>>>> 	80 	524288 	400.72 	397.1
> >>>>>>
> >>>>>> 	40 	1048576 	775.64 	776.45
> >>>>>>
> >>>>>> 	20 	2097152 	1523.95 	1535.65
> >>>>>>
> >>>>>> 	10 	4194304 	3018.84 	3054.89
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Chris
> >>>>>>
> >>>>>>
> >>>>>> Dhabaleswar Panda wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> Can you tell us which version of MVAPICH2 you are using and which
> >>>>>>> option(s) are configured? Are you using correct CPU mapping in both
> >>>>>>> cases?
> >>>>>>>
> >>>>>>> DK
> >>>>>>>
> >>>>>>> On Mon, 15 Jun 2009, Christopher Co wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I am doing performance analysis on a Cray CX1 machine.  I have run the
> >>>>>>>> Pallas MPI benchmark and have noticed a considerable performance
> >>>>>>>> difference between MVAPICH2 and Intel MPI on all the tests when shared
> >>>>>>>> memory is used.  I have also run the benchmark for non-shared memory and
> >>>>>>>> the two performed nearly the same (MVAPICH2 was slightly faster).  Is
> >>>>>>>> this slowdown on shared memory a known issue and/or are there fixes or
> >>>>>>>> switches I can enable or disable to get more speed?
> >>>>>>>>
> >>>>>>>> To give an idea of what I'm seeing, for the simple Ping Pong test for
> >>>>>>>> two processes on the same chip, the numbers looks like:
> >>>>>>>>
> >>>>>>>>              Processes 	           # repetitions
> >>>>>>>> #bytes 	                Intel MPI time (usec) 	                MVAPICH2
> >>>>>>>> time (usec)
> >>>>>>>> 2 	1000 	0 	0.35 	0.94
> >>>>>>>>
> >>>>>>>> 	1000 	1 	0.44 	1.24
> >>>>>>>>
> >>>>>>>> 	1000 	2 	0.45 	1.17
> >>>>>>>>
> >>>>>>>> 	1000 	4 	0.45 	1.08
> >>>>>>>>
> >>>>>>>> 	1000 	8 	0.45 	1.11
> >>>>>>>>
> >>>>>>>> 	1000 	16 	0.44 	1.13
> >>>>>>>>
> >>>>>>>> 	1000 	32 	0.45 	1.21
> >>>>>>>>
> >>>>>>>> 	1000 	64 	0.47 	1.35
> >>>>>>>>
> >>>>>>>> 	1000 	128 	0.48 	1.75
> >>>>>>>>
> >>>>>>>> 	1000 	256 	0.51 	2.92
> >>>>>>>>
> >>>>>>>> 	1000 	512 	0.57 	3.41
> >>>>>>>>
> >>>>>>>> 	1000 	1024 	0.76 	3.85
> >>>>>>>>
> >>>>>>>> 	1000 	2048 	0.98 	4.27
> >>>>>>>>
> >>>>>>>> 	1000 	4096 	1.53 	5.14
> >>>>>>>>
> >>>>>>>> 	1000 	8192 	2.59 	8.04
> >>>>>>>>
> >>>>>>>> 	1000 	16384 	4.86 	14.34
> >>>>>>>>
> >>>>>>>> 	1000 	32768 	7.17 	33.92
> >>>>>>>>
> >>>>>>>> 	640 	65536 	11.65 	43.27
> >>>>>>>>
> >>>>>>>> 	320 	131072 	20.97 	66.98
> >>>>>>>>
> >>>>>>>> 	160 	262144 	39.64 	118.58
> >>>>>>>>
> >>>>>>>> 	80 	524288 	84.91 	224.40
> >>>>>>>>
> >>>>>>>> 	40 	1048576 	212.76 	461.80
> >>>>>>>>
> >>>>>>>> 	20 	2097152 	458.55 	1053.67
> >>>>>>>>
> >>>>>>>> 	10 	4194304 	1738.30 	2649.30
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hopefully the table came out clear.  MVAPICH2 always lags behind by a
> >>>>>>>> considerable amount.  Any insight is much appreciated.  Thanks!
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Chris Co
> >>>>>>>> _______________________________________________
> >>>>>>>> mvapich-discuss mailing list
> >>>>>>>> mvapich-discuss at cse.ohio-state.edu
> >>>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>
> >>>
> >>>
> >> _______________________________________________
> >> mvapich-discuss mailing list
> >> mvapich-discuss at cse.ohio-state.edu
> >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>
> >>
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
>