[mvapich-discuss] Shared Memory Performance

Mon Jun 15 20:25:13 EDT 2009

Thanks for letting us know that you are using MVAPICH2 1.4.  I believe you
are taking numbers on Intel systems. Please note that on Intel systems,
two cores next to each other within the same chip are numbered as 0 and 4
(not 0 and 1). Thus, the default setting (with processes 0 and 1) run
across the chips and thus, you are seeing worse performance. Please run
your tests across cores 0 and 4 and you should be able to see better
performance. Depending on which pairs of processes you use, you may see
some differences in performance for short and large messages (depends on
whether these cores are within the same chip, same socket or across
sockets). I am attaching some numbers below on our Nehalem system with
these two CPU mappings and you can see the performance difference.

MVAPICH2 provides flexible mapping of MPI processes to cores within a
node. You can try out performance across various pairs and you will see
performance difference. More details on such mapping are available from
here:

http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.4rc1.html#x1-360006.8

Also, starting from MVAPICH2 1.4, a new single-copy kernel-based
shared-memory scheme (LiMIC2) is introduced. This is `off' by default.
You can use it to get better performance for larger message sizes. You
need to configure with enable-limic2 and you also need to use
MV2_SMP_USE_LIMIC2=1.  More details are available from here:

http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.4rc1.html#x1-370006.9

Here are some performance numbers with different CPU mappings.

OSU MPI latency with Default CPU mapping (LiMIC2 is off)
--------------------------------------------------------

# OSU MPI Latency Test v3.1.1
# Size            Latency (us)
0                         0.77
1                         0.95
2                         0.95
4                         0.94
8                         0.94
16                        0.94
32                        0.96
64                        0.99
128                       1.09
256                       1.22
512                       1.37
1024                      1.61
2048                      1.79
4096                      2.43
8192                      5.42
16384                     6.73
32768                     9.57
65536                    15.34
131072                   28.71
262144                   53.13
524288                  100.24
1048576                 199.98
2097152                 387.28
4194304                 991.68

OSU MPI latency with CPU mapping 0:4 (LiMIC2 is off)
----------------------------------------------------

# OSU MPI Latency Test v3.1.1
# Size            Latency (us)
0                         0.34
1                         0.40
2                         0.40
4                         0.40
8                         0.40
16                        0.40
32                        0.42
64                        0.42
128                       0.45
256                       0.50
512                       0.55
1024                      0.67
2048                      0.91
4096                      1.35
8192                      3.66
16384                     5.01
32768                     7.41
65536                    12.90
131072                   25.21
262144                   49.71
524288                   97.17
1048576                 187.50
2097152                 465.57
4194304                1196.31

Let us know if you get better performance with appropriate CPU mapping.

Thanks,

DK

On Mon, 15 Jun 2009, Christopher Co wrote:

> I am using MVAPICH2 1.4 with the default configuration (since the CX-1
> uses Mellanox Infiniband).  I am fairly certain my CPU mapping was
> on-node for both cases (curiously, is there a way for MVAPICH2 to print
> out the nodes/cores running).  I have the numbers for Ping Pong for the
> off-node case.  I should have included this in my earlier message:
> Processes 	# repetitions 	#bytes 	Intel MPI time (usec)] 	MVAPICH2 time
> (usec)
> 2 	1000 	0 	4.16 	3.4
>
> 	1000 	1 	4.67 	3.56
>
> 	1000 	2 	4.21 	3.56
>
> 	1000 	4 	4.23 	3.62
>
> 	1000 	8 	4.33 	3.63
>
> 	1000 	16 	4.33 	3.64
>
> 	1000 	32 	4.38 	3.73
>
> 	1000 	64 	4.44 	3.92
>
> 	1000 	128 	5.61 	4.71
>
> 	1000 	256 	5.92 	5.23
>
> 	1000 	512 	6.52 	5.79
>
> 	1000 	1024 	7.68 	7.06
>
> 	1000 	2048 	9.97 	9.36
>
> 	1000 	4096 	12.39 	11.97
>
> 	1000 	8192 	17.86 	22.53
>
> 	1000 	16384 	27.44 	28.27
>
> 	1000 	32768 	40.32 	39.82
>
> 	640 	65536 	63.61 	62.97
>
> 	320 	131072 	109.69 	110.01
>
> 	160 	262144 	204.71 	206.9
>
> 	80 	524288 	400.72 	397.1
>
> 	40 	1048576 	775.64 	776.45
>
> 	20 	2097152 	1523.95 	1535.65
>
> 	10 	4194304 	3018.84 	3054.89
>
>
>
> Chris
>
>
> Dhabaleswar Panda wrote:
> > Can you tell us which version of MVAPICH2 you are using and which
> > option(s) are configured? Are you using correct CPU mapping in both
> > cases?
> >
> > DK
> >
> > On Mon, 15 Jun 2009, Christopher Co wrote:
> >
> >
> >> Hi,
> >>
> >> I am doing performance analysis on a Cray CX1 machine.  I have run the
> >> Pallas MPI benchmark and have noticed a considerable performance
> >> difference between MVAPICH2 and Intel MPI on all the tests when shared
> >> memory is used.  I have also run the benchmark for non-shared memory and
> >> the two performed nearly the same (MVAPICH2 was slightly faster).  Is
> >> this slowdown on shared memory a known issue and/or are there fixes or
> >> switches I can enable or disable to get more speed?
> >>
> >> To give an idea of what I'm seeing, for the simple Ping Pong test for
> >> two processes on the same chip, the numbers looks like:
> >>
> >>              Processes 	           # repetitions
> >> #bytes 	                Intel MPI time (usec) 	                MVAPICH2
> >> time (usec)
> >> 2 	1000 	0 	0.35 	0.94
> >>
> >> 	1000 	1 	0.44 	1.24
> >>
> >> 	1000 	2 	0.45 	1.17
> >>
> >> 	1000 	4 	0.45 	1.08
> >>
> >> 	1000 	8 	0.45 	1.11
> >>
> >> 	1000 	16 	0.44 	1.13
> >>
> >> 	1000 	32 	0.45 	1.21
> >>
> >> 	1000 	64 	0.47 	1.35
> >>
> >> 	1000 	128 	0.48 	1.75
> >>
> >> 	1000 	256 	0.51 	2.92
> >>
> >> 	1000 	512 	0.57 	3.41
> >>
> >> 	1000 	1024 	0.76 	3.85
> >>
> >> 	1000 	2048 	0.98 	4.27
> >>
> >> 	1000 	4096 	1.53 	5.14
> >>
> >> 	1000 	8192 	2.59 	8.04
> >>
> >> 	1000 	16384 	4.86 	14.34
> >>
> >> 	1000 	32768 	7.17 	33.92
> >>
> >> 	640 	65536 	11.65 	43.27
> >>
> >> 	320 	131072 	20.97 	66.98
> >>
> >> 	160 	262144 	39.64 	118.58
> >>
> >> 	80 	524288 	84.91 	224.40
> >>
> >> 	40 	1048576 	212.76 	461.80
> >>
> >> 	20 	2097152 	458.55 	1053.67
> >>
> >> 	10 	4194304 	1738.30 	2649.30
> >>
> >>
> >> Hopefully the table came out clear.  MVAPICH2 always lags behind by a
> >> considerable amount.  Any insight is much appreciated.  Thanks!
> >>
> >>
> >> Chris Co
> >> _______________________________________________
> >> mvapich-discuss mailing list
> >> mvapich-discuss at cse.ohio-state.edu
> >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>
> >>
> >
> >
>