[mvapich-discuss] Shared Memory Performance
Dhabaleswar Panda
panda at cse.ohio-state.edu
Mon Jun 15 20:25:13 EDT 2009
Thanks for letting us know that you are using MVAPICH2 1.4. I believe you
are taking numbers on Intel systems. Please note that on Intel systems,
two cores next to each other within the same chip are numbered as 0 and 4
(not 0 and 1). Thus, the default setting (with processes 0 and 1) run
across the chips and thus, you are seeing worse performance. Please run
your tests across cores 0 and 4 and you should be able to see better
performance. Depending on which pairs of processes you use, you may see
some differences in performance for short and large messages (depends on
whether these cores are within the same chip, same socket or across
sockets). I am attaching some numbers below on our Nehalem system with
these two CPU mappings and you can see the performance difference.
MVAPICH2 provides flexible mapping of MPI processes to cores within a
node. You can try out performance across various pairs and you will see
performance difference. More details on such mapping are available from
here:
http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.4rc1.html#x1-360006.8
Also, starting from MVAPICH2 1.4, a new single-copy kernel-based
shared-memory scheme (LiMIC2) is introduced. This is `off' by default.
You can use it to get better performance for larger message sizes. You
need to configure with enable-limic2 and you also need to use
MV2_SMP_USE_LIMIC2=1. More details are available from here:
http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.4rc1.html#x1-370006.9
Here are some performance numbers with different CPU mappings.
OSU MPI latency with Default CPU mapping (LiMIC2 is off)
--------------------------------------------------------
# OSU MPI Latency Test v3.1.1
# Size Latency (us)
0 0.77
1 0.95
2 0.95
4 0.94
8 0.94
16 0.94
32 0.96
64 0.99
128 1.09
256 1.22
512 1.37
1024 1.61
2048 1.79
4096 2.43
8192 5.42
16384 6.73
32768 9.57
65536 15.34
131072 28.71
262144 53.13
524288 100.24
1048576 199.98
2097152 387.28
4194304 991.68
OSU MPI latency with CPU mapping 0:4 (LiMIC2 is off)
----------------------------------------------------
# OSU MPI Latency Test v3.1.1
# Size Latency (us)
0 0.34
1 0.40
2 0.40
4 0.40
8 0.40
16 0.40
32 0.42
64 0.42
128 0.45
256 0.50
512 0.55
1024 0.67
2048 0.91
4096 1.35
8192 3.66
16384 5.01
32768 7.41
65536 12.90
131072 25.21
262144 49.71
524288 97.17
1048576 187.50
2097152 465.57
4194304 1196.31
Let us know if you get better performance with appropriate CPU mapping.
Thanks,
DK
On Mon, 15 Jun 2009, Christopher Co wrote:
> I am using MVAPICH2 1.4 with the default configuration (since the CX-1
> uses Mellanox Infiniband). I am fairly certain my CPU mapping was
> on-node for both cases (curiously, is there a way for MVAPICH2 to print
> out the nodes/cores running). I have the numbers for Ping Pong for the
> off-node case. I should have included this in my earlier message:
> Processes # repetitions #bytes Intel MPI time (usec)] MVAPICH2 time
> (usec)
> 2 1000 0 4.16 3.4
>
> 1000 1 4.67 3.56
>
> 1000 2 4.21 3.56
>
> 1000 4 4.23 3.62
>
> 1000 8 4.33 3.63
>
> 1000 16 4.33 3.64
>
> 1000 32 4.38 3.73
>
> 1000 64 4.44 3.92
>
> 1000 128 5.61 4.71
>
> 1000 256 5.92 5.23
>
> 1000 512 6.52 5.79
>
> 1000 1024 7.68 7.06
>
> 1000 2048 9.97 9.36
>
> 1000 4096 12.39 11.97
>
> 1000 8192 17.86 22.53
>
> 1000 16384 27.44 28.27
>
> 1000 32768 40.32 39.82
>
> 640 65536 63.61 62.97
>
> 320 131072 109.69 110.01
>
> 160 262144 204.71 206.9
>
> 80 524288 400.72 397.1
>
> 40 1048576 775.64 776.45
>
> 20 2097152 1523.95 1535.65
>
> 10 4194304 3018.84 3054.89
>
>
>
> Chris
>
>
> Dhabaleswar Panda wrote:
> > Can you tell us which version of MVAPICH2 you are using and which
> > option(s) are configured? Are you using correct CPU mapping in both
> > cases?
> >
> > DK
> >
> > On Mon, 15 Jun 2009, Christopher Co wrote:
> >
> >
> >> Hi,
> >>
> >> I am doing performance analysis on a Cray CX1 machine. I have run the
> >> Pallas MPI benchmark and have noticed a considerable performance
> >> difference between MVAPICH2 and Intel MPI on all the tests when shared
> >> memory is used. I have also run the benchmark for non-shared memory and
> >> the two performed nearly the same (MVAPICH2 was slightly faster). Is
> >> this slowdown on shared memory a known issue and/or are there fixes or
> >> switches I can enable or disable to get more speed?
> >>
> >> To give an idea of what I'm seeing, for the simple Ping Pong test for
> >> two processes on the same chip, the numbers looks like:
> >>
> >> Processes # repetitions
> >> #bytes Intel MPI time (usec) MVAPICH2
> >> time (usec)
> >> 2 1000 0 0.35 0.94
> >>
> >> 1000 1 0.44 1.24
> >>
> >> 1000 2 0.45 1.17
> >>
> >> 1000 4 0.45 1.08
> >>
> >> 1000 8 0.45 1.11
> >>
> >> 1000 16 0.44 1.13
> >>
> >> 1000 32 0.45 1.21
> >>
> >> 1000 64 0.47 1.35
> >>
> >> 1000 128 0.48 1.75
> >>
> >> 1000 256 0.51 2.92
> >>
> >> 1000 512 0.57 3.41
> >>
> >> 1000 1024 0.76 3.85
> >>
> >> 1000 2048 0.98 4.27
> >>
> >> 1000 4096 1.53 5.14
> >>
> >> 1000 8192 2.59 8.04
> >>
> >> 1000 16384 4.86 14.34
> >>
> >> 1000 32768 7.17 33.92
> >>
> >> 640 65536 11.65 43.27
> >>
> >> 320 131072 20.97 66.98
> >>
> >> 160 262144 39.64 118.58
> >>
> >> 80 524288 84.91 224.40
> >>
> >> 40 1048576 212.76 461.80
> >>
> >> 20 2097152 458.55 1053.67
> >>
> >> 10 4194304 1738.30 2649.30
> >>
> >>
> >> Hopefully the table came out clear. MVAPICH2 always lags behind by a
> >> considerable amount. Any insight is much appreciated. Thanks!
> >>
> >>
> >> Chris Co
> >> _______________________________________________
> >> mvapich-discuss mailing list
> >> mvapich-discuss at cse.ohio-state.edu
> >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>
> >>
> >
> >
>
More information about the mvapich-discuss
mailing list