[mvapich-discuss] Shared Memory Performance
Dhabaleswar Panda
panda at cse.ohio-state.edu
Tue Jun 16 21:43:59 EDT 2009
Could you let us know what issues you are seeing when using
MV2_CPU_MAPPING. The PLPA support is embedded in MVAPICH2 code. It does
not require any additional configure/install. I am assuming that you are
using the Gen2 (OFED) interface with mpirun_rsh and your systems are
Linux-based.
Thanks,
DK
On Tue, 16 Jun 2009, Christopher Co wrote:
> I am having issues with running processes on the cores I specify using
> MV2_CPU_MAPPING. Is the PLPA support for mapping MPI processes to cores
> embedded in MVAPICH2 or does it link to an existing PLPA on
> configure/install? Also, I want to confirm that no extra configure
> options are needed to enable this feature.
>
>
> Thanks,
> Chris
>
> Dhabaleswar Panda wrote:
> > Thanks for letting us know that you are using MVAPICH2 1.4. I believe you
> > are taking numbers on Intel systems. Please note that on Intel systems,
> > two cores next to each other within the same chip are numbered as 0 and 4
> > (not 0 and 1). Thus, the default setting (with processes 0 and 1) run
> > across the chips and thus, you are seeing worse performance. Please run
> > your tests across cores 0 and 4 and you should be able to see better
> > performance. Depending on which pairs of processes you use, you may see
> > some differences in performance for short and large messages (depends on
> > whether these cores are within the same chip, same socket or across
> > sockets). I am attaching some numbers below on our Nehalem system with
> > these two CPU mappings and you can see the performance difference.
> >
> > MVAPICH2 provides flexible mapping of MPI processes to cores within a
> > node. You can try out performance across various pairs and you will see
> > performance difference. More details on such mapping are available from
> > here:
> >
> > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.4rc1.html#x1-360006.8
> >
> > Also, starting from MVAPICH2 1.4, a new single-copy kernel-based
> > shared-memory scheme (LiMIC2) is introduced. This is `off' by default.
> > You can use it to get better performance for larger message sizes. You
> > need to configure with enable-limic2 and you also need to use
> > MV2_SMP_USE_LIMIC2=1. More details are available from here:
> >
> > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.4rc1.html#x1-370006.9
> >
> > Here are some performance numbers with different CPU mappings.
> >
> > OSU MPI latency with Default CPU mapping (LiMIC2 is off)
> > --------------------------------------------------------
> >
> > # OSU MPI Latency Test v3.1.1
> > # Size Latency (us)
> > 0 0.77
> > 1 0.95
> > 2 0.95
> > 4 0.94
> > 8 0.94
> > 16 0.94
> > 32 0.96
> > 64 0.99
> > 128 1.09
> > 256 1.22
> > 512 1.37
> > 1024 1.61
> > 2048 1.79
> > 4096 2.43
> > 8192 5.42
> > 16384 6.73
> > 32768 9.57
> > 65536 15.34
> > 131072 28.71
> > 262144 53.13
> > 524288 100.24
> > 1048576 199.98
> > 2097152 387.28
> > 4194304 991.68
> >
> > OSU MPI latency with CPU mapping 0:4 (LiMIC2 is off)
> > ----------------------------------------------------
> >
> > # OSU MPI Latency Test v3.1.1
> > # Size Latency (us)
> > 0 0.34
> > 1 0.40
> > 2 0.40
> > 4 0.40
> > 8 0.40
> > 16 0.40
> > 32 0.42
> > 64 0.42
> > 128 0.45
> > 256 0.50
> > 512 0.55
> > 1024 0.67
> > 2048 0.91
> > 4096 1.35
> > 8192 3.66
> > 16384 5.01
> > 32768 7.41
> > 65536 12.90
> > 131072 25.21
> > 262144 49.71
> > 524288 97.17
> > 1048576 187.50
> > 2097152 465.57
> > 4194304 1196.31
> >
> > Let us know if you get better performance with appropriate CPU mapping.
> >
> > Thanks,
> >
> > DK
> >
> >
> > On Mon, 15 Jun 2009, Christopher Co wrote:
> >
> >
> >> I am using MVAPICH2 1.4 with the default configuration (since the CX-1
> >> uses Mellanox Infiniband). I am fairly certain my CPU mapping was
> >> on-node for both cases (curiously, is there a way for MVAPICH2 to print
> >> out the nodes/cores running). I have the numbers for Ping Pong for the
> >> off-node case. I should have included this in my earlier message:
> >> Processes # repetitions #bytes Intel MPI time (usec)] MVAPICH2 time
> >> (usec)
> >> 2 1000 0 4.16 3.4
> >>
> >> 1000 1 4.67 3.56
> >>
> >> 1000 2 4.21 3.56
> >>
> >> 1000 4 4.23 3.62
> >>
> >> 1000 8 4.33 3.63
> >>
> >> 1000 16 4.33 3.64
> >>
> >> 1000 32 4.38 3.73
> >>
> >> 1000 64 4.44 3.92
> >>
> >> 1000 128 5.61 4.71
> >>
> >> 1000 256 5.92 5.23
> >>
> >> 1000 512 6.52 5.79
> >>
> >> 1000 1024 7.68 7.06
> >>
> >> 1000 2048 9.97 9.36
> >>
> >> 1000 4096 12.39 11.97
> >>
> >> 1000 8192 17.86 22.53
> >>
> >> 1000 16384 27.44 28.27
> >>
> >> 1000 32768 40.32 39.82
> >>
> >> 640 65536 63.61 62.97
> >>
> >> 320 131072 109.69 110.01
> >>
> >> 160 262144 204.71 206.9
> >>
> >> 80 524288 400.72 397.1
> >>
> >> 40 1048576 775.64 776.45
> >>
> >> 20 2097152 1523.95 1535.65
> >>
> >> 10 4194304 3018.84 3054.89
> >>
> >>
> >>
> >> Chris
> >>
> >>
> >> Dhabaleswar Panda wrote:
> >>
> >>> Can you tell us which version of MVAPICH2 you are using and which
> >>> option(s) are configured? Are you using correct CPU mapping in both
> >>> cases?
> >>>
> >>> DK
> >>>
> >>> On Mon, 15 Jun 2009, Christopher Co wrote:
> >>>
> >>>
> >>>
> >>>> Hi,
> >>>>
> >>>> I am doing performance analysis on a Cray CX1 machine. I have run the
> >>>> Pallas MPI benchmark and have noticed a considerable performance
> >>>> difference between MVAPICH2 and Intel MPI on all the tests when shared
> >>>> memory is used. I have also run the benchmark for non-shared memory and
> >>>> the two performed nearly the same (MVAPICH2 was slightly faster). Is
> >>>> this slowdown on shared memory a known issue and/or are there fixes or
> >>>> switches I can enable or disable to get more speed?
> >>>>
> >>>> To give an idea of what I'm seeing, for the simple Ping Pong test for
> >>>> two processes on the same chip, the numbers looks like:
> >>>>
> >>>> Processes # repetitions
> >>>> #bytes Intel MPI time (usec) MVAPICH2
> >>>> time (usec)
> >>>> 2 1000 0 0.35 0.94
> >>>>
> >>>> 1000 1 0.44 1.24
> >>>>
> >>>> 1000 2 0.45 1.17
> >>>>
> >>>> 1000 4 0.45 1.08
> >>>>
> >>>> 1000 8 0.45 1.11
> >>>>
> >>>> 1000 16 0.44 1.13
> >>>>
> >>>> 1000 32 0.45 1.21
> >>>>
> >>>> 1000 64 0.47 1.35
> >>>>
> >>>> 1000 128 0.48 1.75
> >>>>
> >>>> 1000 256 0.51 2.92
> >>>>
> >>>> 1000 512 0.57 3.41
> >>>>
> >>>> 1000 1024 0.76 3.85
> >>>>
> >>>> 1000 2048 0.98 4.27
> >>>>
> >>>> 1000 4096 1.53 5.14
> >>>>
> >>>> 1000 8192 2.59 8.04
> >>>>
> >>>> 1000 16384 4.86 14.34
> >>>>
> >>>> 1000 32768 7.17 33.92
> >>>>
> >>>> 640 65536 11.65 43.27
> >>>>
> >>>> 320 131072 20.97 66.98
> >>>>
> >>>> 160 262144 39.64 118.58
> >>>>
> >>>> 80 524288 84.91 224.40
> >>>>
> >>>> 40 1048576 212.76 461.80
> >>>>
> >>>> 20 2097152 458.55 1053.67
> >>>>
> >>>> 10 4194304 1738.30 2649.30
> >>>>
> >>>>
> >>>> Hopefully the table came out clear. MVAPICH2 always lags behind by a
> >>>> considerable amount. Any insight is much appreciated. Thanks!
> >>>>
> >>>>
> >>>> Chris Co
> >>>> _______________________________________________
> >>>> mvapich-discuss mailing list
> >>>> mvapich-discuss at cse.ohio-state.edu
> >>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>>>
> >>>>
> >>>>
> >>>
> >
> >
>
More information about the mvapich-discuss
mailing list