[mvapich-discuss] All-to-All Benchmark with MV2 2.3.1 with Larger Node Sizes

Panda, Dhabaleswar panda at cse.ohio-state.edu
Wed Oct 9 23:48:20 EDT 2019


Thanks for your report. This is not expected. Do you see a similar trend with the latest MVAPICH2 2.3.2 version (released during August 19)?

Thanks,

DK




________________________________________
From: mvapich-discuss-bounces at cse.ohio-state.edu <mvapich-discuss-bounces at mailman.cse.ohio-state.edu> on behalf of Manalo, Kevin L <kevinlee at gatech.edu>
Sent: Wednesday, October 9, 2019 11:30 PM
To: mvapich-discuss at cse.ohio-state.edu
Subject: [mvapich-discuss] All-to-All Benchmark with MV2 2.3.1 with Larger      Node Sizes

Hi MVAPICH Team:

I hope you are all doing well!

We are using MVAPICH2 2.3.1 on a cluster w/ Intel 19.

I have a question that came from working with a user at Georgia Tech PACE.

Our architecture is Intel Cascade Lake (Dual-Socket 6226 CPU) using EDR w/ ConnectX-5 cards, compilers are Intel 19.0.3

When we run problems at a larger node sizes, we see a drop in latency going from size 16 to 32.  Here’s one of the smaller benchmarks that is showing this behavior – it appears to be showing up at 4 nodes, but it’s easier to see at 8 or more.  Two tests with varying ppn 8 and 24.

mpiexec -n 64 -ppn 8 osu_alltoall

# OSU MPI All-to-All Personalized Exchange Latency Test v5.6.1
# Size       Avg Latency(us)
1                      10.74
2                      12.75
4                      17.37
8                      35.56
16                     69.18
32                     19.59
64                     21.60
128                    27.46
256                    37.38
512                    95.49
1024                  158.91
2048                  103.67
4096                  201.15
8192                  368.41
16384                 988.25
32768                1985.57
65536                3002.70
131072               5567.51
262144              10635.05
524288              20895.42
1048576             41338.57

mpiexec -n 64 -ppn 24 osu_alltoall

# OSU MPI All-to-All Personalized Exchange Latency Test v5.6.1
# Size       Avg Latency(us)
1                      13.38
2                      17.65
4                      27.68
8                      67.73
16                    125.09
32                    226.59
64                     26.58
128                    32.84
256                    54.04
512                   123.28
1024                  124.91
2048                  194.57
4096                  354.25
8192                  727.99
16384                1661.27
32768                3440.93
65536                6627.03
131072              11389.51
262144              22448.69
524288              48042.85
1048576             88304.84

We’re not sure about the behavior (is this expected?). Is there an environment variable or MV2_* parameter to adjust if tunable? Or a configuration to correct?

Thanks,
Kevin Manalo

Here’s also a dump of mpiname -a and when MV2_SHOW_ENV_INFO=2  is active

$ mpiname -a
MVAPICH2 2.3.1 Fri Mar 1 22:00:00 EST 2019 ch3:mrail

Compilation
CC: /usr/local/pace-apps/spack/root/0.12/4b400d5/lib/spack/env/intel/icc    -DNDEBUG -DNVALGRIND -O2
CXX: /usr/local/pace-apps/spack/root/0.12/4b400d5/lib/spack/env/intel/icpc   -DNDEBUG -DNVALGRIND -O2
F77: /usr/local/pace-apps/spack/root/0.12/4b400d5/lib/spack/env/intel/ifort   -O2
FC: /usr/local/pace-apps/spack/root/0.12/4b400d5/lib/spack/env/intel/ifort   -O2

Configuration
--prefix=/usr/local/pace-apps/spack/packages/0.12/linux-rhel7-x86_64/intel-19.0.3/mvapich2-2.3.1-nib4xddpmv6xjfwvkwchggasrs6kfquj --enable-shared --enable-romio --disable-silent-rules --disable-new-dtags --enable-fortran=all --enable-threads=multiple --with-ch3-rank-bits=32 --disable-alloca --enable-fast=all --disable-cuda --enable-registration-cache --with-pm=hydra --with-device=ch3:mrail --with-rdma=gen2 --disable-mcast --with-file-system=gpfs+nfs

MVAPICH2-2.3.1 Parameters
---------------------------------------------------------------------
     PROCESSOR ARCH NAME            : MV2_ARCH_INTEL_GENERIC
     PROCESSOR FAMILY NAME          : MV2_CPU_FAMILY_INTEL
     PROCESSOR MODEL NUMBER         : 85
     HCA NAME                       : MV2_HCA_MLX_CX_EDR
     HETEROGENEOUS HCA              : NO
     MV2_VBUF_TOTAL_SIZE            : 16384
     MV2_IBA_EAGER_THRESHOLD        : 16384
     MV2_RDMA_FAST_PATH_BUF_SIZE    : 4096
     MV2_PUT_FALLBACK_THRESHOLD     : 8192
     MV2_GET_FALLBACK_THRESHOLD     : 262144
     MV2_EAGERSIZE_1SC              : 4096
     MV2_SMP_EAGERSIZE              : 65537
     MV2_SMPI_LENGTH_QUEUE          : 262144
     MV2_SMP_NUM_SEND_BUFFER        : 256
     MV2_SMP_BATCH_SIZE             : 8
---------------------------------------------------------------------
---------------------------------------------------------------------

MVAPICH2 All Parameters
     MV2_COMM_WORLD_LOCAL_RANK           : 0
     MPIRUN_RSH_LAUNCH                   : 0
     MV2_SHMEM_BACKED_UD_CM              : 1
     MV2_3DTORUS_SUPPORT                 : 0
     MV2_NUM_SA_QUERY_RETRIES            : 20
     MV2_NUM_SLS                         : 8
     MV2_DEFAULT_SERVICE_LEVEL           : 0
     MV2_PATH_SL_QUERY                   : 0
     MV2_USE_QOS                         : 0
     MV2_ALLGATHER_BRUCK_THRESHOLD       : 524288
     MV2_ALLGATHER_RD_THRESHOLD          : 81920
     MV2_ALLGATHER_REVERSE_RANKING       : 1
     MV2_ALLGATHERV_RD_THRESHOLD         : 0
     MV2_ALLREDUCE_2LEVEL_MSG            : 262144
     MV2_ALLREDUCE_SHORT_MSG             : 2048
     MV2_ALLTOALL_MEDIUM_MSG             : 16384
     MV2_ALLTOALL_SMALL_MSG              : 2048
     MV2_ALLTOALL_THROTTLE_FACTOR        : 32
     MV2_BCAST_TWO_LEVEL_SYSTEM_SIZE     : 64
     MV2_GATHER_SWITCH_PT                : 0
     MV2_INTRA_SHMEM_REDUCE_MSG          : 2048
     MV2_KNOMIAL_2LEVEL_BCAST_MESSAGE_SIZE_THRESHOLD : 2048
     MV2_KNOMIAL_2LEVEL_BCAST_SYSTEM_SIZE_THRESHOLD : 64
     MV2_KNOMIAL_INTER_LEADER_THRESHOLD  : 65536
     MV2_KNOMIAL_INTER_NODE_FACTOR       : 4
     MV2_KNOMIAL_INTRA_NODE_FACTOR       : 4
     MV2_KNOMIAL_INTRA_NODE_THRESHOLD    : 131072
     MV2_RED_SCAT_LARGE_MSG              : 524288
     MV2_RED_SCAT_SHORT_MSG              : 64
     MV2_REDUCE_2LEVEL_MSG               : 16384
     MV2_REDUCE_SHORT_MSG                : 8192
     MV2_SCATTER_MEDIUM_MSG              : 0
     MV2_SCATTER_SMALL_MSG               : 0
     MV2_SHMEM_ALLREDUCE_MSG             : 32768
     MV2_SHMEM_COLL_MAX_MSG_SIZE         : 131072
     MV2_SHMEM_COLL_NUM_COMM             : 8
     MV2_SHMEM_COLL_NUM_PROCS            : 8
     MV2_SHMEM_COLL_SPIN_COUNT           : 5
     MV2_SHMEM_REDUCE_MSG                : 4096
     MV2_USE_BCAST_SHORT_MSG             : 16384
     MV2_USE_DIRECT_GATHER               : 1
     MV2_USE_DIRECT_GATHER_SYSTEM_SIZE_MEDIUM : 1024
     MV2_USE_DIRECT_GATHER_SYSTEM_SIZE_SMALL : 384
     MV2_USE_DIRECT_SCATTER              : 1
     MV2_USE_OSU_COLLECTIVES             : 1
     MV2_USE_OSU_NB_COLLECTIVES          : 1
     MV2_USE_KNOMIAL_2LEVEL_BCAST        : 1
     MV2_USE_KNOMIAL_INTER_LEADER_BCAST  : 1
     MV2_USE_SCATTER_RD_INTER_LEADER_BCAST : 1
     MV2_USE_SCATTER_RING_INTER_LEADER_BCAST : 1
     MV2_USE_SHMEM_ALLREDUCE             : 1
     MV2_USE_SHMEM_BARRIER               : 1
     MV2_USE_SHMEM_BCAST                 : 1
     MV2_USE_SHMEM_COLL                  : 1
     MV2_USE_SHMEM_REDUCE                : 1
     MV2_USE_TWO_LEVEL_GATHER            : 1
     MV2_USE_TWO_LEVEL_SCATTER           : 1
     MV2_USE_XOR_ALLTOALL                : 1
     MV2_DEFAULT_SRC_PATH_BITS           : 0
     MV2_DEFAULT_STATIC_RATE             : 0
     MV2_DEFAULT_TIME_OUT                : 460564
     MV2_DEFAULT_MTU                     : 3
     MV2_DEFAULT_PKEY                    : 0
     MV2_DEFAULT_PORT                    : 0
     MV2_DEFAULT_GID_INDEX               : 0
     MV2_DEFAULT_PSN                     : 0
     MV2_DEFAULT_MAX_RECV_WQE            : 128
     MV2_DEFAULT_MAX_SEND_WQE            : 64
     MV2_DEFAULT_MAX_SG_LIST             : 1
     MV2_DEFAULT_MIN_RNR_TIMER           : 12
     MV2_DEFAULT_QP_OUS_RD_ATOM          : 268701700
     MV2_DEFAULT_RETRY_COUNT             : 1799
     MV2_DEFAULT_RNR_RETRY               : 7
     MV2_DEFAULT_MAX_CQ_SIZE             : 40000
     MV2_DEFAULT_MAX_RDMA_DST_OPS        : 4
     MV2_INITIAL_PREPOST_DEPTH           : 10
     MV2_IWARP_MULTIPLE_CQ_THRESHOLD     : 32
     MV2_NUM_HCAS                        : 1
     MV2_NUM_PORTS                       : 1
     MV2_NUM_QP_PER_PORT                 : 1
     MV2_MAX_RDMA_CONNECT_ATTEMPTS       : 20
     MV2_ON_DEMAND_UD_INFO_EXCHANGE      : 0
     MV2_PREPOST_DEPTH                   : 64
     MV2_HOMOGENEOUS_CLUSTER             : 0
     MV2_NUM_CQES_PER_POLL               : 96
     MV2_COALESCE_THRESHOLD              : 6
     MV2_DREG_CACHE_LIMIT                : 0
     MV2_IBA_EAGER_THRESHOLD             : 16384
     MV2_MAX_INLINE_SIZE                 : 168
     MV2_MAX_R3_PENDING_DATA             : 524288
     MV2_MED_MSG_RAIL_SHARING_POLICY     : 0
     MV2_NDREG_ENTRIES                   : 1228
     MV2_NUM_RDMA_BUFFER                 : 16
     MV2_NUM_SPINS_BEFORE_LOCK           : 2000
     MV2_POLLING_LEVEL                   : 1
     MV2_POLLING_SET_LIMIT               : 64
     MV2_POLLING_SET_THRESHOLD           : 256
     MV2_R3_NOCACHE_THRESHOLD            : 32768
     MV2_R3_THRESHOLD                    : 4096
     MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD : 16384
     MV2_RAIL_SHARING_MED_MSG_THRESHOLD  : 2048
     MV2_RAIL_SHARING_POLICY             : 4
     MV2_RDMA_EAGER_LIMIT                : 32
     MV2_RDMA_FAST_PATH_BUF_SIZE         : 4096
     MV2_RDMA_NUM_EXTRA_POLLS            : 1
     MV2_RNDV_EXT_SENDQ_SIZE             : 5
     MV2_RNDV_PROTOCOL                   : 3
     MV2_SMALL_MSG_RAIL_SHARING_POLICY   : 0
     MV2_SPIN_COUNT                      : 5000
     MV2_SRQ_LIMIT                       : 10
     MV2_SRQ_MAX_SIZE                    : 4096
     MV2_SRQ_SIZE                        : 80
     MV2_STRIPING_THRESHOLD              : 16384
     MV2_USE_COALESCE                    : 0
     MV2_USE_XRC                         : 0
     MV2_VBUF_MAX                        : -1
     MV2_VBUF_POOL_SIZE                  : 80
     MV2_VBUF_SECONDARY_POOL_SIZE        : 16
     MV2_VBUF_TOTAL_SIZE                 : 16384
     MV2_USE_IWARP_MODE                  : 0
     MV2_USE_HWLOC_CPU_BINDING           : 1
     MV2_ENABLE_AFFINITY                 : 1
     MV2_HCA_AWARE_PROCESS_MAPPING       : 1
     MV2_ENABLE_LEASTLOAD                : 0
     MV2_SMP_BATCH_SIZE                  : 8
     MV2_SMP_EAGERSIZE                   : 65537
     MV2_SMPI_LENGTH_QUEUE               : 262144
     MV2_SMP_NUM_SEND_BUFFER             : 256
     MV2_SMP_SEND_BUF_SIZE               : 8192
     MV2_USE_SHARED_MEM                  : 1
     MV2_SMP_CMA_MAX_SIZE                : 0
     MV2_SMP_LIMIC2_MAX_SIZE             : 0
     MV2_SHOW_ENV_INFO                   : 2
     MV2_DEFAULT_PUT_GET_LIST_SIZE       : 200
     MV2_EAGERSIZE_1SC                   : 4096
     MV2_GET_FALLBACK_THRESHOLD          : 262144
     MV2_PIN_POOL_SIZE                   : 2097152
     MV2_PUT_FALLBACK_THRESHOLD          : 8192
     MV2_USE_RDMA_CM                     : 1
     MV2_ASYNC_THREAD_STACK_SIZE         : 1048576
     MV2_THREAD_YIELD_SPIN_THRESHOLD     : 5
     MV2_SUPPORT_DPM                     : 0
     MV2_USE_HUGEPAGES                   : 1





More information about the mvapich-discuss mailing list