[mvapich-discuss] All-to-All Benchmark with MV2 2.3.1 with Larger Node Sizes
Panda, Dhabaleswar
panda at cse.ohio-state.edu
Wed Oct 9 23:48:20 EDT 2019
Thanks for your report. This is not expected. Do you see a similar trend with the latest MVAPICH2 2.3.2 version (released during August 19)?
Thanks,
DK
________________________________________
From: mvapich-discuss-bounces at cse.ohio-state.edu <mvapich-discuss-bounces at mailman.cse.ohio-state.edu> on behalf of Manalo, Kevin L <kevinlee at gatech.edu>
Sent: Wednesday, October 9, 2019 11:30 PM
To: mvapich-discuss at cse.ohio-state.edu
Subject: [mvapich-discuss] All-to-All Benchmark with MV2 2.3.1 with Larger Node Sizes
Hi MVAPICH Team:
I hope you are all doing well!
We are using MVAPICH2 2.3.1 on a cluster w/ Intel 19.
I have a question that came from working with a user at Georgia Tech PACE.
Our architecture is Intel Cascade Lake (Dual-Socket 6226 CPU) using EDR w/ ConnectX-5 cards, compilers are Intel 19.0.3
When we run problems at a larger node sizes, we see a drop in latency going from size 16 to 32. Here’s one of the smaller benchmarks that is showing this behavior – it appears to be showing up at 4 nodes, but it’s easier to see at 8 or more. Two tests with varying ppn 8 and 24.
mpiexec -n 64 -ppn 8 osu_alltoall
# OSU MPI All-to-All Personalized Exchange Latency Test v5.6.1
# Size Avg Latency(us)
1 10.74
2 12.75
4 17.37
8 35.56
16 69.18
32 19.59
64 21.60
128 27.46
256 37.38
512 95.49
1024 158.91
2048 103.67
4096 201.15
8192 368.41
16384 988.25
32768 1985.57
65536 3002.70
131072 5567.51
262144 10635.05
524288 20895.42
1048576 41338.57
mpiexec -n 64 -ppn 24 osu_alltoall
# OSU MPI All-to-All Personalized Exchange Latency Test v5.6.1
# Size Avg Latency(us)
1 13.38
2 17.65
4 27.68
8 67.73
16 125.09
32 226.59
64 26.58
128 32.84
256 54.04
512 123.28
1024 124.91
2048 194.57
4096 354.25
8192 727.99
16384 1661.27
32768 3440.93
65536 6627.03
131072 11389.51
262144 22448.69
524288 48042.85
1048576 88304.84
We’re not sure about the behavior (is this expected?). Is there an environment variable or MV2_* parameter to adjust if tunable? Or a configuration to correct?
Thanks,
Kevin Manalo
Here’s also a dump of mpiname -a and when MV2_SHOW_ENV_INFO=2 is active
$ mpiname -a
MVAPICH2 2.3.1 Fri Mar 1 22:00:00 EST 2019 ch3:mrail
Compilation
CC: /usr/local/pace-apps/spack/root/0.12/4b400d5/lib/spack/env/intel/icc -DNDEBUG -DNVALGRIND -O2
CXX: /usr/local/pace-apps/spack/root/0.12/4b400d5/lib/spack/env/intel/icpc -DNDEBUG -DNVALGRIND -O2
F77: /usr/local/pace-apps/spack/root/0.12/4b400d5/lib/spack/env/intel/ifort -O2
FC: /usr/local/pace-apps/spack/root/0.12/4b400d5/lib/spack/env/intel/ifort -O2
Configuration
--prefix=/usr/local/pace-apps/spack/packages/0.12/linux-rhel7-x86_64/intel-19.0.3/mvapich2-2.3.1-nib4xddpmv6xjfwvkwchggasrs6kfquj --enable-shared --enable-romio --disable-silent-rules --disable-new-dtags --enable-fortran=all --enable-threads=multiple --with-ch3-rank-bits=32 --disable-alloca --enable-fast=all --disable-cuda --enable-registration-cache --with-pm=hydra --with-device=ch3:mrail --with-rdma=gen2 --disable-mcast --with-file-system=gpfs+nfs
MVAPICH2-2.3.1 Parameters
---------------------------------------------------------------------
PROCESSOR ARCH NAME : MV2_ARCH_INTEL_GENERIC
PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_INTEL
PROCESSOR MODEL NUMBER : 85
HCA NAME : MV2_HCA_MLX_CX_EDR
HETEROGENEOUS HCA : NO
MV2_VBUF_TOTAL_SIZE : 16384
MV2_IBA_EAGER_THRESHOLD : 16384
MV2_RDMA_FAST_PATH_BUF_SIZE : 4096
MV2_PUT_FALLBACK_THRESHOLD : 8192
MV2_GET_FALLBACK_THRESHOLD : 262144
MV2_EAGERSIZE_1SC : 4096
MV2_SMP_EAGERSIZE : 65537
MV2_SMPI_LENGTH_QUEUE : 262144
MV2_SMP_NUM_SEND_BUFFER : 256
MV2_SMP_BATCH_SIZE : 8
---------------------------------------------------------------------
---------------------------------------------------------------------
MVAPICH2 All Parameters
MV2_COMM_WORLD_LOCAL_RANK : 0
MPIRUN_RSH_LAUNCH : 0
MV2_SHMEM_BACKED_UD_CM : 1
MV2_3DTORUS_SUPPORT : 0
MV2_NUM_SA_QUERY_RETRIES : 20
MV2_NUM_SLS : 8
MV2_DEFAULT_SERVICE_LEVEL : 0
MV2_PATH_SL_QUERY : 0
MV2_USE_QOS : 0
MV2_ALLGATHER_BRUCK_THRESHOLD : 524288
MV2_ALLGATHER_RD_THRESHOLD : 81920
MV2_ALLGATHER_REVERSE_RANKING : 1
MV2_ALLGATHERV_RD_THRESHOLD : 0
MV2_ALLREDUCE_2LEVEL_MSG : 262144
MV2_ALLREDUCE_SHORT_MSG : 2048
MV2_ALLTOALL_MEDIUM_MSG : 16384
MV2_ALLTOALL_SMALL_MSG : 2048
MV2_ALLTOALL_THROTTLE_FACTOR : 32
MV2_BCAST_TWO_LEVEL_SYSTEM_SIZE : 64
MV2_GATHER_SWITCH_PT : 0
MV2_INTRA_SHMEM_REDUCE_MSG : 2048
MV2_KNOMIAL_2LEVEL_BCAST_MESSAGE_SIZE_THRESHOLD : 2048
MV2_KNOMIAL_2LEVEL_BCAST_SYSTEM_SIZE_THRESHOLD : 64
MV2_KNOMIAL_INTER_LEADER_THRESHOLD : 65536
MV2_KNOMIAL_INTER_NODE_FACTOR : 4
MV2_KNOMIAL_INTRA_NODE_FACTOR : 4
MV2_KNOMIAL_INTRA_NODE_THRESHOLD : 131072
MV2_RED_SCAT_LARGE_MSG : 524288
MV2_RED_SCAT_SHORT_MSG : 64
MV2_REDUCE_2LEVEL_MSG : 16384
MV2_REDUCE_SHORT_MSG : 8192
MV2_SCATTER_MEDIUM_MSG : 0
MV2_SCATTER_SMALL_MSG : 0
MV2_SHMEM_ALLREDUCE_MSG : 32768
MV2_SHMEM_COLL_MAX_MSG_SIZE : 131072
MV2_SHMEM_COLL_NUM_COMM : 8
MV2_SHMEM_COLL_NUM_PROCS : 8
MV2_SHMEM_COLL_SPIN_COUNT : 5
MV2_SHMEM_REDUCE_MSG : 4096
MV2_USE_BCAST_SHORT_MSG : 16384
MV2_USE_DIRECT_GATHER : 1
MV2_USE_DIRECT_GATHER_SYSTEM_SIZE_MEDIUM : 1024
MV2_USE_DIRECT_GATHER_SYSTEM_SIZE_SMALL : 384
MV2_USE_DIRECT_SCATTER : 1
MV2_USE_OSU_COLLECTIVES : 1
MV2_USE_OSU_NB_COLLECTIVES : 1
MV2_USE_KNOMIAL_2LEVEL_BCAST : 1
MV2_USE_KNOMIAL_INTER_LEADER_BCAST : 1
MV2_USE_SCATTER_RD_INTER_LEADER_BCAST : 1
MV2_USE_SCATTER_RING_INTER_LEADER_BCAST : 1
MV2_USE_SHMEM_ALLREDUCE : 1
MV2_USE_SHMEM_BARRIER : 1
MV2_USE_SHMEM_BCAST : 1
MV2_USE_SHMEM_COLL : 1
MV2_USE_SHMEM_REDUCE : 1
MV2_USE_TWO_LEVEL_GATHER : 1
MV2_USE_TWO_LEVEL_SCATTER : 1
MV2_USE_XOR_ALLTOALL : 1
MV2_DEFAULT_SRC_PATH_BITS : 0
MV2_DEFAULT_STATIC_RATE : 0
MV2_DEFAULT_TIME_OUT : 460564
MV2_DEFAULT_MTU : 3
MV2_DEFAULT_PKEY : 0
MV2_DEFAULT_PORT : 0
MV2_DEFAULT_GID_INDEX : 0
MV2_DEFAULT_PSN : 0
MV2_DEFAULT_MAX_RECV_WQE : 128
MV2_DEFAULT_MAX_SEND_WQE : 64
MV2_DEFAULT_MAX_SG_LIST : 1
MV2_DEFAULT_MIN_RNR_TIMER : 12
MV2_DEFAULT_QP_OUS_RD_ATOM : 268701700
MV2_DEFAULT_RETRY_COUNT : 1799
MV2_DEFAULT_RNR_RETRY : 7
MV2_DEFAULT_MAX_CQ_SIZE : 40000
MV2_DEFAULT_MAX_RDMA_DST_OPS : 4
MV2_INITIAL_PREPOST_DEPTH : 10
MV2_IWARP_MULTIPLE_CQ_THRESHOLD : 32
MV2_NUM_HCAS : 1
MV2_NUM_PORTS : 1
MV2_NUM_QP_PER_PORT : 1
MV2_MAX_RDMA_CONNECT_ATTEMPTS : 20
MV2_ON_DEMAND_UD_INFO_EXCHANGE : 0
MV2_PREPOST_DEPTH : 64
MV2_HOMOGENEOUS_CLUSTER : 0
MV2_NUM_CQES_PER_POLL : 96
MV2_COALESCE_THRESHOLD : 6
MV2_DREG_CACHE_LIMIT : 0
MV2_IBA_EAGER_THRESHOLD : 16384
MV2_MAX_INLINE_SIZE : 168
MV2_MAX_R3_PENDING_DATA : 524288
MV2_MED_MSG_RAIL_SHARING_POLICY : 0
MV2_NDREG_ENTRIES : 1228
MV2_NUM_RDMA_BUFFER : 16
MV2_NUM_SPINS_BEFORE_LOCK : 2000
MV2_POLLING_LEVEL : 1
MV2_POLLING_SET_LIMIT : 64
MV2_POLLING_SET_THRESHOLD : 256
MV2_R3_NOCACHE_THRESHOLD : 32768
MV2_R3_THRESHOLD : 4096
MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD : 16384
MV2_RAIL_SHARING_MED_MSG_THRESHOLD : 2048
MV2_RAIL_SHARING_POLICY : 4
MV2_RDMA_EAGER_LIMIT : 32
MV2_RDMA_FAST_PATH_BUF_SIZE : 4096
MV2_RDMA_NUM_EXTRA_POLLS : 1
MV2_RNDV_EXT_SENDQ_SIZE : 5
MV2_RNDV_PROTOCOL : 3
MV2_SMALL_MSG_RAIL_SHARING_POLICY : 0
MV2_SPIN_COUNT : 5000
MV2_SRQ_LIMIT : 10
MV2_SRQ_MAX_SIZE : 4096
MV2_SRQ_SIZE : 80
MV2_STRIPING_THRESHOLD : 16384
MV2_USE_COALESCE : 0
MV2_USE_XRC : 0
MV2_VBUF_MAX : -1
MV2_VBUF_POOL_SIZE : 80
MV2_VBUF_SECONDARY_POOL_SIZE : 16
MV2_VBUF_TOTAL_SIZE : 16384
MV2_USE_IWARP_MODE : 0
MV2_USE_HWLOC_CPU_BINDING : 1
MV2_ENABLE_AFFINITY : 1
MV2_HCA_AWARE_PROCESS_MAPPING : 1
MV2_ENABLE_LEASTLOAD : 0
MV2_SMP_BATCH_SIZE : 8
MV2_SMP_EAGERSIZE : 65537
MV2_SMPI_LENGTH_QUEUE : 262144
MV2_SMP_NUM_SEND_BUFFER : 256
MV2_SMP_SEND_BUF_SIZE : 8192
MV2_USE_SHARED_MEM : 1
MV2_SMP_CMA_MAX_SIZE : 0
MV2_SMP_LIMIC2_MAX_SIZE : 0
MV2_SHOW_ENV_INFO : 2
MV2_DEFAULT_PUT_GET_LIST_SIZE : 200
MV2_EAGERSIZE_1SC : 4096
MV2_GET_FALLBACK_THRESHOLD : 262144
MV2_PIN_POOL_SIZE : 2097152
MV2_PUT_FALLBACK_THRESHOLD : 8192
MV2_USE_RDMA_CM : 1
MV2_ASYNC_THREAD_STACK_SIZE : 1048576
MV2_THREAD_YIELD_SPIN_THRESHOLD : 5
MV2_SUPPORT_DPM : 0
MV2_USE_HUGEPAGES : 1
More information about the mvapich-discuss
mailing list