[mvapich-discuss] Job doesn't even start with core count > ~100. Help trying to diagnose the problem
Angel de Vicente
angelv at iac.es
Wed Dec 19 10:20:50 EST 2018
Hello,
many thanks for your help.
"Subramoni, Hari" <subramoni.1 at osu.edu> writes:
> Can you please send the following information
> 1. output of mpiname -a
,----
| MVAPICH2 2.3rc2 Mon Apr 30 22:00:00 EST 2018 ch3:mrail
|
| Compilation
| CC: icc -DNDEBUG -DNVALGRIND -O2
| CXX: icpc -DNDEBUG -DNVALGRIND -O2
| F77: ifort -L/lib -L/lib -O2
| FC: ifort -O2
|
| Configuration
| --prefix=/apps/MVAPICH2/2.3rc2/INTEL --with-pmi --with-slurm --enable-romio --with-file-system=lustre --with-mxm=/opt/mellanox/mxm --with-hcoll=/opt/mellanox/hcoll --with-knem=/opt/knem-1.1.2.90mlnx2
`----
> 2. output of 9 node run after setting the following environment
> variables "MV2_SHOW_ENV_INFO=2 MV2_SHOW_CPU_MAPPING=1
> MV2_SHOW_HCA_MAPPING=1"
With N=9 n=80 this doesn't produce any extra output, as it gets stuck as
explained, but if I run it with N=8 n=80 I get the following
,----
| MVAPICH2-2.3rc2 Parameters
| ---------------------------------------------------------------------
| PROCESSOR ARCH NAME : MV2_ARCH_INTEL_XEON_E5_2670_16
| PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_INTEL
| PROCESSOR MODEL NUMBER : 45
| HCA NAME : MV2_HCA_MLX_CX_QDR
| HETEROGENEOUS HCA : NO
| MV2_VBUF_TOTAL_SIZE : 17408
| MV2_IBA_EAGER_THRESHOLD : 17408
| MV2_RDMA_FAST_PATH_BUF_SIZE : 5120
| MV2_PUT_FALLBACK_THRESHOLD : 8192
| MV2_GET_FALLBACK_THRESHOLD : 0
| MV2_EAGERSIZE_1SC : 8192
| MV2_SMP_EAGERSIZE : 32769
| MV2_SMPI_LENGTH_QUEUE : 131072
| MV2_SMP_NUM_SEND_BUFFER : 16
| MV2_SMP_BATCH_SIZE : 8
| ---------------------------------------------------------------------
|
| MVAPICH2 All Parameters
| MV2_COMM_WORLD_LOCAL_RANK : 0
| MPIRUN_RSH_LAUNCH : 0
| MV2_SHMEM_BACKED_UD_CM : 1
| MV2_3DTORUS_SUPPORT : 0
| MV2_NUM_SA_QUERY_RETRIES : 20
| MV2_NUM_SLS : 8
| MV2_DEFAULT_SERVICE_LEVEL : 0
| MV2_PATH_SL_QUERY : 0
| MV2_USE_QOS : 0
| MV2_USE_MCAST : 0
| MV2_ALLGATHER_BRUCK_THRESHOLD : 524288
| MV2_ALLGATHER_RD_THRESHOLD : 81920
| MV2_ALLGATHER_REVERSE_RANKING : 1
| MV2_ALLGATHERV_RD_THRESHOLD : 0
| MV2_ALLREDUCE_2LEVEL_MSG : 262144
| MV2_ALLREDUCE_SHORT_MSG : 2048
| MV2_ALLTOALL_MEDIUM_MSG : 16384
| MV2_ALLTOALL_SMALL_MSG : 2048
| MV2_ALLTOALL_THROTTLE_FACTOR : 4
| MV2_BCAST_TWO_LEVEL_SYSTEM_SIZE : 64
| MV2_GATHER_SWITCH_PT : 0
| MV2_INTRA_SHMEM_REDUCE_MSG : 2048
| MV2_KNOMIAL_2LEVEL_BCAST_MESSAGE_SIZE_THRESHOLD : 2048
| MV2_KNOMIAL_2LEVEL_BCAST_SYSTEM_SIZE_THRESHOLD : 64
| MV2_KNOMIAL_INTER_LEADER_THRESHOLD : 65536
| MV2_KNOMIAL_INTER_NODE_FACTOR : 4
| MV2_KNOMIAL_INTRA_NODE_FACTOR : 4
| MV2_KNOMIAL_INTRA_NODE_THRESHOLD : 131072
| MV2_RED_SCAT_LARGE_MSG : 524288
| MV2_RED_SCAT_SHORT_MSG : 64
| MV2_REDUCE_2LEVEL_MSG : 16384
| MV2_REDUCE_SHORT_MSG : 8192
| MV2_SCATTER_MEDIUM_MSG : 0
| MV2_SCATTER_SMALL_MSG : 0
| MV2_SHMEM_ALLREDUCE_MSG : 32768
| MV2_SHMEM_COLL_MAX_MSG_SIZE : 131072
| MV2_SHMEM_COLL_NUM_COMM : 8
| MV2_SHMEM_COLL_NUM_PROCS : 10
| MV2_SHMEM_COLL_SPIN_COUNT : 5
| MV2_SHMEM_REDUCE_MSG : 4096
| MV2_USE_BCAST_SHORT_MSG : 16384
| MV2_USE_DIRECT_GATHER : 1
| MV2_USE_DIRECT_GATHER_SYSTEM_SIZE_MEDIUM : 1024
| MV2_USE_DIRECT_GATHER_SYSTEM_SIZE_SMALL : 384
| MV2_USE_DIRECT_SCATTER : 1
| MV2_USE_OSU_COLLECTIVES : 1
| MV2_USE_OSU_NB_COLLECTIVES : 1
| MV2_USE_KNOMIAL_2LEVEL_BCAST : 1
| MV2_USE_KNOMIAL_INTER_LEADER_BCAST : 1
| MV2_USE_SCATTER_RD_INTER_LEADER_BCAST : 1
| MV2_USE_SCATTER_RING_INTER_LEADER_BCAST : 1
| MV2_USE_SHMEM_ALLREDUCE : 1
| MV2_USE_SHMEM_BARRIER : 1
| MV2_USE_SHMEM_BCAST : 1
| MV2_USE_SHMEM_COLL : 1
| MV2_USE_SHMEM_REDUCE : 1
| MV2_USE_TWO_LEVEL_GATHER : 1
| MV2_USE_TWO_LEVEL_SCATTER : 1
| MV2_USE_XOR_ALLTOALL : 1
| MV2_DEFAULT_SRC_PATH_BITS : 0
| MV2_DEFAULT_STATIC_RATE : 0
| MV2_DEFAULT_TIME_OUT : 17237780
| MV2_DEFAULT_MTU : 3
| MV2_DEFAULT_PKEY : 589824
| MV2_DEFAULT_PORT : 40208032
| MV2_DEFAULT_GID_INDEX : 0
| MV2_DEFAULT_PSN : 0
| MV2_DEFAULT_MAX_RECV_WQE : 128
| MV2_DEFAULT_MAX_SEND_WQE : 64
| MV2_DEFAULT_MAX_SG_LIST : 1
| MV2_DEFAULT_MIN_RNR_TIMER : 12
| MV2_DEFAULT_QP_OUS_RD_ATOM : 268701700
| MV2_DEFAULT_RETRY_COUNT : 16844551
| MV2_DEFAULT_RNR_RETRY : 65799
| MV2_DEFAULT_MAX_CQ_SIZE : 40000
| MV2_DEFAULT_MAX_RDMA_DST_OPS : 4
| MV2_INITIAL_PREPOST_DEPTH : 10
| MV2_IWARP_MULTIPLE_CQ_THRESHOLD : 32
| MV2_NUM_HCAS : 1
| MV2_NUM_PORTS : 1
| MV2_NUM_QP_PER_PORT : 1
| MV2_MAX_RDMA_CONNECT_ATTEMPTS : 20
| MV2_ON_DEMAND_UD_INFO_EXCHANGE : 1
| MV2_PREPOST_DEPTH : 64
| MV2_HOMOGENEOUS_CLUSTER : 0
| MV2_NUM_CQES_PER_POLL : 96
| MV2_COALESCE_THRESHOLD : 6
| MV2_DREG_CACHE_LIMIT : 0
| MV2_IBA_EAGER_THRESHOLD : 17408
| MV2_MAX_INLINE_SIZE : 168
| MV2_MAX_R3_PENDING_DATA : 524288
| MV2_MED_MSG_RAIL_SHARING_POLICY : 0
| MV2_NDREG_ENTRIES : 1260
| MV2_NUM_RDMA_BUFFER : 16
| MV2_NUM_SPINS_BEFORE_LOCK : 2000
| MV2_POLLING_LEVEL : 1
| MV2_POLLING_SET_LIMIT : 64
| MV2_POLLING_SET_THRESHOLD : 256
| MV2_R3_NOCACHE_THRESHOLD : 32768
| MV2_R3_THRESHOLD : 4096
| MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD : 17408
| MV2_RAIL_SHARING_MED_MSG_THRESHOLD : 2048
| MV2_RAIL_SHARING_POLICY : 4
| MV2_RDMA_EAGER_LIMIT : 32
| MV2_RDMA_FAST_PATH_BUF_SIZE : 5120
| MV2_RDMA_NUM_EXTRA_POLLS : 1
| MV2_RNDV_EXT_SENDQ_SIZE : 5
| MV2_RNDV_PROTOCOL : 3
| MV2_SMALL_MSG_RAIL_SHARING_POLICY : 0
| MV2_SPIN_COUNT : 5000
| MV2_SRQ_LIMIT : 10
| MV2_SRQ_MAX_SIZE : 4096
| MV2_SRQ_SIZE : 80
| MV2_STRIPING_THRESHOLD : 17408
| MV2_USE_COALESCE : 0
| MV2_USE_XRC : 0
| MV2_VBUF_MAX : -1
| MV2_VBUF_POOL_SIZE : 80
| MV2_VBUF_SECONDARY_POOL_SIZE : 16
| MV2_VBUF_TOTAL_SIZE : 17408
| MV2_USE_IWARP_MODE : 0
| MV2_USE_HWLOC_CPU_BINDING : 1
| MV2_ENABLE_AFFINITY : 1
| MV2_HCA_AWARE_PROCESS_MAPPING : 1
| MV2_ENABLE_LEASTLOAD : 0
| MV2_SMP_BATCH_SIZE : 8
| MV2_SMP_EAGERSIZE : 32769
| MV2_SMPI_LENGTH_QUEUE : 131072
| MV2_SMP_NUM_SEND_BUFFER : 16
| MV2_SMP_SEND_BUF_SIZE : 16384
| MV2_USE_SHARED_MEM : 1
| MV2_SMP_CMA_MAX_SIZE : 4194304
| MV2_SMP_LIMIC2_MAX_SIZE : 0
| MV2_SHOW_ENV_INFO : 2
| MV2_DEFAULT_PUT_GET_LIST_SIZE : 200
| MV2_EAGERSIZE_1SC : 8192
| MV2_GET_FALLBACK_THRESHOLD : 0
| MV2_PIN_POOL_SIZE : 2097152
| MV2_PUT_FALLBACK_THRESHOLD : 8192
| MV2_USE_RDMA_CM : 1
| MV2_ASYNC_THREAD_STACK_SIZE : 1048576
| MV2_THREAD_YIELD_SPIN_THRESHOLD : 5
| MV2_SUPPORT_DPM : 0
| MV2_USE_HUGEPAGES : 1
| ---------------------------------------------------------------------
| Warning: Process to core binding is enabled and OMP_NUM_THREADS is set to non-zero (1) value
| If your program has OpenMP sections, this can cause over-subscription of cores and consequently poor performance
| To avoid this, please re-run your application after setting MV2_ENABLE_AFFINITY=0
| Use MV2_USE_THREAD_WARNING=0 to suppress this message
`----
> I see that you are running the 9 node case also with 80 processes. Could you
> please try running the 9 node case with 90 processes (I am assuming you are
> running 10 processes per node). If this runs, can you please try setting
> MV2_USE_SHMEM_COLL=0 to see if it makes things pass at 9 nodes with 80
> processes?
(N=8,n=90) and (N=9,n=90) both get stuck (with or without
MV2_USE_SHMEM_COLL=0)
Many thanks,
--
Ángel de Vicente
Tel.: +34 922 605 747
Web.: http://www.iac.es/proyecto/polmag/
More information about the mvapich-discuss
mailing list