[mvapich-discuss] Job doesn't even start with core count > ~100. Help trying to diagnose the problem

Angel de Vicente angelv at iac.es
Wed Dec 19 10:20:50 EST 2018


Hello,

many thanks for your help. 

"Subramoni, Hari" <subramoni.1 at osu.edu> writes:
> Can you please send the following information
> 1. output of mpiname -a

,----
| MVAPICH2 2.3rc2 Mon Apr 30 22:00:00 EST 2018 ch3:mrail
| 
| Compilation
| CC: icc    -DNDEBUG -DNVALGRIND -O2
| CXX: icpc   -DNDEBUG -DNVALGRIND -O2
| F77: ifort -L/lib -L/lib   -O2
| FC: ifort   -O2
| 
| Configuration
| --prefix=/apps/MVAPICH2/2.3rc2/INTEL --with-pmi --with-slurm --enable-romio --with-file-system=lustre --with-mxm=/opt/mellanox/mxm --with-hcoll=/opt/mellanox/hcoll --with-knem=/opt/knem-1.1.2.90mlnx2
`----


> 2. output of 9 node run after setting the following environment
> variables "MV2_SHOW_ENV_INFO=2 MV2_SHOW_CPU_MAPPING=1
> MV2_SHOW_HCA_MAPPING=1"

With N=9 n=80 this doesn't produce any extra output, as it gets stuck as
explained, but if I run it with N=8 n=80 I get the following

,----
|  MVAPICH2-2.3rc2 Parameters
| ---------------------------------------------------------------------
| 	PROCESSOR ARCH NAME            : MV2_ARCH_INTEL_XEON_E5_2670_16
| 	PROCESSOR FAMILY NAME          : MV2_CPU_FAMILY_INTEL
| 	PROCESSOR MODEL NUMBER         : 45
| 	HCA NAME                       : MV2_HCA_MLX_CX_QDR
| 	HETEROGENEOUS HCA              : NO
| 	MV2_VBUF_TOTAL_SIZE            : 17408
| 	MV2_IBA_EAGER_THRESHOLD        : 17408
| 	MV2_RDMA_FAST_PATH_BUF_SIZE    : 5120
| 	MV2_PUT_FALLBACK_THRESHOLD     : 8192
| 	MV2_GET_FALLBACK_THRESHOLD     : 0
| 	MV2_EAGERSIZE_1SC              : 8192
| 	MV2_SMP_EAGERSIZE              : 32769
| 	MV2_SMPI_LENGTH_QUEUE          : 131072
| 	MV2_SMP_NUM_SEND_BUFFER        : 16
| 	MV2_SMP_BATCH_SIZE             : 8
| ---------------------------------------------------------------------
| 
|  MVAPICH2 All Parameters
| 	MV2_COMM_WORLD_LOCAL_RANK           : 0
| 	MPIRUN_RSH_LAUNCH                   : 0
| 	MV2_SHMEM_BACKED_UD_CM              : 1
| 	MV2_3DTORUS_SUPPORT                 : 0
| 	MV2_NUM_SA_QUERY_RETRIES            : 20
| 	MV2_NUM_SLS                         : 8
| 	MV2_DEFAULT_SERVICE_LEVEL           : 0
| 	MV2_PATH_SL_QUERY                   : 0
| 	MV2_USE_QOS                         : 0
| 	MV2_USE_MCAST                       : 0
| 	MV2_ALLGATHER_BRUCK_THRESHOLD       : 524288
| 	MV2_ALLGATHER_RD_THRESHOLD          : 81920
| 	MV2_ALLGATHER_REVERSE_RANKING       : 1
| 	MV2_ALLGATHERV_RD_THRESHOLD         : 0
| 	MV2_ALLREDUCE_2LEVEL_MSG            : 262144
| 	MV2_ALLREDUCE_SHORT_MSG             : 2048
| 	MV2_ALLTOALL_MEDIUM_MSG             : 16384
| 	MV2_ALLTOALL_SMALL_MSG              : 2048
| 	MV2_ALLTOALL_THROTTLE_FACTOR        : 4
| 	MV2_BCAST_TWO_LEVEL_SYSTEM_SIZE     : 64
| 	MV2_GATHER_SWITCH_PT                : 0
| 	MV2_INTRA_SHMEM_REDUCE_MSG          : 2048
| 	MV2_KNOMIAL_2LEVEL_BCAST_MESSAGE_SIZE_THRESHOLD : 2048
| 	MV2_KNOMIAL_2LEVEL_BCAST_SYSTEM_SIZE_THRESHOLD : 64
| 	MV2_KNOMIAL_INTER_LEADER_THRESHOLD  : 65536
| 	MV2_KNOMIAL_INTER_NODE_FACTOR       : 4
| 	MV2_KNOMIAL_INTRA_NODE_FACTOR       : 4
| 	MV2_KNOMIAL_INTRA_NODE_THRESHOLD    : 131072
| 	MV2_RED_SCAT_LARGE_MSG              : 524288
| 	MV2_RED_SCAT_SHORT_MSG              : 64
| 	MV2_REDUCE_2LEVEL_MSG               : 16384
| 	MV2_REDUCE_SHORT_MSG                : 8192
| 	MV2_SCATTER_MEDIUM_MSG              : 0
| 	MV2_SCATTER_SMALL_MSG               : 0
| 	MV2_SHMEM_ALLREDUCE_MSG             : 32768
| 	MV2_SHMEM_COLL_MAX_MSG_SIZE         : 131072
| 	MV2_SHMEM_COLL_NUM_COMM             : 8
| 	MV2_SHMEM_COLL_NUM_PROCS            : 10
| 	MV2_SHMEM_COLL_SPIN_COUNT           : 5
| 	MV2_SHMEM_REDUCE_MSG                : 4096
| 	MV2_USE_BCAST_SHORT_MSG             : 16384
| 	MV2_USE_DIRECT_GATHER               : 1
| 	MV2_USE_DIRECT_GATHER_SYSTEM_SIZE_MEDIUM : 1024
| 	MV2_USE_DIRECT_GATHER_SYSTEM_SIZE_SMALL : 384
| 	MV2_USE_DIRECT_SCATTER              : 1
| 	MV2_USE_OSU_COLLECTIVES             : 1
| 	MV2_USE_OSU_NB_COLLECTIVES          : 1
| 	MV2_USE_KNOMIAL_2LEVEL_BCAST        : 1
| 	MV2_USE_KNOMIAL_INTER_LEADER_BCAST  : 1
| 	MV2_USE_SCATTER_RD_INTER_LEADER_BCAST : 1
| 	MV2_USE_SCATTER_RING_INTER_LEADER_BCAST : 1
| 	MV2_USE_SHMEM_ALLREDUCE             : 1
| 	MV2_USE_SHMEM_BARRIER               : 1
| 	MV2_USE_SHMEM_BCAST                 : 1
| 	MV2_USE_SHMEM_COLL                  : 1
| 	MV2_USE_SHMEM_REDUCE                : 1
| 	MV2_USE_TWO_LEVEL_GATHER            : 1
| 	MV2_USE_TWO_LEVEL_SCATTER           : 1
| 	MV2_USE_XOR_ALLTOALL                : 1
| 	MV2_DEFAULT_SRC_PATH_BITS           : 0
| 	MV2_DEFAULT_STATIC_RATE             : 0
| 	MV2_DEFAULT_TIME_OUT                : 17237780
| 	MV2_DEFAULT_MTU                     : 3
| 	MV2_DEFAULT_PKEY                    : 589824
| 	MV2_DEFAULT_PORT                    : 40208032
| 	MV2_DEFAULT_GID_INDEX               : 0
| 	MV2_DEFAULT_PSN                     : 0
| 	MV2_DEFAULT_MAX_RECV_WQE            : 128
| 	MV2_DEFAULT_MAX_SEND_WQE            : 64
| 	MV2_DEFAULT_MAX_SG_LIST             : 1
| 	MV2_DEFAULT_MIN_RNR_TIMER           : 12
| 	MV2_DEFAULT_QP_OUS_RD_ATOM          : 268701700
| 	MV2_DEFAULT_RETRY_COUNT             : 16844551
| 	MV2_DEFAULT_RNR_RETRY               : 65799
| 	MV2_DEFAULT_MAX_CQ_SIZE             : 40000
| 	MV2_DEFAULT_MAX_RDMA_DST_OPS        : 4
| 	MV2_INITIAL_PREPOST_DEPTH           : 10
| 	MV2_IWARP_MULTIPLE_CQ_THRESHOLD     : 32
| 	MV2_NUM_HCAS                        : 1
| 	MV2_NUM_PORTS                       : 1
| 	MV2_NUM_QP_PER_PORT                 : 1
| 	MV2_MAX_RDMA_CONNECT_ATTEMPTS       : 20
| 	MV2_ON_DEMAND_UD_INFO_EXCHANGE      : 1
| 	MV2_PREPOST_DEPTH                   : 64
| 	MV2_HOMOGENEOUS_CLUSTER             : 0
| 	MV2_NUM_CQES_PER_POLL               : 96
| 	MV2_COALESCE_THRESHOLD              : 6
| 	MV2_DREG_CACHE_LIMIT                : 0
| 	MV2_IBA_EAGER_THRESHOLD             : 17408
| 	MV2_MAX_INLINE_SIZE                 : 168
| 	MV2_MAX_R3_PENDING_DATA             : 524288
| 	MV2_MED_MSG_RAIL_SHARING_POLICY     : 0
| 	MV2_NDREG_ENTRIES                   : 1260
| 	MV2_NUM_RDMA_BUFFER                 : 16
| 	MV2_NUM_SPINS_BEFORE_LOCK           : 2000
| 	MV2_POLLING_LEVEL                   : 1
| 	MV2_POLLING_SET_LIMIT               : 64
| 	MV2_POLLING_SET_THRESHOLD           : 256
| 	MV2_R3_NOCACHE_THRESHOLD            : 32768
| 	MV2_R3_THRESHOLD                    : 4096
| 	MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD : 17408
| 	MV2_RAIL_SHARING_MED_MSG_THRESHOLD  : 2048
| 	MV2_RAIL_SHARING_POLICY             : 4
| 	MV2_RDMA_EAGER_LIMIT                : 32
| 	MV2_RDMA_FAST_PATH_BUF_SIZE         : 5120
| 	MV2_RDMA_NUM_EXTRA_POLLS            : 1
| 	MV2_RNDV_EXT_SENDQ_SIZE             : 5
| 	MV2_RNDV_PROTOCOL                   : 3
| 	MV2_SMALL_MSG_RAIL_SHARING_POLICY   : 0
| 	MV2_SPIN_COUNT                      : 5000
| 	MV2_SRQ_LIMIT                       : 10
| 	MV2_SRQ_MAX_SIZE                    : 4096
| 	MV2_SRQ_SIZE                        : 80
| 	MV2_STRIPING_THRESHOLD              : 17408
| 	MV2_USE_COALESCE                    : 0
| 	MV2_USE_XRC                         : 0
| 	MV2_VBUF_MAX                        : -1
| 	MV2_VBUF_POOL_SIZE                  : 80
| 	MV2_VBUF_SECONDARY_POOL_SIZE        : 16
| 	MV2_VBUF_TOTAL_SIZE                 : 17408
| 	MV2_USE_IWARP_MODE                  : 0
| 	MV2_USE_HWLOC_CPU_BINDING           : 1
| 	MV2_ENABLE_AFFINITY                 : 1
| 	MV2_HCA_AWARE_PROCESS_MAPPING       : 1
| 	MV2_ENABLE_LEASTLOAD                : 0
| 	MV2_SMP_BATCH_SIZE                  : 8
| 	MV2_SMP_EAGERSIZE                   : 32769
| 	MV2_SMPI_LENGTH_QUEUE               : 131072
| 	MV2_SMP_NUM_SEND_BUFFER             : 16
| 	MV2_SMP_SEND_BUF_SIZE               : 16384
| 	MV2_USE_SHARED_MEM                  : 1
| 	MV2_SMP_CMA_MAX_SIZE                : 4194304
| 	MV2_SMP_LIMIC2_MAX_SIZE             : 0
| 	MV2_SHOW_ENV_INFO                   : 2
| 	MV2_DEFAULT_PUT_GET_LIST_SIZE       : 200
| 	MV2_EAGERSIZE_1SC                   : 8192
| 	MV2_GET_FALLBACK_THRESHOLD          : 0
| 	MV2_PIN_POOL_SIZE                   : 2097152
| 	MV2_PUT_FALLBACK_THRESHOLD          : 8192
| 	MV2_USE_RDMA_CM                     : 1
| 	MV2_ASYNC_THREAD_STACK_SIZE         : 1048576
| 	MV2_THREAD_YIELD_SPIN_THRESHOLD     : 5
| 	MV2_SUPPORT_DPM                     : 0
| 	MV2_USE_HUGEPAGES                   : 1
| ---------------------------------------------------------------------
| Warning: Process to core binding is enabled and OMP_NUM_THREADS is set to non-zero (1) value
| If your program has OpenMP sections, this can cause over-subscription of cores and consequently poor performance
| To avoid this, please re-run your application after setting MV2_ENABLE_AFFINITY=0
| Use MV2_USE_THREAD_WARNING=0 to suppress this message
`----



> I see that you are running the 9 node case also with 80 processes. Could you
> please try running the 9 node case with 90 processes (I am assuming you are
> running 10 processes per node). If this runs, can you please try setting
> MV2_USE_SHMEM_COLL=0 to see if it makes things pass at 9 nodes with 80
> processes?


(N=8,n=90)  and (N=9,n=90) both get stuck (with or without
MV2_USE_SHMEM_COLL=0)



Many thanks,
-- 
Ángel de Vicente

Tel.: +34 922 605 747
Web.: http://www.iac.es/proyecto/polmag/


More information about the mvapich-discuss mailing list