[mvapich-discuss] Stuck in wait with blocking connections
Maksym Planeta
mplaneta at os.inf.tu-dresden.de
Wed Feb 17 09:42:58 EST 2016
Hi,
I found a situation when a program hangs in MPI_Wait, while waiting for
a completion of MPI_Igather call.
Here is an example of a program which shows the effect:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char **argv)
{
int rank;
int size;
MPI_Comm world_dup;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("%d %d\n", __LINE__, rank);
fflush(stdout);
MPI_Comm_dup(MPI_COMM_WORLD, &world_dup);
MPI_Barrier(world_dup);
int *array = calloc(size, sizeof(int));
MPI_Gather(&rank, 1, MPI_INT, array, 1, MPI_INT, 0, world_dup);
free(array);
MPI_Barrier(MPI_COMM_WORLD);
printf("%d %d\n", __LINE__, rank);
fflush(stdout);
array = calloc(size, sizeof(int));
MPI_Request request;
MPI_Igather(&rank, 1, MPI_INT, array, 1, MPI_INT, 0, world_dup,
&request);
printf("%d %d\n", __LINE__, rank);
MPI_Wait(&request, MPI_STATUS_IGNORE);
free(array);
printf("Hi %d\n", rank);
fflush(stdout);
MPI_Finalize();
}
For reproducing the hang-up it was important to duplicate the
MPI_COMM_WORLD communicator, use many processes per node, and use
MPI_Igather. Adding fflush before MPI_Wait allows program to continue.
I tried this out for mvapich-2.2b with no further modifications.
I was using following srun command:
srun --nodes=2 --overcommit --ntasks=384 --distribution=block
--mem-per-cpu=2500 --cpu_bind=v,none --kill-on-bad-exit --mpi=pmi2
I also used following environmental variables:
export MV2_ON_DEMAND_THRESHOLD=1000
export MV2_USE_BLOCKING=1
export MV2_ENABLE_AFFINITY=0
export MV2_USE_SHARED_MEM=0
export MV2_RDMA_NUM_EXTRA_POLLS=1
export MV2_USE_EAGER_FAST_SEND=0
export MV2_USE_UD_HYBRID=0
export MV2_SHMEM_BACKED_UD_CM=0
export MV2_CM_MAX_SPIN_COUNT=1
export MV2_SPIN_COUNT=1
export MV2_DEBUG_SHOW_BACKTRACE=1
export MV2_DEBUG_CORESIZE=unlimited
Compilation configuration:
$ mpiname -a
MVAPICH2 2.2b Mon Nov 12 20:00:00 EST 2015 ch3:mrail
Compilation
CC: gcc -DNDEBUG -DNVALGRIND -O2
CXX: g++ -DNDEBUG -DNVALGRIND -O2
F77: gfortran -O2
FC: gfortran -O2
Configuration
--enable-fortran=all --enable-cxx --with-rdma=gen2
--with-device=ch3:mrail --enable-alloca --enable-hwloc
--disable-dependency-tracking --with-pmi=pmi2 --with-pm=slurm
--with-slurm=/opt/slurm/15.08.6_20151221-0628/ --prefix=<path>
--
Regards,
Maksym Planeta
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5154 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160217/7990f5f5/attachment.p7s>
More information about the mvapich-discuss
mailing list