[mvapich-discuss] Hangs in blocking progress mode

Georg Geiser Georg.Geiser at dlr.de
Fri Aug 30 10:59:35 EDT 2019


I am facing several hangs when using blocking mode progress via 
MV2_USE_BLOCKING=1. Here is a first standalone reproducer:

#include <mpi.h>
#include <stdlib.h>

int main(int argc, char* argv[]) {

   double data = 1;
   MPI_Request request;

   MPI_Init(&argc, &argv);

   MPI_Iallreduce(MPI_IN_PLACE, &data, 1, MPI_DOUBLE, MPI_SUM, 
MPI_COMM_WORLD, &request);

   MPI_Wait(&request, MPI_STATUS_IGNORE);

   MPI_Finalize();

   return EXIT_SUCCESS;
}

This hangs with:

MV2_USE_BLOCKING=1 MV2_SPIN_COUNT=0 mpirun -np 1
MV2_USE_BLOCKING=1 MV2_SPIN_COUNT=1 mpirun -np 1

and succeeds with:

MV2_USE_BLOCKING=1 MV2_SPIN_COUNT=2 mpirun -np 1

as well as with higher -np values and arbitrary values of MV2_SPIN_COUNT.

Well, this allreduce is actually a noop for -np 1, but in my real 
application there are also several random hangs where actual 
communication is performed and MV2_SPIN_COUNT defaults to 5000. Choosing 
higher spin counts also helps to proceed, but in my opinion MVAPICH 
should always progress at MV2_SPIN_COUNT=0.

I was not yet able to track down the problem on myself. However, when I 
inspected MPIDI_CH3I_MRAILI_Cq_poll_ib() I possibly stumbled upon an 
additional issue. This function loops over all available HCAs (using the 
i variable), but will block on only a single HCA when 
rdma_blocking_spin_count_threshold is exceeded. I guess it should 
instead block on all HCAs to ensure that all events are captured. 
Though, this would require one separate thread for each blocked HCA.

However, I only have one HCA so this is not the reason for my hangs.

It seems like the problem has already been reported here:

http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2016-September/006178.html

Though, there seems to be no progress in fixing it.


Here is the output from mpiname -a:

MVAPICH2 2.3.2 Fri August 9 22:00:00 EST 2019 ch3:mrail


Compilation
CC: /export/opt/gcc-8.2.0/bin/gcc    -DNDEBUG -DNVALGRIND -O2
CXX: /export/opt/gcc-8.2.0/bin/g++   -DNDEBUG -DNVALGRIND -O2
F77: /export/opt/gcc-8.2.0/bin/gfortran -L/lib -L/lib   -O2
FC: /export/opt/gcc-8.2.0/bin/gfortran   -O2

Configuration
--enable-threads=funneled --disable-mcast


Georg



More information about the mvapich-discuss mailing list