[mvapich-discuss] asynchronous progress with CUDA

Tue Apr 23 03:47:55 EDT 2013

Dear mvapich2 team

I have a fat node with 8 GPUs and a simple communication with MPI_Isend & MPI_Irecv on gpu pointers, which I would like to progress with an additional thread.

Below I post a snippet with the function that is called by a pthread_create (The tag within the MPI_Irecv is never fulfilled).

void* mpi_test_fn(void* ptr)
{
  MPI_Request req;
  MPI_Status status;
  double* b;
  cudaMalloc(&b, sizeof(double) );

  MPI_Irecv(&b, 1, MPI_DOUBLE, 0, 599999, MPI_COMM_WORLD, &req);
  int flag;
  while(true)
    MPI_Test(&req, &flag, &status);
}

The trick works with CPU communication, i.e if the pointer I place in the MPI_Isend & MPI_Irecv is a host pointer, and the asynchronous progress seems to work as well.
But it crashes when I use gpu pointers (it is the thread created with pthread, and calling MPI_Test the one that crashes).

The segmentation fault happens in
src/mpid/ch3/channels/mrail/src/gen2/ibv_cuda_rndv.c
in the MPIDI_CH3_CUDAIPC_Rendezvous_push function.

Early in this function, there is some code like (simpliflying)
       cudaStream_t strm = 0;
        strm = stream_d2h;
But stream_d2h was never created, and therefore strm contains a null pointer which later on triggers the seg fault.

The crash only happens with VAPI_PROTOCOL_CUDAIPC, as I also tested it with devices with non peer to peer capability, then the whole communication has to go via VAPI_PROTOCOL_R3, which seems to work, i.e. there is no crash and the progress happens.

Am I missing something? perhaps someone already succeeded with this asynchronous progress on cuda device communication with a different approach?

For reference, I am using mvapich2/1.9rc1 with the following configure
./configure--enable-threads=multiple --enable-shared --enable-sharedlibs=gcc --enable-fc --enable-cxx --with-mpe --enable-rdma-cm --enable-fast --enable-smpcoll --with-hwloc --enable-xrc --with-device=ch3:mrail --with-rdma=gen2 --enable-cuda --enable-g=dbg --enable-debuginfo --enable-async-progress CC=gcc CXX=g++ FC=gfortran F77=gfortran

thanks for the help, Carlos