[mvapich-discuss] asynchronous progress with CUDA

Tue Apr 23 12:02:15 EDT 2013

Hi Carlos

Thanks for your report. We will take a look at it

-Devendar

On Tue, Apr 23, 2013 at 3:47 AM, Osuna Escamilla Carlos <
carlos.osuna at env.ethz.ch> wrote:

> Dear mvapich2 team
>
> I have a fat node with 8 GPUs and a simple communication with MPI_Isend &
> MPI_Irecv on gpu pointers, which I would like to progress with an
> additional thread.
>
> Below I post a snippet with the function that is called by a
> pthread_create (The tag within the MPI_Irecv is never fulfilled).
>
> void* mpi_test_fn(void* ptr)
> {
>   MPI_Request req;
>   MPI_Status status;
>   double* b;
>   cudaMalloc(&b, sizeof(double) );
>
>   MPI_Irecv(&b, 1, MPI_DOUBLE, 0, 599999, MPI_COMM_WORLD, &req);
>   int flag;
>   while(true)
>     MPI_Test(&req, &flag, &status);
> }
>
> The trick works with CPU communication, i.e if the pointer I place in the
> MPI_Isend & MPI_Irecv is a host pointer, and the asynchronous progress
> seems to work as well.
> But it crashes when I use gpu pointers (it is the thread created with
> pthread, and calling MPI_Test the one that crashes).
>
> The segmentation fault happens in
> src/mpid/ch3/channels/mrail/src/gen2/ibv_cuda_rndv.c
> in the MPIDI_CH3_CUDAIPC_Rendezvous_push function.
>
> Early in this function, there is some code like (simpliflying)
>        cudaStream_t strm = 0;
>         strm = stream_d2h;
> But stream_d2h was never created, and therefore strm contains a null
> pointer which later on triggers the seg fault.
>
> The crash only happens with VAPI_PROTOCOL_CUDAIPC, as I also tested it
> with devices with non peer to peer capability, then the whole communication
> has to go via VAPI_PROTOCOL_R3, which seems to work, i.e. there is no crash
> and the progress happens.
>
> Am I missing something? perhaps someone already succeeded with this
> asynchronous progress on cuda device communication with a different
> approach?
>
> For reference, I am using mvapich2/1.9rc1 with the following configure
> ./configure--enable-threads=multiple --enable-shared
> --enable-sharedlibs=gcc --enable-fc --enable-cxx --with-mpe
> --enable-rdma-cm --enable-fast --enable-smpcoll --with-hwloc --enable-xrc
> --with-device=ch3:mrail --with-rdma=gen2 --enable-cuda --enable-g=dbg
> --enable-debuginfo --enable-async-progress CC=gcc CXX=g++ FC=gfortran
> F77=gfortran
>
>
> thanks for the help, Carlos
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>

-- 
Devendar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20130423/3b444c17/attachment.html