[mvapich-discuss] asynchronous progress with CUDA
Carlos Osuna
carlos.osuna at env.ethz.ch
Wed May 8 07:51:37 EDT 2013
Dear Devendar,
thanks for providing the patch. I could recently try it and works
perfectly.
Cheers, Carlos
On 04/29/2013 01:29 AM, Devendar Bureddy wrote:
> Hi Carlos
>
> This issue is because of no cuda context set in the thread. MVAPICH2
> supports async communication progress thread with run-time parameter
> MPICH_ASYNC_PROGRESS=1. you can use this functionality instead of your
> own thread. The attached patch will fix the context issue in the
> internal async progress thread. You can try this patch with above
> mentioned run-time flag. Please follow below instruction in applying
> the patch.
>
> [mvapich2-1.9rc1]$ patch -p1 < ./diff.patch
> patching file src/mpi/init/async.c
> patching file src/mpid/ch3/channels/mrail/src/gen2/ibv_cuda_util.c
> patching file src/mpid/ch3/include/mpidimpl.h
>
> -Devendar
>
>
> On Tue, Apr 23, 2013 at 12:02 PM, Devendar Bureddy
> <bureddy at cse.ohio-state.edu <mailto:bureddy at cse.ohio-state.edu>> wrote:
>
> Hi Carlos
>
> Thanks for your report. We will take a look at it
>
> -Devendar
>
>
> On Tue, Apr 23, 2013 at 3:47 AM, Osuna Escamilla Carlos
> <carlos.osuna at env.ethz.ch <mailto:carlos.osuna at env.ethz.ch>> wrote:
>
> Dear mvapich2 team
>
> I have a fat node with 8 GPUs and a simple communication with
> MPI_Isend & MPI_Irecv on gpu pointers, which I would like to
> progress with an additional thread.
>
> Below I post a snippet with the function that is called by a
> pthread_create (The tag within the MPI_Irecv is never fulfilled).
>
> void* mpi_test_fn(void* ptr)
> {
> MPI_Request req;
> MPI_Status status;
> double* b;
> cudaMalloc(&b, sizeof(double) );
>
> MPI_Irecv(&b, 1, MPI_DOUBLE, 0, 599999, MPI_COMM_WORLD, &req);
> int flag;
> while(true)
> MPI_Test(&req, &flag, &status);
> }
>
> The trick works with CPU communication, i.e if the pointer I
> place in the MPI_Isend & MPI_Irecv is a host pointer, and the
> asynchronous progress seems to work as well.
> But it crashes when I use gpu pointers (it is the thread
> created with pthread, and calling MPI_Test the one that crashes).
>
> The segmentation fault happens in
> src/mpid/ch3/channels/mrail/src/gen2/ibv_cuda_rndv.c
> in the MPIDI_CH3_CUDAIPC_Rendezvous_push function.
>
> Early in this function, there is some code like (simpliflying)
> cudaStream_t strm = 0;
> strm = stream_d2h;
> But stream_d2h was never created, and therefore strm contains
> a null pointer which later on triggers the seg fault.
>
> The crash only happens with VAPI_PROTOCOL_CUDAIPC, as I also
> tested it with devices with non peer to peer capability, then
> the whole communication has to go via VAPI_PROTOCOL_R3, which
> seems to work, i.e. there is no crash and the progress happens.
>
> Am I missing something? perhaps someone already succeeded with
> this asynchronous progress on cuda device communication with a
> different approach?
>
> For reference, I am using mvapich2/1.9rc1 with the following
> configure
> ./configure--enable-threads=multiple --enable-shared
> --enable-sharedlibs=gcc --enable-fc --enable-cxx --with-mpe
> --enable-rdma-cm --enable-fast --enable-smpcoll --with-hwloc
> --enable-xrc --with-device=ch3:mrail --with-rdma=gen2
> --enable-cuda --enable-g=dbg --enable-debuginfo
> --enable-async-progress CC=gcc CXX=g++ FC=gfortran F77=gfortran
>
>
> thanks for the help, Carlos
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> <mailto:mvapich-discuss at cse.ohio-state.edu>
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
>
>
> --
> Devendar
>
>
>
>
> --
> Devendar
--
----------------------------------------------------
Carlos Osuna
ETH Zürich
Institute for Atmospheric and Climate Sciences
Universitätstrasse 16
CH-8092 Zurich, Switzerland
Tel: +41 (44) 632 82 66
----------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20130508/85d94710/attachment.html
More information about the mvapich-discuss
mailing list