[mvapich-discuss] MVAPICH2.0-8-GDR - error if cudaSetDevice() on multi node/multi GPU

khaled hamidouche khaledhamidouche at gmail.com
Mon Nov 16 07:49:46 EST 2015


Dear Sebastian,

Apologies for the delay  in getting back to you on this.
MVAPICH2-GDR "does not require" you to select the GPU, it is the CUDA-Aware
model that requires you to select the GPU. as you send/recv data from GPU,
means the buffer has been allocated on the GPU means a GPU has been
selected already. That statement on the user-guide means that MVAPICH2
cannot control the selection of the GPU (as it is done by the application)
however MVAPICH2 will select the best HCA depending on the selected GPU.
(hope this is clear)

Now regarding the issue, you are affecting 2 GPUs to the same process which
is not allowed by MVAPICH2.

Thanks

On Wed, Nov 11, 2015 at 5:11 AM, Prepelita Sebastian <
sebastian.prepelita at aalto.fi> wrote:

> Hi,
>
>
>
> I have read in the user guide (for MVAPICH-2.1) that it is recommended to
> set the CUDA device before sending/receiving MPI data. The exact phrasing
> is a bit unclear to me and I don’t understand if it is a mandatory step: “*When
> multiple GPUs are present on a node, users might want to set the MPI*
>
> *process affinity to a particular GPU using cuda calls like
> cudaSetDevice().*”
>
>
>
> However, if I do so I get an error:
>
>
>
> mpi_rank_1, task 1: Exited with exit code 253 (cudaEventRecord failed),
> inside MPIDI_CH3I_MRAILI_Rendezvous_rput_push_cuda() @
> src/mpid/ch3/channels/mrail/src/gen2/ibv_cuda_rndv.c:636
>
>
>
> The library version I am using is “mvapich 2.0-8-gdr”.
>
>
>
> Each MPI node has 2 CUDA devices and the code uses both devices. On 2 MPI
> nodes, the code runs OK. However on 3 nodes I get the above error. Googling
> around, I found these two related issues:
>
> *http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2013-June/004471.html
> <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2013-June/004471.html>*
>
> *http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2014-April/004971.html
> <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2014-April/004971.html>*
>
>
>
> For 3 nodes, the middle node (rank_1) is first receiving data to CUDA
> device 0 and then sending data from CUDA device 1. It’s in the sending part
> that the program crashes when using cudaSetDevice(). Here is a code
> overview with some of values of interest:
>
> float MPI_switch_single(CudaMesh* d_mesh, unsigned int step,
>
>                                                           int MPI_rank,
>
>                                                           int
> MPI_rank_neigbor_down,
>
>                                                           int
> MPI_rank_neigbor_up){
>
>                              clock_t start_t;
>
>                              clock_t end_t;
>
>
>
>                              start_t = clock();
>
>                              // MPI message tagging:
>
>                              const int MPI_HALOTAG = 1;
>
>                              /// Prerequisite variables:
>
>                              int MPI_halo_size = d_mesh->getHaloSize();
>
>                              MPI_Status MPI_rec_status_from_DOWN,
> MPI_rec_status_from_UP; // Receive status info
>
>                              if (MPI_rank_neigbor_down != -2){
>
>
> *cudaSetDevice(0);*
>
>                                                           MPI_Recv((float
> *)pointer_on_CUDA_device_0, MPI_halo_size, MPI_FLOAT,
> MPI_rank_neigbor_down, MPI_HALOTAG, MPI_COMM_WORLD,
> &MPI_rec_status_from_DOWN);
>
>                              }
>
>                              if (MPI_rank_neigbor_up != -2){
>
>         *// Crash here for MPI RANK 1*
>
>
> *cudaSetDevice(1);*
>
>                                                           MPI_Send((float
> *)pointer_on_CUDA_device_1, MPI_halo_size, MPI_FLOAT, MPI_rank_neigbor_up,
> MPI_HALOTAG, MPI_COMM_WORLD);
>
>                              }
>
>                              if (MPI_rank_neigbor_up != -2){
>
>
> *cudaSetDevice(1);*
>
>                                                           MPI_Recv((float
> *)pointer_on_CUDA_device_1, MPI_halo_size, MPI_FLOAT, MPI_rank_neigbor_up,
> MPI_HALOTAG, MPI_COMM_WORLD, &MPI_rec_status_from_UP);
>
>                              }
>
>                              if (MPI_rank_neigbor_down != -2){
>
>
> *cudaSetDevice(0);*
>
>                                                           MPI_Send((float
> *)pointer_on_CUDA_device_0 + MPI_halo_size, MPI_halo_size, MPI_FLOAT,
> MPI_rank_neigbor_down, MPI_HALOTAG, MPI_COMM_WORLD);
>
>                              }
>
>                              // Wait for ALL MPI data transfers to end:
>
>                              MPI_Barrier(MPI_COMM_WORLD);
>
>                              end_t = clock()-start_t;
>
>                              return ((float)end_t/CLOCKS_PER_SEC);
>
> }
>
>
>
> Now, if I comment out the “*cudaSetDevice(…);*” lines, the code works and
> results are correct. The pointers given to MPI_Send/MPI_Recv are allocated
> on different devices, as I have written in the code.
>
> Is it safe to remove the cudaSetDevice code? Should I expect some buggy
> behavior or possible crashes?
>
>
>
> Thank you,
>
> Sebastian.
>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>


-- 
 K.H
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20151116/4b381f1c/attachment-0001.html>


More information about the mvapich-discuss mailing list