[mvapich-discuss] MVAPICH2.0-8-GDR - error if cudaSetDevice() on multi node/multi GPU

Wed Nov 11 05:11:02 EST 2015

Hi,

I have read in the user guide (for MVAPICH-2.1) that it is recommended to set the CUDA device before sending/receiving MPI data. The exact phrasing is a bit unclear to me and I don't understand if it is a mandatory step: "When multiple GPUs are present on a node, users might want to set the MPI
process affinity to a particular GPU using cuda calls like cudaSetDevice()."

However, if I do so I get an error:

mpi_rank_1, task 1: Exited with exit code 253 (cudaEventRecord failed), inside MPIDI_CH3I_MRAILI_Rendezvous_rput_push_cuda() @ src/mpid/ch3/channels/mrail/src/gen2/ibv_cuda_rndv.c:636

The library version I am using is "mvapich 2.0-8-gdr".

Each MPI node has 2 CUDA devices and the code uses both devices. On 2 MPI nodes, the code runs OK. However on 3 nodes I get the above error. Googling around, I found these two related issues:
http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2013-June/004471.html
http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2014-April/004971.html

For 3 nodes, the middle node (rank_1) is first receiving data to CUDA device 0 and then sending data from CUDA device 1. It's in the sending part that the program crashes when using cudaSetDevice(). Here is a code overview with some of values of interest:
float MPI_switch_single(CudaMesh* d_mesh, unsigned int step,
                                                          int MPI_rank,
                                                          int MPI_rank_neigbor_down,
                                                          int MPI_rank_neigbor_up){
                             clock_t start_t;
                             clock_t end_t;

                             start_t = clock();
                             // MPI message tagging:
                             const int MPI_HALOTAG = 1;
                             /// Prerequisite variables:
                             int MPI_halo_size = d_mesh->getHaloSize();
                             MPI_Status MPI_rec_status_from_DOWN, MPI_rec_status_from_UP; // Receive status info
                             if (MPI_rank_neigbor_down != -2){
                                                          cudaSetDevice(0);
                                                          MPI_Recv((float *)pointer_on_CUDA_device_0, MPI_halo_size, MPI_FLOAT, MPI_rank_neigbor_down, MPI_HALOTAG, MPI_COMM_WORLD, &MPI_rec_status_from_DOWN);
                             }
                             if (MPI_rank_neigbor_up != -2){
        // Crash here for MPI RANK 1
                                                          cudaSetDevice(1);
                                                          MPI_Send((float *)pointer_on_CUDA_device_1, MPI_halo_size, MPI_FLOAT, MPI_rank_neigbor_up, MPI_HALOTAG, MPI_COMM_WORLD);
                             }
                             if (MPI_rank_neigbor_up != -2){
                                                          cudaSetDevice(1);
                                                          MPI_Recv((float *)pointer_on_CUDA_device_1, MPI_halo_size, MPI_FLOAT, MPI_rank_neigbor_up, MPI_HALOTAG, MPI_COMM_WORLD, &MPI_rec_status_from_UP);
                             }
                             if (MPI_rank_neigbor_down != -2){
                                                          cudaSetDevice(0);
                                                          MPI_Send((float *)pointer_on_CUDA_device_0 + MPI_halo_size, MPI_halo_size, MPI_FLOAT, MPI_rank_neigbor_down, MPI_HALOTAG, MPI_COMM_WORLD);
                             }
                             // Wait for ALL MPI data transfers to end:
                             MPI_Barrier(MPI_COMM_WORLD);
                             end_t = clock()-start_t;
                             return ((float)end_t/CLOCKS_PER_SEC);
}

Now, if I comment out the "cudaSetDevice(...);" lines, the code works and results are correct. The pointers given to MPI_Send/MPI_Recv are allocated on different devices, as I have written in the code.
Is it safe to remove the cudaSetDevice code? Should I expect some buggy behavior or possible crashes?

Thank you,
Sebastian.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20151111/5d9670e1/attachment-0001.html>