[mvapich-discuss] MVAPICH2.0-8-GDR - error if cudaSetDevice() on multi node/multi GPU

Sun Nov 15 22:13:09 EST 2015

Hi Sebastian,

You are changing the device used by a tank. But it's not possible to control multiple GPUs from a single MPI rank when you want to use the CUDA-aware features of MVAPICH2. You should start one MPI rank per GPU. So in your case two per node.

Hope this helps

Jiri

Sent from my smartphone. Please excuse autocorrect typos.

---- Prepelita Sebastian schrieb ----

Hi,

I have read in the user guide (for MVAPICH-2.1) that it is recommended to set the CUDA device before sending/receiving MPI data. The exact phrasing is a bit unclear to me and I don't understand if it is a mandatory step: "When multiple GPUs are present on a node, users might want to set the MPI
process affinity to a particular GPU using cuda calls like cudaSetDevice()."

However, if I do so I get an error:

mpi_rank_1, task 1: Exited with exit code 253 (cudaEventRecord failed), inside MPIDI_CH3I_MRAILI_Rendezvous_rput_push_cuda() @ src/mpid/ch3/channels/mrail/src/gen2/ibv_cuda_rndv.c:636

The library version I am using is "mvapich 2.0-8-gdr".

Each MPI node has 2 CUDA devices and the code uses both devices. On 2 MPI nodes, the code runs OK. However on 3 nodes I get the above error. Googling around, I found these two related issues:
http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2013-June/004471.html
http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2014-April/004971.html

For 3 nodes, the middle node (rank_1) is first receiving data to CUDA device 0 and then sending data from CUDA device 1. It's in the sending part that the program crashes when using cudaSetDevice(). Here is a code overview with some of values of interest:
float MPI_switch_single(CudaMesh* d_mesh, unsigned int step,
                                                          int MPI_rank,
                                                          int MPI_rank_neigbor_down,
                                                          int MPI_rank_neigbor_up){
                             clock_t start_t;
                             clock_t end_t;

                             start_t = clock();
                             // MPI message tagging:
                             const int MPI_HALOTAG = 1;
                             /// Prerequisite variables:
                             int MPI_halo_size = d_mesh->getHaloSize();
                             MPI_Status MPI_rec_status_from_DOWN, MPI_rec_status_from_UP; // Receive status info
                             if (MPI_rank_neigbor_down != -2){
                                                          cudaSetDevice(0);
                                                          MPI_Recv((float *)pointer_on_CUDA_device_0, MPI_halo_size, MPI_FLOAT, MPI_rank_neigbor_down, MPI_HALOTAG, MPI_COMM_WORLD, &MPI_rec_status_from_DOWN);
                             }
                             if (MPI_rank_neigbor_up != -2){
        // Crash here for MPI RANK 1
                                                          cudaSetDevice(1);
                                                          MPI_Send((float *)pointer_on_CUDA_device_1, MPI_halo_size, MPI_FLOAT, MPI_rank_neigbor_up, MPI_HALOTAG, MPI_COMM_WORLD);
                             }
                             if (MPI_rank_neigbor_up != -2){
                                                          cudaSetDevice(1);
                                                          MPI_Recv((float *)pointer_on_CUDA_device_1, MPI_halo_size, MPI_FLOAT, MPI_rank_neigbor_up, MPI_HALOTAG, MPI_COMM_WORLD, &MPI_rec_status_from_UP);
                             }
                             if (MPI_rank_neigbor_down != -2){
                                                          cudaSetDevice(0);
                                                          MPI_Send((float *)pointer_on_CUDA_device_0 + MPI_halo_size, MPI_halo_size, MPI_FLOAT, MPI_rank_neigbor_down, MPI_HALOTAG, MPI_COMM_WORLD);
                             }
                             // Wait for ALL MPI data transfers to end:
                             MPI_Barrier(MPI_COMM_WORLD);
                             end_t = clock()-start_t;
                             return ((float)end_t/CLOCKS_PER_SEC);
}

Now, if I comment out the "cudaSetDevice(...);" lines, the code works and results are correct. The pointers given to MPI_Send/MPI_Recv are allocated on different devices, as I have written in the code.
Is it safe to remove the cudaSetDevice code? Should I expect some buggy behavior or possible crashes?

Thank you,
Sebastian.

NVIDIA GmbH, Wuerselen, Germany, Amtsgericht Aachen, HRB 8361
Managing Director: Karen Theresa Burns

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information.  Any unauthorized review, use, disclosure or distribution
is prohibited.  If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20151116/944a673b/attachment-0001.html>