[mvapich-discuss] MVAPICH2.0-8-GDR - error if cudaSetDevice() on multi node/multi GPU
Prepelita Sebastian
sebastian.prepelita at aalto.fi
Wed Nov 11 05:11:02 EST 2015
Hi,
I have read in the user guide (for MVAPICH-2.1) that it is recommended to set the CUDA device before sending/receiving MPI data. The exact phrasing is a bit unclear to me and I don't understand if it is a mandatory step: "When multiple GPUs are present on a node, users might want to set the MPI
process affinity to a particular GPU using cuda calls like cudaSetDevice()."
However, if I do so I get an error:
mpi_rank_1, task 1: Exited with exit code 253 (cudaEventRecord failed), inside MPIDI_CH3I_MRAILI_Rendezvous_rput_push_cuda() @ src/mpid/ch3/channels/mrail/src/gen2/ibv_cuda_rndv.c:636
The library version I am using is "mvapich 2.0-8-gdr".
Each MPI node has 2 CUDA devices and the code uses both devices. On 2 MPI nodes, the code runs OK. However on 3 nodes I get the above error. Googling around, I found these two related issues:
http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2013-June/004471.html
http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2014-April/004971.html
For 3 nodes, the middle node (rank_1) is first receiving data to CUDA device 0 and then sending data from CUDA device 1. It's in the sending part that the program crashes when using cudaSetDevice(). Here is a code overview with some of values of interest:
float MPI_switch_single(CudaMesh* d_mesh, unsigned int step,
int MPI_rank,
int MPI_rank_neigbor_down,
int MPI_rank_neigbor_up){
clock_t start_t;
clock_t end_t;
start_t = clock();
// MPI message tagging:
const int MPI_HALOTAG = 1;
/// Prerequisite variables:
int MPI_halo_size = d_mesh->getHaloSize();
MPI_Status MPI_rec_status_from_DOWN, MPI_rec_status_from_UP; // Receive status info
if (MPI_rank_neigbor_down != -2){
cudaSetDevice(0);
MPI_Recv((float *)pointer_on_CUDA_device_0, MPI_halo_size, MPI_FLOAT, MPI_rank_neigbor_down, MPI_HALOTAG, MPI_COMM_WORLD, &MPI_rec_status_from_DOWN);
}
if (MPI_rank_neigbor_up != -2){
// Crash here for MPI RANK 1
cudaSetDevice(1);
MPI_Send((float *)pointer_on_CUDA_device_1, MPI_halo_size, MPI_FLOAT, MPI_rank_neigbor_up, MPI_HALOTAG, MPI_COMM_WORLD);
}
if (MPI_rank_neigbor_up != -2){
cudaSetDevice(1);
MPI_Recv((float *)pointer_on_CUDA_device_1, MPI_halo_size, MPI_FLOAT, MPI_rank_neigbor_up, MPI_HALOTAG, MPI_COMM_WORLD, &MPI_rec_status_from_UP);
}
if (MPI_rank_neigbor_down != -2){
cudaSetDevice(0);
MPI_Send((float *)pointer_on_CUDA_device_0 + MPI_halo_size, MPI_halo_size, MPI_FLOAT, MPI_rank_neigbor_down, MPI_HALOTAG, MPI_COMM_WORLD);
}
// Wait for ALL MPI data transfers to end:
MPI_Barrier(MPI_COMM_WORLD);
end_t = clock()-start_t;
return ((float)end_t/CLOCKS_PER_SEC);
}
Now, if I comment out the "cudaSetDevice(...);" lines, the code works and results are correct. The pointers given to MPI_Send/MPI_Recv are allocated on different devices, as I have written in the code.
Is it safe to remove the cudaSetDevice code? Should I expect some buggy behavior or possible crashes?
Thank you,
Sebastian.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20151111/5d9670e1/attachment-0001.html>
More information about the mvapich-discuss
mailing list