[mvapich-discuss] GPU affinity and clusters with multi-GPU nodes

Rustico, Eugenio eugenio.rustico at baw.de
Mon Apr 28 09:07:25 EDT 2014


Hello,

I work on a cluster of 2-GPU nodes featuring MVAPICH2-1.9. I have one thread for
each device and arbitrary pairs of devices need to exchange data over the
network. Device buffers pointers are passed directly.

If I run a 4-GPUs simulation over 2 nodes, no error is encountered. Same if I
run a single-GPU, multiple nodes simulation with up to 8 nodes. However, as soon
as I run a multi-GPU simulation over 3 or more nodes (so 3 * 2, 4 * 2 and so on)
it crashes with:

  [MPIDI_CH3I_MRAILI_Process_cuda_finish]
src/mpid/ch3/channels/mrail/src/gen2/ibv_cuda_rndv.c:865: cudaEventRecord failed

I read that  setting the CUDA device after MPI_Init() is supported only from
 2.0 on and if I evaluate MV2_COMM_WORLD_LOCAL_RANK, this is always 0. My guess
that the problem is a wrong GPU affinity, i.e. MVAPICH tries to use the wrong
GPU.

Is there any way to use multiple GPUs with version 1.9, e.g. setting an
environment variable? Otherwise, I guess I will have to stage transfers on host
and adding a cudaMemcpy() after each transfer.

Thanks,
Eugenio Rustico
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140428/fc9df3ae/attachment.html>


More information about the mvapich-discuss mailing list