[mvapich-discuss] GPU affinity and clusters with multi-GPU nodes

Jonathan Perkins perkinjo at cse.ohio-state.edu
Mon Apr 28 14:08:49 EDT 2014


Thanks for the note.  It's surprising that MV2_COMM_WORLD_LOCAL_RANK
is always being detected as 0.  Can you please share how you are
launching the jobs (which launcher are you using in particular)?

On Mon, Apr 28, 2014 at 9:07 AM, Rustico, Eugenio
<eugenio.rustico at baw.de> wrote:
> Hello,
>
> I work on a cluster of 2-GPU nodes featuring MVAPICH2-1.9. I have one thread
> for each device and arbitrary pairs of devices need to exchange data over
> the network. Device buffers pointers are passed directly.
>
> If I run a 4-GPUs simulation over 2 nodes, no error is encountered. Same if
> I run a single-GPU, multiple nodes simulation with up to 8 nodes. However,
> as soon as I run a multi-GPU simulation over 3 or more nodes (so 3 * 2, 4 *
> 2 and so on) it crashes with:
>
>   [MPIDI_CH3I_MRAILI_Process_cuda_finish]
> src/mpid/ch3/channels/mrail/src/gen2/ibv_cuda_rndv.c:865: cudaEventRecord
> failed
>
> I read that  setting the CUDA device after MPI_Init() is supported only from
> 2.0 on and if I evaluate MV2_COMM_WORLD_LOCAL_RANK, this is always 0. My
> guess that the problem is a wrong GPU affinity, i.e. MVAPICH tries to use
> the wrong GPU.
>
> Is there any way to use multiple GPUs with version 1.9, e.g. setting an
> environment variable? Otherwise, I guess I will have to stage transfers on
> host and adding a cudaMemcpy() after each transfer.
>
> Thanks,
> Eugenio Rustico
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo



More information about the mvapich-discuss mailing list