[mvapich-discuss] MVAPICH2-GDR 2.2 and MPI_Put in combinationwith cudaMallocManaged on GPU cluster

Mon Dec 18 10:08:09 EST 2017

Dear Yussuf,

MVAPICH2-GDR 2.2 supports efficient communication over NVLink for all basic
point to point and collective operations.

We have added new optimizations in the MVAPICH2-GDR 2.3a release that will
provide much better performance for large messages.

If possible, we highly recommend upgrading to 2.3a.

Thanks,
Ammar

On Mon, Dec 18, 2017 at 2:02 AM, Yussuf Ali <Yussuf.ali at jaea.go.jp> wrote:

> Dear Hari,
>
>
>
> Thank you for your answer.
>
>
>
> I have another question regarding NVlink. We use MVAPICH-GDR 2.2, does
> this version use NVlink for intra GPU communication?
>
>
>
> For example if I use MPI_Isend in order to send data from one GPU to
> another GPU. In such a case is NVlink used?
>
>
>
> Best Regards,
>
> Yussuf
>
>
>
>
>
> *From: *Subramoni, Hari <subramoni.1 at osu.edu>
> *Sent: *Saturday, December 16, 2017 8:36 AM
> *To: *Yussuf Ali <Yussuf.ali at jaea.go.jp>; mvapich-discuss at cse.ohio-
> state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
> *Cc: *Subramoni, Hari <subramoni.1 at osu.edu>
> *Subject: *RE: [mvapich-discuss] MVAPICH2-GDR 2.2 and MPI_Put in
> combinationwith cudaMallocManaged on GPU cluster
>
> Hi, Yussuf.
>
>
>
> MVAPICH2-GDR 2.3a only supports for high-performance communication from
> managed memory for basic point-to-point and collective operations. Advanced
> managed memory support RMA is on our roadmap and will be available with
> future releases.
>
>
>
> Regards,
>
> Hari.
>
>
>
> *From:* mvapich-discuss-bounces at cse.ohio-state.edu *On Behalf Of *Yussuf
> Ali
> *Sent:* Thursday, December 14, 2017 9:18 PM
> *To:* mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.
> ohio-state.edu>
> *Subject:* [mvapich-discuss] MVAPICH2-GDR 2.2 and MPI_Put in combination
> with cudaMallocManaged on GPU cluster
>
>
>
> Dear MVAPICH2 developers and users,
>
>
>
> I’m trying to get a very simple MPI program running on a GPU cluster with
> 4 NVIDIA Tesla P100-SXM2-16GB GPU.
>
>
>
> In this example program each GPU exchanges 1 byte with the neighbor (Intra
> node communication) using unified memory.
>
> The program works fine if I use cudaMalloc instead of cudaMallocManaged
> and explicitly manage host and device communication.
>
>
>
> Is there something wrong with this program or is the system misconfigured?
>
>
>
> The program crashes with the following error:
>
> ____________________________________________________________
> _________________________________________________________________________
>
> [jtesla1:mpi_rank_1][MPIDI_CH3I_CUDA_IPC_win_create]
> src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_1sc.c:1338: rdma_iba_1sc:
> cuIpcGetMemHandle failed
>
> [jtesla1:mpi_rank_0][MPIDI_CH3I_CUDA_IPC_win_create]
> src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_1sc.c:1338: rdma_iba_1sc:
> cuIpcGetMemHandle failed
>
> [jtesla1:mpi_rank_2][MPIDI_CH3I_CUDA_IPC_win_create]
> src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_1sc.c:1338: rdma_iba_1sc:
> cuIpcGetMemHandle failed
>
> [jtesla1:mpi_rank_3][MPIDI_CH3I_CUDA_IPC_win_create]
> src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_1sc.c:1338: rdma_iba_1sc:
> cuIpcGetMemHandle failed
>
> ____________________________________________________________
> __________________________________________________________________________
>
>
>
> This is the program:
>
> ____________________________________________________________
> _________________________________________________________________________
>
> #include <mpi.h>
>
> #include <iostream>
>
> #include <cuda.h>
>
>
>
> int main(int argc, char* argv[])
>
> {
>
>     MPI_Init(&argc, &argv);
>
>     int  ncpu, rank;
>
>     MPI_Comm_size(MPI_COMM_WORLD, &ncpu);
>
>     MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>
>     cudaSetDevice(rank % 4);
>
>
>
>     char* f0;
>
>     char* g0;
>
>     const int NN = 1;
>
>     cudaMallocManaged((void**)&f0, sizeof(char) * NN);
>
>     cudaMallocManaged((void**)&g0, sizeof(char) * NN);
>
>
>
>     char c = rank+65;
>
>     cudaMemset(f0,c,sizeof(char)*NN);
>
>
>
>     MPI_Win  win_host;
>
>     MPI_Win_create(g0, sizeof(char)*NN, sizeof(char), MPI_INFO_NULL,
> MPI_COMM_WORLD, &win_host);
>
>     MPI_Win_fence(0, win_host);
>
>     const int  rank_dst = (rank^1);  // Exchange with neighbor node
>
>     MPI_Put(f0, NN, MPI_CHAR, rank_dst, 0, NN, MPI_CHAR, win_host);
>
>     MPI_Win_fence(0, win_host);
>
>     MPI_Win_free(&win_host);
>
>
>
>     std::cout << "I'm rank: " << rank << " and my data is: " << g0[0] <<
> "\n";
>
>     MPI_Barrier(MPI_COMM_WORLD);
>
>     MPI_Finalize();
>
>     return 0;
>
> }
>
>
>
> Inside the PBS script I set the following environment variables:
>
> _________________________________________________
>
> export LD_LIBRARY_PATH=/home/app/mvapich2/2.2-gdr-cuda8.0/gnu/
> lib64:$LD_LIBRARY_PATH
>
> export MV2_PATH=/home/app/mvapich2/2.2-gdr-cuda8.0/gnu
>
> export MV2_GPUDIRECT_GDRCOPY_LIB=/home/app/mvapich2/2.2-gdr-
> cuda8.0/gnu/gdrcopy-master/libgdrapi.so
>
> export MV2_USE_CUDA=1
>
> export MV2_USE_GPUDIRECT=1
>
> export MV2_GPUDIRECT_GDRCOPY=1
>
> export MV2_USE_GPUDIRECT_GDRCOPY=1
>
> ___________________________________________________
>
>
>
> This is the output of “mpiname -a”:
>
> _____________________________________________
>
> MVAPICH2-GDR 2.2 Tue Oct 25 22:00:00 EST 2016 ch3:mrail
>
>
>
> Compilation
>
> CC: gcc -I/usr/local/cuda-8.0/include   -DNDEBUG -DNVALGRIND -O2
>
> CXX: g++ -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
> -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches
> -m64 -mtune=generic  -DNDEBUG -DNVALGRIND -O2
>
> F77: gfortran -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
> -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches
> -m64 -mtune=generic -I/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/lib64/gfortran/modules
> -O2
>
> FC: gfortran -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
> -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches
> -m64 -mtune=generic -I/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/lib64/gfortran/modules
> -O2
>
>
>
> Configuration
>
> --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu
> --program-prefix= --disable-dependency-tracking
> --prefix=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5
> --exec-prefix=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5
> --bindir=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/bin
> --sbindir=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/sbin
> --sysconfdir=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/etc
> --datadir=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/share
> --includedir=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/include
> --libdir=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/lib64
> --libexecdir=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/libexec
> --localstatedir=/var --sharedstatedir=/var/lib
> --mandir=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/share/man
> --infodir=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/share/info
> --disable-rpath --disable-static --enable-shared --disable-rdma-cm
> --without-hydra-ckpointlib --with-pbs=/opt/pbs --with-pm=hydra
> --disable-mcast --with-core-direct --enable-cuda
> CPPFLAGS=-I/usr/local/cuda-8.0/include CFLAGS=-I/usr/local/cuda-8.0/include
> LDFLAGS=-L/usr/local/lib -lcuda -L/usr/local/cuda-8.0/lib64 -lcudart -lrt
> -lstdc++ -Wl,-rpath,/usr/local/cuda-8.0/lib64
> -Wl,-rpath,XORIGIN/placeholder -Wl,--build-id CC=gcc CXX=g++ F77=gfortran
> FC=gfortran
>
> ________________________________________________
>
>
>
> Thank you for your help!
>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20171218/f5dd6399/attachment-0001.html>