[mvapich-discuss] MVAPICH2-GDR 2.2 and MPI_Put in combinationwithcudaMallocManaged on GPU cluster

Tue Dec 19 03:34:44 EST 2017

Dear Ammar,

Thank you for your answer! We will try 2.3a.

Best Regards,
Yussuf 

From: Ammar Ahmad Awan
Sent: Tuesday, December 19, 2017 12:08 AM
To: Yussuf Ali
Cc: Subramoni, Hari; mvapich-discuss at cse.ohio-state.edu; Awan, Ammar Ahmad
Subject: [SECURITY WARNING: FREE E-MAIL] Re: [mvapich-discuss] MVAPICH2-GDR 2.2 and MPI_Put in combinationwithcudaMallocManaged on GPU cluster

Dear Yussuf,

MVAPICH2-GDR 2.2 supports efficient communication over NVLink for all basic point to point and collective operations. 

We have added new optimizations in the MVAPICH2-GDR 2.3a release that will provide much better performance for large messages. 

If possible, we highly recommend upgrading to 2.3a. 

Thanks,
Ammar

On Mon, Dec 18, 2017 at 2:02 AM, Yussuf Ali <Yussuf.ali at jaea.go.jp> wrote:
Dear Hari,

Thank you for your answer. 

I have another question regarding NVlink. We use MVAPICH-GDR 2.2, does this version use NVlink for intra GPU communication?

For example if I use MPI_Isend in order to send data from one GPU to another GPU. In such a case is NVlink used? 

Best Regards,
Yussuf 

From: Subramoni, Hari
Sent: Saturday, December 16, 2017 8:36 AM
To: Yussuf Ali; mvapich-discuss at cse.ohio-state.edu
Cc: Subramoni, Hari
Subject: RE: [mvapich-discuss] MVAPICH2-GDR 2.2 and MPI_Put in combinationwith cudaMallocManaged on GPU cluster
Hi, Yussuf.

MVAPICH2-GDR 2.3a only supports for high-performance communication from managed memory for basic point-to-point and collective operations. Advanced managed memory support RMA is on our roadmap and will be available with future releases.

Regards,
Hari.

From: mvapich-discuss-bounces at cse.ohio-state.edu On Behalf Of Yussuf Ali
Sent: Thursday, December 14, 2017 9:18 PM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] MVAPICH2-GDR 2.2 and MPI_Put in combination with cudaMallocManaged on GPU cluster

Dear MVAPICH2 developers and users,

I’m trying to get a very simple MPI program running on a GPU cluster with 4 NVIDIA Tesla P100-SXM2-16GB GPU.

In this example program each GPU exchanges 1 byte with the neighbor (Intra node communication) using unified memory.
The program works fine if I use cudaMalloc instead of cudaMallocManaged and explicitly manage host and device communication. 

Is there something wrong with this program or is the system misconfigured? 

The program crashes with the following error:
_____________________________________________________________________________________________________________________________________
[jtesla1:mpi_rank_1][MPIDI_CH3I_CUDA_IPC_win_create] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_1sc.c:1338: rdma_iba_1sc: cuIpcGetMemHandle failed 
[jtesla1:mpi_rank_0][MPIDI_CH3I_CUDA_IPC_win_create] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_1sc.c:1338: rdma_iba_1sc: cuIpcGetMemHandle failed 
[jtesla1:mpi_rank_2][MPIDI_CH3I_CUDA_IPC_win_create] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_1sc.c:1338: rdma_iba_1sc: cuIpcGetMemHandle failed 
[jtesla1:mpi_rank_3][MPIDI_CH3I_CUDA_IPC_win_create] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_1sc.c:1338: rdma_iba_1sc: cuIpcGetMemHandle failed 
______________________________________________________________________________________________________________________________________

This is the program:
_____________________________________________________________________________________________________________________________________
#include <mpi.h>
#include <iostream>
#include <cuda.h>

int main(int argc, char* argv[])
{
    MPI_Init(&argc, &argv);
    int  ncpu, rank;
    MPI_Comm_size(MPI_COMM_WORLD, &ncpu);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    cudaSetDevice(rank % 4);

    char* f0;
    char* g0;
    const int NN = 1;
    cudaMallocManaged((void**)&f0, sizeof(char) * NN);
    cudaMallocManaged((void**)&g0, sizeof(char) * NN);

    char c = rank+65;
    cudaMemset(f0,c,sizeof(char)*NN);

    MPI_Win  win_host;
    MPI_Win_create(g0, sizeof(char)*NN, sizeof(char), MPI_INFO_NULL, MPI_COMM_WORLD, &win_host);
    MPI_Win_fence(0, win_host);
    const int  rank_dst = (rank^1);  // Exchange with neighbor node
    MPI_Put(f0, NN, MPI_CHAR, rank_dst, 0, NN, MPI_CHAR, win_host);
    MPI_Win_fence(0, win_host);
    MPI_Win_free(&win_host);

    std::cout << "I'm rank: " << rank << " and my data is: " << g0[0] << "\n";
    MPI_Barrier(MPI_COMM_WORLD);
    MPI_Finalize();
    return 0;
}

Inside the PBS script I set the following environment variables:
_________________________________________________
export LD_LIBRARY_PATH=/home/app/mvapich2/2.2-gdr-cuda8.0/gnu/lib64:$LD_LIBRARY_PATH
export MV2_PATH=/home/app/mvapich2/2.2-gdr-cuda8.0/gnu
export MV2_GPUDIRECT_GDRCOPY_LIB=/home/app/mvapich2/2.2-gdr-cuda8.0/gnu/gdrcopy-master/libgdrapi.so
export MV2_USE_CUDA=1
export MV2_USE_GPUDIRECT=1
export MV2_GPUDIRECT_GDRCOPY=1
export MV2_USE_GPUDIRECT_GDRCOPY=1
___________________________________________________

This is the output of “mpiname -a”:
_____________________________________________
MVAPICH2-GDR 2.2 Tue Oct 25 22:00:00 EST 2016 ch3:mrail

Compilation
CC: gcc -I/usr/local/cuda-8.0/include   -DNDEBUG -DNVALGRIND -O2
CXX: g++ -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic  -DNDEBUG -DNVALGRIND -O2
F77: gfortran -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic -I/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/lib64/gfortran/modules  -O2
FC: gfortran -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic -I/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/lib64/gfortran/modules  -O2

Configuration
--build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5 --exec-prefix=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5 --bindir=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/bin --sbindir=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/sbin --sysconfdir=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/etc --datadir=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/share --includedir=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/include --libdir=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/lib64 --libexecdir=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/share/man --infodir=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/share/info --disable-rpath --disable-static --enable-shared --disable-rdma-cm --without-hydra-ckpointlib --with-pbs=/opt/pbs --with-pm=hydra --disable-mcast --with-core-direct --enable-cuda CPPFLAGS=-I/usr/local/cuda-8.0/include CFLAGS=-I/usr/local/cuda-8.0/include LDFLAGS=-L/usr/local/lib -lcuda -L/usr/local/cuda-8.0/lib64 -lcudart -lrt -lstdc++ -Wl,-rpath,/usr/local/cuda-8.0/lib64 -Wl,-rpath,XORIGIN/placeholder -Wl,--build-id CC=gcc CXX=g++ F77=gfortran FC=gfortran
________________________________________________

Thank you for your help! 

_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http*//mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20171219/778ba387/attachment-0001.html>