[mvapich-discuss] MVAPICH2-GDR 2.2 and MPI_Put in combination with cudaMallocManaged on GPU cluster

Subramoni, Hari subramoni.1 at osu.edu
Fri Dec 15 18:35:36 EST 2017


Hi, Yussuf.

MVAPICH2-GDR 2.3a only supports for high-performance communication from managed memory for basic point-to-point and collective operations. Advanced managed memory support RMA is on our roadmap and will be available with future releases.

Regards,
Hari.

From: mvapich-discuss-bounces at cse.ohio-state.edu On Behalf Of Yussuf Ali
Sent: Thursday, December 14, 2017 9:18 PM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] MVAPICH2-GDR 2.2 and MPI_Put in combination with cudaMallocManaged on GPU cluster

Dear MVAPICH2 developers and users,

I’m trying to get a very simple MPI program running on a GPU cluster with 4 NVIDIA Tesla P100-SXM2-16GB GPU.

In this example program each GPU exchanges 1 byte with the neighbor (Intra node communication) using unified memory.
The program works fine if I use cudaMalloc instead of cudaMallocManaged and explicitly manage host and device communication.

Is there something wrong with this program or is the system misconfigured?

The program crashes with the following error:
_____________________________________________________________________________________________________________________________________
[jtesla1:mpi_rank_1][MPIDI_CH3I_CUDA_IPC_win_create] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_1sc.c:1338: rdma_iba_1sc: cuIpcGetMemHandle failed
[jtesla1:mpi_rank_0][MPIDI_CH3I_CUDA_IPC_win_create] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_1sc.c:1338: rdma_iba_1sc: cuIpcGetMemHandle failed
[jtesla1:mpi_rank_2][MPIDI_CH3I_CUDA_IPC_win_create] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_1sc.c:1338: rdma_iba_1sc: cuIpcGetMemHandle failed
[jtesla1:mpi_rank_3][MPIDI_CH3I_CUDA_IPC_win_create] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_1sc.c:1338: rdma_iba_1sc: cuIpcGetMemHandle failed
______________________________________________________________________________________________________________________________________

This is the program:
_____________________________________________________________________________________________________________________________________
#include <mpi.h>
#include <iostream>
#include <cuda.h>

int main(int argc, char* argv[])
{
    MPI_Init(&argc, &argv);
    int  ncpu, rank;
    MPI_Comm_size(MPI_COMM_WORLD, &ncpu);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    cudaSetDevice(rank % 4);

    char* f0;
    char* g0;
    const int NN = 1;
    cudaMallocManaged((void**)&f0, sizeof(char) * NN);
    cudaMallocManaged((void**)&g0, sizeof(char) * NN);

    char c = rank+65;
    cudaMemset(f0,c,sizeof(char)*NN);

    MPI_Win  win_host;
    MPI_Win_create(g0, sizeof(char)*NN, sizeof(char), MPI_INFO_NULL, MPI_COMM_WORLD, &win_host);
    MPI_Win_fence(0, win_host);
    const int  rank_dst = (rank^1);  // Exchange with neighbor node
    MPI_Put(f0, NN, MPI_CHAR, rank_dst, 0, NN, MPI_CHAR, win_host);
    MPI_Win_fence(0, win_host);
    MPI_Win_free(&win_host);

    std::cout << "I'm rank: " << rank << " and my data is: " << g0[0] << "\n";
    MPI_Barrier(MPI_COMM_WORLD);
    MPI_Finalize();
    return 0;
}

Inside the PBS script I set the following environment variables:
_________________________________________________
export LD_LIBRARY_PATH=/home/app/mvapich2/2.2-gdr-cuda8.0/gnu/lib64:$LD_LIBRARY_PATH
export MV2_PATH=/home/app/mvapich2/2.2-gdr-cuda8.0/gnu
export MV2_GPUDIRECT_GDRCOPY_LIB=/home/app/mvapich2/2.2-gdr-cuda8.0/gnu/gdrcopy-master/libgdrapi.so
export MV2_USE_CUDA=1
export MV2_USE_GPUDIRECT=1
export MV2_GPUDIRECT_GDRCOPY=1
export MV2_USE_GPUDIRECT_GDRCOPY=1
___________________________________________________

This is the output of “mpiname -a”:
_____________________________________________
MVAPICH2-GDR 2.2 Tue Oct 25 22:00:00 EST 2016 ch3:mrail

Compilation
CC: gcc -I/usr/local/cuda-8.0/include   -DNDEBUG -DNVALGRIND -O2
CXX: g++ -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic  -DNDEBUG -DNVALGRIND -O2
F77: gfortran -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic -I/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/lib64/gfortran/modules  -O2
FC: gfortran -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic -I/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/lib64/gfortran/modules  -O2

Configuration
--build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5 --exec-prefix=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5 --bindir=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/bin --sbindir=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/sbin --sysconfdir=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/etc --datadir=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/share --includedir=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/include --libdir=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/lib64 --libexecdir=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/share/man --infodir=/opt/mvapich2/gdr/no-mcast/2.2/cuda8.0/pbs/gnu4.8.5/share/info --disable-rpath --disable-static --enable-shared --disable-rdma-cm --without-hydra-ckpointlib --with-pbs=/opt/pbs --with-pm=hydra --disable-mcast --with-core-direct --enable-cuda CPPFLAGS=-I/usr/local/cuda-8.0/include CFLAGS=-I/usr/local/cuda-8.0/include LDFLAGS=-L/usr/local/lib -lcuda -L/usr/local/cuda-8.0/lib64 -lcudart -lrt -lstdc++ -Wl,-rpath,/usr/local/cuda-8.0/lib64 -Wl,-rpath,XORIGIN/placeholder -Wl,--build-id CC=gcc CXX=g++ F77=gfortran FC=gfortran
________________________________________________

Thank you for your help!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 14636 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20171215/39681acb/attachment.bin>


More information about the mvapich-discuss mailing list