[mvapich-discuss] Memory transfer performance using CUDA-Aware RDMA (Infiniband) with MVAPICH2-GDR 2.0 (Released 08/23/2014)

Sat Sep 6 02:30:50 EDT 2014

Hi MVAPICH,

Im trying to test the performance of using MVAPICH-GDR 2.0 with RDMA. My current setting is i have 2 nodes (Named NODE_1, NODE_2), each connected one to another using Infiniband ConnectX3. Both nodes has Nvidia K20c each.  I ran 3 experiment,

Experiment 1 :
Transferring data from CPU Ram NODE_1 to GPU Ram NODE_2, (Simulating GPU Direct from one CPU RAM to another machine GPU RAM)

Experiment 2:
Transferring data from GPU Ram NODE_1 to GPU Ram NODE_2. (Simulating GPU Direct from one GPU to another GPU in different machine)

Experiment 3:
Transferring data from GPU Ram NODE_1 to CPU Ram in NODE_1 to CPU Ram NODE_2 to GPU Ram NONDE_2  (This is to simulate the traditional method of memory transferring in MPI, without GPU Direct)

I ran the experiment using 50000000 int array (Aproximate 47 MiB ). However, i found out the the performance is shown as below,

CPU_RAM_NODE_1 --> to --> GPU_RAM_NODE_2  : 150 ms
GPU_RAM_NODE_1 --> to --> GPU_RAM_NODE_2 : 300 ms
GPU_RAM_NODE_1 --> CPU_RAM_NODE_1 --> to --> CPU_RAM_NODE_2 --> GPU_RAM_NODE_2 : 226 ms

Beside that, I also found that if i use more than 110000000 int array (Aproximate 100 MiB) to send data using MPI_Send, it will crash and print out segmentation fault

My Question is,

*         Is the performance as shown above is expected?.. Is sending data from GPU RAM directly to another GPU RAM is always slower?

*         What is the proper size to send data? Is it faster to sending smaller chunk data repeatly that sending one big chunk at the time?.

*         Why is it crash when i send data more than 100 MiB?..  I remember last time i did report the same issue about sending big chunk of data and it crash, am I suppose to send smaller chunk data to get better performance or what?.

*         What is the most effective or some guideline to properly use GPU Direct RDMA?.

Below is the sample code i use, you can reproduce the result by using this code, I also include the compilation code, and mpirun code.

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime_api.h>
#include <sys/time.h>
#include <ctime>
#include <iostream>
#include <fstream>

using namespace std;

int main(int argc, char **argv)
{
                int myrank;
                /* Initialize MPI */
                int * X, *X_dev;

                struct timeval tv1 ,tv2;
                unsigned long long int ret, ret2;
                int SIZE = 50000000;

                if(argc > 1)
                {
                                SIZE = atoi(argv[1]);
                }

                /* MPI initialization */
                MPI_Init(&argc, &argv);
                printf("inited!");

                /* Find out my identity in the default communicator */
                MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
                printf("check rank :%i\n",myrank);

                /* Allocation */
                cudaMalloc((void**)&X_dev,   sizeof(int)*SIZE);
                X = (int * ) malloc (sizeof(int)*SIZE);

                 MPI_Barrier(MPI_COMM_WORLD);

                 if(myrank == 0)
                {
for(int i =0 ;i < SIZE; i++)
                                                X[i] = i;

cudaMemcpy(X_dev,X,sizeof(int)*SIZE,cudaMemcpyHostToDevice);
                }

                //EXPERIMENT 1
if(myrank==0)
                {
gettimeofday(&tv1, NULL);
MPI_Send(X  , SIZE, MPI_INT , 1, 0, MPI_COMM_WORLD);
gettimeofday(&tv2, NULL);

ret =  tv1.tv_usec;
ret2 = tv2.tv_usec;

/* Convert from micro seconds (10^-6) to milliseconds (10^-3) */
                               ret /= 1000;
                               ret2 /= 1000;

                               /* Adds the seconds (10^0) after converting them to milliseconds (10^-3) */
                               ret +=  (tv1.tv_sec * 1000);
ret2 += (tv2.tv_sec * 1000);

printf("CPU_RAM_NODE_1 --> to --> GPU_RAM_NODE_2  : %llu ms\n",(ret2-ret));
                }
                else
                {
                               MPI_Recv(X_dev , SIZE, MPI_INT,  0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
                }

                //EXPERIMENT 2
                if(myrank==0)
                {
gettimeofday(&tv1, NULL);
MPI_Send(X_dev  , SIZE, MPI_INT , 1, 0, MPI_COMM_WORLD);
gettimeofday(&tv2, NULL);

                               ret =  tv1.tv_usec; ret2 = tv2.tv_usec;
                               ret /= 1000; ret2 /= 1000;
                               ret +=  (tv1.tv_sec * 1000); ret2 += (tv2.tv_sec * 1000);

                               printf("GPU_RAM_NODE_1 --> to --> GPU_RAM_NODE_2 : %llu ms\n",(ret2-ret));
                }
                else
                {
                               MPI_Recv(X_dev , SIZE, MPI_INT,  0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
                }

                MPI_Barrier(MPI_COMM_WORLD);

//EXPERIMENT 3
                if(myrank==0)
                {
gettimeofday(&tv1, NULL);
                               cudaMemcpy(X_dev,X,sizeof(int)*SIZE,cudaMemcpyHostToDevice);
                               MPI_Send(X  , SIZE, MPI_INT , 1, 0, MPI_COMM_WORLD);
                }
                else
                {
                               MPI_Recv(X , SIZE, MPI_INT,  0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
                               cudaMemcpy(X_dev,X,sizeof(int)*SIZE,cudaMemcpyHostToDevice);
                }

                MPI_Barrier(MPI_COMM_WORLD);

                if(myrank ==0 )
                {
                               gettimeofday(&tv2, NULL);
                               ret =  tv1.tv_usec; ret2 = tv2.tv_usec;
ret /= 1000; ret2 /= 1000;
ret +=  (tv1.tv_sec * 1000); ret2 += (tv2.tv_sec * 1000);

                               printf("GPU_RAM_NODE_1 --> CPU_RAM_NODE_1 --> to --> CPU_RAM_NODE_2 --> GPU_RAM_NODE_2 : %llu ms\n",(ret2-ret));
                }

                free(X);
                cudaFree(X_dev);
                MPI_Finalize();
                return 0;
}

Compile Code
/opt/mvapich2/gdr/2.0/gnu/bin/mpicc -I/usr/local/cuda/include -L/usr/local/cuda/lib64 -lcuda  -lcudart -m64 -lstdc++ -o main main.cpp

Mpirun Code
[mimos at wnode4 RDMA_test]$ /opt/mvapich2/gdr/2.0/gnu/bin/mpirun_rsh -np 2  -hostfile hostfile ./main 50000000
inited!check rank :0
inited!check rank :1
CPU_RAM_NODE_1 --> to --> GPU_RAM_NODE_2  : 150 ms
GPU_RAM_NODE_1 --> to --> GPU_RAM_NODE_2 : 300 ms
GPU_RAM_NODE_1 --> CPU_RAM_NODE_1 --> to --> CPU_RAM_NODE_2 --> GPU_RAM_NODE_2 : 226 ms

Regards,
Amirul
Software Engineer,
MIMOS Berhad.

________________________________
DISCLAIMER:

This e-mail (including any attachments) is for the addressee(s) only and may be confidential, especially as regards personal data. If you are not the intended recipient, please note that any dealing, review, distribution, printing, copying or use of this e-mail is strictly prohibited. If you have received this email in error, please notify the sender immediately and delete the original message (including any attachments).

MIMOS Berhad is a research and development institution under the purview of the Malaysian Ministry of Science, Technology and Innovation. Opinions, conclusions and other information in this e-mail that do not relate to the official business of MIMOS Berhad and/or its subsidiaries shall be understood as neither given nor endorsed by MIMOS Berhad and/or its subsidiaries and neither MIMOS Berhad nor its subsidiaries accepts responsibility for the same. All liability arising from or in connection with computer viruses and/or corrupted e-mails is excluded to the fullest extent permitted by law.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140906/c4c71996/attachment-0001.html>