[mvapich-discuss] CUDA running issue in MVAPICH2

Fri Apr 10 04:31:45 EDT 2015

Hi Jonathan,
I test cudaMemcopy by myself, the latency in one card is 10 ns, and 23ns
between different card( 8 nvidia tesla in one computer).
here is the `mpiname -a` output
```
MVAPICH2 2.1rc2 Thu Mar 12 20:00:00 EDT 2014 ch3:mrail

Compilation
CC: gcc    -DNDEBUG -DNVALGRIND -O2
CXX: g++   -DNDEBUG -DNVALGRIND -O2
F77: gfortran -L/lib -L/lib   -O2
FC: gfortran   -O2

Configuration
--prefix=/home/liangdun/mvapich/build --enable-cuda --disable-mcast
--with-cuda=/usr/local/cuda --with-device=ch3:mrail
```
and MV2_SHOW_ENV_INFO=1
```
 MVAPICH2-2.1rc2 Parameters
---------------------------------------------------------------------
        PROCESSOR ARCH NAME            : MV2_ARCH_INTEL_GENERIC
        PROCESSOR FAMILY NAME          : MV2_CPU_FAMILY_INTEL
        PROCESSOR MODEL NUMBER         : 62
        HCA NAME                       : MV2_HCA_UNKWN
        HETEROGENEOUS HCA              : NO
        MV2_EAGERSIZE_1SC              : 0
        MV2_SMP_EAGERSIZE              : 65537
        MV2_SMPI_LENGTH_QUEUE          : 262144
        MV2_SMP_NUM_SEND_BUFFER        : 256
        MV2_SMP_BATCH_SIZE             : 8
---------------------------------------------------------------------
```

and I changed the osu_benchmark/*osu_latency.c* for testing cudamemcopy
latency,  change part below:
```
        if(myid == 0) {
            for(i = 0; i < loop + skip; i++) {
                if(i == skip) t_start = MPI_Wtime();
                //fprintf(stdout, "0 %d\n", i);
                //MPI_Send(s_buf, size, MPI_CHAR, 1, 1, MPI_COMM_WORLD);
                //MPI_Recv(r_buf, size, MPI_CHAR, 1, 1, MPI_COMM_WORLD,
&reqstat);
*                cudaMemcpy(s_buf, r_buf, size, cudaMemcpyDeviceToDevice);*
*                cudaMemcpy(r_buf, s_buf, size, cudaMemcpyDeviceToDevice);*
*                cudaDeviceSynchronize();*
            }

            t_end = MPI_Wtime();
        }

        else if(myid == 1) {
            for(i = 0; i < loop + skip; i++) {
                //fprintf(stdout, "1 %d\n", i);
                //MPI_Recv(r_buf, size, MPI_CHAR, 0, 1, MPI_COMM_WORLD,
&reqstat);
                //MPI_Send(s_buf, size, MPI_CHAR, 0, 1, MPI_COMM_WORLD);
            }
        }
```
and  change function *allocate_memory *to specify memory location(in the
same or different GPUs)
```
              *  cudaSetDevice(0);*
                if (allocate_device_buffer(sbuf)) {
                    fprintf(stderr, "Error allocating cuda memory\n");
                    return 1;
                }

*                cudaSetDevice(1);*
                if (allocate_device_buffer(rbuf)) {
                    fprintf(stderr, "Error allocating cuda memory\n");
                    return 1;
                }
```
here is the test result when allocate in the same GPU
```
# OSU MPI-CUDA Latency Test
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size            Latency (us)
1                        10.04
2                        10.02
4                        10.07
8                        10.17
16                       10.04
32                       10.01
64                       10.03
128                       9.43
256                       9.73
512                       9.50
1024                      9.51
2048                      9.54
4096                      9.50
8192                      9.75
16384                     9.94
32768                     9.60
65536                     9.81
131072                   10.53
262144                   11.36
524288                   13.82
1048576                  21.45
2097152                  31.96
4194304                  53.96
```

here is the test result when allocate in the different GPU
```
# OSU MPI-CUDA Latency Test
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size            Latency (us)
1                        22.79
2                        22.66
4                        23.37
8                        24.38
16                       24.49
32                       24.48
64                       24.41
128                      24.29
256                      24.20
512                      24.19
1024                     24.85
2048                     25.00
4096                     25.72
8192                     30.06
16384                    30.67
32768                    33.18
65536                    37.00
131072                   51.04
262144                   73.39
524288                  129.99
1048576                 236.01
2097152                 362.99
4194304                 595.19
```
both faster than mpi_send/MPI_Recv, so what slow down cudamemcopy?

2015-04-10 0:28 GMT+08:00 Jonathan Perkins <perkinjo at cse.ohio-state.edu>:

> Hi Dun.  Your results look "okay" to me.  The latency between transfers
> originating or landing on a GPU have much higher latency than those on the
> standard CPU memory.
>
> We are able to achieve slightly lower latency in house but this may be due
> to our hardware and build settings compared to yours.  Can you share the
> output of mpiname -a as well as the output from an osu_latency run
> with MV2_SHOW_ENV_INFO=1 also set?
>
> On Thu, Apr 9, 2015 at 12:12 PM randonlang at gmail.com <randonlang at gmail.com>
> wrote:
>
>> Thx,  Jonathan, it works! and thanks khaled too.
>> sorry for bother again :p
>> but I got some weird output, D to D is far more slower than H to H when
>> transfer small data, even D to H
>>
>> here is the benchmark result:
>>
>> # OSU MPI-CUDA Latency Test
>> # Send Buffer on *DEVICE (D)* and Receive Buffer on *DEVICE (D) *
>> # Size Latency (us)
>> 1 63.42
>> 2 63.02
>> 4 61.95
>> 8 61.96
>> 16 61.87
>> 32 61.95
>> 64 61.92
>> 128 61.94
>> 256 61.97
>> 512 61.98
>> 1024 62.06
>> 2048 62.05
>> 4096 62.12
>> 8192 62.15
>> 16384 74.19
>> 32768 74.25
>> 65536 75.24
>> 131072 82.66
>> 262144 81.32
>> 524288 85.70
>> 1048576 121.99
>> 2097152 272.36
>> 4194304 585.34
>>
>> # OSU MPI-CUDA Latency Test
>> # Send Buffer on *HOST (H)* and Receive Buffer on *HOST (H) *
>> # Size Latency (us)
>> 1 0.92
>> 2 0.91
>> 4 0.91
>> 8 0.92
>> 16 0.91
>> 32 0.93
>> 64 0.99
>> 128 0.96
>> 256 1.03
>> 512 1.11
>> 1024 1.20
>> 2048 1.39
>> 4096 1.78
>> 8192 2.74
>> 16384 5.31
>> 32768 7.32
>> 65536 8.00
>> 131072 13.95
>> 262144 29.38
>> 524288 57.95
>> 1048576 115.65
>> 2097152 226.63
>> 4194304 571.31
>>
>>
>>
>> # OSU MPI-CUDA Latency Test
>>
>> # Send Buffer on *HOST (H)* and Receive Buffer on *DEVICE (D) *
>> # Size Latency (us)
>> 1 9.59
>> 2 9.73
>> 4 9.56
>> 8 9.66
>> 16 9.83
>> 32 9.63
>> 64 9.75
>> 128 8.57
>> 256 8.42
>> 512 8.87
>> 1024 8.62
>> 2048 8.79
>> 4096 9.34
>> 8192 10.37
>> 16384 12.40
>> 32768 19.03
>> 65536 21.84
>> 131072 35.24
>> 262144 66.08
>> 524288 110.40
>> 1048576 207.23
>> 2097152 354.09
>> 4194304 669.29
>>
>>
>> *From:* Jonathan Perkins <perkinjo at cse.ohio-state.edu>
>> *Date:* 2015-04-09 21:40
>> *To:* Dun Liang <randonlang at gmail.com>; mvapich-discuss
>> <mvapich-discuss at cse.ohio-state.edu>
>> *Subject:* Re: [mvapich-discuss] CUDA running issue in MVAPICH2
>>
>> Hi Dun, can you try setting MV2_USE_CUDA=1 when you run the benchmarks
>> with the device buffers?
>>
>> Example:
>> mpirun_rsh -np 2 debian81 debian81 MV2_USE_CUDA=1 ./osu_latency D D
>>
>> On Thu, Apr 9, 2015 at 8:54 AM Dun Liang <randonlang at gmail.com> wrote:
>>
>>> Dear developers:
>>>
>>> currently I have some problems running mvapich with cuda,
>>> the program is osu_latency
>>> here is the error msg:
>>> ```
>>> ┌─[liangdun at debian81] -
>>> [~/mvapich/mvapich2-2.1rc2_ib/mvapich2-2.1rc2/osu_benchmarks/.libs] -
>>> [2015-04-09 06:17:20]
>>> └─[1] <> mpirun_rsh -np 2 debian81 debian81 ./osu_latency D D
>>> # OSU MPI-CUDA Latency Test
>>> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
>>> # Size            Latency (us)
>>> [debian81:mpi_rank_0][error_sighandler] Caught error: Segmentation fault
>>> (signal 11)
>>> [debian81:mpispawn_0][readline] Unexpected End-Of-File on file
>>> descriptor 6. MPI process died?
>>> [debian81:mpispawn_0][mtpmi_processops] Error while reading PMI socket.
>>> MPI process died?
>>> [debian81:mpispawn_0][child_handler] MPI process (rank: 0, pid: 1376)
>>> terminated with signal 11 -> abort job
>>> [debian81:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node
>>> debian81 aborted: Error while reading a PMI socket (4)
>>>
>>> ```
>>> it works fine when I run `./osu_latency H H`
>>> ```
>>> ┌─[liangdun at debian81] -
>>> [~/mvapich/mvapich2-2.1rc2_ib/mvapich2-2.1rc2/osu_benchmarks/.libs] -
>>> [2015-04-09 06:17:41]
>>> └─[1] <> mpirun_rsh -np 2 debian81 debian81 ./osu_latency H H
>>> # OSU MPI-CUDA Latency Test
>>> # Send Buffer on HOST (H) and Receive Buffer on HOST (H)
>>> # Size            Latency (us)
>>> 1                         0.28
>>> 2                         0.27
>>> 4                         0.27
>>> 8                         0.29
>>> 16                        0.27
>>> 32                        0.28
>>> 64                        0.31
>>> 128                       0.33
>>> 256                       0.39
>>> 512                       0.46
>>> 1024                      0.56
>>> 2048                      0.75
>>> 4096                      1.24
>>> 8192                      1.99
>>> 16384                     3.71
>>> 32768                     6.49
>>> 65536                     6.96
>>> 131072                   12.95
>>> 262144                   27.73
>>> 524288                   56.53
>>> 1048576                 113.61
>>> 2097152                 226.53
>>> 4194304                 628.29
>>>
>>> ```
>>>
>>> here is my mpi version info:
>>> ```
>>> MVAPICH2 Version:       2.1rc2
>>> MVAPICH2 Release date:  Thu Mar 12 20:00:00 EDT 2014
>>> MVAPICH2 Device:        ch3:mrail
>>> MVAPICH2 configure:     --prefix=/home/liangdun/mvapich/build
>>> --enable-cuda --disable-mcast --with-cuda=/usr/local/cuda
>>> --with-device=ch3:mrail
>>> MVAPICH2 CC:    gcc    -DNDEBUG -DNVALGRIND -O2
>>> MVAPICH2 CXX:   g++   -DNDEBUG -DNVALGRIND -O2
>>> MVAPICH2 F77:   gfortran -L/lib -L/lib   -O2
>>> MVAPICH2 FC:    gfortran   -O2
>>> ```
>>> the special circumstance is there is no infiniband installed in my
>>> computer, but I have to test cuda, I find out --enable-cuda config doesnt
>>> work when I using --with-device=ch3:sock .
>>>
>>> here are my questions:
>>> * is this cuda error caused by no infiniband installation?
>>> * is there any way to test cuda with tcp/ip setup?
>>>
>>> sorry for my poor English, I appreciate MVAPICH's work!
>>>
>>> best regards!
>>>
>>> Dun
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150410/6d2c584d/attachment-0001.html>