[mvapich-discuss] CUDA running issue in MVAPICH2
Dun Liang
randonlang at gmail.com
Fri Apr 10 04:31:45 EDT 2015
Hi Jonathan,
I test cudaMemcopy by myself, the latency in one card is 10 ns, and 23ns
between different card( 8 nvidia tesla in one computer).
here is the `mpiname -a` output
```
MVAPICH2 2.1rc2 Thu Mar 12 20:00:00 EDT 2014 ch3:mrail
Compilation
CC: gcc -DNDEBUG -DNVALGRIND -O2
CXX: g++ -DNDEBUG -DNVALGRIND -O2
F77: gfortran -L/lib -L/lib -O2
FC: gfortran -O2
Configuration
--prefix=/home/liangdun/mvapich/build --enable-cuda --disable-mcast
--with-cuda=/usr/local/cuda --with-device=ch3:mrail
```
and MV2_SHOW_ENV_INFO=1
```
MVAPICH2-2.1rc2 Parameters
---------------------------------------------------------------------
PROCESSOR ARCH NAME : MV2_ARCH_INTEL_GENERIC
PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_INTEL
PROCESSOR MODEL NUMBER : 62
HCA NAME : MV2_HCA_UNKWN
HETEROGENEOUS HCA : NO
MV2_EAGERSIZE_1SC : 0
MV2_SMP_EAGERSIZE : 65537
MV2_SMPI_LENGTH_QUEUE : 262144
MV2_SMP_NUM_SEND_BUFFER : 256
MV2_SMP_BATCH_SIZE : 8
---------------------------------------------------------------------
```
and I changed the osu_benchmark/*osu_latency.c* for testing cudamemcopy
latency, change part below:
```
if(myid == 0) {
for(i = 0; i < loop + skip; i++) {
if(i == skip) t_start = MPI_Wtime();
//fprintf(stdout, "0 %d\n", i);
//MPI_Send(s_buf, size, MPI_CHAR, 1, 1, MPI_COMM_WORLD);
//MPI_Recv(r_buf, size, MPI_CHAR, 1, 1, MPI_COMM_WORLD,
&reqstat);
* cudaMemcpy(s_buf, r_buf, size, cudaMemcpyDeviceToDevice);*
* cudaMemcpy(r_buf, s_buf, size, cudaMemcpyDeviceToDevice);*
* cudaDeviceSynchronize();*
}
t_end = MPI_Wtime();
}
else if(myid == 1) {
for(i = 0; i < loop + skip; i++) {
//fprintf(stdout, "1 %d\n", i);
//MPI_Recv(r_buf, size, MPI_CHAR, 0, 1, MPI_COMM_WORLD,
&reqstat);
//MPI_Send(s_buf, size, MPI_CHAR, 0, 1, MPI_COMM_WORLD);
}
}
```
and change function *allocate_memory *to specify memory location(in the
same or different GPUs)
```
* cudaSetDevice(0);*
if (allocate_device_buffer(sbuf)) {
fprintf(stderr, "Error allocating cuda memory\n");
return 1;
}
* cudaSetDevice(1);*
if (allocate_device_buffer(rbuf)) {
fprintf(stderr, "Error allocating cuda memory\n");
return 1;
}
```
here is the test result when allocate in the same GPU
```
# OSU MPI-CUDA Latency Test
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Latency (us)
1 10.04
2 10.02
4 10.07
8 10.17
16 10.04
32 10.01
64 10.03
128 9.43
256 9.73
512 9.50
1024 9.51
2048 9.54
4096 9.50
8192 9.75
16384 9.94
32768 9.60
65536 9.81
131072 10.53
262144 11.36
524288 13.82
1048576 21.45
2097152 31.96
4194304 53.96
```
here is the test result when allocate in the different GPU
```
# OSU MPI-CUDA Latency Test
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Latency (us)
1 22.79
2 22.66
4 23.37
8 24.38
16 24.49
32 24.48
64 24.41
128 24.29
256 24.20
512 24.19
1024 24.85
2048 25.00
4096 25.72
8192 30.06
16384 30.67
32768 33.18
65536 37.00
131072 51.04
262144 73.39
524288 129.99
1048576 236.01
2097152 362.99
4194304 595.19
```
both faster than mpi_send/MPI_Recv, so what slow down cudamemcopy?
2015-04-10 0:28 GMT+08:00 Jonathan Perkins <perkinjo at cse.ohio-state.edu>:
> Hi Dun. Your results look "okay" to me. The latency between transfers
> originating or landing on a GPU have much higher latency than those on the
> standard CPU memory.
>
> We are able to achieve slightly lower latency in house but this may be due
> to our hardware and build settings compared to yours. Can you share the
> output of mpiname -a as well as the output from an osu_latency run
> with MV2_SHOW_ENV_INFO=1 also set?
>
> On Thu, Apr 9, 2015 at 12:12 PM randonlang at gmail.com <randonlang at gmail.com>
> wrote:
>
>> Thx, Jonathan, it works! and thanks khaled too.
>> sorry for bother again :p
>> but I got some weird output, D to D is far more slower than H to H when
>> transfer small data, even D to H
>>
>> here is the benchmark result:
>>
>> # OSU MPI-CUDA Latency Test
>> # Send Buffer on *DEVICE (D)* and Receive Buffer on *DEVICE (D) *
>> # Size Latency (us)
>> 1 63.42
>> 2 63.02
>> 4 61.95
>> 8 61.96
>> 16 61.87
>> 32 61.95
>> 64 61.92
>> 128 61.94
>> 256 61.97
>> 512 61.98
>> 1024 62.06
>> 2048 62.05
>> 4096 62.12
>> 8192 62.15
>> 16384 74.19
>> 32768 74.25
>> 65536 75.24
>> 131072 82.66
>> 262144 81.32
>> 524288 85.70
>> 1048576 121.99
>> 2097152 272.36
>> 4194304 585.34
>>
>> # OSU MPI-CUDA Latency Test
>> # Send Buffer on *HOST (H)* and Receive Buffer on *HOST (H) *
>> # Size Latency (us)
>> 1 0.92
>> 2 0.91
>> 4 0.91
>> 8 0.92
>> 16 0.91
>> 32 0.93
>> 64 0.99
>> 128 0.96
>> 256 1.03
>> 512 1.11
>> 1024 1.20
>> 2048 1.39
>> 4096 1.78
>> 8192 2.74
>> 16384 5.31
>> 32768 7.32
>> 65536 8.00
>> 131072 13.95
>> 262144 29.38
>> 524288 57.95
>> 1048576 115.65
>> 2097152 226.63
>> 4194304 571.31
>>
>>
>>
>> # OSU MPI-CUDA Latency Test
>>
>> # Send Buffer on *HOST (H)* and Receive Buffer on *DEVICE (D) *
>> # Size Latency (us)
>> 1 9.59
>> 2 9.73
>> 4 9.56
>> 8 9.66
>> 16 9.83
>> 32 9.63
>> 64 9.75
>> 128 8.57
>> 256 8.42
>> 512 8.87
>> 1024 8.62
>> 2048 8.79
>> 4096 9.34
>> 8192 10.37
>> 16384 12.40
>> 32768 19.03
>> 65536 21.84
>> 131072 35.24
>> 262144 66.08
>> 524288 110.40
>> 1048576 207.23
>> 2097152 354.09
>> 4194304 669.29
>>
>>
>> *From:* Jonathan Perkins <perkinjo at cse.ohio-state.edu>
>> *Date:* 2015-04-09 21:40
>> *To:* Dun Liang <randonlang at gmail.com>; mvapich-discuss
>> <mvapich-discuss at cse.ohio-state.edu>
>> *Subject:* Re: [mvapich-discuss] CUDA running issue in MVAPICH2
>>
>> Hi Dun, can you try setting MV2_USE_CUDA=1 when you run the benchmarks
>> with the device buffers?
>>
>> Example:
>> mpirun_rsh -np 2 debian81 debian81 MV2_USE_CUDA=1 ./osu_latency D D
>>
>> On Thu, Apr 9, 2015 at 8:54 AM Dun Liang <randonlang at gmail.com> wrote:
>>
>>> Dear developers:
>>>
>>> currently I have some problems running mvapich with cuda,
>>> the program is osu_latency
>>> here is the error msg:
>>> ```
>>> ┌─[liangdun at debian81] -
>>> [~/mvapich/mvapich2-2.1rc2_ib/mvapich2-2.1rc2/osu_benchmarks/.libs] -
>>> [2015-04-09 06:17:20]
>>> └─[1] <> mpirun_rsh -np 2 debian81 debian81 ./osu_latency D D
>>> # OSU MPI-CUDA Latency Test
>>> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
>>> # Size Latency (us)
>>> [debian81:mpi_rank_0][error_sighandler] Caught error: Segmentation fault
>>> (signal 11)
>>> [debian81:mpispawn_0][readline] Unexpected End-Of-File on file
>>> descriptor 6. MPI process died?
>>> [debian81:mpispawn_0][mtpmi_processops] Error while reading PMI socket.
>>> MPI process died?
>>> [debian81:mpispawn_0][child_handler] MPI process (rank: 0, pid: 1376)
>>> terminated with signal 11 -> abort job
>>> [debian81:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node
>>> debian81 aborted: Error while reading a PMI socket (4)
>>>
>>> ```
>>> it works fine when I run `./osu_latency H H`
>>> ```
>>> ┌─[liangdun at debian81] -
>>> [~/mvapich/mvapich2-2.1rc2_ib/mvapich2-2.1rc2/osu_benchmarks/.libs] -
>>> [2015-04-09 06:17:41]
>>> └─[1] <> mpirun_rsh -np 2 debian81 debian81 ./osu_latency H H
>>> # OSU MPI-CUDA Latency Test
>>> # Send Buffer on HOST (H) and Receive Buffer on HOST (H)
>>> # Size Latency (us)
>>> 1 0.28
>>> 2 0.27
>>> 4 0.27
>>> 8 0.29
>>> 16 0.27
>>> 32 0.28
>>> 64 0.31
>>> 128 0.33
>>> 256 0.39
>>> 512 0.46
>>> 1024 0.56
>>> 2048 0.75
>>> 4096 1.24
>>> 8192 1.99
>>> 16384 3.71
>>> 32768 6.49
>>> 65536 6.96
>>> 131072 12.95
>>> 262144 27.73
>>> 524288 56.53
>>> 1048576 113.61
>>> 2097152 226.53
>>> 4194304 628.29
>>>
>>> ```
>>>
>>> here is my mpi version info:
>>> ```
>>> MVAPICH2 Version: 2.1rc2
>>> MVAPICH2 Release date: Thu Mar 12 20:00:00 EDT 2014
>>> MVAPICH2 Device: ch3:mrail
>>> MVAPICH2 configure: --prefix=/home/liangdun/mvapich/build
>>> --enable-cuda --disable-mcast --with-cuda=/usr/local/cuda
>>> --with-device=ch3:mrail
>>> MVAPICH2 CC: gcc -DNDEBUG -DNVALGRIND -O2
>>> MVAPICH2 CXX: g++ -DNDEBUG -DNVALGRIND -O2
>>> MVAPICH2 F77: gfortran -L/lib -L/lib -O2
>>> MVAPICH2 FC: gfortran -O2
>>> ```
>>> the special circumstance is there is no infiniband installed in my
>>> computer, but I have to test cuda, I find out --enable-cuda config doesnt
>>> work when I using --with-device=ch3:sock .
>>>
>>> here are my questions:
>>> * is this cuda error caused by no infiniband installation?
>>> * is there any way to test cuda with tcp/ip setup?
>>>
>>> sorry for my poor English, I appreciate MVAPICH's work!
>>>
>>> best regards!
>>>
>>> Dun
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150410/6d2c584d/attachment-0001.html>
More information about the mvapich-discuss
mailing list