[mvapich-discuss] MVAPICH 2.1a GDR cuda-aware/GDR data corrupted?
Filippo SPIGA
fs395 at cam.ac.uk
Sun Apr 12 17:14:05 EDT 2015
Dear MVAPICH-2 developers,
any news about this issue?
F
On Apr 1, 2015, at 4:02 PM, Jens Glaser <jsglaser at umich.edu> wrote:
> Khaled,
>
> setting the parameter as you suggested fixes the cuda-aware MPI case.
>
> However, all GDR tests still fail. As an example, I am showing the one for GDR without gdrcopy or loopback.
>
> Jens
>
> env | grep MV2
> MV2_USE_APM=0
> MV2_SMP_USE_LIMIC2=1
> MV2_CUDA_NONBLOCKING_STREAMS=0
>
> mpirun -np $SLURM_NTASKS -ppn 2 -genvall \
> -genv MV2_RAIL_SHARING_POLICY FIXED_MAPPING -genv MV2_PROCESS_TO_RAIL_MAPPING 0:1 \
> -genv MV2_CPU_MAPPING 0:1 \
> -genv MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD 1G \
> -genv MV2_ENABLE_AFFINITY 1 -genv MV2_CPU_BINDING_LEVEL SOCKET -genv MV2_CPU_BINDING_POLICY SCATTER \
> -genv MV2_USE_SHARED_MEM 0 \
> -genv MV2_USE_CUDA 1 -genv MV2_USE_GPUDIRECT 1 -genv MV2_CUDA_IPC 0 -genv MV2_USE_GPUDIRECT_LOOPBACK_LIMIT 9999999 \
> sh /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/get_local_rank /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/mpi/pt2pt/osu_bw D D
>
> # OSU MPI-CUDA Bandwidth Test v4.4
> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
> # Size Bandwidth (MB/s)
> 1 0.01
> 2 0.03
> 4 0.05
> 8 0.11
> 16 0.21
> 32 0.81
> 64 1.62
> 128 3.24
> 256 6.35
> 512 12.80
> 1024 24.99
> 2048 48.55
> 4096 89.94
> 8192 594.49
> 16384 666.69
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> 32768 14.72
> Message byte 0, b != a
> 65536 698.79
> 131072 778.72
> 262144 788.29
> 524288 776.84
> 1048576 774.68
> 2097152 186.34
> 4194304 196.13
>
>> On Mar 31, 2015, at 10:00 PM, Jens Glaser <jsglaser at umich.edu> wrote:
>>
>> Hi Khaled,
>>
>> no, I haven’t run inter-node. A quick test suggests that the behavior may be more sporadic there.
>> Let me know if you need detailed data.
>>
>> Jens
>>
>>
>>> On Mar 31, 2015, at 9:31 PM, khaled hamidouche <khaledhamidouche at gmail.com> wrote:
>>>
>>> Hi Jens,
>>>
>>> Thanks a lot for the reproducer we will take a look at it and get back to you.
>>> In meantime, I see that you specify ppn=2, does this mean that this is an intranode job? as osu_bw uses only 2 processes, are these processes in the same node ?. Does it happened for internode ?
>>>
>>> Thanks
>>>
>>> On Tue, Mar 31, 2015 at 9:10 PM, Jens Glaser <jsglaser at umich.edu> wrote:
>>> Hi,
>>>
>>> I am observing bad data with MVAPICH 2.1a GDR in non-blocking, point-to-point communication.
>>> Host-host communication is fine, but both cuda-aware MPI and cuda-aware MPI with GPUDirect RDMA fail.
>>> I have additional data showing similar behavior for MVAPICH 2.0 GDR.
>>>
>>> Jens
>>>
>>> DETAILS:
>>>
>>> 1. To test communication correctness, I modify the MPI_recv call in the bandwidth test of the OSU micro benchmarks (4.4)
>>> in such a way that received data for different iterations of the benchmark for a given message size
>>> is written into an expanded output buffer in contiguous fashion. Then I check if the received characters match
>>> the expected values (‘a’).
>>>
>>> Patch to osu_bw.c:
>>> --- osu_bw.c 2015-03-31 20:29:32.000000000 -0400
>>> +++ osu_bw_expanded_buf.c 2015-03-31 20:24:22.000000000 -0400
>>> @@ -42,9 +42,11 @@
>>>
>>> #define MAX_REQ_NUM 1000
>>>
>>> +#define WINDOW_SIZE 64
>>> +
>>> #define MAX_ALIGNMENT 65536
>>> #define MAX_MSG_SIZE (1<<22)
>>> -#define MYBUFSIZE (MAX_MSG_SIZE + MAX_ALIGNMENT)
>>> +#define MYBUFSIZE (MAX_MSG_SIZE*WINDOW_SIZE + MAX_ALIGNMENT)
>>>
>>> #define LOOP_LARGE 20
>>> #define WINDOW_SIZE_LARGE 64
>>> @@ -98,6 +100,7 @@
>>> int allocate_memory (char **sbuf, char **rbuf, int rank);
>>> void print_header (int rank);
>>> void touch_data (void *sbuf, void *rbuf, int rank, size_t size);
>>> +void check_data (void *buf, size_t size);
>>> void free_memory (void *sbuf, void *rbuf, int rank);
>>> int init_accel (void);
>>> int cleanup_accel (void);
>>> @@ -110,7 +113,7 @@
>>> char *s_buf, *r_buf;
>>> double t_start = 0.0, t_end = 0.0, t = 0.0;
>>> int loop = 100;
>>> - int window_size = 64;
>>> + int window_size = WINDOW_SIZE;
>>> int skip = 10;
>>> int po_ret = process_options(argc, argv);
>>>
>>> @@ -205,12 +208,16 @@
>>> else if(myid == 1) {
>>> for(i = 0; i < loop + skip; i++) {
>>> for(j = 0; j < window_size; j++) {
>>> - MPI_Irecv(r_buf, size, MPI_CHAR, 0, 100, MPI_COMM_WORLD,
>>> + MPI_Irecv(r_buf + j*size, size, MPI_CHAR, 0, 100, MPI_COMM_WORLD,
>>> request + j);
>>> }
>>>
>>> MPI_Waitall(window_size, request, reqstat);
>>> MPI_Send(s_buf, 4, MPI_CHAR, 0, 101, MPI_COMM_WORLD);
>>> +
>>> + check_data(r_buf, size*window_size);
>>> +
>>> +
>>> }
>>> }
>>>
>>> @@ -564,6 +571,39 @@
>>> }
>>> }
>>>
>>> +void
>>> +check_data (void * buf, size_t size)
>>> +{
>>> + char *h_rbuf;
>>> + #ifdef _ENABLE_CUDA_
>>> + if ('D' == options.dst) {
>>> + h_rbuf = malloc(size);
>>> + cudaError_t cuerr = cudaMemcpy(h_rbuf, buf, size, cudaMemcpyDeviceToHost);
>>> + if (cudaSuccess != cuerr) {
>>> + fprintf(stderr, "Error copying D2H\n");
>>> + return;
>>> + }
>>> + } else
>>> + #endif
>>> + {
>>> + h_rbuf = buf;
>>> + }
>>> +
>>> + unsigned int i;
>>> + for (i = 0; i < size; ++i)
>>> + {
>>> + if (h_rbuf[i] != 'a')
>>> + {
>>> + printf("Message byte %d, %c != %c\n", i, h_rbuf[i], 'a');
>>> + break;
>>> + }
>>> + }
>>> + if ('D' == options.dst) {
>>> + free(h_rbuf);
>>> + }
>>> +}
>>> +
>>> +
>>> int
>>> free_device_buffer (void * buf)
>>> {
>>>
>>>
>>> 2. I execute the test on a dual rail configuration node, with two GPUs and two HCAs on different segments of PCIe.
>>> Specifically, I am testing on the Wilkes cluster. The three different configurations are:
>>> Host-Host, Device-Device cuda-aware, and Device-Device GDR. The CUDA toolkit version is 6.5.
>>>
>>> These are the results:
>>>
>>> a) Host-Host
>>> mpirun -np $SLURM_NTASKS -ppn 2 -genvall \
>>> -genv MV2_RAIL_SHARING_POLICY FIXED_MAPPING -genv MV2_PROCESS_TO_RAIL_MAPPING 0:1 \
>>> -genv MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD 1G \
>>> -genv MV2_ENABLE_AFFINITY 1 -genv MV2_CPU_BINDING_LEVEL SOCKET -genv MV2_CPU_BINDING_POLICY SCATTER \
>>> -genv MV2_USE_SHARED_MEM 0 \
>>> -genv MV2_USE_CUDA 1 -genv MV2_USE_GPUDIRECT 0 -genv MV2_GPUDIRECT_GDRCOPY_LIB ${GDRCOPY_LIBRARY_PATH}/libgdrapi.so -genv MV2_CUDA_IPC 0 \
>>> sh /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/get_local_rank /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/mpi/pt2pt/osu_bw H H
>>>
>>> ldd /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/mpi/pt2pt/osu_bw
>>> # OSU MPI-CUDA Bandwidth Test v4.4
>>> # Send Buffer on HOST (H) and Receive Buffer on HOST (H)
>>> # Size Bandwidth (MB/s)
>>> 1 1.11
>>> 2 2.20
>>> 4 4.43
>>> 8 8.89
>>> 16 17.84
>>> 32 35.64
>>> 64 70.33
>>> 128 133.84
>>> 256 242.58
>>> 512 359.18
>>> 1024 578.63
>>> 2048 828.26
>>> 4096 1011.72
>>> 8192 1134.18
>>> 16384 1205.19
>>> 32768 1261.87
>>> 65536 1272.95
>>> 131072 1279.46
>>> 262144 1275.65
>>> 524288 1275.42
>>> 1048576 1275.61
>>> 2097152 1277.70
>>> 4194304 1278.82
>>>
>>> -> OK
>>>
>>> b) Device-Device cuda-aware
>>>
>>> mpirun -np $SLURM_NTASKS -ppn 2 -genvall \
>>> -genv MV2_RAIL_SHARING_POLICY FIXED_MAPPING -genv MV2_PROCESS_TO_RAIL_MAPPING 0:1 \
>>> -genv MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD 1G \
>>> -genv MV2_ENABLE_AFFINITY 1 -genv MV2_CPU_BINDING_LEVEL SOCKET -genv MV2_CPU_BINDING_POLICY SCATTER \
>>> -genv MV2_USE_SHARED_MEM 0 \
>>> -genv MV2_USE_CUDA 1 -genv MV2_USE_GPUDIRECT 0 -genv MV2_GPUDIRECT_GDRCOPY_LIB ${GDRCOPY_LIBRARY_PATH}/libgdrapi.so -genv MV2_CUDA_IPC 0 \
>>> sh /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/get_local_rank /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/mpi/pt2pt/osu_bw D D
>>>
>>> # OSU MPI-CUDA Bandwidth Test v4.4
>>> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
>>> # Size Bandwidth (MB/s)
>>> Warning *** The GPU and IB selected are not on the same socket. Do not delever the best performance
>>> 1 0.07
>>> 2 0.13
>>> 4 0.34
>>> 8 0.69
>>> 16 1.36
>>> 32 2.73
>>> 64 5.42
>>> 128 10.87
>>> 256 21.70
>>> 512 43.21
>>> 1024 84.83
>>> 2048 161.70
>>> 4096 299.68
>>> 8192 412.03
>>> 16384 501.18
>>> 32768 543.28
>>> Message byte 0, b != a
>>> 65536 661.09
>>> Message byte 0, b != a
>>> 131072 739.19
>>> Message byte 0, b != a
>>> 262144 770.89
>>> Message byte 0, b != a
>>> 524288 761.48
>>> 1048576 756.53
>>> 2097152 757.82
>>> Message byte 0, b != a
>>> 4194304 755.51
>>>
>>> -> FAIL
>>>
>>> c) Device-device GDR
>>> unset MV2_GPUDIRECT_GDRCOPY_LIB
>>> mpirun -np $SLURM_NTASKS -ppn 2 -genvall \
>>> -genv MV2_RAIL_SHARING_POLICY FIXED_MAPPING -genv MV2_PROCESS_TO_RAIL_MAPPING 0:1 \
>>> -genv MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD 1G \
>>> -genv MV2_ENABLE_AFFINITY 1 -genv MV2_CPU_BINDING_LEVEL SOCKET -genv MV2_CPU_BINDING_POLICY SCATTER \
>>> -genv MV2_USE_SHARED_MEM 0 \
>>> -genv MV2_USE_CUDA 1 -genv MV2_USE_GPUDIRECT 1 -genv MV2_CUDA_IPC 0\
>>> sh /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/get_local_rank /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/mpi/pt2pt/osu_bw D D
>>>
>>> # OSU MPI-CUDA Bandwidth Test v4.4
>>> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
>>> # Size Bandwidth (MB/s)
>>> Warning *** The GPU and IB selected are not on the same socket. Do not delever the best performance
>>> 1 0.01
>>> 2 0.03
>>> 4 0.05
>>> 8 0.11
>>> 16 0.22
>>> 32 0.84
>>> 64 1.69
>>> 128 3.35
>>> 256 6.61
>>> 512 13.22
>>> 1024 25.67
>>> 2048 49.59
>>> 4096 92.64
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> 8192 14.81
>>> Message byte 0, b != a
>>> 16384 421.67
>>> 32768 608.24
>>> 65536 721.74
>>> 131072 792.72
>>> 262144 795.85
>>> 524288 780.61
>>> 1048576 776.48
>>> 2097152 160.07
>>> 4194304 401.23
>>>
>>> -> FAIL
>>>
>>> d) Device-Device GDR (no loopback)
>>> unset MV2_GPUDIRECT_GDRCOPY_LIB
>>> mpirun -np $SLURM_NTASKS -ppn 2 -genvall \
>>> -genv MV2_RAIL_SHARING_POLICY FIXED_MAPPING -genv MV2_PROCESS_TO_RAIL_MAPPING 0:1 \
>>> -genv MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD 1G \
>>> -genv MV2_ENABLE_AFFINITY 1 -genv MV2_CPU_BINDING_LEVEL SOCKET -genv MV2_CPU_BINDING_POLICY SCATTER \
>>> -genv MV2_USE_SHARED_MEM 0 \
>>> -genv MV2_USE_CUDA 1 -genv MV2_USE_GPUDIRECT 1 -genv MV2_CUDA_IPC 0 -genv MV2_USE_GPUDIRECT_LOOPBACK_LIMIT 9999999 \
>>> sh /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/get_local_rank /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/mpi/pt2pt/osu_bw D D
>>>
>>> # OSU MPI-CUDA Bandwidth Test v4.4
>>> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
>>> # Size Bandwidth (MB/s)
>>> Warning *** The GPU and IB selected are not on the same socket. Do not delever the best performance
>>> 1 0.01
>>> 2 0.03
>>> 4 0.05
>>> 8 0.11
>>> 16 0.22
>>> 32 0.83
>>> 64 1.67
>>> 128 3.33
>>> 256 6.57
>>> 512 13.08
>>> 1024 25.40
>>> 2048 49.38
>>> 4096 91.31
>>> 8192 595.21
>>> 16384 666.12
>>> Message byte 0, b != a
>>> 32768 605.65
>>> 65536 721.52
>>> 131072 791.46
>>> 262144 794.08
>>> 524288 779.70
>>> 1048576 776.23
>>> 2097152 187.64
>>> 4194304 196.25
>>>
>>> -> FAIL
>>>
>>> 3. Additional info:
>>>
>>> MVAPICH2 Version: 2.1a
>>> MVAPICH2 Release date: Sun Sep 21 12:00:00 EDT 2014
>>> MVAPICH2 Device: ch3:mrail
>>> MVAPICH2 configure: --build=x86_64-unknown-linux-gnu --host=x86_64-unknown-linux-gnu --target=x86_64-redhat-linux-gnu --program-prefix= --prefix=/opt/mvapich2/gdr/2.1a/gnu --exec-prefix=/opt/mvapich2/gdr/2.1a/gnu --bindir=/opt/mvapich2/gdr/2.1a/gnu/bin --sbindir=/opt/mvapich2/gdr/2.1a/gnu/sbin --sysconfdir=/opt/mvapich2/gdr/2.1a/gnu/etc --datadir=/opt/mvapich2/gdr/2.1a/gnu/share --includedir=/opt/mvapich2/gdr/2.1a/gnu/include --libdir=/opt/mvapich2/gdr/2.1a/gnu/lib64 --libexecdir=/opt/mvapich2/gdr/2.1a/gnu/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/opt/mvapich2/gdr/2.1a/gnu/share/man --infodir=/opt/mvapich2/gdr/2.1a/gnu/share/info --disable-rpath --disable-static --enable-shared --disable-rdma-cm --disable-mcast --enable-cuda --without-hydra-ckpointlib CPPFLAGS=-I/usr/local/cuda/include LDFLAGS=-L/usr/local/cuda/lib64 -Wl,-rpath,/usr/local/cuda/lib64 -Wl,-rpath,XORIGIN/placeholder
>>> MVAPICH2 CC: gcc -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -DNDEBUG -DNVALGRIND -O2
>>> MVAPICH2 CXX: g++ -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -DNDEBUG -DNVALGRIND -O2
>>> MVAPICH2 F77: gfortran -L/lib -L/lib -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -I/opt/mvapich2/gdr/2.1a/gnu/lib64/gfortran/modules -O2
>>> MVAPICH2 FC: gfortran -O2
>>>
>>> ldd /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/mpi/pt2pt/osu_bw
>>> linux-vdso.so.1 => (0x00007fff4ec9a000)
>>> libmpi.so.12 => /usr/local/Cluster-Apps/mvapich2-GDR/gnu/2.1a_cuda-6.5/lib64/libmpi.so.12 (0x00007fd83ab34000)
>>> libc.so.6 => /lib64/libc.so.6 (0x00007fd83a776000)
>>> libcudart.so.6.5 => /usr/local/Cluster-Apps/cuda/6.5/lib64/libcudart.so.6.5 (0x00007fd83a526000)
>>> libcuda.so.1 => /usr/lib64/libcuda.so.1 (0x00007fd8395b4000)
>>> libstdc++.so.6 => /usr/local/Cluster-Apps/gcc/4.8.1/lib64/libstdc++.so.6 (0x00007fd8392ab000)
>>> libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00007fd8390a0000)
>>> libibumad.so.3 => /usr/lib64/libibumad.so.3 (0x00007fd838e98000)
>>> libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00007fd838c82000)
>>> libdl.so.2 => /lib64/libdl.so.2 (0x00007fd838a7e000)
>>> librt.so.1 => /lib64/librt.so.1 (0x00007fd838875000)
>>> libgfortran.so.3 => /usr/local/Cluster-Apps/gcc/4.8.1/lib64/libgfortran.so.3 (0x00007fd83855f000)
>>> libm.so.6 => /lib64/libm.so.6 (0x00007fd8382db000)
>>> libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fd8380bd000)
>>> libgcc_s.so.1 => /usr/local/Cluster-Apps/gcc/4.8.1/lib64/libgcc_s.so.1 (0x00007fd837ea8000)
>>> /lib64/ld-linux-x86-64.so.2 (0x00007fd83b1f2000)
>>> libnl.so.1 => /lib64/libnl.so.1 (0x00007fd837c55000)
>>> libquadmath.so.0 => /usr/local/Cluster-Apps/gcc/4.8.1/lib/../lib64/libquadmath.so.0 (0x00007fd837a1a000)
>>>
>>> [hpcgla1 at tesla80 qc_spiga]$ nvidia-smi topo -m
>>> GPU0 GPU1 mlx5_0 mlx5_1 CPU Affinity
>>> GPU0 X SOC PHB SOC 0-0,2-2,4-4,6-6,8-8,10-10
>>> GPU1 SOC X SOC PHB 1-1,3-3,5-5,7-7,9-9,11-11
>>> mlx5_0 PHB SOC X SOC
>>> mlx5_1 SOC PHB SOC X
>>>
>>> Legend:
>>>
>>> X = Self
>>> SOC = Path traverses a socket-level link (e.g. QPI)
>>> PHB = Path traverses a PCIe host bridge
>>> PXB = Path traverses multiple PCIe internal switches
>>> PIX = Path traverses a PCIe internal switch
>>>
>>> The warning message
>>> Warning *** The GPU and IB selected are not on the same socket. Do not delever the best performance
>>> goes away if I set MV2_CPU_MAPPING 0:1, but behavior is unchanged otherwise.
>>>
>>> Additonal details (ib configuration, loaded modules, ofed version,..) upon request.
>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>>
>>>
>>>
>>> --
>>> K.H
>>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
--
Mr. Filippo SPIGA, M.Sc. - HPC Application Specialist
High Performance Computing Service, University of Cambridge (UK)
http://www.hpc.cam.ac.uk/ ~ http://filippospiga.info ~ skype: filippo.spiga
«Nobody will drive us out of Cantor's paradise.» ~ David Hilbert
*****
Disclaimer: "Please note this message and any attachments are CONFIDENTIAL and may be privileged or otherwise protected from disclosure. The contents are not to be disclosed to anyone other than the addressee. Unauthorized recipients are requested to preserve this confidentiality and to advise the sender immediately of any error in transmission."
More information about the mvapich-discuss
mailing list