[mvapich-discuss] MVAPICH 2.1a GDR cuda-aware/GDR data corrupted?

Sun Apr 12 17:14:05 EDT 2015

Dear MVAPICH-2 developers,

any news about this issue?

F

On Apr 1, 2015, at 4:02 PM, Jens Glaser <jsglaser at umich.edu> wrote:
> Khaled,
> 
> setting the parameter as you suggested fixes the cuda-aware MPI case.
> 
> However, all GDR tests still fail. As an example, I am showing the one for GDR without gdrcopy or loopback.
> 
> Jens
> 
> env | grep MV2
> MV2_USE_APM=0
> MV2_SMP_USE_LIMIC2=1
> MV2_CUDA_NONBLOCKING_STREAMS=0
> 
> mpirun -np $SLURM_NTASKS -ppn 2  -genvall \
> -genv MV2_RAIL_SHARING_POLICY FIXED_MAPPING -genv MV2_PROCESS_TO_RAIL_MAPPING 0:1 \
> -genv MV2_CPU_MAPPING 0:1 \
> -genv MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD 1G \
> -genv MV2_ENABLE_AFFINITY 1 -genv MV2_CPU_BINDING_LEVEL SOCKET -genv MV2_CPU_BINDING_POLICY SCATTER \
> -genv MV2_USE_SHARED_MEM 0 \
> -genv MV2_USE_CUDA 1 -genv MV2_USE_GPUDIRECT 1 -genv MV2_CUDA_IPC 0 -genv MV2_USE_GPUDIRECT_LOOPBACK_LIMIT 9999999 \
> sh /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/get_local_rank /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/mpi/pt2pt/osu_bw D D
> 
> # OSU MPI-CUDA Bandwidth Test v4.4
> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
> # Size      Bandwidth (MB/s)
> 1                       0.01
> 2                       0.03
> 4                       0.05
> 8                       0.11
> 16                      0.21
> 32                      0.81
> 64                      1.62
> 128                     3.24
> 256                     6.35
> 512                    12.80
> 1024                   24.99
> 2048                   48.55
> 4096                   89.94
> 8192                  594.49
> 16384                 666.69
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> Message byte 0, b != a
> 32768                  14.72
> Message byte 0, b != a
> 65536                 698.79
> 131072                778.72
> 262144                788.29
> 524288                776.84
> 1048576               774.68
> 2097152               186.34
> 4194304               196.13
> 
>> On Mar 31, 2015, at 10:00 PM, Jens Glaser <jsglaser at umich.edu> wrote:
>> 
>> Hi Khaled,
>> 
>> no, I haven’t run inter-node. A quick test suggests that the behavior may be more sporadic there.
>> Let me know if you need detailed data.
>> 
>> Jens
>> 
>> 
>>> On Mar 31, 2015, at 9:31 PM, khaled hamidouche <khaledhamidouche at gmail.com> wrote:
>>> 
>>> Hi Jens, 
>>> 
>>> Thanks a lot for the reproducer we will take a look at it and get back to you. 
>>> In meantime, I see that you specify ppn=2, does this mean that this is an intranode job?  as osu_bw uses only 2 processes, are these processes in the same node ?. Does it happened for internode ? 
>>> 
>>> Thanks 
>>> 
>>> On Tue, Mar 31, 2015 at 9:10 PM, Jens Glaser <jsglaser at umich.edu> wrote:
>>> Hi,
>>> 
>>> I am observing bad data with MVAPICH 2.1a GDR in non-blocking, point-to-point communication.
>>> Host-host communication is fine, but both cuda-aware MPI and cuda-aware MPI with GPUDirect RDMA fail.
>>> I have additional data showing similar behavior for MVAPICH 2.0 GDR.
>>> 
>>> Jens
>>> 
>>> DETAILS:
>>> 
>>> 1. To test communication correctness, I modify the MPI_recv call in the bandwidth test of the OSU micro benchmarks (4.4)
>>> in such a way that received data for different iterations of the benchmark for a given message size
>>> is written into an expanded output buffer in contiguous fashion. Then I check if the received characters match
>>> the expected values (‘a’).
>>> 
>>> Patch to osu_bw.c:
>>> --- osu_bw.c	2015-03-31 20:29:32.000000000 -0400
>>> +++ osu_bw_expanded_buf.c	2015-03-31 20:24:22.000000000 -0400
>>> @@ -42,9 +42,11 @@
>>>  
>>>  #define MAX_REQ_NUM 1000
>>>  
>>> +#define WINDOW_SIZE 64
>>> +
>>>  #define MAX_ALIGNMENT 65536
>>>  #define MAX_MSG_SIZE (1<<22)
>>> -#define MYBUFSIZE (MAX_MSG_SIZE + MAX_ALIGNMENT)
>>> +#define MYBUFSIZE (MAX_MSG_SIZE*WINDOW_SIZE + MAX_ALIGNMENT)
>>>  
>>>  #define LOOP_LARGE  20
>>>  #define WINDOW_SIZE_LARGE  64
>>> @@ -98,6 +100,7 @@
>>>  int allocate_memory (char **sbuf, char **rbuf, int rank);
>>>  void print_header (int rank);
>>>  void touch_data (void *sbuf, void *rbuf, int rank, size_t size);
>>> +void check_data (void *buf, size_t size);
>>>  void free_memory (void *sbuf, void *rbuf, int rank);
>>>  int init_accel (void);
>>>  int cleanup_accel (void);
>>> @@ -110,7 +113,7 @@
>>>      char *s_buf, *r_buf;
>>>      double t_start = 0.0, t_end = 0.0, t = 0.0;
>>>      int loop = 100;
>>> -    int window_size = 64;
>>> +    int window_size = WINDOW_SIZE;
>>>      int skip = 10;
>>>      int po_ret = process_options(argc, argv);
>>>  
>>> @@ -205,12 +208,16 @@
>>>          else if(myid == 1) {
>>>              for(i = 0; i < loop + skip; i++) {
>>>                  for(j = 0; j < window_size; j++) {
>>> -                    MPI_Irecv(r_buf, size, MPI_CHAR, 0, 100, MPI_COMM_WORLD,
>>> +                    MPI_Irecv(r_buf + j*size, size, MPI_CHAR, 0, 100, MPI_COMM_WORLD,
>>>                              request + j);
>>>                  }
>>>  
>>>                  MPI_Waitall(window_size, request, reqstat);
>>>                  MPI_Send(s_buf, 4, MPI_CHAR, 0, 101, MPI_COMM_WORLD);
>>> +
>>> +                check_data(r_buf, size*window_size);
>>> +
>>> +
>>>              }
>>>          }
>>>  
>>> @@ -564,6 +571,39 @@
>>>      }
>>>  }
>>>  
>>> +void
>>> +check_data (void * buf, size_t size)
>>> +{
>>> +    char *h_rbuf;
>>> +    #ifdef _ENABLE_CUDA_
>>> +    if ('D' == options.dst) {
>>> +        h_rbuf = malloc(size);
>>> +        cudaError_t cuerr = cudaMemcpy(h_rbuf, buf, size, cudaMemcpyDeviceToHost);
>>> +        if (cudaSuccess != cuerr) {
>>> +            fprintf(stderr, "Error copying D2H\n");
>>> +            return;
>>> +        }
>>> +    } else
>>> +    #endif
>>> +        {
>>> +        h_rbuf = buf;
>>> +    }
>>> +
>>> +    unsigned int i;
>>> +    for (i = 0; i < size; ++i)
>>> +        {
>>> +        if (h_rbuf[i] != 'a')
>>> +            {
>>> +            printf("Message byte %d, %c != %c\n", i, h_rbuf[i], 'a');
>>> +            break;
>>> +            }
>>> +        }
>>> +    if ('D' == options.dst) {
>>> +        free(h_rbuf);
>>> +        }
>>> +}
>>> +
>>> +
>>>  int
>>>  free_device_buffer (void * buf)
>>>  {
>>> 
>>> 
>>> 2. I execute the test on a dual rail configuration node, with two GPUs and two HCAs on different segments of PCIe.
>>> Specifically, I am testing on the Wilkes cluster. The three different configurations are:
>>> Host-Host, Device-Device cuda-aware, and Device-Device GDR. The CUDA toolkit version is 6.5.
>>> 
>>> These are the results:
>>> 
>>> a) Host-Host
>>> mpirun -np $SLURM_NTASKS -ppn 2  -genvall \
>>> -genv MV2_RAIL_SHARING_POLICY FIXED_MAPPING -genv MV2_PROCESS_TO_RAIL_MAPPING 0:1 \
>>> -genv MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD 1G \
>>> -genv MV2_ENABLE_AFFINITY 1 -genv MV2_CPU_BINDING_LEVEL SOCKET -genv MV2_CPU_BINDING_POLICY SCATTER \
>>> -genv MV2_USE_SHARED_MEM 0 \
>>> -genv MV2_USE_CUDA 1 -genv MV2_USE_GPUDIRECT 0 -genv MV2_GPUDIRECT_GDRCOPY_LIB ${GDRCOPY_LIBRARY_PATH}/libgdrapi.so -genv MV2_CUDA_IPC 0 \
>>> sh /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/get_local_rank /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/mpi/pt2pt/osu_bw H H
>>> 
>>> ldd /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/mpi/pt2pt/osu_bw
>>> # OSU MPI-CUDA Bandwidth Test v4.4
>>> # Send Buffer on HOST (H) and Receive Buffer on HOST (H)
>>> # Size      Bandwidth (MB/s)
>>> 1                       1.11
>>> 2                       2.20
>>> 4                       4.43
>>> 8                       8.89
>>> 16                     17.84
>>> 32                     35.64
>>> 64                     70.33
>>> 128                   133.84
>>> 256                   242.58
>>> 512                   359.18
>>> 1024                  578.63
>>> 2048                  828.26
>>> 4096                 1011.72
>>> 8192                 1134.18
>>> 16384                1205.19
>>> 32768                1261.87
>>> 65536                1272.95
>>> 131072               1279.46
>>> 262144               1275.65
>>> 524288               1275.42
>>> 1048576              1275.61
>>> 2097152              1277.70
>>> 4194304              1278.82
>>> 
>>> -> OK
>>> 
>>> b) Device-Device cuda-aware
>>> 
>>> mpirun -np $SLURM_NTASKS -ppn 2  -genvall \
>>> -genv MV2_RAIL_SHARING_POLICY FIXED_MAPPING -genv MV2_PROCESS_TO_RAIL_MAPPING 0:1 \
>>> -genv MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD 1G \
>>> -genv MV2_ENABLE_AFFINITY 1 -genv MV2_CPU_BINDING_LEVEL SOCKET -genv MV2_CPU_BINDING_POLICY SCATTER \
>>> -genv MV2_USE_SHARED_MEM 0 \
>>> -genv MV2_USE_CUDA 1 -genv MV2_USE_GPUDIRECT 0 -genv MV2_GPUDIRECT_GDRCOPY_LIB ${GDRCOPY_LIBRARY_PATH}/libgdrapi.so -genv MV2_CUDA_IPC 0 \
>>> sh /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/get_local_rank /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/mpi/pt2pt/osu_bw D D
>>> 
>>> # OSU MPI-CUDA Bandwidth Test v4.4
>>> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
>>> # Size      Bandwidth (MB/s)
>>> Warning *** The GPU and IB selected are not on the same socket. Do not delever the best performance 
>>> 1                       0.07
>>> 2                       0.13
>>> 4                       0.34
>>> 8                       0.69
>>> 16                      1.36
>>> 32                      2.73
>>> 64                      5.42
>>> 128                    10.87
>>> 256                    21.70
>>> 512                    43.21
>>> 1024                   84.83
>>> 2048                  161.70
>>> 4096                  299.68
>>> 8192                  412.03
>>> 16384                 501.18
>>> 32768                 543.28
>>> Message byte 0, b != a
>>> 65536                 661.09
>>> Message byte 0, b != a
>>> 131072                739.19
>>> Message byte 0, b != a
>>> 262144                770.89
>>> Message byte 0, b != a
>>> 524288                761.48
>>> 1048576               756.53
>>> 2097152               757.82
>>> Message byte 0, b != a
>>> 4194304               755.51
>>> 
>>> -> FAIL
>>> 
>>> c) Device-device GDR
>>> unset MV2_GPUDIRECT_GDRCOPY_LIB
>>> mpirun -np $SLURM_NTASKS -ppn 2  -genvall \
>>> -genv MV2_RAIL_SHARING_POLICY FIXED_MAPPING -genv MV2_PROCESS_TO_RAIL_MAPPING 0:1 \
>>> -genv MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD 1G \
>>> -genv MV2_ENABLE_AFFINITY 1 -genv MV2_CPU_BINDING_LEVEL SOCKET -genv MV2_CPU_BINDING_POLICY SCATTER \
>>> -genv MV2_USE_SHARED_MEM 0 \
>>> -genv MV2_USE_CUDA 1 -genv MV2_USE_GPUDIRECT 1 -genv MV2_CUDA_IPC 0\
>>> sh /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/get_local_rank /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/mpi/pt2pt/osu_bw D D
>>> 
>>> # OSU MPI-CUDA Bandwidth Test v4.4
>>> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
>>> # Size      Bandwidth (MB/s)
>>> Warning *** The GPU and IB selected are not on the same socket. Do not delever the best performance 
>>> 1                       0.01
>>> 2                       0.03
>>> 4                       0.05
>>> 8                       0.11
>>> 16                      0.22
>>> 32                      0.84
>>> 64                      1.69
>>> 128                     3.35
>>> 256                     6.61
>>> 512                    13.22
>>> 1024                   25.67
>>> 2048                   49.59
>>> 4096                   92.64
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> Message byte 0, b != a
>>> 8192                   14.81
>>> Message byte 0, b != a
>>> 16384                 421.67
>>> 32768                 608.24
>>> 65536                 721.74
>>> 131072                792.72
>>> 262144                795.85
>>> 524288                780.61
>>> 1048576               776.48
>>> 2097152               160.07
>>> 4194304               401.23
>>> 
>>> -> FAIL
>>> 
>>> d) Device-Device GDR (no loopback)
>>> unset MV2_GPUDIRECT_GDRCOPY_LIB
>>> mpirun -np $SLURM_NTASKS -ppn 2  -genvall \
>>> -genv MV2_RAIL_SHARING_POLICY FIXED_MAPPING -genv MV2_PROCESS_TO_RAIL_MAPPING 0:1 \
>>> -genv MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD 1G \
>>> -genv MV2_ENABLE_AFFINITY 1 -genv MV2_CPU_BINDING_LEVEL SOCKET -genv MV2_CPU_BINDING_POLICY SCATTER \
>>> -genv MV2_USE_SHARED_MEM 0 \
>>> -genv MV2_USE_CUDA 1 -genv MV2_USE_GPUDIRECT 1 -genv MV2_CUDA_IPC 0 -genv MV2_USE_GPUDIRECT_LOOPBACK_LIMIT 9999999 \
>>> sh /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/get_local_rank /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/mpi/pt2pt/osu_bw D D
>>> 
>>> # OSU MPI-CUDA Bandwidth Test v4.4
>>> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
>>> # Size      Bandwidth (MB/s)
>>> Warning *** The GPU and IB selected are not on the same socket. Do not delever the best performance 
>>> 1                       0.01
>>> 2                       0.03
>>> 4                       0.05
>>> 8                       0.11
>>> 16                      0.22
>>> 32                      0.83
>>> 64                      1.67
>>> 128                     3.33
>>> 256                     6.57
>>> 512                    13.08
>>> 1024                   25.40
>>> 2048                   49.38
>>> 4096                   91.31
>>> 8192                  595.21
>>> 16384                 666.12
>>> Message byte 0, b != a
>>> 32768                 605.65
>>> 65536                 721.52
>>> 131072                791.46
>>> 262144                794.08
>>> 524288                779.70
>>> 1048576               776.23
>>> 2097152               187.64
>>> 4194304               196.25
>>> 
>>> -> FAIL
>>> 
>>> 3. Additional info:
>>> 
>>> MVAPICH2 Version:     	2.1a
>>> MVAPICH2 Release date:	Sun Sep 21 12:00:00 EDT 2014
>>> MVAPICH2 Device:      	ch3:mrail
>>> MVAPICH2 configure:   	--build=x86_64-unknown-linux-gnu --host=x86_64-unknown-linux-gnu --target=x86_64-redhat-linux-gnu --program-prefix= --prefix=/opt/mvapich2/gdr/2.1a/gnu --exec-prefix=/opt/mvapich2/gdr/2.1a/gnu --bindir=/opt/mvapich2/gdr/2.1a/gnu/bin --sbindir=/opt/mvapich2/gdr/2.1a/gnu/sbin --sysconfdir=/opt/mvapich2/gdr/2.1a/gnu/etc --datadir=/opt/mvapich2/gdr/2.1a/gnu/share --includedir=/opt/mvapich2/gdr/2.1a/gnu/include --libdir=/opt/mvapich2/gdr/2.1a/gnu/lib64 --libexecdir=/opt/mvapich2/gdr/2.1a/gnu/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/opt/mvapich2/gdr/2.1a/gnu/share/man --infodir=/opt/mvapich2/gdr/2.1a/gnu/share/info --disable-rpath --disable-static --enable-shared --disable-rdma-cm --disable-mcast --enable-cuda --without-hydra-ckpointlib CPPFLAGS=-I/usr/local/cuda/include LDFLAGS=-L/usr/local/cuda/lib64 -Wl,-rpath,/usr/local/cuda/lib64 -Wl,-rpath,XORIGIN/placeholder
>>> MVAPICH2 CC:  	gcc -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic   -DNDEBUG -DNVALGRIND -O2
>>> MVAPICH2 CXX: 	g++ -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic  -DNDEBUG -DNVALGRIND -O2
>>> MVAPICH2 F77: 	gfortran -L/lib -L/lib -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -I/opt/mvapich2/gdr/2.1a/gnu/lib64/gfortran/modules  -O2
>>> MVAPICH2 FC:  	gfortran   -O2
>>> 
>>> ldd /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/mpi/pt2pt/osu_bw
>>>     linux-vdso.so.1 =>  (0x00007fff4ec9a000)
>>>     libmpi.so.12 => /usr/local/Cluster-Apps/mvapich2-GDR/gnu/2.1a_cuda-6.5/lib64/libmpi.so.12 (0x00007fd83ab34000)
>>>     libc.so.6 => /lib64/libc.so.6 (0x00007fd83a776000)
>>>     libcudart.so.6.5 => /usr/local/Cluster-Apps/cuda/6.5/lib64/libcudart.so.6.5 (0x00007fd83a526000)
>>>     libcuda.so.1 => /usr/lib64/libcuda.so.1 (0x00007fd8395b4000)
>>>     libstdc++.so.6 => /usr/local/Cluster-Apps/gcc/4.8.1/lib64/libstdc++.so.6 (0x00007fd8392ab000)
>>>     libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00007fd8390a0000)
>>>     libibumad.so.3 => /usr/lib64/libibumad.so.3 (0x00007fd838e98000)
>>>     libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00007fd838c82000)
>>>     libdl.so.2 => /lib64/libdl.so.2 (0x00007fd838a7e000)
>>>     librt.so.1 => /lib64/librt.so.1 (0x00007fd838875000)
>>>     libgfortran.so.3 => /usr/local/Cluster-Apps/gcc/4.8.1/lib64/libgfortran.so.3 (0x00007fd83855f000)
>>>     libm.so.6 => /lib64/libm.so.6 (0x00007fd8382db000)
>>>     libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fd8380bd000)
>>>     libgcc_s.so.1 => /usr/local/Cluster-Apps/gcc/4.8.1/lib64/libgcc_s.so.1 (0x00007fd837ea8000)
>>>     /lib64/ld-linux-x86-64.so.2 (0x00007fd83b1f2000)
>>>     libnl.so.1 => /lib64/libnl.so.1 (0x00007fd837c55000)
>>>     libquadmath.so.0 => /usr/local/Cluster-Apps/gcc/4.8.1/lib/../lib64/libquadmath.so.0 (0x00007fd837a1a000)
>>> 
>>> [hpcgla1 at tesla80 qc_spiga]$ nvidia-smi topo -m
>>>         GPU0 GPU1    mlx5_0  mlx5_1  CPU Affinity
>>> GPU0     X      SOC     PHB     SOC     0-0,2-2,4-4,6-6,8-8,10-10
>>> GPU1    SOC      X      SOC     PHB     1-1,3-3,5-5,7-7,9-9,11-11
>>> mlx5_0  PHB     SOC      X      SOC     
>>> mlx5_1  SOC     PHB     SOC      X      
>>> 
>>> Legend:
>>> 
>>>   X   = Self
>>>   SOC = Path traverses a socket-level link (e.g. QPI)
>>>   PHB = Path traverses a PCIe host bridge
>>>   PXB = Path traverses multiple PCIe internal switches
>>>   PIX = Path traverses a PCIe internal switch
>>> 
>>> The warning message
>>> Warning *** The GPU and IB selected are not on the same socket. Do not delever the best performance
>>> goes away if I set MV2_CPU_MAPPING 0:1, but behavior is unchanged otherwise.
>>> 
>>> Additonal details (ib configuration, loaded modules, ofed version,..) upon request.
>>> 
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>>  K.H
>> 
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

--
Mr. Filippo SPIGA, M.Sc. - HPC  Application Specialist
High Performance Computing Service, University of Cambridge (UK)
http://www.hpc.cam.ac.uk/ ~ http://filippospiga.info ~ skype: filippo.spiga

«Nobody will drive us out of Cantor's paradise.» ~ David Hilbert

*****
Disclaimer: "Please note this message and any attachments are CONFIDENTIAL and may be privileged or otherwise protected from disclosure. The contents are not to be disclosed to anyone other than the addressee. Unauthorized recipients are requested to preserve this confidentiality and to advise the sender immediately of any error in transmission."