[mvapich-discuss] MVAPICH 2.1a GDR cuda-aware/GDR data corrupted?

Sun Apr 12 18:02:48 EDT 2015

Hi Filippo,
Sorry for the delay,

As I mentioned to Jens earlier in this thread, disabling Non-Blocking
streams will fix the Non GDR issue.
For GDR related can you please try setting
MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=32768
or 64K. Please try first 32K  and let us know if this fixes your issue.

Thanks a lot

On Sun, Apr 12, 2015 at 11:14 PM, Filippo SPIGA <fs395 at cam.ac.uk> wrote:

> Dear MVAPICH-2 developers,
>
> any news about this issue?
>
> F
>
> On Apr 1, 2015, at 4:02 PM, Jens Glaser <jsglaser at umich.edu> wrote:
> > Khaled,
> >
> > setting the parameter as you suggested fixes the cuda-aware MPI case.
> >
> > However, all GDR tests still fail. As an example, I am showing the one
> for GDR without gdrcopy or loopback.
> >
> > Jens
> >
> > env | grep MV2
> > MV2_USE_APM=0
> > MV2_SMP_USE_LIMIC2=1
> > MV2_CUDA_NONBLOCKING_STREAMS=0
> >
> > mpirun -np $SLURM_NTASKS -ppn 2  -genvall \
> > -genv MV2_RAIL_SHARING_POLICY FIXED_MAPPING -genv
> MV2_PROCESS_TO_RAIL_MAPPING 0:1 \
> > -genv MV2_CPU_MAPPING 0:1 \
> > -genv MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD 1G \
> > -genv MV2_ENABLE_AFFINITY 1 -genv MV2_CPU_BINDING_LEVEL SOCKET -genv
> MV2_CPU_BINDING_POLICY SCATTER \
> > -genv MV2_USE_SHARED_MEM 0 \
> > -genv MV2_USE_CUDA 1 -genv MV2_USE_GPUDIRECT 1 -genv MV2_CUDA_IPC 0
> -genv MV2_USE_GPUDIRECT_LOOPBACK_LIMIT 9999999 \
> > sh /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/get_local_rank
> /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/mpi/pt2pt/osu_bw D D
> >
> > # OSU MPI-CUDA Bandwidth Test v4.4
> > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
> > # Size      Bandwidth (MB/s)
> > 1                       0.01
> > 2                       0.03
> > 4                       0.05
> > 8                       0.11
> > 16                      0.21
> > 32                      0.81
> > 64                      1.62
> > 128                     3.24
> > 256                     6.35
> > 512                    12.80
> > 1024                   24.99
> > 2048                   48.55
> > 4096                   89.94
> > 8192                  594.49
> > 16384                 666.69
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > 32768                  14.72
> > Message byte 0, b != a
> > 65536                 698.79
> > 131072                778.72
> > 262144                788.29
> > 524288                776.84
> > 1048576               774.68
> > 2097152               186.34
> > 4194304               196.13
> >
> >> On Mar 31, 2015, at 10:00 PM, Jens Glaser <jsglaser at umich.edu> wrote:
> >>
> >> Hi Khaled,
> >>
> >> no, I haven’t run inter-node. A quick test suggests that the behavior
> may be more sporadic there.
> >> Let me know if you need detailed data.
> >>
> >> Jens
> >>
> >>
> >>> On Mar 31, 2015, at 9:31 PM, khaled hamidouche <
> khaledhamidouche at gmail.com> wrote:
> >>>
> >>> Hi Jens,
> >>>
> >>> Thanks a lot for the reproducer we will take a look at it and get back
> to you.
> >>> In meantime, I see that you specify ppn=2, does this mean that this is
> an intranode job?  as osu_bw uses only 2 processes, are these processes in
> the same node ?. Does it happened for internode ?
> >>>
> >>> Thanks
> >>>
> >>> On Tue, Mar 31, 2015 at 9:10 PM, Jens Glaser <jsglaser at umich.edu>
> wrote:
> >>> Hi,
> >>>
> >>> I am observing bad data with MVAPICH 2.1a GDR in non-blocking,
> point-to-point communication.
> >>> Host-host communication is fine, but both cuda-aware MPI and
> cuda-aware MPI with GPUDirect RDMA fail.
> >>> I have additional data showing similar behavior for MVAPICH 2.0 GDR.
> >>>
> >>> Jens
> >>>
> >>> DETAILS:
> >>>
> >>> 1. To test communication correctness, I modify the MPI_recv call in
> the bandwidth test of the OSU micro benchmarks (4.4)
> >>> in such a way that received data for different iterations of the
> benchmark for a given message size
> >>> is written into an expanded output buffer in contiguous fashion. Then
> I check if the received characters match
> >>> the expected values (‘a’).
> >>>
> >>> Patch to osu_bw.c:
> >>> --- osu_bw.c        2015-03-31 20:29:32.000000000 -0400
> >>> +++ osu_bw_expanded_buf.c   2015-03-31 20:24:22.000000000 -0400
> >>> @@ -42,9 +42,11 @@
> >>>
> >>>  #define MAX_REQ_NUM 1000
> >>>
> >>> +#define WINDOW_SIZE 64
> >>> +
> >>>  #define MAX_ALIGNMENT 65536
> >>>  #define MAX_MSG_SIZE (1<<22)
> >>> -#define MYBUFSIZE (MAX_MSG_SIZE + MAX_ALIGNMENT)
> >>> +#define MYBUFSIZE (MAX_MSG_SIZE*WINDOW_SIZE + MAX_ALIGNMENT)
> >>>
> >>>  #define LOOP_LARGE  20
> >>>  #define WINDOW_SIZE_LARGE  64
> >>> @@ -98,6 +100,7 @@
> >>>  int allocate_memory (char **sbuf, char **rbuf, int rank);
> >>>  void print_header (int rank);
> >>>  void touch_data (void *sbuf, void *rbuf, int rank, size_t size);
> >>> +void check_data (void *buf, size_t size);
> >>>  void free_memory (void *sbuf, void *rbuf, int rank);
> >>>  int init_accel (void);
> >>>  int cleanup_accel (void);
> >>> @@ -110,7 +113,7 @@
> >>>      char *s_buf, *r_buf;
> >>>      double t_start = 0.0, t_end = 0.0, t = 0.0;
> >>>      int loop = 100;
> >>> -    int window_size = 64;
> >>> +    int window_size = WINDOW_SIZE;
> >>>      int skip = 10;
> >>>      int po_ret = process_options(argc, argv);
> >>>
> >>> @@ -205,12 +208,16 @@
> >>>          else if(myid == 1) {
> >>>              for(i = 0; i < loop + skip; i++) {
> >>>                  for(j = 0; j < window_size; j++) {
> >>> -                    MPI_Irecv(r_buf, size, MPI_CHAR, 0, 100,
> MPI_COMM_WORLD,
> >>> +                    MPI_Irecv(r_buf + j*size, size, MPI_CHAR, 0, 100,
> MPI_COMM_WORLD,
> >>>                              request + j);
> >>>                  }
> >>>
> >>>                  MPI_Waitall(window_size, request, reqstat);
> >>>                  MPI_Send(s_buf, 4, MPI_CHAR, 0, 101, MPI_COMM_WORLD);
> >>> +
> >>> +                check_data(r_buf, size*window_size);
> >>> +
> >>> +
> >>>              }
> >>>          }
> >>>
> >>> @@ -564,6 +571,39 @@
> >>>      }
> >>>  }
> >>>
> >>> +void
> >>> +check_data (void * buf, size_t size)
> >>> +{
> >>> +    char *h_rbuf;
> >>> +    #ifdef _ENABLE_CUDA_
> >>> +    if ('D' == options.dst) {
> >>> +        h_rbuf = malloc(size);
> >>> +        cudaError_t cuerr = cudaMemcpy(h_rbuf, buf, size,
> cudaMemcpyDeviceToHost);
> >>> +        if (cudaSuccess != cuerr) {
> >>> +            fprintf(stderr, "Error copying D2H\n");
> >>> +            return;
> >>> +        }
> >>> +    } else
> >>> +    #endif
> >>> +        {
> >>> +        h_rbuf = buf;
> >>> +    }
> >>> +
> >>> +    unsigned int i;
> >>> +    for (i = 0; i < size; ++i)
> >>> +        {
> >>> +        if (h_rbuf[i] != 'a')
> >>> +            {
> >>> +            printf("Message byte %d, %c != %c\n", i, h_rbuf[i], 'a');
> >>> +            break;
> >>> +            }
> >>> +        }
> >>> +    if ('D' == options.dst) {
> >>> +        free(h_rbuf);
> >>> +        }
> >>> +}
> >>> +
> >>> +
> >>>  int
> >>>  free_device_buffer (void * buf)
> >>>  {
> >>>
> >>>
> >>> 2. I execute the test on a dual rail configuration node, with two GPUs
> and two HCAs on different segments of PCIe.
> >>> Specifically, I am testing on the Wilkes cluster. The three different
> configurations are:
> >>> Host-Host, Device-Device cuda-aware, and Device-Device GDR. The CUDA
> toolkit version is 6.5.
> >>>
> >>> These are the results:
> >>>
> >>> a) Host-Host
> >>> mpirun -np $SLURM_NTASKS -ppn 2  -genvall \
> >>> -genv MV2_RAIL_SHARING_POLICY FIXED_MAPPING -genv
> MV2_PROCESS_TO_RAIL_MAPPING 0:1 \
> >>> -genv MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD 1G \
> >>> -genv MV2_ENABLE_AFFINITY 1 -genv MV2_CPU_BINDING_LEVEL SOCKET -genv
> MV2_CPU_BINDING_POLICY SCATTER \
> >>> -genv MV2_USE_SHARED_MEM 0 \
> >>> -genv MV2_USE_CUDA 1 -genv MV2_USE_GPUDIRECT 0 -genv
> MV2_GPUDIRECT_GDRCOPY_LIB ${GDRCOPY_LIBRARY_PATH}/libgdrapi.so -genv
> MV2_CUDA_IPC 0 \
> >>> sh /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/get_local_rank
> /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/mpi/pt2pt/osu_bw H H
> >>>
> >>> ldd /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/mpi/pt2pt/osu_bw
> >>> # OSU MPI-CUDA Bandwidth Test v4.4
> >>> # Send Buffer on HOST (H) and Receive Buffer on HOST (H)
> >>> # Size      Bandwidth (MB/s)
> >>> 1                       1.11
> >>> 2                       2.20
> >>> 4                       4.43
> >>> 8                       8.89
> >>> 16                     17.84
> >>> 32                     35.64
> >>> 64                     70.33
> >>> 128                   133.84
> >>> 256                   242.58
> >>> 512                   359.18
> >>> 1024                  578.63
> >>> 2048                  828.26
> >>> 4096                 1011.72
> >>> 8192                 1134.18
> >>> 16384                1205.19
> >>> 32768                1261.87
> >>> 65536                1272.95
> >>> 131072               1279.46
> >>> 262144               1275.65
> >>> 524288               1275.42
> >>> 1048576              1275.61
> >>> 2097152              1277.70
> >>> 4194304              1278.82
> >>>
> >>> -> OK
> >>>
> >>> b) Device-Device cuda-aware
> >>>
> >>> mpirun -np $SLURM_NTASKS -ppn 2  -genvall \
> >>> -genv MV2_RAIL_SHARING_POLICY FIXED_MAPPING -genv
> MV2_PROCESS_TO_RAIL_MAPPING 0:1 \
> >>> -genv MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD 1G \
> >>> -genv MV2_ENABLE_AFFINITY 1 -genv MV2_CPU_BINDING_LEVEL SOCKET -genv
> MV2_CPU_BINDING_POLICY SCATTER \
> >>> -genv MV2_USE_SHARED_MEM 0 \
> >>> -genv MV2_USE_CUDA 1 -genv MV2_USE_GPUDIRECT 0 -genv
> MV2_GPUDIRECT_GDRCOPY_LIB ${GDRCOPY_LIBRARY_PATH}/libgdrapi.so -genv
> MV2_CUDA_IPC 0 \
> >>> sh /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/get_local_rank
> /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/mpi/pt2pt/osu_bw D D
> >>>
> >>> # OSU MPI-CUDA Bandwidth Test v4.4
> >>> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
> >>> # Size      Bandwidth (MB/s)
> >>> Warning *** The GPU and IB selected are not on the same socket. Do not
> delever the best performance
> >>> 1                       0.07
> >>> 2                       0.13
> >>> 4                       0.34
> >>> 8                       0.69
> >>> 16                      1.36
> >>> 32                      2.73
> >>> 64                      5.42
> >>> 128                    10.87
> >>> 256                    21.70
> >>> 512                    43.21
> >>> 1024                   84.83
> >>> 2048                  161.70
> >>> 4096                  299.68
> >>> 8192                  412.03
> >>> 16384                 501.18
> >>> 32768                 543.28
> >>> Message byte 0, b != a
> >>> 65536                 661.09
> >>> Message byte 0, b != a
> >>> 131072                739.19
> >>> Message byte 0, b != a
> >>> 262144                770.89
> >>> Message byte 0, b != a
> >>> 524288                761.48
> >>> 1048576               756.53
> >>> 2097152               757.82
> >>> Message byte 0, b != a
> >>> 4194304               755.51
> >>>
> >>> -> FAIL
> >>>
> >>> c) Device-device GDR
> >>> unset MV2_GPUDIRECT_GDRCOPY_LIB
> >>> mpirun -np $SLURM_NTASKS -ppn 2  -genvall \
> >>> -genv MV2_RAIL_SHARING_POLICY FIXED_MAPPING -genv
> MV2_PROCESS_TO_RAIL_MAPPING 0:1 \
> >>> -genv MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD 1G \
> >>> -genv MV2_ENABLE_AFFINITY 1 -genv MV2_CPU_BINDING_LEVEL SOCKET -genv
> MV2_CPU_BINDING_POLICY SCATTER \
> >>> -genv MV2_USE_SHARED_MEM 0 \
> >>> -genv MV2_USE_CUDA 1 -genv MV2_USE_GPUDIRECT 1 -genv MV2_CUDA_IPC 0\
> >>> sh /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/get_local_rank
> /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/mpi/pt2pt/osu_bw D D
> >>>
> >>> # OSU MPI-CUDA Bandwidth Test v4.4
> >>> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
> >>> # Size      Bandwidth (MB/s)
> >>> Warning *** The GPU and IB selected are not on the same socket. Do not
> delever the best performance
> >>> 1                       0.01
> >>> 2                       0.03
> >>> 4                       0.05
> >>> 8                       0.11
> >>> 16                      0.22
> >>> 32                      0.84
> >>> 64                      1.69
> >>> 128                     3.35
> >>> 256                     6.61
> >>> 512                    13.22
> >>> 1024                   25.67
> >>> 2048                   49.59
> >>> 4096                   92.64
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> 8192                   14.81
> >>> Message byte 0, b != a
> >>> 16384                 421.67
> >>> 32768                 608.24
> >>> 65536                 721.74
> >>> 131072                792.72
> >>> 262144                795.85
> >>> 524288                780.61
> >>> 1048576               776.48
> >>> 2097152               160.07
> >>> 4194304               401.23
> >>>
> >>> -> FAIL
> >>>
> >>> d) Device-Device GDR (no loopback)
> >>> unset MV2_GPUDIRECT_GDRCOPY_LIB
> >>> mpirun -np $SLURM_NTASKS -ppn 2  -genvall \
> >>> -genv MV2_RAIL_SHARING_POLICY FIXED_MAPPING -genv
> MV2_PROCESS_TO_RAIL_MAPPING 0:1 \
> >>> -genv MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD 1G \
> >>> -genv MV2_ENABLE_AFFINITY 1 -genv MV2_CPU_BINDING_LEVEL SOCKET -genv
> MV2_CPU_BINDING_POLICY SCATTER \
> >>> -genv MV2_USE_SHARED_MEM 0 \
> >>> -genv MV2_USE_CUDA 1 -genv MV2_USE_GPUDIRECT 1 -genv MV2_CUDA_IPC 0
> -genv MV2_USE_GPUDIRECT_LOOPBACK_LIMIT 9999999 \
> >>> sh /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/get_local_rank
> /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/mpi/pt2pt/osu_bw D D
> >>>
> >>> # OSU MPI-CUDA Bandwidth Test v4.4
> >>> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
> >>> # Size      Bandwidth (MB/s)
> >>> Warning *** The GPU and IB selected are not on the same socket. Do not
> delever the best performance
> >>> 1                       0.01
> >>> 2                       0.03
> >>> 4                       0.05
> >>> 8                       0.11
> >>> 16                      0.22
> >>> 32                      0.83
> >>> 64                      1.67
> >>> 128                     3.33
> >>> 256                     6.57
> >>> 512                    13.08
> >>> 1024                   25.40
> >>> 2048                   49.38
> >>> 4096                   91.31
> >>> 8192                  595.21
> >>> 16384                 666.12
> >>> Message byte 0, b != a
> >>> 32768                 605.65
> >>> 65536                 721.52
> >>> 131072                791.46
> >>> 262144                794.08
> >>> 524288                779.70
> >>> 1048576               776.23
> >>> 2097152               187.64
> >>> 4194304               196.25
> >>>
> >>> -> FAIL
> >>>
> >>> 3. Additional info:
> >>>
> >>> MVAPICH2 Version:           2.1a
> >>> MVAPICH2 Release date:      Sun Sep 21 12:00:00 EDT 2014
> >>> MVAPICH2 Device:            ch3:mrail
> >>> MVAPICH2 configure:         --build=x86_64-unknown-linux-gnu
> --host=x86_64-unknown-linux-gnu --target=x86_64-redhat-linux-gnu
> --program-prefix= --prefix=/opt/mvapich2/gdr/2.1a/gnu
> --exec-prefix=/opt/mvapich2/gdr/2.1a/gnu
> --bindir=/opt/mvapich2/gdr/2.1a/gnu/bin
> --sbindir=/opt/mvapich2/gdr/2.1a/gnu/sbin
> --sysconfdir=/opt/mvapich2/gdr/2.1a/gnu/etc
> --datadir=/opt/mvapich2/gdr/2.1a/gnu/share
> --includedir=/opt/mvapich2/gdr/2.1a/gnu/include
> --libdir=/opt/mvapich2/gdr/2.1a/gnu/lib64
> --libexecdir=/opt/mvapich2/gdr/2.1a/gnu/libexec --localstatedir=/var
> --sharedstatedir=/var/lib --mandir=/opt/mvapich2/gdr/2.1a/gnu/share/man
> --infodir=/opt/mvapich2/gdr/2.1a/gnu/share/info --disable-rpath
> --disable-static --enable-shared --disable-rdma-cm --disable-mcast
> --enable-cuda --without-hydra-ckpointlib CPPFLAGS=-I/usr/local/cuda/include
> LDFLAGS=-L/usr/local/cuda/lib64 -Wl,-rpath,/usr/local/cuda/lib64
> -Wl,-rpath,XORIGIN/placeholder
> >>> MVAPICH2 CC:        gcc -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
> -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
> -mtune=generic   -DNDEBUG -DNVALGRIND -O2
> >>> MVAPICH2 CXX:       g++ -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
> -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
> -mtune=generic  -DNDEBUG -DNVALGRIND -O2
> >>> MVAPICH2 F77:       gfortran -L/lib -L/lib -O2 -g -pipe -Wall
> -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector
> --param=ssp-buffer-size=4 -m64 -mtune=generic
> -I/opt/mvapich2/gdr/2.1a/gnu/lib64/gfortran/modules  -O2
> >>> MVAPICH2 FC:        gfortran   -O2
> >>>
> >>> ldd /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/mpi/pt2pt/osu_bw
> >>>     linux-vdso.so.1 =>  (0x00007fff4ec9a000)
> >>>     libmpi.so.12 =>
> /usr/local/Cluster-Apps/mvapich2-GDR/gnu/2.1a_cuda-6.5/lib64/libmpi.so.12
> (0x00007fd83ab34000)
> >>>     libc.so.6 => /lib64/libc.so.6 (0x00007fd83a776000)
> >>>     libcudart.so.6.5 =>
> /usr/local/Cluster-Apps/cuda/6.5/lib64/libcudart.so.6.5 (0x00007fd83a526000)
> >>>     libcuda.so.1 => /usr/lib64/libcuda.so.1 (0x00007fd8395b4000)
> >>>     libstdc++.so.6 =>
> /usr/local/Cluster-Apps/gcc/4.8.1/lib64/libstdc++.so.6 (0x00007fd8392ab000)
> >>>     libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00007fd8390a0000)
> >>>     libibumad.so.3 => /usr/lib64/libibumad.so.3 (0x00007fd838e98000)
> >>>     libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00007fd838c82000)
> >>>     libdl.so.2 => /lib64/libdl.so.2 (0x00007fd838a7e000)
> >>>     librt.so.1 => /lib64/librt.so.1 (0x00007fd838875000)
> >>>     libgfortran.so.3 =>
> /usr/local/Cluster-Apps/gcc/4.8.1/lib64/libgfortran.so.3
> (0x00007fd83855f000)
> >>>     libm.so.6 => /lib64/libm.so.6 (0x00007fd8382db000)
> >>>     libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fd8380bd000)
> >>>     libgcc_s.so.1 =>
> /usr/local/Cluster-Apps/gcc/4.8.1/lib64/libgcc_s.so.1 (0x00007fd837ea8000)
> >>>     /lib64/ld-linux-x86-64.so.2 (0x00007fd83b1f2000)
> >>>     libnl.so.1 => /lib64/libnl.so.1 (0x00007fd837c55000)
> >>>     libquadmath.so.0 =>
> /usr/local/Cluster-Apps/gcc/4.8.1/lib/../lib64/libquadmath.so.0
> (0x00007fd837a1a000)
> >>>
> >>> [hpcgla1 at tesla80 qc_spiga]$ nvidia-smi topo -m
> >>>         GPU0 GPU1    mlx5_0  mlx5_1  CPU Affinity
> >>> GPU0     X      SOC     PHB     SOC     0-0,2-2,4-4,6-6,8-8,10-10
> >>> GPU1    SOC      X      SOC     PHB     1-1,3-3,5-5,7-7,9-9,11-11
> >>> mlx5_0  PHB     SOC      X      SOC
> >>> mlx5_1  SOC     PHB     SOC      X
> >>>
> >>> Legend:
> >>>
> >>>   X   = Self
> >>>   SOC = Path traverses a socket-level link (e.g. QPI)
> >>>   PHB = Path traverses a PCIe host bridge
> >>>   PXB = Path traverses multiple PCIe internal switches
> >>>   PIX = Path traverses a PCIe internal switch
> >>>
> >>> The warning message
> >>> Warning *** The GPU and IB selected are not on the same socket. Do not
> delever the best performance
> >>> goes away if I set MV2_CPU_MAPPING 0:1, but behavior is unchanged
> otherwise.
> >>>
> >>> Additonal details (ib configuration, loaded modules, ofed version,..)
> upon request.
> >>>
> >>> _______________________________________________
> >>> mvapich-discuss mailing list
> >>> mvapich-discuss at cse.ohio-state.edu
> >>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>>  K.H
> >>
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
> --
> Mr. Filippo SPIGA, M.Sc. - HPC  Application Specialist
> High Performance Computing Service, University of Cambridge (UK)
> http://www.hpc.cam.ac.uk/ ~ http://filippospiga.info ~ skype:
> filippo.spiga
>
> «Nobody will drive us out of Cantor's paradise.» ~ David Hilbert
>
> *****
> Disclaimer: "Please note this message and any attachments are CONFIDENTIAL
> and may be privileged or otherwise protected from disclosure. The contents
> are not to be disclosed to anyone other than the addressee. Unauthorized
> recipients are requested to preserve this confidentiality and to advise the
> sender immediately of any error in transmission."
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150413/71d04ca6/attachment-0001.html>