[mvapich-discuss] MVAPICH 2.1a GDR cuda-aware/GDR data corrupted?
khaled hamidouche
hamidouc at cse.ohio-state.edu
Sun Apr 12 18:02:48 EDT 2015
Hi Filippo,
Sorry for the delay,
As I mentioned to Jens earlier in this thread, disabling Non-Blocking
streams will fix the Non GDR issue.
For GDR related can you please try setting
MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=32768
or 64K. Please try first 32K and let us know if this fixes your issue.
Thanks a lot
On Sun, Apr 12, 2015 at 11:14 PM, Filippo SPIGA <fs395 at cam.ac.uk> wrote:
> Dear MVAPICH-2 developers,
>
> any news about this issue?
>
> F
>
> On Apr 1, 2015, at 4:02 PM, Jens Glaser <jsglaser at umich.edu> wrote:
> > Khaled,
> >
> > setting the parameter as you suggested fixes the cuda-aware MPI case.
> >
> > However, all GDR tests still fail. As an example, I am showing the one
> for GDR without gdrcopy or loopback.
> >
> > Jens
> >
> > env | grep MV2
> > MV2_USE_APM=0
> > MV2_SMP_USE_LIMIC2=1
> > MV2_CUDA_NONBLOCKING_STREAMS=0
> >
> > mpirun -np $SLURM_NTASKS -ppn 2 -genvall \
> > -genv MV2_RAIL_SHARING_POLICY FIXED_MAPPING -genv
> MV2_PROCESS_TO_RAIL_MAPPING 0:1 \
> > -genv MV2_CPU_MAPPING 0:1 \
> > -genv MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD 1G \
> > -genv MV2_ENABLE_AFFINITY 1 -genv MV2_CPU_BINDING_LEVEL SOCKET -genv
> MV2_CPU_BINDING_POLICY SCATTER \
> > -genv MV2_USE_SHARED_MEM 0 \
> > -genv MV2_USE_CUDA 1 -genv MV2_USE_GPUDIRECT 1 -genv MV2_CUDA_IPC 0
> -genv MV2_USE_GPUDIRECT_LOOPBACK_LIMIT 9999999 \
> > sh /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/get_local_rank
> /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/mpi/pt2pt/osu_bw D D
> >
> > # OSU MPI-CUDA Bandwidth Test v4.4
> > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
> > # Size Bandwidth (MB/s)
> > 1 0.01
> > 2 0.03
> > 4 0.05
> > 8 0.11
> > 16 0.21
> > 32 0.81
> > 64 1.62
> > 128 3.24
> > 256 6.35
> > 512 12.80
> > 1024 24.99
> > 2048 48.55
> > 4096 89.94
> > 8192 594.49
> > 16384 666.69
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > Message byte 0, b != a
> > 32768 14.72
> > Message byte 0, b != a
> > 65536 698.79
> > 131072 778.72
> > 262144 788.29
> > 524288 776.84
> > 1048576 774.68
> > 2097152 186.34
> > 4194304 196.13
> >
> >> On Mar 31, 2015, at 10:00 PM, Jens Glaser <jsglaser at umich.edu> wrote:
> >>
> >> Hi Khaled,
> >>
> >> no, I haven’t run inter-node. A quick test suggests that the behavior
> may be more sporadic there.
> >> Let me know if you need detailed data.
> >>
> >> Jens
> >>
> >>
> >>> On Mar 31, 2015, at 9:31 PM, khaled hamidouche <
> khaledhamidouche at gmail.com> wrote:
> >>>
> >>> Hi Jens,
> >>>
> >>> Thanks a lot for the reproducer we will take a look at it and get back
> to you.
> >>> In meantime, I see that you specify ppn=2, does this mean that this is
> an intranode job? as osu_bw uses only 2 processes, are these processes in
> the same node ?. Does it happened for internode ?
> >>>
> >>> Thanks
> >>>
> >>> On Tue, Mar 31, 2015 at 9:10 PM, Jens Glaser <jsglaser at umich.edu>
> wrote:
> >>> Hi,
> >>>
> >>> I am observing bad data with MVAPICH 2.1a GDR in non-blocking,
> point-to-point communication.
> >>> Host-host communication is fine, but both cuda-aware MPI and
> cuda-aware MPI with GPUDirect RDMA fail.
> >>> I have additional data showing similar behavior for MVAPICH 2.0 GDR.
> >>>
> >>> Jens
> >>>
> >>> DETAILS:
> >>>
> >>> 1. To test communication correctness, I modify the MPI_recv call in
> the bandwidth test of the OSU micro benchmarks (4.4)
> >>> in such a way that received data for different iterations of the
> benchmark for a given message size
> >>> is written into an expanded output buffer in contiguous fashion. Then
> I check if the received characters match
> >>> the expected values (‘a’).
> >>>
> >>> Patch to osu_bw.c:
> >>> --- osu_bw.c 2015-03-31 20:29:32.000000000 -0400
> >>> +++ osu_bw_expanded_buf.c 2015-03-31 20:24:22.000000000 -0400
> >>> @@ -42,9 +42,11 @@
> >>>
> >>> #define MAX_REQ_NUM 1000
> >>>
> >>> +#define WINDOW_SIZE 64
> >>> +
> >>> #define MAX_ALIGNMENT 65536
> >>> #define MAX_MSG_SIZE (1<<22)
> >>> -#define MYBUFSIZE (MAX_MSG_SIZE + MAX_ALIGNMENT)
> >>> +#define MYBUFSIZE (MAX_MSG_SIZE*WINDOW_SIZE + MAX_ALIGNMENT)
> >>>
> >>> #define LOOP_LARGE 20
> >>> #define WINDOW_SIZE_LARGE 64
> >>> @@ -98,6 +100,7 @@
> >>> int allocate_memory (char **sbuf, char **rbuf, int rank);
> >>> void print_header (int rank);
> >>> void touch_data (void *sbuf, void *rbuf, int rank, size_t size);
> >>> +void check_data (void *buf, size_t size);
> >>> void free_memory (void *sbuf, void *rbuf, int rank);
> >>> int init_accel (void);
> >>> int cleanup_accel (void);
> >>> @@ -110,7 +113,7 @@
> >>> char *s_buf, *r_buf;
> >>> double t_start = 0.0, t_end = 0.0, t = 0.0;
> >>> int loop = 100;
> >>> - int window_size = 64;
> >>> + int window_size = WINDOW_SIZE;
> >>> int skip = 10;
> >>> int po_ret = process_options(argc, argv);
> >>>
> >>> @@ -205,12 +208,16 @@
> >>> else if(myid == 1) {
> >>> for(i = 0; i < loop + skip; i++) {
> >>> for(j = 0; j < window_size; j++) {
> >>> - MPI_Irecv(r_buf, size, MPI_CHAR, 0, 100,
> MPI_COMM_WORLD,
> >>> + MPI_Irecv(r_buf + j*size, size, MPI_CHAR, 0, 100,
> MPI_COMM_WORLD,
> >>> request + j);
> >>> }
> >>>
> >>> MPI_Waitall(window_size, request, reqstat);
> >>> MPI_Send(s_buf, 4, MPI_CHAR, 0, 101, MPI_COMM_WORLD);
> >>> +
> >>> + check_data(r_buf, size*window_size);
> >>> +
> >>> +
> >>> }
> >>> }
> >>>
> >>> @@ -564,6 +571,39 @@
> >>> }
> >>> }
> >>>
> >>> +void
> >>> +check_data (void * buf, size_t size)
> >>> +{
> >>> + char *h_rbuf;
> >>> + #ifdef _ENABLE_CUDA_
> >>> + if ('D' == options.dst) {
> >>> + h_rbuf = malloc(size);
> >>> + cudaError_t cuerr = cudaMemcpy(h_rbuf, buf, size,
> cudaMemcpyDeviceToHost);
> >>> + if (cudaSuccess != cuerr) {
> >>> + fprintf(stderr, "Error copying D2H\n");
> >>> + return;
> >>> + }
> >>> + } else
> >>> + #endif
> >>> + {
> >>> + h_rbuf = buf;
> >>> + }
> >>> +
> >>> + unsigned int i;
> >>> + for (i = 0; i < size; ++i)
> >>> + {
> >>> + if (h_rbuf[i] != 'a')
> >>> + {
> >>> + printf("Message byte %d, %c != %c\n", i, h_rbuf[i], 'a');
> >>> + break;
> >>> + }
> >>> + }
> >>> + if ('D' == options.dst) {
> >>> + free(h_rbuf);
> >>> + }
> >>> +}
> >>> +
> >>> +
> >>> int
> >>> free_device_buffer (void * buf)
> >>> {
> >>>
> >>>
> >>> 2. I execute the test on a dual rail configuration node, with two GPUs
> and two HCAs on different segments of PCIe.
> >>> Specifically, I am testing on the Wilkes cluster. The three different
> configurations are:
> >>> Host-Host, Device-Device cuda-aware, and Device-Device GDR. The CUDA
> toolkit version is 6.5.
> >>>
> >>> These are the results:
> >>>
> >>> a) Host-Host
> >>> mpirun -np $SLURM_NTASKS -ppn 2 -genvall \
> >>> -genv MV2_RAIL_SHARING_POLICY FIXED_MAPPING -genv
> MV2_PROCESS_TO_RAIL_MAPPING 0:1 \
> >>> -genv MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD 1G \
> >>> -genv MV2_ENABLE_AFFINITY 1 -genv MV2_CPU_BINDING_LEVEL SOCKET -genv
> MV2_CPU_BINDING_POLICY SCATTER \
> >>> -genv MV2_USE_SHARED_MEM 0 \
> >>> -genv MV2_USE_CUDA 1 -genv MV2_USE_GPUDIRECT 0 -genv
> MV2_GPUDIRECT_GDRCOPY_LIB ${GDRCOPY_LIBRARY_PATH}/libgdrapi.so -genv
> MV2_CUDA_IPC 0 \
> >>> sh /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/get_local_rank
> /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/mpi/pt2pt/osu_bw H H
> >>>
> >>> ldd /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/mpi/pt2pt/osu_bw
> >>> # OSU MPI-CUDA Bandwidth Test v4.4
> >>> # Send Buffer on HOST (H) and Receive Buffer on HOST (H)
> >>> # Size Bandwidth (MB/s)
> >>> 1 1.11
> >>> 2 2.20
> >>> 4 4.43
> >>> 8 8.89
> >>> 16 17.84
> >>> 32 35.64
> >>> 64 70.33
> >>> 128 133.84
> >>> 256 242.58
> >>> 512 359.18
> >>> 1024 578.63
> >>> 2048 828.26
> >>> 4096 1011.72
> >>> 8192 1134.18
> >>> 16384 1205.19
> >>> 32768 1261.87
> >>> 65536 1272.95
> >>> 131072 1279.46
> >>> 262144 1275.65
> >>> 524288 1275.42
> >>> 1048576 1275.61
> >>> 2097152 1277.70
> >>> 4194304 1278.82
> >>>
> >>> -> OK
> >>>
> >>> b) Device-Device cuda-aware
> >>>
> >>> mpirun -np $SLURM_NTASKS -ppn 2 -genvall \
> >>> -genv MV2_RAIL_SHARING_POLICY FIXED_MAPPING -genv
> MV2_PROCESS_TO_RAIL_MAPPING 0:1 \
> >>> -genv MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD 1G \
> >>> -genv MV2_ENABLE_AFFINITY 1 -genv MV2_CPU_BINDING_LEVEL SOCKET -genv
> MV2_CPU_BINDING_POLICY SCATTER \
> >>> -genv MV2_USE_SHARED_MEM 0 \
> >>> -genv MV2_USE_CUDA 1 -genv MV2_USE_GPUDIRECT 0 -genv
> MV2_GPUDIRECT_GDRCOPY_LIB ${GDRCOPY_LIBRARY_PATH}/libgdrapi.so -genv
> MV2_CUDA_IPC 0 \
> >>> sh /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/get_local_rank
> /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/mpi/pt2pt/osu_bw D D
> >>>
> >>> # OSU MPI-CUDA Bandwidth Test v4.4
> >>> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
> >>> # Size Bandwidth (MB/s)
> >>> Warning *** The GPU and IB selected are not on the same socket. Do not
> delever the best performance
> >>> 1 0.07
> >>> 2 0.13
> >>> 4 0.34
> >>> 8 0.69
> >>> 16 1.36
> >>> 32 2.73
> >>> 64 5.42
> >>> 128 10.87
> >>> 256 21.70
> >>> 512 43.21
> >>> 1024 84.83
> >>> 2048 161.70
> >>> 4096 299.68
> >>> 8192 412.03
> >>> 16384 501.18
> >>> 32768 543.28
> >>> Message byte 0, b != a
> >>> 65536 661.09
> >>> Message byte 0, b != a
> >>> 131072 739.19
> >>> Message byte 0, b != a
> >>> 262144 770.89
> >>> Message byte 0, b != a
> >>> 524288 761.48
> >>> 1048576 756.53
> >>> 2097152 757.82
> >>> Message byte 0, b != a
> >>> 4194304 755.51
> >>>
> >>> -> FAIL
> >>>
> >>> c) Device-device GDR
> >>> unset MV2_GPUDIRECT_GDRCOPY_LIB
> >>> mpirun -np $SLURM_NTASKS -ppn 2 -genvall \
> >>> -genv MV2_RAIL_SHARING_POLICY FIXED_MAPPING -genv
> MV2_PROCESS_TO_RAIL_MAPPING 0:1 \
> >>> -genv MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD 1G \
> >>> -genv MV2_ENABLE_AFFINITY 1 -genv MV2_CPU_BINDING_LEVEL SOCKET -genv
> MV2_CPU_BINDING_POLICY SCATTER \
> >>> -genv MV2_USE_SHARED_MEM 0 \
> >>> -genv MV2_USE_CUDA 1 -genv MV2_USE_GPUDIRECT 1 -genv MV2_CUDA_IPC 0\
> >>> sh /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/get_local_rank
> /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/mpi/pt2pt/osu_bw D D
> >>>
> >>> # OSU MPI-CUDA Bandwidth Test v4.4
> >>> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
> >>> # Size Bandwidth (MB/s)
> >>> Warning *** The GPU and IB selected are not on the same socket. Do not
> delever the best performance
> >>> 1 0.01
> >>> 2 0.03
> >>> 4 0.05
> >>> 8 0.11
> >>> 16 0.22
> >>> 32 0.84
> >>> 64 1.69
> >>> 128 3.35
> >>> 256 6.61
> >>> 512 13.22
> >>> 1024 25.67
> >>> 2048 49.59
> >>> 4096 92.64
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> Message byte 0, b != a
> >>> 8192 14.81
> >>> Message byte 0, b != a
> >>> 16384 421.67
> >>> 32768 608.24
> >>> 65536 721.74
> >>> 131072 792.72
> >>> 262144 795.85
> >>> 524288 780.61
> >>> 1048576 776.48
> >>> 2097152 160.07
> >>> 4194304 401.23
> >>>
> >>> -> FAIL
> >>>
> >>> d) Device-Device GDR (no loopback)
> >>> unset MV2_GPUDIRECT_GDRCOPY_LIB
> >>> mpirun -np $SLURM_NTASKS -ppn 2 -genvall \
> >>> -genv MV2_RAIL_SHARING_POLICY FIXED_MAPPING -genv
> MV2_PROCESS_TO_RAIL_MAPPING 0:1 \
> >>> -genv MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD 1G \
> >>> -genv MV2_ENABLE_AFFINITY 1 -genv MV2_CPU_BINDING_LEVEL SOCKET -genv
> MV2_CPU_BINDING_POLICY SCATTER \
> >>> -genv MV2_USE_SHARED_MEM 0 \
> >>> -genv MV2_USE_CUDA 1 -genv MV2_USE_GPUDIRECT 1 -genv MV2_CUDA_IPC 0
> -genv MV2_USE_GPUDIRECT_LOOPBACK_LIMIT 9999999 \
> >>> sh /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/get_local_rank
> /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/mpi/pt2pt/osu_bw D D
> >>>
> >>> # OSU MPI-CUDA Bandwidth Test v4.4
> >>> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
> >>> # Size Bandwidth (MB/s)
> >>> Warning *** The GPU and IB selected are not on the same socket. Do not
> delever the best performance
> >>> 1 0.01
> >>> 2 0.03
> >>> 4 0.05
> >>> 8 0.11
> >>> 16 0.22
> >>> 32 0.83
> >>> 64 1.67
> >>> 128 3.33
> >>> 256 6.57
> >>> 512 13.08
> >>> 1024 25.40
> >>> 2048 49.38
> >>> 4096 91.31
> >>> 8192 595.21
> >>> 16384 666.12
> >>> Message byte 0, b != a
> >>> 32768 605.65
> >>> 65536 721.52
> >>> 131072 791.46
> >>> 262144 794.08
> >>> 524288 779.70
> >>> 1048576 776.23
> >>> 2097152 187.64
> >>> 4194304 196.25
> >>>
> >>> -> FAIL
> >>>
> >>> 3. Additional info:
> >>>
> >>> MVAPICH2 Version: 2.1a
> >>> MVAPICH2 Release date: Sun Sep 21 12:00:00 EDT 2014
> >>> MVAPICH2 Device: ch3:mrail
> >>> MVAPICH2 configure: --build=x86_64-unknown-linux-gnu
> --host=x86_64-unknown-linux-gnu --target=x86_64-redhat-linux-gnu
> --program-prefix= --prefix=/opt/mvapich2/gdr/2.1a/gnu
> --exec-prefix=/opt/mvapich2/gdr/2.1a/gnu
> --bindir=/opt/mvapich2/gdr/2.1a/gnu/bin
> --sbindir=/opt/mvapich2/gdr/2.1a/gnu/sbin
> --sysconfdir=/opt/mvapich2/gdr/2.1a/gnu/etc
> --datadir=/opt/mvapich2/gdr/2.1a/gnu/share
> --includedir=/opt/mvapich2/gdr/2.1a/gnu/include
> --libdir=/opt/mvapich2/gdr/2.1a/gnu/lib64
> --libexecdir=/opt/mvapich2/gdr/2.1a/gnu/libexec --localstatedir=/var
> --sharedstatedir=/var/lib --mandir=/opt/mvapich2/gdr/2.1a/gnu/share/man
> --infodir=/opt/mvapich2/gdr/2.1a/gnu/share/info --disable-rpath
> --disable-static --enable-shared --disable-rdma-cm --disable-mcast
> --enable-cuda --without-hydra-ckpointlib CPPFLAGS=-I/usr/local/cuda/include
> LDFLAGS=-L/usr/local/cuda/lib64 -Wl,-rpath,/usr/local/cuda/lib64
> -Wl,-rpath,XORIGIN/placeholder
> >>> MVAPICH2 CC: gcc -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
> -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
> -mtune=generic -DNDEBUG -DNVALGRIND -O2
> >>> MVAPICH2 CXX: g++ -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
> -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
> -mtune=generic -DNDEBUG -DNVALGRIND -O2
> >>> MVAPICH2 F77: gfortran -L/lib -L/lib -O2 -g -pipe -Wall
> -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector
> --param=ssp-buffer-size=4 -m64 -mtune=generic
> -I/opt/mvapich2/gdr/2.1a/gnu/lib64/gfortran/modules -O2
> >>> MVAPICH2 FC: gfortran -O2
> >>>
> >>> ldd /home/hpcgla1/osu-micro-benchmarks-4.4-expanded/mpi/pt2pt/osu_bw
> >>> linux-vdso.so.1 => (0x00007fff4ec9a000)
> >>> libmpi.so.12 =>
> /usr/local/Cluster-Apps/mvapich2-GDR/gnu/2.1a_cuda-6.5/lib64/libmpi.so.12
> (0x00007fd83ab34000)
> >>> libc.so.6 => /lib64/libc.so.6 (0x00007fd83a776000)
> >>> libcudart.so.6.5 =>
> /usr/local/Cluster-Apps/cuda/6.5/lib64/libcudart.so.6.5 (0x00007fd83a526000)
> >>> libcuda.so.1 => /usr/lib64/libcuda.so.1 (0x00007fd8395b4000)
> >>> libstdc++.so.6 =>
> /usr/local/Cluster-Apps/gcc/4.8.1/lib64/libstdc++.so.6 (0x00007fd8392ab000)
> >>> libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00007fd8390a0000)
> >>> libibumad.so.3 => /usr/lib64/libibumad.so.3 (0x00007fd838e98000)
> >>> libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00007fd838c82000)
> >>> libdl.so.2 => /lib64/libdl.so.2 (0x00007fd838a7e000)
> >>> librt.so.1 => /lib64/librt.so.1 (0x00007fd838875000)
> >>> libgfortran.so.3 =>
> /usr/local/Cluster-Apps/gcc/4.8.1/lib64/libgfortran.so.3
> (0x00007fd83855f000)
> >>> libm.so.6 => /lib64/libm.so.6 (0x00007fd8382db000)
> >>> libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fd8380bd000)
> >>> libgcc_s.so.1 =>
> /usr/local/Cluster-Apps/gcc/4.8.1/lib64/libgcc_s.so.1 (0x00007fd837ea8000)
> >>> /lib64/ld-linux-x86-64.so.2 (0x00007fd83b1f2000)
> >>> libnl.so.1 => /lib64/libnl.so.1 (0x00007fd837c55000)
> >>> libquadmath.so.0 =>
> /usr/local/Cluster-Apps/gcc/4.8.1/lib/../lib64/libquadmath.so.0
> (0x00007fd837a1a000)
> >>>
> >>> [hpcgla1 at tesla80 qc_spiga]$ nvidia-smi topo -m
> >>> GPU0 GPU1 mlx5_0 mlx5_1 CPU Affinity
> >>> GPU0 X SOC PHB SOC 0-0,2-2,4-4,6-6,8-8,10-10
> >>> GPU1 SOC X SOC PHB 1-1,3-3,5-5,7-7,9-9,11-11
> >>> mlx5_0 PHB SOC X SOC
> >>> mlx5_1 SOC PHB SOC X
> >>>
> >>> Legend:
> >>>
> >>> X = Self
> >>> SOC = Path traverses a socket-level link (e.g. QPI)
> >>> PHB = Path traverses a PCIe host bridge
> >>> PXB = Path traverses multiple PCIe internal switches
> >>> PIX = Path traverses a PCIe internal switch
> >>>
> >>> The warning message
> >>> Warning *** The GPU and IB selected are not on the same socket. Do not
> delever the best performance
> >>> goes away if I set MV2_CPU_MAPPING 0:1, but behavior is unchanged
> otherwise.
> >>>
> >>> Additonal details (ib configuration, loaded modules, ofed version,..)
> upon request.
> >>>
> >>> _______________________________________________
> >>> mvapich-discuss mailing list
> >>> mvapich-discuss at cse.ohio-state.edu
> >>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> K.H
> >>
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
> --
> Mr. Filippo SPIGA, M.Sc. - HPC Application Specialist
> High Performance Computing Service, University of Cambridge (UK)
> http://www.hpc.cam.ac.uk/ ~ http://filippospiga.info ~ skype:
> filippo.spiga
>
> «Nobody will drive us out of Cantor's paradise.» ~ David Hilbert
>
> *****
> Disclaimer: "Please note this message and any attachments are CONFIDENTIAL
> and may be privileged or otherwise protected from disclosure. The contents
> are not to be disclosed to anyone other than the addressee. Unauthorized
> recipients are requested to preserve this confidentiality and to advise the
> sender immediately of any error in transmission."
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150413/71d04ca6/attachment-0001.html>
More information about the mvapich-discuss
mailing list