[mvapich-discuss] MV2_USE_CUDA=1 gets ignored?

Wed Feb 6 11:58:28 EST 2013

Hi Christian,

I'm not sure if this is related but I get similar behavior on our cluster when I link mvapich to the libcuda.so the admins provide on an NFS share. They do this because the head nodes don't have GPUS and thus don't have libcuda.so. When I instead compile on the compute node and link against the libcuda.so on the local file system, the problem goes away. This is very strange because the two files are identical.

- Josh

On Feb 6, 2013, at 11:44 AM, Christian Trott wrote:

> Hi
> 
> I am trying to use GPU to GPU mpi communication on a new cluster of ours, and it always fails with segfaults. The funny thing is I get the same valgrind output whether I use MV2_USE_CUDA=1 or not (output comes further down). I downloaded the most recent 1.9a2 version and this is my current config line:
> 
> ./configure --enable-cuda --with-cuda=/home/crtrott/lib/cuda-5.0/ --prefix=/home/crtrott/mpi/mvapich2-1.9/gcc/cuda50a --disable-rdmacm --disable-mcast --enable-g=dbg --disable-fast
> 
> This is my run command:
> 
> mpirun -np 2 env MV2_USE_CUDA=1 MV2_DEBUG_SHOW_BACKTRACE=1 valgrind ./osu_bw D D
> 
> And this is the relevant valgrind output (and as :
> 
> ==58800== Warning: set address range perms: large range [0x3d00000000, 0x5e00000000) (noaccess)
> ==58801== Warning: set address range perms: large range [0x3d00000000, 0x5e00000000) (noaccess)
> ==58800== Warning: set address range perms: large range [0x2d00000000, 0x3100000000) (noaccess)
> ==58801== Warning: set address range perms: large range [0x2d00000000, 0x3100000000) (noaccess)
> # OSU MPI-CUDA Bandwidth Test
> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
> # Size        Bandwidth (MB/s)
> ==58800== Invalid read of size 1
> ==58800==    at 0x4A08020: memcpy (mc_replace_strmem.c:628)
> ==58800==    by 0x4452D6: MPIUI_Memcpy (mpiimpl.h:146)
> ==58800==    by 0x44D41E: MPIDI_CH3I_SMP_writev (ch3_smp_progress.c:2895)
> ==58800==    by 0x5DA884: MPIDI_CH3_SMP_iSendv (ch3_isendv.c:108)
> ==58800==    by 0x5DAC39: MPIDI_CH3_iSendv (ch3_isendv.c:187)
> ==58800==    by 0x5D1BBA: MPIDI_CH3_EagerContigIsend (ch3u_eager.c:632)
> ==58800==    by 0x42E09E: MPID_Isend (mpid_isend.c:220)
> ==58800==    by 0x40C1B3: PMPI_Isend (isend.c:122)
> ==58800==    by 0x406E85: main (osu_bw.c:242)
> ==58800==  Address 0x2d00200000 is not stack'd, malloc'd or (recently) free'd
> ==58800==
> [k20-0001:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
> [k20-0001:mpi_rank_0][print_backtrace]   0: ./osu_bw() [0x4b65a2]
> [k20-0001:mpi_rank_0][print_backtrace]   1: ./osu_bw() [0x4b66de]
> [k20-0001:mpi_rank_0][print_backtrace]   2: /lib64/libpthread.so.0() [0x38b7a0f4a0]
> [k20-0001:mpi_rank_0][print_backtrace]   3: /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so(_vgrZU_libcZdsoZa_memcpy+0x160) [0x4a08020]
> [k20-0001:mpi_rank_0][print_backtrace]   4: ./osu_bw() [0x4452d7]
> [k20-0001:mpi_rank_0][print_backtrace]   5: ./osu_bw() [0x44d41f]
> [k20-0001:mpi_rank_0][print_backtrace]   6: ./osu_bw() [0x5da885]
> [k20-0001:mpi_rank_0][print_backtrace]   7: ./osu_bw() [0x5dac3a]
> [k20-0001:mpi_rank_0][print_backtrace]   8: ./osu_bw() [0x5d1bbb]
> [k20-0001:mpi_rank_0][print_backtrace]   9: ./osu_bw() [0x42e09f]
> [k20-0001:mpi_rank_0][print_backtrace]  10: ./osu_bw() [0x40c1b4]
> [k20-0001:mpi_rank_0][print_backtrace]  11: ./osu_bw() [0x406e86]
> [k20-0001:mpi_rank_0][print_backtrace]  12: /lib64/libc.so.6(__libc_start_main+0xfd) [0x38b6e1ecdd]
> [k20-0001:mpi_rank_0][print_backtrace]  13: ./osu_bw() [0x4066a9]
> 
> Any suggestions would be greatly appreciated.
> 
> Christian
> 
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss