[EXTERNAL] Re: [mvapich-discuss] MV2_USE_CUDA=1 gets ignored?

Wed Feb 6 12:11:08 EST 2013

Hi

you mean you compiled mvapich on the compute node linking against local 
files?
I am already compiling on the compute nodes, but the filesystem is an 
NFS if I am not mistaken.
Here is one more piece of info:

I added to this file 
src/mpid/ch3/channels/mrail/src/rdma/ch3_smp_progress.c some print out 
in lines 2858:

#if defined(_ENABLE_CUDA_)
         if (rdma_enable_cuda) {
             printf("Test\n");
             iov_isdev = is_device_buffer((void *) iov[i].MPID_IOV_BUF);
             printf("Test %i %p\n",iov_isdev,(void *) iov[i].MPID_IOV_BUF);
         }

And this is my output:

Test
Test 0 0x7fefff4b0
# OSU MPI-CUDA Bandwidth Test
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size        Bandwidth (MB/s)
Test
Test 0 0x7feffe950
Test
Test 0 0x2d00300000
==61548== Invalid read of size 1
==61548==    at 0x4A08020: memcpy (mc_replace_strmem.c:628)
==61548==    by 0x445462: MPIUI_Memcpy (mpiimpl.h:146)
==61548==    by 0x44D5DE: MPIDI_CH3I_SMP_writev (ch3_smp_progress.c:2897)
==61548==    by 0x5DAA44: MPIDI_CH3_SMP_iSendv (ch3_isendv.c:108)
==61548==    by 0x5DADF9: MPIDI_CH3_iSendv (ch3_isendv.c:187)
==61548==    by 0x5D1D7A: MPIDI_CH3_EagerContigIsend (ch3u_eager.c:632)
==61548==    by 0x42E22A: MPID_Isend (mpid_isend.c:220)
==61548==    by 0x40C33F: PMPI_Isend (isend.c:122)
==61548==    by 0x407001: main (osu_bw.c:243)
==61548==  Address 0x2d00300000 is not stack'd, malloc'd or (recently) 
free'd
==61548==
[k20-0001:mpi_rank_0][error_sighandler] Caught error: Segmentation fault 
(signal 11)
[k20-0001:mpi_rank_0][print_backtrace]   0: ./out() [0x4b6762]
[k20-0001:mpi_rank_0][print_backtrace]   1: ./out() [0x4b689e]
[k20-0001:mpi_rank_0][print_backtrace]   2: /lib64/libpthread.so.0() 
[0x38b7a0f4a0]
[k20-0001:mpi_rank_0][print_backtrace]   3: 
/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so(_vgrZU_libcZdsoZa_memcpy+0x160) 
[0x4a08020]
[k20-0001:mpi_rank_0][print_backtrace]   4: ./out() [0x445463]
[k20-0001:mpi_rank_0][print_backtrace]   5: ./out() [0x44d5df]
[k20-0001:mpi_rank_0][print_backtrace]   6: ./out() [0x5daa45]
[k20-0001:mpi_rank_0][print_backtrace]   7: ./out() [0x5dadfa]
[k20-0001:mpi_rank_0][print_backtrace]   8: ./out() [0x5d1d7b]
[k20-0001:mpi_rank_0][print_backtrace]   9: ./out() [0x42e22b]
[k20-0001:mpi_rank_0][print_backtrace]  10: ./out() [0x40c340]
[k20-0001:mpi_rank_0][print_backtrace]  11: ./out() [0x407002]
[k20-0001:mpi_rank_0][print_backtrace]  12: 
/lib64/libc.so.6(__libc_start_main+0xfd) [0x38b6e1ecdd]
[k20-0001:mpi_rank_0][print_backtrace]  13: ./out() [0x406829]

My guess is the address 0x2d00300000 should be on the GPU. So the 
is_device_buffer test seems to fail. Maybe that is connected to the 
rather interesting settings of our machine. We got 128GB of RAM per 
node, of which apparently 64GB are configure to be used as RAMDISK for /tmp.

Cheers
Christian

n 02/06/2013 09:58 AM, Joshua Anderson wrote:
> Hi Christian,
>
> I'm not sure if this is related but I get similar behavior on our cluster when I link mvapich to the libcuda.so the admins provide on an NFS share. They do this because the head nodes don't have GPUS and thus don't have libcuda.so. When I instead compile on the compute node and link against the libcuda.so on the local file system, the problem goes away. This is very strange because the two files are identical.
>
> - Josh
>
> On Feb 6, 2013, at 11:44 AM, Christian Trott wrote:
>
>> Hi
>>
>> I am trying to use GPU to GPU mpi communication on a new cluster of ours, and it always fails with segfaults. The funny thing is I get the same valgrind output whether I use MV2_USE_CUDA=1 or not (output comes further down). I downloaded the most recent 1.9a2 version and this is my current config line:
>>
>> ./configure --enable-cuda --with-cuda=/home/crtrott/lib/cuda-5.0/ --prefix=/home/crtrott/mpi/mvapich2-1.9/gcc/cuda50a --disable-rdmacm --disable-mcast --enable-g=dbg --disable-fast
>>
>> This is my run command:
>>
>> mpirun -np 2 env MV2_USE_CUDA=1 MV2_DEBUG_SHOW_BACKTRACE=1 valgrind ./osu_bw D D
>>
>> And this is the relevant valgrind output (and as :
>>
>> ==58800== Warning: set address range perms: large range [0x3d00000000, 0x5e00000000) (noaccess)
>> ==58801== Warning: set address range perms: large range [0x3d00000000, 0x5e00000000) (noaccess)
>> ==58800== Warning: set address range perms: large range [0x2d00000000, 0x3100000000) (noaccess)
>> ==58801== Warning: set address range perms: large range [0x2d00000000, 0x3100000000) (noaccess)
>> # OSU MPI-CUDA Bandwidth Test
>> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
>> # Size        Bandwidth (MB/s)
>> ==58800== Invalid read of size 1
>> ==58800==    at 0x4A08020: memcpy (mc_replace_strmem.c:628)
>> ==58800==    by 0x4452D6: MPIUI_Memcpy (mpiimpl.h:146)
>> ==58800==    by 0x44D41E: MPIDI_CH3I_SMP_writev (ch3_smp_progress.c:2895)
>> ==58800==    by 0x5DA884: MPIDI_CH3_SMP_iSendv (ch3_isendv.c:108)
>> ==58800==    by 0x5DAC39: MPIDI_CH3_iSendv (ch3_isendv.c:187)
>> ==58800==    by 0x5D1BBA: MPIDI_CH3_EagerContigIsend (ch3u_eager.c:632)
>> ==58800==    by 0x42E09E: MPID_Isend (mpid_isend.c:220)
>> ==58800==    by 0x40C1B3: PMPI_Isend (isend.c:122)
>> ==58800==    by 0x406E85: main (osu_bw.c:242)
>> ==58800==  Address 0x2d00200000 is not stack'd, malloc'd or (recently) free'd
>> ==58800==
>> [k20-0001:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
>> [k20-0001:mpi_rank_0][print_backtrace]   0: ./osu_bw() [0x4b65a2]
>> [k20-0001:mpi_rank_0][print_backtrace]   1: ./osu_bw() [0x4b66de]
>> [k20-0001:mpi_rank_0][print_backtrace]   2: /lib64/libpthread.so.0() [0x38b7a0f4a0]
>> [k20-0001:mpi_rank_0][print_backtrace]   3: /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so(_vgrZU_libcZdsoZa_memcpy+0x160) [0x4a08020]
>> [k20-0001:mpi_rank_0][print_backtrace]   4: ./osu_bw() [0x4452d7]
>> [k20-0001:mpi_rank_0][print_backtrace]   5: ./osu_bw() [0x44d41f]
>> [k20-0001:mpi_rank_0][print_backtrace]   6: ./osu_bw() [0x5da885]
>> [k20-0001:mpi_rank_0][print_backtrace]   7: ./osu_bw() [0x5dac3a]
>> [k20-0001:mpi_rank_0][print_backtrace]   8: ./osu_bw() [0x5d1bbb]
>> [k20-0001:mpi_rank_0][print_backtrace]   9: ./osu_bw() [0x42e09f]
>> [k20-0001:mpi_rank_0][print_backtrace]  10: ./osu_bw() [0x40c1b4]
>> [k20-0001:mpi_rank_0][print_backtrace]  11: ./osu_bw() [0x406e86]
>> [k20-0001:mpi_rank_0][print_backtrace]  12: /lib64/libc.so.6(__libc_start_main+0xfd) [0x38b6e1ecdd]
>> [k20-0001:mpi_rank_0][print_backtrace]  13: ./osu_bw() [0x4066a9]
>>
>> Any suggestions would be greatly appreciated.
>>
>> Christian
>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>