[EXTERNAL] Re: [mvapich-discuss] MV2_USE_CUDA=1 gets ignored?

Wed Feb 6 12:26:27 EST 2013

Hi Christian

Can you please try attached small test program to see if this(
detecting GPU memory correctly) is the reason for this issue.

$mpicc -o test  ./test.c

$ ./test
memory type detected correctly

-Devendar

On Wed, Feb 6, 2013 at 12:11 PM, Christian Trott <crtrott at sandia.gov> wrote:
> Hi
>
> you mean you compiled mvapich on the compute node linking against local
> files?
> I am already compiling on the compute nodes, but the filesystem is an NFS if
> I am not mistaken.
> Here is one more piece of info:
>
> I added to this file src/mpid/ch3/channels/mrail/src/rdma/ch3_smp_progress.c
> some print out in lines 2858:
>
> #if defined(_ENABLE_CUDA_)
>         if (rdma_enable_cuda) {
>             printf("Test\n");
>             iov_isdev = is_device_buffer((void *) iov[i].MPID_IOV_BUF);
>             printf("Test %i %p\n",iov_isdev,(void *) iov[i].MPID_IOV_BUF);
>         }
>
> And this is my output:
>
> Test
> Test 0 0x7fefff4b0
>
> # OSU MPI-CUDA Bandwidth Test
> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
> # Size        Bandwidth (MB/s)
> Test
> Test 0 0x7feffe950
> Test
> Test 0 0x2d00300000
> ==61548== Invalid read of size 1
> ==61548==    at 0x4A08020: memcpy (mc_replace_strmem.c:628)
> ==61548==    by 0x445462: MPIUI_Memcpy (mpiimpl.h:146)
> ==61548==    by 0x44D5DE: MPIDI_CH3I_SMP_writev (ch3_smp_progress.c:2897)
> ==61548==    by 0x5DAA44: MPIDI_CH3_SMP_iSendv (ch3_isendv.c:108)
> ==61548==    by 0x5DADF9: MPIDI_CH3_iSendv (ch3_isendv.c:187)
> ==61548==    by 0x5D1D7A: MPIDI_CH3_EagerContigIsend (ch3u_eager.c:632)
> ==61548==    by 0x42E22A: MPID_Isend (mpid_isend.c:220)
> ==61548==    by 0x40C33F: PMPI_Isend (isend.c:122)
> ==61548==    by 0x407001: main (osu_bw.c:243)
> ==61548==  Address 0x2d00300000 is not stack'd, malloc'd or (recently)
> free'd
> ==61548==
>
> [k20-0001:mpi_rank_0][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> [k20-0001:mpi_rank_0][print_backtrace]   0: ./out() [0x4b6762]
> [k20-0001:mpi_rank_0][print_backtrace]   1: ./out() [0x4b689e]
>
> [k20-0001:mpi_rank_0][print_backtrace]   2: /lib64/libpthread.so.0()
> [0x38b7a0f4a0]
> [k20-0001:mpi_rank_0][print_backtrace]   3:
> /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so(_vgrZU_libcZdsoZa_memcpy+0x160)
> [0x4a08020]
> [k20-0001:mpi_rank_0][print_backtrace]   4: ./out() [0x445463]
> [k20-0001:mpi_rank_0][print_backtrace]   5: ./out() [0x44d5df]
> [k20-0001:mpi_rank_0][print_backtrace]   6: ./out() [0x5daa45]
> [k20-0001:mpi_rank_0][print_backtrace]   7: ./out() [0x5dadfa]
> [k20-0001:mpi_rank_0][print_backtrace]   8: ./out() [0x5d1d7b]
> [k20-0001:mpi_rank_0][print_backtrace]   9: ./out() [0x42e22b]
> [k20-0001:mpi_rank_0][print_backtrace]  10: ./out() [0x40c340]
> [k20-0001:mpi_rank_0][print_backtrace]  11: ./out() [0x407002]
>
> [k20-0001:mpi_rank_0][print_backtrace]  12:
> /lib64/libc.so.6(__libc_start_main+0xfd) [0x38b6e1ecdd]
> [k20-0001:mpi_rank_0][print_backtrace]  13: ./out() [0x406829]
>
> My guess is the address 0x2d00300000 should be on the GPU. So the
> is_device_buffer test seems to fail. Maybe that is connected to the rather
> interesting settings of our machine. We got 128GB of RAM per node, of which
> apparently 64GB are configure to be used as RAMDISK for /tmp.
>
> Cheers
> Christian
>
>
> n 02/06/2013 09:58 AM, Joshua Anderson wrote:
>>
>> Hi Christian,
>>
>> I'm not sure if this is related but I get similar behavior on our cluster
>> when I link mvapich to the libcuda.so the admins provide on an NFS share.
>> They do this because the head nodes don't have GPUS and thus don't have
>> libcuda.so. When I instead compile on the compute node and link against the
>> libcuda.so on the local file system, the problem goes away. This is very
>> strange because the two files are identical.
>>
>> - Josh
>>
>> On Feb 6, 2013, at 11:44 AM, Christian Trott wrote:
>>
>>> Hi
>>>
>>> I am trying to use GPU to GPU mpi communication on a new cluster of ours,
>>> and it always fails with segfaults. The funny thing is I get the same
>>> valgrind output whether I use MV2_USE_CUDA=1 or not (output comes further
>>> down). I downloaded the most recent 1.9a2 version and this is my current
>>> config line:
>>>
>>> ./configure --enable-cuda --with-cuda=/home/crtrott/lib/cuda-5.0/
>>> --prefix=/home/crtrott/mpi/mvapich2-1.9/gcc/cuda50a --disable-rdmacm
>>> --disable-mcast --enable-g=dbg --disable-fast
>>>
>>> This is my run command:
>>>
>>> mpirun -np 2 env MV2_USE_CUDA=1 MV2_DEBUG_SHOW_BACKTRACE=1 valgrind
>>> ./osu_bw D D
>>>
>>> And this is the relevant valgrind output (and as :
>>>
>>> ==58800== Warning: set address range perms: large range [0x3d00000000,
>>> 0x5e00000000) (noaccess)
>>> ==58801== Warning: set address range perms: large range [0x3d00000000,
>>> 0x5e00000000) (noaccess)
>>> ==58800== Warning: set address range perms: large range [0x2d00000000,
>>> 0x3100000000) (noaccess)
>>> ==58801== Warning: set address range perms: large range [0x2d00000000,
>>> 0x3100000000) (noaccess)
>>> # OSU MPI-CUDA Bandwidth Test
>>> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
>>> # Size        Bandwidth (MB/s)
>>> ==58800== Invalid read of size 1
>>> ==58800==    at 0x4A08020: memcpy (mc_replace_strmem.c:628)
>>> ==58800==    by 0x4452D6: MPIUI_Memcpy (mpiimpl.h:146)
>>> ==58800==    by 0x44D41E: MPIDI_CH3I_SMP_writev (ch3_smp_progress.c:2895)
>>> ==58800==    by 0x5DA884: MPIDI_CH3_SMP_iSendv (ch3_isendv.c:108)
>>> ==58800==    by 0x5DAC39: MPIDI_CH3_iSendv (ch3_isendv.c:187)
>>> ==58800==    by 0x5D1BBA: MPIDI_CH3_EagerContigIsend (ch3u_eager.c:632)
>>> ==58800==    by 0x42E09E: MPID_Isend (mpid_isend.c:220)
>>> ==58800==    by 0x40C1B3: PMPI_Isend (isend.c:122)
>>> ==58800==    by 0x406E85: main (osu_bw.c:242)
>>> ==58800==  Address 0x2d00200000 is not stack'd, malloc'd or (recently)
>>> free'd
>>> ==58800==
>>> [k20-0001:mpi_rank_0][error_sighandler] Caught error: Segmentation fault
>>> (signal 11)
>>> [k20-0001:mpi_rank_0][print_backtrace]   0: ./osu_bw() [0x4b65a2]
>>> [k20-0001:mpi_rank_0][print_backtrace]   1: ./osu_bw() [0x4b66de]
>>> [k20-0001:mpi_rank_0][print_backtrace]   2: /lib64/libpthread.so.0()
>>> [0x38b7a0f4a0]
>>> [k20-0001:mpi_rank_0][print_backtrace]   3:
>>> /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so(_vgrZU_libcZdsoZa_memcpy+0x160)
>>> [0x4a08020]
>>> [k20-0001:mpi_rank_0][print_backtrace]   4: ./osu_bw() [0x4452d7]
>>> [k20-0001:mpi_rank_0][print_backtrace]   5: ./osu_bw() [0x44d41f]
>>> [k20-0001:mpi_rank_0][print_backtrace]   6: ./osu_bw() [0x5da885]
>>> [k20-0001:mpi_rank_0][print_backtrace]   7: ./osu_bw() [0x5dac3a]
>>> [k20-0001:mpi_rank_0][print_backtrace]   8: ./osu_bw() [0x5d1bbb]
>>> [k20-0001:mpi_rank_0][print_backtrace]   9: ./osu_bw() [0x42e09f]
>>> [k20-0001:mpi_rank_0][print_backtrace]  10: ./osu_bw() [0x40c1b4]
>>> [k20-0001:mpi_rank_0][print_backtrace]  11: ./osu_bw() [0x406e86]
>>> [k20-0001:mpi_rank_0][print_backtrace]  12:
>>> /lib64/libc.so.6(__libc_start_main+0xfd) [0x38b6e1ecdd]
>>> [k20-0001:mpi_rank_0][print_backtrace]  13: ./osu_bw() [0x4066a9]
>>>
>>> Any suggestions would be greatly appreciated.
>>>
>>> Christian
>>>
>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

-- 
Devendar
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.c
Type: text/x-csrc
Size: 1234 bytes
Desc: not available
Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20130206/fd50f6bd/test.bin