[EXTERNAL] Re: [mvapich-discuss] MV2_USE_CUDA=1 gets ignored?

Christian Trott crtrott at sandia.gov
Wed Feb 6 12:54:48 EST 2013


The testcode works. I modified it slightly to be able to run 2 processes 
on two different GPUs and put the same output to it as I added to 
mvapich this is what I get:

memory type detected correctly
Test: 0 0x2700720000 0 0 2 2
memory type detected correctly
Test: 1 0x2700720000 0 0 2 2


And this was what I got for the same line with osu_bw:
IsDevicePointer2: 0x2700720000 1 0 0 2

The difference is that in the mvapich code cuPointerGetAttribute throws 
an error for actually the same address!

Christian


On 02/06/2013 10:26 AM, Devendar Bureddy wrote:
> Hi Christian
>
> Can you please try attached small test program to see if this(
> detecting GPU memory correctly) is the reason for this issue.
>
> $mpicc -o test  ./test.c
>
> $ ./test
> memory type detected correctly
>
> -Devendar
>
> On Wed, Feb 6, 2013 at 12:11 PM, Christian Trott<crtrott at sandia.gov>  wrote:
>> Hi
>>
>> you mean you compiled mvapich on the compute node linking against local
>> files?
>> I am already compiling on the compute nodes, but the filesystem is an NFS if
>> I am not mistaken.
>> Here is one more piece of info:
>>
>> I added to this file src/mpid/ch3/channels/mrail/src/rdma/ch3_smp_progress.c
>> some print out in lines 2858:
>>
>> #if defined(_ENABLE_CUDA_)
>>          if (rdma_enable_cuda) {
>>              printf("Test\n");
>>              iov_isdev = is_device_buffer((void *) iov[i].MPID_IOV_BUF);
>>              printf("Test %i %p\n",iov_isdev,(void *) iov[i].MPID_IOV_BUF);
>>          }
>>
>> And this is my output:
>>
>> Test
>> Test 0 0x7fefff4b0
>>
>> # OSU MPI-CUDA Bandwidth Test
>> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
>> # Size        Bandwidth (MB/s)
>> Test
>> Test 0 0x7feffe950
>> Test
>> Test 0 0x2d00300000
>> ==61548== Invalid read of size 1
>> ==61548==    at 0x4A08020: memcpy (mc_replace_strmem.c:628)
>> ==61548==    by 0x445462: MPIUI_Memcpy (mpiimpl.h:146)
>> ==61548==    by 0x44D5DE: MPIDI_CH3I_SMP_writev (ch3_smp_progress.c:2897)
>> ==61548==    by 0x5DAA44: MPIDI_CH3_SMP_iSendv (ch3_isendv.c:108)
>> ==61548==    by 0x5DADF9: MPIDI_CH3_iSendv (ch3_isendv.c:187)
>> ==61548==    by 0x5D1D7A: MPIDI_CH3_EagerContigIsend (ch3u_eager.c:632)
>> ==61548==    by 0x42E22A: MPID_Isend (mpid_isend.c:220)
>> ==61548==    by 0x40C33F: PMPI_Isend (isend.c:122)
>> ==61548==    by 0x407001: main (osu_bw.c:243)
>> ==61548==  Address 0x2d00300000 is not stack'd, malloc'd or (recently)
>> free'd
>> ==61548==
>>
>> [k20-0001:mpi_rank_0][error_sighandler] Caught error: Segmentation fault
>> (signal 11)
>> [k20-0001:mpi_rank_0][print_backtrace]   0: ./out() [0x4b6762]
>> [k20-0001:mpi_rank_0][print_backtrace]   1: ./out() [0x4b689e]
>>
>> [k20-0001:mpi_rank_0][print_backtrace]   2: /lib64/libpthread.so.0()
>> [0x38b7a0f4a0]
>> [k20-0001:mpi_rank_0][print_backtrace]   3:
>> /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so(_vgrZU_libcZdsoZa_memcpy+0x160)
>> [0x4a08020]
>> [k20-0001:mpi_rank_0][print_backtrace]   4: ./out() [0x445463]
>> [k20-0001:mpi_rank_0][print_backtrace]   5: ./out() [0x44d5df]
>> [k20-0001:mpi_rank_0][print_backtrace]   6: ./out() [0x5daa45]
>> [k20-0001:mpi_rank_0][print_backtrace]   7: ./out() [0x5dadfa]
>> [k20-0001:mpi_rank_0][print_backtrace]   8: ./out() [0x5d1d7b]
>> [k20-0001:mpi_rank_0][print_backtrace]   9: ./out() [0x42e22b]
>> [k20-0001:mpi_rank_0][print_backtrace]  10: ./out() [0x40c340]
>> [k20-0001:mpi_rank_0][print_backtrace]  11: ./out() [0x407002]
>>
>> [k20-0001:mpi_rank_0][print_backtrace]  12:
>> /lib64/libc.so.6(__libc_start_main+0xfd) [0x38b6e1ecdd]
>> [k20-0001:mpi_rank_0][print_backtrace]  13: ./out() [0x406829]
>>
>> My guess is the address 0x2d00300000 should be on the GPU. So the
>> is_device_buffer test seems to fail. Maybe that is connected to the rather
>> interesting settings of our machine. We got 128GB of RAM per node, of which
>> apparently 64GB are configure to be used as RAMDISK for /tmp.
>>
>> Cheers
>> Christian
>>
>>
>> n 02/06/2013 09:58 AM, Joshua Anderson wrote:
>>> Hi Christian,
>>>
>>> I'm not sure if this is related but I get similar behavior on our cluster
>>> when I link mvapich to the libcuda.so the admins provide on an NFS share.
>>> They do this because the head nodes don't have GPUS and thus don't have
>>> libcuda.so. When I instead compile on the compute node and link against the
>>> libcuda.so on the local file system, the problem goes away. This is very
>>> strange because the two files are identical.
>>>
>>> - Josh
>>>
>>> On Feb 6, 2013, at 11:44 AM, Christian Trott wrote:
>>>
>>>> Hi
>>>>
>>>> I am trying to use GPU to GPU mpi communication on a new cluster of ours,
>>>> and it always fails with segfaults. The funny thing is I get the same
>>>> valgrind output whether I use MV2_USE_CUDA=1 or not (output comes further
>>>> down). I downloaded the most recent 1.9a2 version and this is my current
>>>> config line:
>>>>
>>>> ./configure --enable-cuda --with-cuda=/home/crtrott/lib/cuda-5.0/
>>>> --prefix=/home/crtrott/mpi/mvapich2-1.9/gcc/cuda50a --disable-rdmacm
>>>> --disable-mcast --enable-g=dbg --disable-fast
>>>>
>>>> This is my run command:
>>>>
>>>> mpirun -np 2 env MV2_USE_CUDA=1 MV2_DEBUG_SHOW_BACKTRACE=1 valgrind
>>>> ./osu_bw D D
>>>>
>>>> And this is the relevant valgrind output (and as :
>>>>
>>>> ==58800== Warning: set address range perms: large range [0x3d00000000,
>>>> 0x5e00000000) (noaccess)
>>>> ==58801== Warning: set address range perms: large range [0x3d00000000,
>>>> 0x5e00000000) (noaccess)
>>>> ==58800== Warning: set address range perms: large range [0x2d00000000,
>>>> 0x3100000000) (noaccess)
>>>> ==58801== Warning: set address range perms: large range [0x2d00000000,
>>>> 0x3100000000) (noaccess)
>>>> # OSU MPI-CUDA Bandwidth Test
>>>> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
>>>> # Size        Bandwidth (MB/s)
>>>> ==58800== Invalid read of size 1
>>>> ==58800==    at 0x4A08020: memcpy (mc_replace_strmem.c:628)
>>>> ==58800==    by 0x4452D6: MPIUI_Memcpy (mpiimpl.h:146)
>>>> ==58800==    by 0x44D41E: MPIDI_CH3I_SMP_writev (ch3_smp_progress.c:2895)
>>>> ==58800==    by 0x5DA884: MPIDI_CH3_SMP_iSendv (ch3_isendv.c:108)
>>>> ==58800==    by 0x5DAC39: MPIDI_CH3_iSendv (ch3_isendv.c:187)
>>>> ==58800==    by 0x5D1BBA: MPIDI_CH3_EagerContigIsend (ch3u_eager.c:632)
>>>> ==58800==    by 0x42E09E: MPID_Isend (mpid_isend.c:220)
>>>> ==58800==    by 0x40C1B3: PMPI_Isend (isend.c:122)
>>>> ==58800==    by 0x406E85: main (osu_bw.c:242)
>>>> ==58800==  Address 0x2d00200000 is not stack'd, malloc'd or (recently)
>>>> free'd
>>>> ==58800==
>>>> [k20-0001:mpi_rank_0][error_sighandler] Caught error: Segmentation fault
>>>> (signal 11)
>>>> [k20-0001:mpi_rank_0][print_backtrace]   0: ./osu_bw() [0x4b65a2]
>>>> [k20-0001:mpi_rank_0][print_backtrace]   1: ./osu_bw() [0x4b66de]
>>>> [k20-0001:mpi_rank_0][print_backtrace]   2: /lib64/libpthread.so.0()
>>>> [0x38b7a0f4a0]
>>>> [k20-0001:mpi_rank_0][print_backtrace]   3:
>>>> /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so(_vgrZU_libcZdsoZa_memcpy+0x160)
>>>> [0x4a08020]
>>>> [k20-0001:mpi_rank_0][print_backtrace]   4: ./osu_bw() [0x4452d7]
>>>> [k20-0001:mpi_rank_0][print_backtrace]   5: ./osu_bw() [0x44d41f]
>>>> [k20-0001:mpi_rank_0][print_backtrace]   6: ./osu_bw() [0x5da885]
>>>> [k20-0001:mpi_rank_0][print_backtrace]   7: ./osu_bw() [0x5dac3a]
>>>> [k20-0001:mpi_rank_0][print_backtrace]   8: ./osu_bw() [0x5d1bbb]
>>>> [k20-0001:mpi_rank_0][print_backtrace]   9: ./osu_bw() [0x42e09f]
>>>> [k20-0001:mpi_rank_0][print_backtrace]  10: ./osu_bw() [0x40c1b4]
>>>> [k20-0001:mpi_rank_0][print_backtrace]  11: ./osu_bw() [0x406e86]
>>>> [k20-0001:mpi_rank_0][print_backtrace]  12:
>>>> /lib64/libc.so.6(__libc_start_main+0xfd) [0x38b6e1ecdd]
>>>> [k20-0001:mpi_rank_0][print_backtrace]  13: ./osu_bw() [0x4066a9]
>>>>
>>>> Any suggestions would be greatly appreciated.
>>>>
>>>> Christian
>>>>
>>>>
>>>> _______________________________________________
>>>> mvapich-discuss mailing list
>>>> mvapich-discuss at cse.ohio-state.edu
>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>




More information about the mvapich-discuss mailing list