[EXTERNAL] Re: [mvapich-discuss] MV2_USE_CUDA=1 gets ignored?

Devendar Bureddy bureddy at cse.ohio-state.edu
Wed Feb 6 13:17:56 EST 2013


By default osu_bw will use only one GPU on the system.  Can you try
with get_local_rank script shipped with osu_benchmarks to use 2
processes with two different GPUs and see if that makes any
difference.

mpirun -np 2 env MV2_USE_CUDA=1 MV2_DEBUG_SHOW_BACKTRACE=1
./get_local_rank ./osu_bw D D

-Devendar

On Wed, Feb 6, 2013 at 12:54 PM, Christian Trott <crtrott at sandia.gov> wrote:
> The testcode works. I modified it slightly to be able to run 2 processes on
> two different GPUs and put the same output to it as I added to mvapich this
> is what I get:
>
> memory type detected correctly
> Test: 0 0x2700720000 0 0 2 2
> memory type detected correctly
> Test: 1 0x2700720000 0 0 2 2
>
>
> And this was what I got for the same line with osu_bw:
> IsDevicePointer2: 0x2700720000 1 0 0 2
>
> The difference is that in the mvapich code cuPointerGetAttribute throws an
> error for actually the same address!
>
> Christian
>
>
>
> On 02/06/2013 10:26 AM, Devendar Bureddy wrote:
>>
>> Hi Christian
>>
>> Can you please try attached small test program to see if this(
>> detecting GPU memory correctly) is the reason for this issue.
>>
>> $mpicc -o test  ./test.c
>>
>> $ ./test
>> memory type detected correctly
>>
>> -Devendar
>>
>> On Wed, Feb 6, 2013 at 12:11 PM, Christian Trott<crtrott at sandia.gov>
>> wrote:
>>>
>>> Hi
>>>
>>> you mean you compiled mvapich on the compute node linking against local
>>> files?
>>> I am already compiling on the compute nodes, but the filesystem is an NFS
>>> if
>>> I am not mistaken.
>>> Here is one more piece of info:
>>>
>>> I added to this file
>>> src/mpid/ch3/channels/mrail/src/rdma/ch3_smp_progress.c
>>> some print out in lines 2858:
>>>
>>> #if defined(_ENABLE_CUDA_)
>>>          if (rdma_enable_cuda) {
>>>              printf("Test\n");
>>>              iov_isdev = is_device_buffer((void *) iov[i].MPID_IOV_BUF);
>>>              printf("Test %i %p\n",iov_isdev,(void *)
>>> iov[i].MPID_IOV_BUF);
>>>          }
>>>
>>> And this is my output:
>>>
>>> Test
>>> Test 0 0x7fefff4b0
>>>
>>> # OSU MPI-CUDA Bandwidth Test
>>> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
>>> # Size        Bandwidth (MB/s)
>>> Test
>>> Test 0 0x7feffe950
>>> Test
>>> Test 0 0x2d00300000
>>> ==61548== Invalid read of size 1
>>> ==61548==    at 0x4A08020: memcpy (mc_replace_strmem.c:628)
>>> ==61548==    by 0x445462: MPIUI_Memcpy (mpiimpl.h:146)
>>> ==61548==    by 0x44D5DE: MPIDI_CH3I_SMP_writev (ch3_smp_progress.c:2897)
>>> ==61548==    by 0x5DAA44: MPIDI_CH3_SMP_iSendv (ch3_isendv.c:108)
>>> ==61548==    by 0x5DADF9: MPIDI_CH3_iSendv (ch3_isendv.c:187)
>>> ==61548==    by 0x5D1D7A: MPIDI_CH3_EagerContigIsend (ch3u_eager.c:632)
>>> ==61548==    by 0x42E22A: MPID_Isend (mpid_isend.c:220)
>>> ==61548==    by 0x40C33F: PMPI_Isend (isend.c:122)
>>> ==61548==    by 0x407001: main (osu_bw.c:243)
>>> ==61548==  Address 0x2d00300000 is not stack'd, malloc'd or (recently)
>>> free'd
>>> ==61548==
>>>
>>> [k20-0001:mpi_rank_0][error_sighandler] Caught error: Segmentation fault
>>> (signal 11)
>>> [k20-0001:mpi_rank_0][print_backtrace]   0: ./out() [0x4b6762]
>>> [k20-0001:mpi_rank_0][print_backtrace]   1: ./out() [0x4b689e]
>>>
>>> [k20-0001:mpi_rank_0][print_backtrace]   2: /lib64/libpthread.so.0()
>>> [0x38b7a0f4a0]
>>> [k20-0001:mpi_rank_0][print_backtrace]   3:
>>>
>>> /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so(_vgrZU_libcZdsoZa_memcpy+0x160)
>>> [0x4a08020]
>>> [k20-0001:mpi_rank_0][print_backtrace]   4: ./out() [0x445463]
>>> [k20-0001:mpi_rank_0][print_backtrace]   5: ./out() [0x44d5df]
>>> [k20-0001:mpi_rank_0][print_backtrace]   6: ./out() [0x5daa45]
>>> [k20-0001:mpi_rank_0][print_backtrace]   7: ./out() [0x5dadfa]
>>> [k20-0001:mpi_rank_0][print_backtrace]   8: ./out() [0x5d1d7b]
>>> [k20-0001:mpi_rank_0][print_backtrace]   9: ./out() [0x42e22b]
>>> [k20-0001:mpi_rank_0][print_backtrace]  10: ./out() [0x40c340]
>>> [k20-0001:mpi_rank_0][print_backtrace]  11: ./out() [0x407002]
>>>
>>> [k20-0001:mpi_rank_0][print_backtrace]  12:
>>> /lib64/libc.so.6(__libc_start_main+0xfd) [0x38b6e1ecdd]
>>> [k20-0001:mpi_rank_0][print_backtrace]  13: ./out() [0x406829]
>>>
>>> My guess is the address 0x2d00300000 should be on the GPU. So the
>>> is_device_buffer test seems to fail. Maybe that is connected to the
>>> rather
>>> interesting settings of our machine. We got 128GB of RAM per node, of
>>> which
>>> apparently 64GB are configure to be used as RAMDISK for /tmp.
>>>
>>> Cheers
>>> Christian
>>>
>>>
>>> n 02/06/2013 09:58 AM, Joshua Anderson wrote:
>>>>
>>>> Hi Christian,
>>>>
>>>> I'm not sure if this is related but I get similar behavior on our
>>>> cluster
>>>> when I link mvapich to the libcuda.so the admins provide on an NFS
>>>> share.
>>>> They do this because the head nodes don't have GPUS and thus don't have
>>>> libcuda.so. When I instead compile on the compute node and link against
>>>> the
>>>> libcuda.so on the local file system, the problem goes away. This is very
>>>> strange because the two files are identical.
>>>>
>>>> - Josh
>>>>
>>>> On Feb 6, 2013, at 11:44 AM, Christian Trott wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> I am trying to use GPU to GPU mpi communication on a new cluster of
>>>>> ours,
>>>>> and it always fails with segfaults. The funny thing is I get the same
>>>>> valgrind output whether I use MV2_USE_CUDA=1 or not (output comes
>>>>> further
>>>>> down). I downloaded the most recent 1.9a2 version and this is my
>>>>> current
>>>>> config line:
>>>>>
>>>>> ./configure --enable-cuda --with-cuda=/home/crtrott/lib/cuda-5.0/
>>>>> --prefix=/home/crtrott/mpi/mvapich2-1.9/gcc/cuda50a --disable-rdmacm
>>>>> --disable-mcast --enable-g=dbg --disable-fast
>>>>>
>>>>> This is my run command:
>>>>>
>>>>> mpirun -np 2 env MV2_USE_CUDA=1 MV2_DEBUG_SHOW_BACKTRACE=1 valgrind
>>>>> ./osu_bw D D
>>>>>
>>>>> And this is the relevant valgrind output (and as :
>>>>>
>>>>> ==58800== Warning: set address range perms: large range [0x3d00000000,
>>>>> 0x5e00000000) (noaccess)
>>>>> ==58801== Warning: set address range perms: large range [0x3d00000000,
>>>>> 0x5e00000000) (noaccess)
>>>>> ==58800== Warning: set address range perms: large range [0x2d00000000,
>>>>> 0x3100000000) (noaccess)
>>>>> ==58801== Warning: set address range perms: large range [0x2d00000000,
>>>>> 0x3100000000) (noaccess)
>>>>> # OSU MPI-CUDA Bandwidth Test
>>>>> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
>>>>> # Size        Bandwidth (MB/s)
>>>>> ==58800== Invalid read of size 1
>>>>> ==58800==    at 0x4A08020: memcpy (mc_replace_strmem.c:628)
>>>>> ==58800==    by 0x4452D6: MPIUI_Memcpy (mpiimpl.h:146)
>>>>> ==58800==    by 0x44D41E: MPIDI_CH3I_SMP_writev
>>>>> (ch3_smp_progress.c:2895)
>>>>> ==58800==    by 0x5DA884: MPIDI_CH3_SMP_iSendv (ch3_isendv.c:108)
>>>>> ==58800==    by 0x5DAC39: MPIDI_CH3_iSendv (ch3_isendv.c:187)
>>>>> ==58800==    by 0x5D1BBA: MPIDI_CH3_EagerContigIsend (ch3u_eager.c:632)
>>>>> ==58800==    by 0x42E09E: MPID_Isend (mpid_isend.c:220)
>>>>> ==58800==    by 0x40C1B3: PMPI_Isend (isend.c:122)
>>>>> ==58800==    by 0x406E85: main (osu_bw.c:242)
>>>>> ==58800==  Address 0x2d00200000 is not stack'd, malloc'd or (recently)
>>>>> free'd
>>>>> ==58800==
>>>>> [k20-0001:mpi_rank_0][error_sighandler] Caught error: Segmentation
>>>>> fault
>>>>> (signal 11)
>>>>> [k20-0001:mpi_rank_0][print_backtrace]   0: ./osu_bw() [0x4b65a2]
>>>>> [k20-0001:mpi_rank_0][print_backtrace]   1: ./osu_bw() [0x4b66de]
>>>>> [k20-0001:mpi_rank_0][print_backtrace]   2: /lib64/libpthread.so.0()
>>>>> [0x38b7a0f4a0]
>>>>> [k20-0001:mpi_rank_0][print_backtrace]   3:
>>>>>
>>>>> /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so(_vgrZU_libcZdsoZa_memcpy+0x160)
>>>>> [0x4a08020]
>>>>> [k20-0001:mpi_rank_0][print_backtrace]   4: ./osu_bw() [0x4452d7]
>>>>> [k20-0001:mpi_rank_0][print_backtrace]   5: ./osu_bw() [0x44d41f]
>>>>> [k20-0001:mpi_rank_0][print_backtrace]   6: ./osu_bw() [0x5da885]
>>>>> [k20-0001:mpi_rank_0][print_backtrace]   7: ./osu_bw() [0x5dac3a]
>>>>> [k20-0001:mpi_rank_0][print_backtrace]   8: ./osu_bw() [0x5d1bbb]
>>>>> [k20-0001:mpi_rank_0][print_backtrace]   9: ./osu_bw() [0x42e09f]
>>>>> [k20-0001:mpi_rank_0][print_backtrace]  10: ./osu_bw() [0x40c1b4]
>>>>> [k20-0001:mpi_rank_0][print_backtrace]  11: ./osu_bw() [0x406e86]
>>>>> [k20-0001:mpi_rank_0][print_backtrace]  12:
>>>>> /lib64/libc.so.6(__libc_start_main+0xfd) [0x38b6e1ecdd]
>>>>> [k20-0001:mpi_rank_0][print_backtrace]  13: ./osu_bw() [0x4066a9]
>>>>>
>>>>> Any suggestions would be greatly appreciated.
>>>>>
>>>>> Christian
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> mvapich-discuss mailing list
>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>
>>>>
>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>>
>
>



-- 
Devendar


More information about the mvapich-discuss mailing list