[mvapich-discuss] Error registering memory with CUDA

sreeram potluri potluri at cse.ohio-state.edu
Sun Aug 11 09:53:43 EDT 2013


Adam,

We have not seen this behavior in our automated testing and we run several
tests which share a GPU among several processes. In your email, did mean to
say that this error even when each process is using a different
GPU, exclusively?

Which version of the CUDA driver do you have on the systems? Can you try an
upgrade if it is not the latest?

Can you give us a reproducer for the issue?

Thank you
Sreeram Potluri

On Thursday, August 8, 2013, Adam T. Moody wrote:

> Hi Sreeram,
> As far as we can tell, different procs are picking devices, and the GPU is
> in the correct mode.  However, one clue that we uncovered is that forcing
> procs to sleep for different amounts of time before registering helps.  It
> seems the problem is a race condition when two procs call cudaHostRegister
> at the same time.  If we force a delay between procs, there is no error.
>  Any idea what's going on here?
> -Adam
>
>
> sreeram potluri wrote:
>
>  Hi Adam,
>>
>> I have seen this error earlier when a user tries to share a GPU between
>> two
>> processes but the GPU is set in thread exclusive or process exclusive
>> mode.
>> Can you check with the user if this is the case?
>>
>> This can also happen in other cases like when devices are not iniitalized
>> properly using deviceQuery. However, I suspect that earlier is the case.
>>
>> Best
>> Sreeram Potluri
>>
>> On Fri, Jul 19, 2013 at 8:49 PM, Adam T. Moody <moody20 at llnl.gov> wrote:
>>
>>
>>
>>> Hello MVAPICH team,
>>> Someone is running on a system using MVAPICH2-1.9 with CUDA enabled, but
>>> he is sometimes (90% of his runs) failing with the following error.
>>>
>>> [edge42:mpi_rank_0][ibv_cuda_****register]
>>> src/mpid/ch3/channels/mrail/****src/gen2/ibv_cuda_util.c:704:
>>> cudaHostRegister Failed
>>>
>>>
>>>
>>>>  [edge42:mpi_rank_1][ibv_cuda_****register]
>>>>>>>>
>>>>>>>>
>>>>>>> src/mpid/ch3/channels/mrail/****src/gen2/ibv_cuda_util.c:704:
>>> cudaHostRegister Failed
>>>
>>>
>>>
>>>>  [edge63:mpi_rank_2][ibv_cuda_****register]
>>>>>>>>
>>>>>>>>
>>>>>>> src/mpid/ch3/channels/mrail/****src/gen2/ibv_cuda_util.c:704:
>>> cudaHostRegister Failed
>>>
>>>
>>>
>>>>  [edge63:mpi_rank_3][ibv_cuda_****register]
>>>>>>>>
>>>>>>>>
>>>>>>> src/mpid/ch3/channels/mrail/****src/gen2/ibv_cuda_util.c:704:
>>> cudaHostRegister Failed
>>>
>>> Have you seen this before?  Do you know why it might happen?
>>> Thanks,
>>> -Adam
>>> ______________________________****_________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-**sta**te.edu <http://state.edu> <
>>> mvapich-discuss at cse.ohio-state.edu>
>>> http://mail.cse.ohio-state.****edu/mailman/listinfo/mvapich-****discuss<
>>> http://mail.cse.ohio-**state.edu/mailman/listinfo/**mvapich-discuss<http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss>
>>> >
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20130811/e7d69672/attachment-0001.html


More information about the mvapich-discuss mailing list