[mvapich-discuss] problems running osu benchmarks with RDMA CM

Hari Subramoni subramoni.1 at osu.edu
Thu Mar 26 10:19:55 EDT 2015


Hello Jesus,

This is strange. We've alwasy been able to sucessfully test RDMA_CM in our
internal testing.

Can you tell me if you have multiple HCAs per node or just one HCA?

Thx,
Hari.

On Wed, Mar 25, 2015 at 1:38 PM, Jesus Camacho Villanueva <
jesus.camacho at fabriscale.com> wrote:

> Hello Hari,
>
> I usually have this issue, but it's working on rare occasions.
> I tried to increase the number of attempts without success.
> I doubt the system is overloaded, because I am the only one using a small
> cluster with four switches and 8 HCAs for this tests.
>
> Do you have any other suggestion for me?
>
> Thanks for your quick response!
> Jesus
>
>
> On 25 March 2015 at 18:06, Hari Subramoni <subramoni.1 at osu.edu> wrote:
>
>> Hello Jesus,
>>
>> Are you facing this issue at all times or in a random fashion (with some
>> runs passing and some failing with this error)?
>>
>> If you're facing this issue at all times, please make sure that you've
>> set up things as described in the following link of the MVAPICH2 userguide.
>>
>>
>> http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.1rc2-userguide.html#x1-360005.2.6
>>
>> If you're facing this issue in a random fashion, then its most likely a
>> system issue. Typically, this indicates the system might be overloaded and
>> hence is unable to resolve the address properly.
>>
>> One thing you can try to do in this case is to increase the number of
>> retries using the environment variable "MV2_MAX_RDMA_CONNECT_
>> ATTEMPTS".
>>
>> Please let us know if either one of these suggestions helps in your case.
>>
>> Thx,
>> Hari.
>>
>> On Wed, Mar 25, 2015 at 9:32 AM, Jesus Camacho Villanueva <
>> jesus.camacho at fabriscale.com> wrote:
>>
>>> Hello,
>>>
>>> I can run osu benchmarks without any problem, but when running them with
>>> the rdma connection manager they crash.
>>> Previously I have been running performance tests for Infiniband using
>>> the rdma connection manager without problems.
>>> Now when using the MV2_USE_RDMA_CM option, I obtain the next output:
>>>
>>> # mpirun_rsh -hostfile host -np 2 MV2_USE_RDMA_CM=1 ./osu_acc_latency
>>> [compute-0-1.local:mpi_rank_1][ib_cma_event_handler]
>>> src/mpid/ch3/channels/common/src/rdma_cm/rdma_cm.c:210: rdma_connect error
>>> -1 after 20 attempts
>>> : Invalid argument (22)
>>> [compute-0-1.local:mpispawn_1][readline] Unexpected End-Of-File on file
>>> descriptor 5. MPI process died?
>>> [compute-0-1.local:mpispawn_1][mtpmi_processops] Error while reading PMI
>>> socket. MPI process died?
>>> [compute-0-1.local:mpispawn_1][child_handler] MPI process (rank: 1, pid:
>>> 20837) exited with status 253
>>> [root at sunshine osu_benchmarks]#
>>> [compute-0-0.local:mpispawn_0][read_size] Unexpected End-Of-File on file
>>> descriptor 7. MPI process died?
>>> [compute-0-0.local:mpispawn_0][read_size] Unexpected End-Of-File on file
>>> descriptor 7. MPI process died?
>>> [compute-0-0.local:mpispawn_0][handle_mt_peer] Error while reading PMI
>>> socket. MPI process died?
>>>
>>> Can someone help me with this?
>>>
>>> Best regards,
>>> Jesus
>>>
>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150326/310faa39/attachment.html>


More information about the mvapich-discuss mailing list