[mvapich-discuss] problems running osu benchmarks with RDMA CM

Jesus Camacho Villanueva jesus.camacho at fabriscale.com
Wed Mar 25 13:38:37 EDT 2015


Hello Hari,

I usually have this issue, but it's working on rare occasions.
I tried to increase the number of attempts without success.
I doubt the system is overloaded, because I am the only one using a small
cluster with four switches and 8 HCAs for this tests.

Do you have any other suggestion for me?

Thanks for your quick response!
Jesus

On 25 March 2015 at 18:06, Hari Subramoni <subramoni.1 at osu.edu> wrote:

> Hello Jesus,
>
> Are you facing this issue at all times or in a random fashion (with some
> runs passing and some failing with this error)?
>
> If you're facing this issue at all times, please make sure that you've set
> up things as described in the following link of the MVAPICH2 userguide.
>
>
> http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.1rc2-userguide.html#x1-360005.2.6
>
> If you're facing this issue in a random fashion, then its most likely a
> system issue. Typically, this indicates the system might be overloaded and
> hence is unable to resolve the address properly.
>
> One thing you can try to do in this case is to increase the number of
> retries using the environment variable "MV2_MAX_RDMA_CONNECT_
> ATTEMPTS".
>
> Please let us know if either one of these suggestions helps in your case.
>
> Thx,
> Hari.
>
> On Wed, Mar 25, 2015 at 9:32 AM, Jesus Camacho Villanueva <
> jesus.camacho at fabriscale.com> wrote:
>
>> Hello,
>>
>> I can run osu benchmarks without any problem, but when running them with
>> the rdma connection manager they crash.
>> Previously I have been running performance tests for Infiniband using the
>> rdma connection manager without problems.
>> Now when using the MV2_USE_RDMA_CM option, I obtain the next output:
>>
>> # mpirun_rsh -hostfile host -np 2 MV2_USE_RDMA_CM=1 ./osu_acc_latency
>> [compute-0-1.local:mpi_rank_1][ib_cma_event_handler]
>> src/mpid/ch3/channels/common/src/rdma_cm/rdma_cm.c:210: rdma_connect error
>> -1 after 20 attempts
>> : Invalid argument (22)
>> [compute-0-1.local:mpispawn_1][readline] Unexpected End-Of-File on file
>> descriptor 5. MPI process died?
>> [compute-0-1.local:mpispawn_1][mtpmi_processops] Error while reading PMI
>> socket. MPI process died?
>> [compute-0-1.local:mpispawn_1][child_handler] MPI process (rank: 1, pid:
>> 20837) exited with status 253
>> [root at sunshine osu_benchmarks]#
>> [compute-0-0.local:mpispawn_0][read_size] Unexpected End-Of-File on file
>> descriptor 7. MPI process died?
>> [compute-0-0.local:mpispawn_0][read_size] Unexpected End-Of-File on file
>> descriptor 7. MPI process died?
>> [compute-0-0.local:mpispawn_0][handle_mt_peer] Error while reading PMI
>> socket. MPI process died?
>>
>> Can someone help me with this?
>>
>> Best regards,
>> Jesus
>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150325/4adb999a/attachment-0001.html>


More information about the mvapich-discuss mailing list