[mvapich-discuss] problems running osu benchmarks with RDMA CM

Hari Subramoni subramoni.1 at osu.edu
Thu Mar 26 12:14:38 EDT 2015


Hi Jesus,

There are some known limitations with rdma_cm and multiple HCAs. Sometimes
the address resolution gets screwed up by the underlying rdma_cm
implementaion leading to the rdma_connect errors. I would recommend not try
try RDMA_CM with multiple HCAs. You can use the MV2_IBA_HCA environment
variable to select the HCA you need to use. You don't have to disable the
other HCA.

The new error is from the job launcher. It looks like its related to
firewalls stopping the connection or the network routes between the two
hosts got messed up when you disabled the other HCA.

Thx,
Hari.

On Thu, Mar 26, 2015 at 11:58 AM, Jesus Camacho Villanueva <
jesus.camacho at fabriscale.com> wrote:

> Hello Hari,
>
> I have two interfaces (ib0 and ib1). I have disabled ib1 but it is still
> not working.
>
> For some reason, there are two new lines in the output:
>
> [compute-0-1.local:mpispawn_1][report_error] connect() failed: Connection
> refused (111)
> [compute-0-0.local:mpispawn_0][report_error] connect() failed: Connection
> refused (111)
>
> Any idea about this?
>
> Best regards,
> Jesus
>
>
> On 26 March 2015 at 15:19, Hari Subramoni <subramoni.1 at osu.edu> wrote:
>
>> Hello Jesus,
>>
>> This is strange. We've alwasy been able to sucessfully test RDMA_CM in
>> our internal testing.
>>
>> Can you tell me if you have multiple HCAs per node or just one HCA?
>>
>> Thx,
>> Hari.
>>
>> On Wed, Mar 25, 2015 at 1:38 PM, Jesus Camacho Villanueva <
>> jesus.camacho at fabriscale.com> wrote:
>>
>>> Hello Hari,
>>>
>>> I usually have this issue, but it's working on rare occasions.
>>> I tried to increase the number of attempts without success.
>>> I doubt the system is overloaded, because I am the only one using a
>>> small cluster with four switches and 8 HCAs for this tests.
>>>
>>> Do you have any other suggestion for me?
>>>
>>> Thanks for your quick response!
>>> Jesus
>>>
>>>
>>> On 25 March 2015 at 18:06, Hari Subramoni <subramoni.1 at osu.edu> wrote:
>>>
>>>> Hello Jesus,
>>>>
>>>> Are you facing this issue at all times or in a random fashion (with
>>>> some runs passing and some failing with this error)?
>>>>
>>>> If you're facing this issue at all times, please make sure that you've
>>>> set up things as described in the following link of the MVAPICH2 userguide.
>>>>
>>>>
>>>> http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.1rc2-userguide.html#x1-360005.2.6
>>>>
>>>> If you're facing this issue in a random fashion, then its most likely a
>>>> system issue. Typically, this indicates the system might be overloaded and
>>>> hence is unable to resolve the address properly.
>>>>
>>>> One thing you can try to do in this case is to increase the number of
>>>> retries using the environment variable "MV2_MAX_RDMA_CONNECT_
>>>> ATTEMPTS".
>>>>
>>>> Please let us know if either one of these suggestions helps in your
>>>> case.
>>>>
>>>> Thx,
>>>> Hari.
>>>>
>>>> On Wed, Mar 25, 2015 at 9:32 AM, Jesus Camacho Villanueva <
>>>> jesus.camacho at fabriscale.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I can run osu benchmarks without any problem, but when running them
>>>>> with the rdma connection manager they crash.
>>>>> Previously I have been running performance tests for Infiniband using
>>>>> the rdma connection manager without problems.
>>>>> Now when using the MV2_USE_RDMA_CM option, I obtain the next output:
>>>>>
>>>>> # mpirun_rsh -hostfile host -np 2 MV2_USE_RDMA_CM=1 ./osu_acc_latency
>>>>> [compute-0-1.local:mpi_rank_1][ib_cma_event_handler]
>>>>> src/mpid/ch3/channels/common/src/rdma_cm/rdma_cm.c:210: rdma_connect error
>>>>> -1 after 20 attempts
>>>>> : Invalid argument (22)
>>>>> [compute-0-1.local:mpispawn_1][readline] Unexpected End-Of-File on
>>>>> file descriptor 5. MPI process died?
>>>>> [compute-0-1.local:mpispawn_1][mtpmi_processops] Error while reading
>>>>> PMI socket. MPI process died?
>>>>> [compute-0-1.local:mpispawn_1][child_handler] MPI process (rank: 1,
>>>>> pid: 20837) exited with status 253
>>>>> [root at sunshine osu_benchmarks]#
>>>>> [compute-0-0.local:mpispawn_0][read_size] Unexpected End-Of-File on file
>>>>> descriptor 7. MPI process died?
>>>>> [compute-0-0.local:mpispawn_0][read_size] Unexpected End-Of-File on
>>>>> file descriptor 7. MPI process died?
>>>>> [compute-0-0.local:mpispawn_0][handle_mt_peer] Error while reading PMI
>>>>> socket. MPI process died?
>>>>>
>>>>> Can someone help me with this?
>>>>>
>>>>> Best regards,
>>>>> Jesus
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> mvapich-discuss mailing list
>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150326/24f89f44/attachment.html>


More information about the mvapich-discuss mailing list