[mvapich-discuss] problems running osu benchmarks with RDMA CM

Jesus Camacho Villanueva jesus.camacho at fabriscale.com
Tue Apr 7 08:49:18 EDT 2015


Hi again,

I have only one HCA per node, but two ports per HCA. Is there any issue
with this configuration?

I have configured the HCAs in order to use the port number 1 only, but I
still have
the same problem.

I need to use the RDMA connection manager. Then, I would really appreciate
if you had any other hint about this.

Many thanks,
Jesus



On 26 March 2015 at 17:14, Hari Subramoni <subramoni.1 at osu.edu> wrote:

> Hi Jesus,
>
> There are some known limitations with rdma_cm and multiple HCAs. Sometimes
> the address resolution gets screwed up by the underlying rdma_cm
> implementaion leading to the rdma_connect errors. I would recommend not try
> try RDMA_CM with multiple HCAs. You can use the MV2_IBA_HCA environment
> variable to select the HCA you need to use. You don't have to disable the
> other HCA.
>
> The new error is from the job launcher. It looks like its related to
> firewalls stopping the connection or the network routes between the two
> hosts got messed up when you disabled the other HCA.
>
> Thx,
> Hari.
>
> On Thu, Mar 26, 2015 at 11:58 AM, Jesus Camacho Villanueva <
> jesus.camacho at fabriscale.com> wrote:
>
>> Hello Hari,
>>
>> I have two interfaces (ib0 and ib1). I have disabled ib1 but it is still
>> not working.
>>
>> For some reason, there are two new lines in the output:
>>
>> [compute-0-1.local:mpispawn_1][report_error] connect() failed: Connection
>> refused (111)
>> [compute-0-0.local:mpispawn_0][report_error] connect() failed: Connection
>> refused (111)
>>
>> Any idea about this?
>>
>> Best regards,
>> Jesus
>>
>>
>> On 26 March 2015 at 15:19, Hari Subramoni <subramoni.1 at osu.edu> wrote:
>>
>>> Hello Jesus,
>>>
>>> This is strange. We've alwasy been able to sucessfully test RDMA_CM in
>>> our internal testing.
>>>
>>> Can you tell me if you have multiple HCAs per node or just one HCA?
>>>
>>> Thx,
>>> Hari.
>>>
>>> On Wed, Mar 25, 2015 at 1:38 PM, Jesus Camacho Villanueva <
>>> jesus.camacho at fabriscale.com> wrote:
>>>
>>>> Hello Hari,
>>>>
>>>> I usually have this issue, but it's working on rare occasions.
>>>> I tried to increase the number of attempts without success.
>>>> I doubt the system is overloaded, because I am the only one using a
>>>> small cluster with four switches and 8 HCAs for this tests.
>>>>
>>>> Do you have any other suggestion for me?
>>>>
>>>> Thanks for your quick response!
>>>> Jesus
>>>>
>>>>
>>>> On 25 March 2015 at 18:06, Hari Subramoni <subramoni.1 at osu.edu> wrote:
>>>>
>>>>> Hello Jesus,
>>>>>
>>>>> Are you facing this issue at all times or in a random fashion (with
>>>>> some runs passing and some failing with this error)?
>>>>>
>>>>> If you're facing this issue at all times, please make sure that you've
>>>>> set up things as described in the following link of the MVAPICH2 userguide.
>>>>>
>>>>>
>>>>> http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.1rc2-userguide.html#x1-360005.2.6
>>>>>
>>>>> If you're facing this issue in a random fashion, then its most likely
>>>>> a system issue. Typically, this indicates the system might be overloaded
>>>>> and hence is unable to resolve the address properly.
>>>>>
>>>>> One thing you can try to do in this case is to increase the number of
>>>>> retries using the environment variable "MV2_MAX_RDMA_CONNECT_
>>>>> ATTEMPTS".
>>>>>
>>>>> Please let us know if either one of these suggestions helps in your
>>>>> case.
>>>>>
>>>>> Thx,
>>>>> Hari.
>>>>>
>>>>> On Wed, Mar 25, 2015 at 9:32 AM, Jesus Camacho Villanueva <
>>>>> jesus.camacho at fabriscale.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I can run osu benchmarks without any problem, but when running them
>>>>>> with the rdma connection manager they crash.
>>>>>> Previously I have been running performance tests for Infiniband using
>>>>>> the rdma connection manager without problems.
>>>>>> Now when using the MV2_USE_RDMA_CM option, I obtain the next output:
>>>>>>
>>>>>> # mpirun_rsh -hostfile host -np 2 MV2_USE_RDMA_CM=1 ./osu_acc_latency
>>>>>> [compute-0-1.local:mpi_rank_1][ib_cma_event_handler]
>>>>>> src/mpid/ch3/channels/common/src/rdma_cm/rdma_cm.c:210: rdma_connect error
>>>>>> -1 after 20 attempts
>>>>>> : Invalid argument (22)
>>>>>> [compute-0-1.local:mpispawn_1][readline] Unexpected End-Of-File on
>>>>>> file descriptor 5. MPI process died?
>>>>>> [compute-0-1.local:mpispawn_1][mtpmi_processops] Error while reading
>>>>>> PMI socket. MPI process died?
>>>>>> [compute-0-1.local:mpispawn_1][child_handler] MPI process (rank: 1,
>>>>>> pid: 20837) exited with status 253
>>>>>> [root at sunshine osu_benchmarks]#
>>>>>> [compute-0-0.local:mpispawn_0][read_size] Unexpected End-Of-File on file
>>>>>> descriptor 7. MPI process died?
>>>>>> [compute-0-0.local:mpispawn_0][read_size] Unexpected End-Of-File on
>>>>>> file descriptor 7. MPI process died?
>>>>>> [compute-0-0.local:mpispawn_0][handle_mt_peer] Error while reading
>>>>>> PMI socket. MPI process died?
>>>>>>
>>>>>> Can someone help me with this?
>>>>>>
>>>>>> Best regards,
>>>>>> Jesus
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> mvapich-discuss mailing list
>>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150407/0d954a6e/attachment-0001.html>


More information about the mvapich-discuss mailing list