[Hadoop-RDMA-discuss] Problem with RDMA-Hadoop and Mellanox CX5

Wed Jun 28 10:30:42 EDT 2017

Thanks DK, please let us know if we can be of any assistance in your
investigations.

Regards,
Mark

On 28 June 2017 at 15:10, Panda, Dhabaleswar <panda at cse.ohio-state.edu>
wrote:

> We are taking a look at this issue to see what could be happening here. We
> will keep you updated.
>
> Thanks,
>
> DK
> ------------------------------
> *From:* rdma-hadoop-discuss-bounces at cse.ohio-state.edu on behalf of Mark
> Goddard [mark at stackhpc.com]
> *Sent:* Wednesday, June 28, 2017 9:58 AM
> *To:* Vittorio Rebecchi - A3Cube Inc.
> *Cc:* rdma-hadoop-discuss at cse.ohio-state.edu
> *Subject:* Re: [Hadoop-RDMA-discuss] Problem with RDMA-Hadoop and
> Mellanox CX5
>
> Thanks for the response Vittorio,
>
> I see the same behaviour with the NICs being enumerated in reverse order.
> What's odd in my setup though is that the link of the second port is
> actually down, but this doesn't prevent ib_send_bw or RDMA hadoop from
> trying to use it.
>
> Regards,
> Mark
>
> On 28 June 2017 at 14:43, <vittorio at a3cube-inc.com> wrote:
>
>> Hi Mark,
>>
>> I think you have the same issue we have with Mellanox ConnectX5 cards.
>>
>> The new mellanox driver (I've seen that you have dual port Mellanox CX4
>> cards) maps the ports as independent devices, mlx5_0 and mlx5_1, with one
>> port each. We have dual channel Mellanox ConnectX5 cards.
>> The problem is that the driver (the new one i think) will activate them
>> in the opposite order, mlx5_1 before than mlx5_0, so, when your software
>> starts to use the "first" IB device (by a sort of index), it will receive
>> the mlx5_1 device before than mlx5_0 that, in your case, is the "ethernet"
>> port. You can see that by running ibv_devinfo.
>>
>> The solution is telling your software to use a specific IB device, by its
>> name. For example, with ib_write_bw I must use option "-d mlx5_0" to be
>> able to run these simple tests on connectX5 cards.
>> We do the same with our distributed filesystem "Anima", for IB data
>> exchange, and everything works nicely. With RDMA_hadoop I think we need an
>> option to specify the proper device because I suspect that in multicard or
>> multichannel systems will have the same issue.
>>
>> Just to be sure I have run the same setup in a mini-cluster with
>> ConnectX3 single channels and the IB communication between nodes works.
>> Then we have other issue but I think it is a matter of configuration of
>> map/reduce section together with IB, as i reported in my post, because
>> teragen (it creates the dataset) works fine but terasort has some problems
>> during reduce phase.
>>
>> Best Regards.
>>
>> -- Vittorio Rebecchi
>>
>>
>> Il 2017-06-28 11:24 Mark Goddard ha scritto:
>>
>>> Hi Vittorio,
>>>
>>> It sounds like we're experiencing similar issues. I'm using Mellanox
>>> connectX4 dual port NICs with port 0 in IB mode and am unable to start
>>> HDFS services. I've not tried running YARN.
>>>
>>> Here's my email to this list on the issue:
>>> http://mailman.cse.ohio-state.edu/pipermail/rdma-hadoop-disc
>>> uss/2017-June/000095.html.
>>>
>>> Regards,
>>> Mark
>>>
>>> On 27 June 2017 at 16:09, Vittorio Rebecchi - A3Cube Inc.
>>> <vittorio at a3cube-inc.com> wrote:
>>>
>>> Hi Xiao yi,
>>>>
>>>> thank for yout attention.
>>>>
>>>> Im sending, as attachment to the mail, our configuration and logs
>>>> from the clusters on which I run RDAM-hadoop.
>>>> I've managed to be able to go on with terasort by removing all the
>>>> optimizations I usually add. The logs were regenerated today for a
>>>> better picture of the problem.
>>>>
>>>> Let me describe better the environment in which we use RDMA-hadoop:
>>>> we don't use neither HDFS nor Lustre as filesystem but our
>>>> distributed filesystem, "A3Cube Anima", and it works really fine
>>>> with hadoop with our specific plug-in. You will see its setup in the
>>>> confs together with RDMA-hadoop. We usually run terasort with our
>>>> distributed filesystem without any problems.
>>>> Im trying to use RDMA-hadoop for its improvements in map/reduce with
>>>> IB. Teragen works fine and it creates the files to process with
>>>> terasort. Then we run terasort with: ./bin/hadoop jar
>>>> ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar
>>>> terasort teraInput teraOutput
>>>>
>>>> The IB IPC communication in yarn and other hadoop parts works fine.
>>>> Terasort completes the map phase and, when its time for reduce after
>>>> a while, for example at "mapreduce.Job:  map 100% reduce 33%", it
>>>> wont go further and the logs reports the following lines:
>>>> 2017-06-27 07:11:12,277 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl:
>>>> Removed completed containers from NM context:
>>>> [container_1498572173359_0002_01_000154,
>>>> container_1498572173359_0002_01_000151]
>>>> 2017-06-27 07:15:26,097 INFO
>>>> org.apache.hadoop.mapred.HOMRShuffleHandler: RDMA Receiver 0 is
>>>> turning off!!!
>>>>
>>>> What I have describes happens on the 2 node ConnectX-3 cluster
>>>> (i7-4770 cpus with 16 GB ram). Logs1.tar.gz is the log of node1,
>>>> logs2.tar.gz is the log of node2, etc.tar.gz has the conf files and
>>>> terasort.sh is the terasort.sh with some optimization that breaks
>>>> java communication after a while.
>>>>
>>>> I have added terasort.sh with some optimization we usually use with
>>>> terasort because, during the reduce, phase nodemanager crashes with
>>>> the following output:
>>>>
>>>> #A fatal error has been detected by the Java Runtime Environment:
>>>> #
>>>> #  SIGSEGV (0xb) at pc=0x00007fb5c0600761, pid=18982,
>>>> tid=0x00007fb5bcebd700
>>>> #
>>>> # JRE version: Java(TM) SE Runtime Environment (8.0_131-b11) (build
>>>> 1.8.0_131-b11)
>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.131-b11 mixed mode
>>>> linux-amd64 compressed oops)
>>>> # Problematic frame:
>>>> # C  [librdmamrserver.so.1.1.0+0x7761]
>>>> ucr_setup_ib_qp_init_params+0x21
>>>>
>>>> I have added the logs about this run and it's called
>>>> logs_extended.tar.gz
>>>>
>>>> Let me add that the same setup I have reported, won't work on out
>>>> ConnectX5 cluster because i think RDMA-hadoop cannot connect to the
>>>> proper Mellanox device in the system.
>>>> It looks like to be mandatory to specify Mellanox devices to use
>>>> (mlx5_0 or mlx5_1) even for ib_write_bw testing command.
>>>>
>>>> Do you have any hints on the setup about map/reduce? There is
>>>> something to fix on the setup for allowing terasort to complete its
>>>> processing?
>>>> Do you have any suggestions to activate the use of RDMA-hadoop with
>>>> Mellanox ConnecX5 cards and to fix the reduce behavour? I have
>>>> checked the configuration with your manual and everything seems
>>>> correct.
>>>>
>>>> Thanks in advance.
>>>>
>>>> Bye
>>>>
>>>> -- Vittorio Rebecchi
>>>>
>>>> Il 26/06/2017 19:48, Xiaoyi Lu ha scritto:
>>>> Hi, Vittorio,
>>>>
>>>> Thanks for your interest in our project. We locally tried to run
>>>> some benchmarks on our ConnextX-5 nodes and things run fine.
>>>>
>>>> Can you please send us your logs, confs, and exact commands? We can
>>>> try to reproduce this and get back to you.
>>>>
>>>> Thanks,
>>>> Xiaoyi
>>>>
>>>> On Jun 26, 2017, at 9:20 AM, vittorio at a3cube-inc.com wrote:
>>>>
>>>> Hi,
>>>>
>>>> My name is Vittorio Rebecchi and I'm testing RDMA-based Apache
>>>> hadoop with an 8 node cluster with 1 Mellanox ConnectX 5 card on
>>>> each machine.
>>>>
>>>> The Mellanox ConnectX 5 cards have  2 ports and they are mapped as 2
>>>> independant devices (mlx5_0 and mlx5_1) on the OS by Mellanox
>>>> drivers. The cards work properly (tested with ib_write_lat and
>>>> ib_write_bw) bui I must specify which IB device to use (in
>>>> ib_write_lat, for example, I must specify "-d mlx5_0").
>>>>
>>>> On my setup, currenty, Yarn starts but the nodemanager nodes report
>>>> the following message:
>>>>
>>>> ctx error: ibv_poll_cq() failed: IBV_WC_SUCCESS != wc.status
>>>> IBV_WC_SUCCESS != wc.status (12)
>>>> ucr_probe_blocking return value -1
>>>>
>>>> I tested the same installation base on other two nodes with ConnectX
>>>> 3 and RDMa-hadoop works witout showing that message.
>>>> So I suppose this error is due to the fact ConnecX5 cards have 2
>>>> ports that are exposed to applications as independent devices by the
>>>> new Mellanox Driver (4.0 - the one that supports CX5) and
>>>> RDMA-hadoop cannot establish which device to use. In other software
>>>> we must specify the device (and sometime even the port) to use, as
>>>> "mlx5_0", to solve similar problems.
>>>>
>>>> Is there a way to specify, in RDMA-bases hadoop (and plugin) setup,
>>>> the proper IB device to use?
>>>>
>>>> Thanks.
>>>>
>>>> Vittorio Rebecchi
>>>> _______________________________________________
>>>> RDMA-Hadoop-discuss mailing list
>>>> RDMA-Hadoop-discuss at cse.ohio-state.edu
>>>>
>>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/rdma-hadoop-discuss
>>>
>>>> [1]
>>>>
>>>
>>> _______________________________________________
>>> RDMA-Hadoop-discuss mailing list
>>> RDMA-Hadoop-discuss at cse.ohio-state.edu
>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/rdma-hadoop-discuss
>>> [1]
>>>
>>>
>>>
>>> Links:
>>> ------
>>> [1] http://mailman.cse.ohio-state.edu/mailman/listinfo/rdma-hado
>>> op-discuss
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/rdma-hadoop-discuss/attachments/20170628/5e345393/attachment-0001.html>