[Hadoop-RDMA-discuss] Problem with RDMA-Hadoop and Mellanox CX5

Wed Jun 28 11:55:24 EDT 2017

Hi Xiaoyi,

Gmail seems to not want to let me share a tarball via email so I've
uploaded it to google drive. You should be able to access it here:
https://drive.google.com/drive/folders/0B29_pUBk8Ck8TnJabUpLS1lCQW8?usp=sharing
.

Thanks,
Mark

On 28 June 2017 at 16:50, Xiaoyi Lu <lu.932 at osu.edu> wrote:

> Hi, Vittorio,
>
> Seems like your attachment somehow is not sent successfully. We could not
> get it. This may be because of some security check settings.
>
> Maybe you need to rename the file and send it again.
>
> Xiaoyi
>
> > On Jun 28, 2017, at 11:37 AM, Vittorio Rebecchi - A3Cube Inc. <
> vittorio at a3cube-inc.com> wrote:
> >
> > Thank you all :-)
> >
> > Sorry ! I had already prepared the configuration tarball but I forgot to
> add it ;-) together with the logs.
> > Just for information, this tarball was taken from the first node of the
> cluster that behave as resourcemanager and nodemanager (because our
> distributed filesystem "anima" do not have the metadata overhead of hdfs).
> The other nodes have a similar setup, and plain hadoop with "anima PFS"
> works without any problem with this setup.
> > In slaves you will find names like "FORTISSIMO<X>DATA" that we associate
> to the IPs we associate to ib cards, while the name "FORTISSIMO<X>" is
> associated to the management network of our clusters (a normal ethernet
> network)
> >
> > Thank you again for your attention.
> >
> > -- Vittorio Rebecchi
> >
> > Il 28/06/2017 16:58, Xiaoyi Lu ha scritto:
> >> Thanks for your feedback. As Dr. Panda mentioned, we are looking into
> this issue. Will get back to you later.
> >>
> >> Mark - Can you please send us your logs and configurations?
> >>
> >> Vittorio - Thanks for sending us your logs. Can you please send us your
> configurations as well?
> >>
> >> Thanks,
> >> Xiaoyi
> >>
> >>> On Jun 28, 2017, at 9:58 AM, Mark Goddard <mark at stackhpc.com> wrote:
> >>>
> >>> Thanks for the response Vittorio,
> >>>
> >>> I see the same behaviour with the NICs being enumerated in reverse
> order. What's odd in my setup though is that the link of the second port is
> actually down, but this doesn't prevent ib_send_bw or RDMA hadoop from
> trying to use it.
> >>>
> >>> Regards,
> >>> Mark
> >>>
> >>> On 28 June 2017 at 14:43, <vittorio at a3cube-inc.com> wrote:
> >>> Hi Mark,
> >>>
> >>> I think you have the same issue we have with Mellanox ConnectX5 cards.
> >>>
> >>> The new mellanox driver (I've seen that you have dual port Mellanox
> CX4 cards) maps the ports as independent devices, mlx5_0 and mlx5_1, with
> one port each. We have dual channel Mellanox ConnectX5 cards.
> >>> The problem is that the driver (the new one i think) will activate
> them in the opposite order, mlx5_1 before than mlx5_0, so, when your
> software starts to use the "first" IB device (by a sort of index), it will
> receive the mlx5_1 device before than mlx5_0 that, in your case, is the
> "ethernet" port. You can see that by running ibv_devinfo.
> >>>
> >>> The solution is telling your software to use a specific IB device, by
> its name. For example, with ib_write_bw I must use option "-d mlx5_0" to be
> able to run these simple tests on connectX5 cards.
> >>> We do the same with our distributed filesystem "Anima", for IB data
> exchange, and everything works nicely. With RDMA_hadoop I think we need an
> option to specify the proper device because I suspect that in multicard or
> multichannel systems will have the same issue.
> >>>
> >>> Just to be sure I have run the same setup in a mini-cluster with
> ConnectX3 single channels and the IB communication between nodes works.
> >>> Then we have other issue but I think it is a matter of configuration
> of map/reduce section together with IB, as i reported in my post, because
> teragen (it creates the dataset) works fine but terasort has some problems
> during reduce phase.
> >>>
> >>> Best Regards.
> >>>
> >>> -- Vittorio Rebecchi
> >>>
> >>>
> >>> Il 2017-06-28 11:24 Mark Goddard ha scritto:
> >>> Hi Vittorio,
> >>>
> >>> It sounds like we're experiencing similar issues. I'm using Mellanox
> >>> connectX4 dual port NICs with port 0 in IB mode and am unable to start
> >>> HDFS services. I've not tried running YARN.
> >>>
> >>> Here's my email to this list on the issue:
> >>> http://mailman.cse.ohio-state.edu/pipermail/rdma-hadoop-
> discuss/2017-June/000095.html.
> >>>
> >>> Regards,
> >>> Mark
> >>>
> >>> On 27 June 2017 at 16:09, Vittorio Rebecchi - A3Cube Inc.
> >>> <vittorio at a3cube-inc.com> wrote:
> >>>
> >>> Hi Xiao yi,
> >>>
> >>> thank for yout attention.
> >>>
> >>> Im sending, as attachment to the mail, our configuration and logs
> >>> from the clusters on which I run RDAM-hadoop.
> >>> I've managed to be able to go on with terasort by removing all the
> >>> optimizations I usually add. The logs were regenerated today for a
> >>> better picture of the problem.
> >>>
> >>> Let me describe better the environment in which we use RDMA-hadoop:
> >>> we don't use neither HDFS nor Lustre as filesystem but our
> >>> distributed filesystem, "A3Cube Anima", and it works really fine
> >>> with hadoop with our specific plug-in. You will see its setup in the
> >>> confs together with RDMA-hadoop. We usually run terasort with our
> >>> distributed filesystem without any problems.
> >>> Im trying to use RDMA-hadoop for its improvements in map/reduce with
> >>> IB. Teragen works fine and it creates the files to process with
> >>> terasort. Then we run terasort with: ./bin/hadoop jar
> >>> ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar
> >>> terasort teraInput teraOutput
> >>>
> >>> The IB IPC communication in yarn and other hadoop parts works fine.
> >>> Terasort completes the map phase and, when its time for reduce after
> >>> a while, for example at "mapreduce.Job:  map 100% reduce 33%", it
> >>> wont go further and the logs reports the following lines:
> >>> 2017-06-27 07:11:12,277 INFO
> >>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl:
> >>> Removed completed containers from NM context:
> >>> [container_1498572173359_0002_01_000154,
> >>> container_1498572173359_0002_01_000151]
> >>> 2017-06-27 07:15:26,097 INFO
> >>> org.apache.hadoop.mapred.HOMRShuffleHandler: RDMA Receiver 0 is
> >>> turning off!!!
> >>>
> >>> What I have describes happens on the 2 node ConnectX-3 cluster
> >>> (i7-4770 cpus with 16 GB ram). Logs1.tar.gz is the log of node1,
> >>> logs2.tar.gz is the log of node2, etc.tar.gz has the conf files and
> >>> terasort.sh is the terasort.sh with some optimization that breaks
> >>> java communication after a while.
> >>>
> >>> I have added terasort.sh with some optimization we usually use with
> >>> terasort because, during the reduce, phase nodemanager crashes with
> >>> the following output:
> >>>
> >>> #A fatal error has been detected by the Java Runtime Environment:
> >>> #
> >>> #  SIGSEGV (0xb) at pc=0x00007fb5c0600761, pid=18982,
> >>> tid=0x00007fb5bcebd700
> >>> #
> >>> # JRE version: Java(TM) SE Runtime Environment (8.0_131-b11) (build
> >>> 1.8.0_131-b11)
> >>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.131-b11 mixed mode
> >>> linux-amd64 compressed oops)
> >>> # Problematic frame:
> >>> # C  [librdmamrserver.so.1.1.0+0x7761]
> >>> ucr_setup_ib_qp_init_params+0x21
> >>>
> >>> I have added the logs about this run and it's called
> >>> logs_extended.tar.gz
> >>>
> >>> Let me add that the same setup I have reported, won't work on out
> >>> ConnectX5 cluster because i think RDMA-hadoop cannot connect to the
> >>> proper Mellanox device in the system.
> >>> It looks like to be mandatory to specify Mellanox devices to use
> >>> (mlx5_0 or mlx5_1) even for ib_write_bw testing command.
> >>>
> >>> Do you have any hints on the setup about map/reduce? There is
> >>> something to fix on the setup for allowing terasort to complete its
> >>> processing?
> >>> Do you have any suggestions to activate the use of RDMA-hadoop with
> >>> Mellanox ConnecX5 cards and to fix the reduce behavour? I have
> >>> checked the configuration with your manual and everything seems
> >>> correct.
> >>>
> >>> Thanks in advance.
> >>>
> >>> Bye
> >>>
> >>> -- Vittorio Rebecchi
> >>>
> >>> Il 26/06/2017 19:48, Xiaoyi Lu ha scritto:
> >>> Hi, Vittorio,
> >>>
> >>> Thanks for your interest in our project. We locally tried to run
> >>> some benchmarks on our ConnextX-5 nodes and things run fine.
> >>>
> >>> Can you please send us your logs, confs, and exact commands? We can
> >>> try to reproduce this and get back to you.
> >>>
> >>> Thanks,
> >>> Xiaoyi
> >>>
> >>> On Jun 26, 2017, at 9:20 AM, vittorio at a3cube-inc.com wrote:
> >>>
> >>> Hi,
> >>>
> >>> My name is Vittorio Rebecchi and I'm testing RDMA-based Apache
> >>> hadoop with an 8 node cluster with 1 Mellanox ConnectX 5 card on
> >>> each machine.
> >>>
> >>> The Mellanox ConnectX 5 cards have  2 ports and they are mapped as 2
> >>> independant devices (mlx5_0 and mlx5_1) on the OS by Mellanox
> >>> drivers. The cards work properly (tested with ib_write_lat and
> >>> ib_write_bw) bui I must specify which IB device to use (in
> >>> ib_write_lat, for example, I must specify "-d mlx5_0").
> >>>
> >>> On my setup, currenty, Yarn starts but the nodemanager nodes report
> >>> the following message:
> >>>
> >>> ctx error: ibv_poll_cq() failed: IBV_WC_SUCCESS != wc.status
> >>> IBV_WC_SUCCESS != wc.status (12)
> >>> ucr_probe_blocking return value -1
> >>>
> >>> I tested the same installation base on other two nodes with ConnectX
> >>> 3 and RDMa-hadoop works witout showing that message.
> >>> So I suppose this error is due to the fact ConnecX5 cards have 2
> >>> ports that are exposed to applications as independent devices by the
> >>> new Mellanox Driver (4.0 - the one that supports CX5) and
> >>> RDMA-hadoop cannot establish which device to use. In other software
> >>> we must specify the device (and sometime even the port) to use, as
> >>> "mlx5_0", to solve similar problems.
> >>>
> >>> Is there a way to specify, in RDMA-bases hadoop (and plugin) setup,
> >>> the proper IB device to use?
> >>>
> >>> Thanks.
> >>>
> >>> Vittorio Rebecchi
> >>> _______________________________________________
> >>> RDMA-Hadoop-discuss mailing list
> >>> RDMA-Hadoop-discuss at cse.ohio-state.edu
> >>>
> >>> http://mailman.cse.ohio-state.edu/mailman/listinfo/rdma-hadoop-discuss
> >>> [1]
> >>>
> >>> _______________________________________________
> >>> RDMA-Hadoop-discuss mailing list
> >>> RDMA-Hadoop-discuss at cse.ohio-state.edu
> >>> http://mailman.cse.ohio-state.edu/mailman/listinfo/rdma-hadoop-discuss
> >>> [1]
> >>>
> >>>
> >>>
> >>> Links:
> >>> ------
> >>> [1] http://mailman.cse.ohio-state.edu/mailman/listinfo/rdma-
> hadoop-discuss
> >>>
> >
> > <Quarantined Attachment.txt>_______________
> ________________________________
> > RDMA-Hadoop-discuss mailing list
> > RDMA-Hadoop-discuss at cse.ohio-state.edu
> > http://mailman.cse.ohio-state.edu/mailman/listinfo/rdma-hadoop-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/rdma-hadoop-discuss/attachments/20170628/4dbb478f/attachment-0001.html>