[Hadoop-RDMA-discuss] Problem with RDMA-Hadoop and Mellanox CX5

Wed Jun 28 11:57:46 EDT 2017

OK, the antispam is always right :-)
Im sending the dropbox link: 
https://dl.dropboxusercontent.com/u/47682138/etc.tar.gz

-- Vittorio

Il 28/06/2017 17:55, Mark Goddard ha scritto:
> Hi Xiaoyi,
>
> Gmail seems to not want to let me share a tarball via email so I've 
> uploaded it to google drive. You should be able to access it here: 
> https://drive.google.com/drive/folders/0B29_pUBk8Ck8TnJabUpLS1lCQW8?usp=sharing.
>
> Thanks,
> Mark
>
> On 28 June 2017 at 16:50, Xiaoyi Lu <lu.932 at osu.edu 
> <mailto:lu.932 at osu.edu>> wrote:
>
>     Hi, Vittorio,
>
>     Seems like your attachment somehow is not sent successfully. We
>     could not get it. This may be because of some security check settings.
>
>     Maybe you need to rename the file and send it again.
>
>     Xiaoyi
>
>     > On Jun 28, 2017, at 11:37 AM, Vittorio Rebecchi - A3Cube Inc.
>     <vittorio at a3cube-inc.com <mailto:vittorio at a3cube-inc.com>> wrote:
>     >
>     > Thank you all :-)
>     >
>     > Sorry ! I had already prepared the configuration tarball but I
>     forgot to add it ;-) together with the logs.
>     > Just for information, this tarball was taken from the first node
>     of the cluster that behave as resourcemanager and nodemanager
>     (because our distributed filesystem "anima" do not have the
>     metadata overhead of hdfs). The other nodes have a similar setup,
>     and plain hadoop with "anima PFS" works without any problem with
>     this setup.
>     > In slaves you will find names like "FORTISSIMO<X>DATA" that we
>     associate to the IPs we associate to ib cards, while the name
>     "FORTISSIMO<X>" is associated to the management network of our
>     clusters (a normal ethernet network)
>     >
>     > Thank you again for your attention.
>     >
>     > -- Vittorio Rebecchi
>     >
>     > Il 28/06/2017 16:58, Xiaoyi Lu ha scritto:
>     >> Thanks for your feedback. As Dr. Panda mentioned, we are
>     looking into this issue. Will get back to you later.
>     >>
>     >> Mark - Can you please send us your logs and configurations?
>     >>
>     >> Vittorio - Thanks for sending us your logs. Can you please send
>     us your configurations as well?
>     >>
>     >> Thanks,
>     >> Xiaoyi
>     >>
>     >>> On Jun 28, 2017, at 9:58 AM, Mark Goddard <mark at stackhpc.com
>     <mailto:mark at stackhpc.com>> wrote:
>     >>>
>     >>> Thanks for the response Vittorio,
>     >>>
>     >>> I see the same behaviour with the NICs being enumerated in
>     reverse order. What's odd in my setup though is that the link of
>     the second port is actually down, but this doesn't prevent
>     ib_send_bw or RDMA hadoop from trying to use it.
>     >>>
>     >>> Regards,
>     >>> Mark
>     >>>
>     >>> On 28 June 2017 at 14:43, <vittorio at a3cube-inc.com
>     <mailto:vittorio at a3cube-inc.com>> wrote:
>     >>> Hi Mark,
>     >>>
>     >>> I think you have the same issue we have with Mellanox
>     ConnectX5 cards.
>     >>>
>     >>> The new mellanox driver (I've seen that you have dual port
>     Mellanox CX4 cards) maps the ports as independent devices, mlx5_0
>     and mlx5_1, with one port each. We have dual channel Mellanox
>     ConnectX5 cards.
>     >>> The problem is that the driver (the new one i think) will
>     activate them in the opposite order, mlx5_1 before than mlx5_0,
>     so, when your software starts to use the "first" IB device (by a
>     sort of index), it will receive the mlx5_1 device before than
>     mlx5_0 that, in your case, is the "ethernet" port. You can see
>     that by running ibv_devinfo.
>     >>>
>     >>> The solution is telling your software to use a specific IB
>     device, by its name. For example, with ib_write_bw I must use
>     option "-d mlx5_0" to be able to run these simple tests on
>     connectX5 cards.
>     >>> We do the same with our distributed filesystem "Anima", for IB
>     data exchange, and everything works nicely. With RDMA_hadoop I
>     think we need an option to specify the proper device because I
>     suspect that in multicard or multichannel systems will have the
>     same issue.
>     >>>
>     >>> Just to be sure I have run the same setup in a mini-cluster
>     with ConnectX3 single channels and the IB communication between
>     nodes works.
>     >>> Then we have other issue but I think it is a matter of
>     configuration of map/reduce section together with IB, as i
>     reported in my post, because teragen (it creates the dataset)
>     works fine but terasort has some problems during reduce phase.
>     >>>
>     >>> Best Regards.
>     >>>
>     >>> -- Vittorio Rebecchi
>     >>>
>     >>>
>     >>> Il 2017-06-28 11:24 Mark Goddard ha scritto:
>     >>> Hi Vittorio,
>     >>>
>     >>> It sounds like we're experiencing similar issues. I'm using
>     Mellanox
>     >>> connectX4 dual port NICs with port 0 in IB mode and am unable
>     to start
>     >>> HDFS services. I've not tried running YARN.
>     >>>
>     >>> Here's my email to this list on the issue:
>     >>>
>     http://mailman.cse.ohio-state.edu/pipermail/rdma-hadoop-discuss/2017-June/000095.html
>     <http://mailman.cse.ohio-state.edu/pipermail/rdma-hadoop-discuss/2017-June/000095.html>.
>     >>>
>     >>> Regards,
>     >>> Mark
>     >>>
>     >>> On 27 June 2017 at 16:09, Vittorio Rebecchi - A3Cube Inc.
>     >>> <vittorio at a3cube-inc.com <mailto:vittorio at a3cube-inc.com>> wrote:
>     >>>
>     >>> Hi Xiao yi,
>     >>>
>     >>> thank for yout attention.
>     >>>
>     >>> Im sending, as attachment to the mail, our configuration and logs
>     >>> from the clusters on which I run RDAM-hadoop.
>     >>> I've managed to be able to go on with terasort by removing all the
>     >>> optimizations I usually add. The logs were regenerated today for a
>     >>> better picture of the problem.
>     >>>
>     >>> Let me describe better the environment in which we use
>     RDMA-hadoop:
>     >>> we don't use neither HDFS nor Lustre as filesystem but our
>     >>> distributed filesystem, "A3Cube Anima", and it works really fine
>     >>> with hadoop with our specific plug-in. You will see its setup
>     in the
>     >>> confs together with RDMA-hadoop. We usually run terasort with our
>     >>> distributed filesystem without any problems.
>     >>> Im trying to use RDMA-hadoop for its improvements in
>     map/reduce with
>     >>> IB. Teragen works fine and it creates the files to process with
>     >>> terasort. Then we run terasort with: ./bin/hadoop jar
>     >>> ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar
>     >>> terasort teraInput teraOutput
>     >>>
>     >>> The IB IPC communication in yarn and other hadoop parts works
>     fine.
>     >>> Terasort completes the map phase and, when its time for reduce
>     after
>     >>> a while, for example at "mapreduce.Job: map 100% reduce 33%", it
>     >>> wont go further and the logs reports the following lines:
>     >>> 2017-06-27 07:11:12,277 INFO
>     >>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl:
>     >>> Removed completed containers from NM context:
>     >>> [container_1498572173359_0002_01_000154,
>     >>> container_1498572173359_0002_01_000151]
>     >>> 2017-06-27 07:15:26,097 INFO
>     >>> org.apache.hadoop.mapred.HOMRShuffleHandler: RDMA Receiver 0 is
>     >>> turning off!!!
>     >>>
>     >>> What I have describes happens on the 2 node ConnectX-3 cluster
>     >>> (i7-4770 cpus with 16 GB ram). Logs1.tar.gz is the log of node1,
>     >>> logs2.tar.gz is the log of node2, etc.tar.gz has the conf
>     files and
>     >>> terasort.sh is the terasort.sh with some optimization that breaks
>     >>> java communication after a while.
>     >>>
>     >>> I have added terasort.sh with some optimization we usually use
>     with
>     >>> terasort because, during the reduce, phase nodemanager crashes
>     with
>     >>> the following output:
>     >>>
>     >>> #A fatal error has been detected by the Java Runtime Environment:
>     >>> #
>     >>> #  SIGSEGV (0xb) at pc=0x00007fb5c0600761, pid=18982,
>     >>> tid=0x00007fb5bcebd700
>     >>> #
>     >>> # JRE version: Java(TM) SE Runtime Environment (8.0_131-b11)
>     (build
>     >>> 1.8.0_131-b11)
>     >>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.131-b11 mixed
>     mode
>     >>> linux-amd64 compressed oops)
>     >>> # Problematic frame:
>     >>> # C  [librdmamrserver.so.1.1.0+0x7761]
>     >>> ucr_setup_ib_qp_init_params+0x21
>     >>>
>     >>> I have added the logs about this run and it's called
>     >>> logs_extended.tar.gz
>     >>>
>     >>> Let me add that the same setup I have reported, won't work on out
>     >>> ConnectX5 cluster because i think RDMA-hadoop cannot connect
>     to the
>     >>> proper Mellanox device in the system.
>     >>> It looks like to be mandatory to specify Mellanox devices to use
>     >>> (mlx5_0 or mlx5_1) even for ib_write_bw testing command.
>     >>>
>     >>> Do you have any hints on the setup about map/reduce? There is
>     >>> something to fix on the setup for allowing terasort to
>     complete its
>     >>> processing?
>     >>> Do you have any suggestions to activate the use of RDMA-hadoop
>     with
>     >>> Mellanox ConnecX5 cards and to fix the reduce behavour? I have
>     >>> checked the configuration with your manual and everything seems
>     >>> correct.
>     >>>
>     >>> Thanks in advance.
>     >>>
>     >>> Bye
>     >>>
>     >>> -- Vittorio Rebecchi
>     >>>
>     >>> Il 26/06/2017 19:48, Xiaoyi Lu ha scritto:
>     >>> Hi, Vittorio,
>     >>>
>     >>> Thanks for your interest in our project. We locally tried to run
>     >>> some benchmarks on our ConnextX-5 nodes and things run fine.
>     >>>
>     >>> Can you please send us your logs, confs, and exact commands?
>     We can
>     >>> try to reproduce this and get back to you.
>     >>>
>     >>> Thanks,
>     >>> Xiaoyi
>     >>>
>     >>> On Jun 26, 2017, at 9:20 AM, vittorio at a3cube-inc.com
>     <mailto:vittorio at a3cube-inc.com> wrote:
>     >>>
>     >>> Hi,
>     >>>
>     >>> My name is Vittorio Rebecchi and I'm testing RDMA-based Apache
>     >>> hadoop with an 8 node cluster with 1 Mellanox ConnectX 5 card on
>     >>> each machine.
>     >>>
>     >>> The Mellanox ConnectX 5 cards have  2 ports and they are
>     mapped as 2
>     >>> independant devices (mlx5_0 and mlx5_1) on the OS by Mellanox
>     >>> drivers. The cards work properly (tested with ib_write_lat and
>     >>> ib_write_bw) bui I must specify which IB device to use (in
>     >>> ib_write_lat, for example, I must specify "-d mlx5_0").
>     >>>
>     >>> On my setup, currenty, Yarn starts but the nodemanager nodes
>     report
>     >>> the following message:
>     >>>
>     >>> ctx error: ibv_poll_cq() failed: IBV_WC_SUCCESS != wc.status
>     >>> IBV_WC_SUCCESS != wc.status (12)
>     >>> ucr_probe_blocking return value -1
>     >>>
>     >>> I tested the same installation base on other two nodes with
>     ConnectX
>     >>> 3 and RDMa-hadoop works witout showing that message.
>     >>> So I suppose this error is due to the fact ConnecX5 cards have 2
>     >>> ports that are exposed to applications as independent devices
>     by the
>     >>> new Mellanox Driver (4.0 - the one that supports CX5) and
>     >>> RDMA-hadoop cannot establish which device to use. In other
>     software
>     >>> we must specify the device (and sometime even the port) to use, as
>     >>> "mlx5_0", to solve similar problems.
>     >>>
>     >>> Is there a way to specify, in RDMA-bases hadoop (and plugin)
>     setup,
>     >>> the proper IB device to use?
>     >>>
>     >>> Thanks.
>     >>>
>     >>> Vittorio Rebecchi
>     >>> _______________________________________________
>     >>> RDMA-Hadoop-discuss mailing list
>     >>> RDMA-Hadoop-discuss at cse.ohio-state.edu
>     <mailto:RDMA-Hadoop-discuss at cse.ohio-state.edu>
>     >>>
>     >>>
>     http://mailman.cse.ohio-state.edu/mailman/listinfo/rdma-hadoop-discuss
>     <http://mailman.cse.ohio-state.edu/mailman/listinfo/rdma-hadoop-discuss>
>     >>> [1]
>     >>>
>     >>> _______________________________________________
>     >>> RDMA-Hadoop-discuss mailing list
>     >>> RDMA-Hadoop-discuss at cse.ohio-state.edu
>     <mailto:RDMA-Hadoop-discuss at cse.ohio-state.edu>
>     >>>
>     http://mailman.cse.ohio-state.edu/mailman/listinfo/rdma-hadoop-discuss
>     <http://mailman.cse.ohio-state.edu/mailman/listinfo/rdma-hadoop-discuss>
>     >>> [1]
>     >>>
>     >>>
>     >>>
>     >>> Links:
>     >>> ------
>     >>> [1]
>     http://mailman.cse.ohio-state.edu/mailman/listinfo/rdma-hadoop-discuss
>     <http://mailman.cse.ohio-state.edu/mailman/listinfo/rdma-hadoop-discuss>
>     >>>
>     >
>     > <Quarantined
>     Attachment.txt>_______________________________________________
>     > RDMA-Hadoop-discuss mailing list
>     > RDMA-Hadoop-discuss at cse.ohio-state.edu
>     <mailto:RDMA-Hadoop-discuss at cse.ohio-state.edu>
>     >
>     http://mailman.cse.ohio-state.edu/mailman/listinfo/rdma-hadoop-discuss
>     <http://mailman.cse.ohio-state.edu/mailman/listinfo/rdma-hadoop-discuss>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/rdma-hadoop-discuss/attachments/20170628/46a1de04/attachment-0001.html>