[Hadoop-RDMA-discuss] Problem with RDMA-Hadoop and Mellanox CX5
Vittorio Rebecchi - A3Cube Inc.
vittorio at a3cube-inc.com
Wed Jun 28 11:57:46 EDT 2017
OK, the antispam is always right :-)
Im sending the dropbox link:
https://dl.dropboxusercontent.com/u/47682138/etc.tar.gz
-- Vittorio
Il 28/06/2017 17:55, Mark Goddard ha scritto:
> Hi Xiaoyi,
>
> Gmail seems to not want to let me share a tarball via email so I've
> uploaded it to google drive. You should be able to access it here:
> https://drive.google.com/drive/folders/0B29_pUBk8Ck8TnJabUpLS1lCQW8?usp=sharing.
>
> Thanks,
> Mark
>
> On 28 June 2017 at 16:50, Xiaoyi Lu <lu.932 at osu.edu
> <mailto:lu.932 at osu.edu>> wrote:
>
> Hi, Vittorio,
>
> Seems like your attachment somehow is not sent successfully. We
> could not get it. This may be because of some security check settings.
>
> Maybe you need to rename the file and send it again.
>
> Xiaoyi
>
> > On Jun 28, 2017, at 11:37 AM, Vittorio Rebecchi - A3Cube Inc.
> <vittorio at a3cube-inc.com <mailto:vittorio at a3cube-inc.com>> wrote:
> >
> > Thank you all :-)
> >
> > Sorry ! I had already prepared the configuration tarball but I
> forgot to add it ;-) together with the logs.
> > Just for information, this tarball was taken from the first node
> of the cluster that behave as resourcemanager and nodemanager
> (because our distributed filesystem "anima" do not have the
> metadata overhead of hdfs). The other nodes have a similar setup,
> and plain hadoop with "anima PFS" works without any problem with
> this setup.
> > In slaves you will find names like "FORTISSIMO<X>DATA" that we
> associate to the IPs we associate to ib cards, while the name
> "FORTISSIMO<X>" is associated to the management network of our
> clusters (a normal ethernet network)
> >
> > Thank you again for your attention.
> >
> > -- Vittorio Rebecchi
> >
> > Il 28/06/2017 16:58, Xiaoyi Lu ha scritto:
> >> Thanks for your feedback. As Dr. Panda mentioned, we are
> looking into this issue. Will get back to you later.
> >>
> >> Mark - Can you please send us your logs and configurations?
> >>
> >> Vittorio - Thanks for sending us your logs. Can you please send
> us your configurations as well?
> >>
> >> Thanks,
> >> Xiaoyi
> >>
> >>> On Jun 28, 2017, at 9:58 AM, Mark Goddard <mark at stackhpc.com
> <mailto:mark at stackhpc.com>> wrote:
> >>>
> >>> Thanks for the response Vittorio,
> >>>
> >>> I see the same behaviour with the NICs being enumerated in
> reverse order. What's odd in my setup though is that the link of
> the second port is actually down, but this doesn't prevent
> ib_send_bw or RDMA hadoop from trying to use it.
> >>>
> >>> Regards,
> >>> Mark
> >>>
> >>> On 28 June 2017 at 14:43, <vittorio at a3cube-inc.com
> <mailto:vittorio at a3cube-inc.com>> wrote:
> >>> Hi Mark,
> >>>
> >>> I think you have the same issue we have with Mellanox
> ConnectX5 cards.
> >>>
> >>> The new mellanox driver (I've seen that you have dual port
> Mellanox CX4 cards) maps the ports as independent devices, mlx5_0
> and mlx5_1, with one port each. We have dual channel Mellanox
> ConnectX5 cards.
> >>> The problem is that the driver (the new one i think) will
> activate them in the opposite order, mlx5_1 before than mlx5_0,
> so, when your software starts to use the "first" IB device (by a
> sort of index), it will receive the mlx5_1 device before than
> mlx5_0 that, in your case, is the "ethernet" port. You can see
> that by running ibv_devinfo.
> >>>
> >>> The solution is telling your software to use a specific IB
> device, by its name. For example, with ib_write_bw I must use
> option "-d mlx5_0" to be able to run these simple tests on
> connectX5 cards.
> >>> We do the same with our distributed filesystem "Anima", for IB
> data exchange, and everything works nicely. With RDMA_hadoop I
> think we need an option to specify the proper device because I
> suspect that in multicard or multichannel systems will have the
> same issue.
> >>>
> >>> Just to be sure I have run the same setup in a mini-cluster
> with ConnectX3 single channels and the IB communication between
> nodes works.
> >>> Then we have other issue but I think it is a matter of
> configuration of map/reduce section together with IB, as i
> reported in my post, because teragen (it creates the dataset)
> works fine but terasort has some problems during reduce phase.
> >>>
> >>> Best Regards.
> >>>
> >>> -- Vittorio Rebecchi
> >>>
> >>>
> >>> Il 2017-06-28 11:24 Mark Goddard ha scritto:
> >>> Hi Vittorio,
> >>>
> >>> It sounds like we're experiencing similar issues. I'm using
> Mellanox
> >>> connectX4 dual port NICs with port 0 in IB mode and am unable
> to start
> >>> HDFS services. I've not tried running YARN.
> >>>
> >>> Here's my email to this list on the issue:
> >>>
> http://mailman.cse.ohio-state.edu/pipermail/rdma-hadoop-discuss/2017-June/000095.html
> <http://mailman.cse.ohio-state.edu/pipermail/rdma-hadoop-discuss/2017-June/000095.html>.
> >>>
> >>> Regards,
> >>> Mark
> >>>
> >>> On 27 June 2017 at 16:09, Vittorio Rebecchi - A3Cube Inc.
> >>> <vittorio at a3cube-inc.com <mailto:vittorio at a3cube-inc.com>> wrote:
> >>>
> >>> Hi Xiao yi,
> >>>
> >>> thank for yout attention.
> >>>
> >>> Im sending, as attachment to the mail, our configuration and logs
> >>> from the clusters on which I run RDAM-hadoop.
> >>> I've managed to be able to go on with terasort by removing all the
> >>> optimizations I usually add. The logs were regenerated today for a
> >>> better picture of the problem.
> >>>
> >>> Let me describe better the environment in which we use
> RDMA-hadoop:
> >>> we don't use neither HDFS nor Lustre as filesystem but our
> >>> distributed filesystem, "A3Cube Anima", and it works really fine
> >>> with hadoop with our specific plug-in. You will see its setup
> in the
> >>> confs together with RDMA-hadoop. We usually run terasort with our
> >>> distributed filesystem without any problems.
> >>> Im trying to use RDMA-hadoop for its improvements in
> map/reduce with
> >>> IB. Teragen works fine and it creates the files to process with
> >>> terasort. Then we run terasort with: ./bin/hadoop jar
> >>> ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar
> >>> terasort teraInput teraOutput
> >>>
> >>> The IB IPC communication in yarn and other hadoop parts works
> fine.
> >>> Terasort completes the map phase and, when its time for reduce
> after
> >>> a while, for example at "mapreduce.Job: map 100% reduce 33%", it
> >>> wont go further and the logs reports the following lines:
> >>> 2017-06-27 07:11:12,277 INFO
> >>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl:
> >>> Removed completed containers from NM context:
> >>> [container_1498572173359_0002_01_000154,
> >>> container_1498572173359_0002_01_000151]
> >>> 2017-06-27 07:15:26,097 INFO
> >>> org.apache.hadoop.mapred.HOMRShuffleHandler: RDMA Receiver 0 is
> >>> turning off!!!
> >>>
> >>> What I have describes happens on the 2 node ConnectX-3 cluster
> >>> (i7-4770 cpus with 16 GB ram). Logs1.tar.gz is the log of node1,
> >>> logs2.tar.gz is the log of node2, etc.tar.gz has the conf
> files and
> >>> terasort.sh is the terasort.sh with some optimization that breaks
> >>> java communication after a while.
> >>>
> >>> I have added terasort.sh with some optimization we usually use
> with
> >>> terasort because, during the reduce, phase nodemanager crashes
> with
> >>> the following output:
> >>>
> >>> #A fatal error has been detected by the Java Runtime Environment:
> >>> #
> >>> # SIGSEGV (0xb) at pc=0x00007fb5c0600761, pid=18982,
> >>> tid=0x00007fb5bcebd700
> >>> #
> >>> # JRE version: Java(TM) SE Runtime Environment (8.0_131-b11)
> (build
> >>> 1.8.0_131-b11)
> >>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.131-b11 mixed
> mode
> >>> linux-amd64 compressed oops)
> >>> # Problematic frame:
> >>> # C [librdmamrserver.so.1.1.0+0x7761]
> >>> ucr_setup_ib_qp_init_params+0x21
> >>>
> >>> I have added the logs about this run and it's called
> >>> logs_extended.tar.gz
> >>>
> >>> Let me add that the same setup I have reported, won't work on out
> >>> ConnectX5 cluster because i think RDMA-hadoop cannot connect
> to the
> >>> proper Mellanox device in the system.
> >>> It looks like to be mandatory to specify Mellanox devices to use
> >>> (mlx5_0 or mlx5_1) even for ib_write_bw testing command.
> >>>
> >>> Do you have any hints on the setup about map/reduce? There is
> >>> something to fix on the setup for allowing terasort to
> complete its
> >>> processing?
> >>> Do you have any suggestions to activate the use of RDMA-hadoop
> with
> >>> Mellanox ConnecX5 cards and to fix the reduce behavour? I have
> >>> checked the configuration with your manual and everything seems
> >>> correct.
> >>>
> >>> Thanks in advance.
> >>>
> >>> Bye
> >>>
> >>> -- Vittorio Rebecchi
> >>>
> >>> Il 26/06/2017 19:48, Xiaoyi Lu ha scritto:
> >>> Hi, Vittorio,
> >>>
> >>> Thanks for your interest in our project. We locally tried to run
> >>> some benchmarks on our ConnextX-5 nodes and things run fine.
> >>>
> >>> Can you please send us your logs, confs, and exact commands?
> We can
> >>> try to reproduce this and get back to you.
> >>>
> >>> Thanks,
> >>> Xiaoyi
> >>>
> >>> On Jun 26, 2017, at 9:20 AM, vittorio at a3cube-inc.com
> <mailto:vittorio at a3cube-inc.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> My name is Vittorio Rebecchi and I'm testing RDMA-based Apache
> >>> hadoop with an 8 node cluster with 1 Mellanox ConnectX 5 card on
> >>> each machine.
> >>>
> >>> The Mellanox ConnectX 5 cards have 2 ports and they are
> mapped as 2
> >>> independant devices (mlx5_0 and mlx5_1) on the OS by Mellanox
> >>> drivers. The cards work properly (tested with ib_write_lat and
> >>> ib_write_bw) bui I must specify which IB device to use (in
> >>> ib_write_lat, for example, I must specify "-d mlx5_0").
> >>>
> >>> On my setup, currenty, Yarn starts but the nodemanager nodes
> report
> >>> the following message:
> >>>
> >>> ctx error: ibv_poll_cq() failed: IBV_WC_SUCCESS != wc.status
> >>> IBV_WC_SUCCESS != wc.status (12)
> >>> ucr_probe_blocking return value -1
> >>>
> >>> I tested the same installation base on other two nodes with
> ConnectX
> >>> 3 and RDMa-hadoop works witout showing that message.
> >>> So I suppose this error is due to the fact ConnecX5 cards have 2
> >>> ports that are exposed to applications as independent devices
> by the
> >>> new Mellanox Driver (4.0 - the one that supports CX5) and
> >>> RDMA-hadoop cannot establish which device to use. In other
> software
> >>> we must specify the device (and sometime even the port) to use, as
> >>> "mlx5_0", to solve similar problems.
> >>>
> >>> Is there a way to specify, in RDMA-bases hadoop (and plugin)
> setup,
> >>> the proper IB device to use?
> >>>
> >>> Thanks.
> >>>
> >>> Vittorio Rebecchi
> >>> _______________________________________________
> >>> RDMA-Hadoop-discuss mailing list
> >>> RDMA-Hadoop-discuss at cse.ohio-state.edu
> <mailto:RDMA-Hadoop-discuss at cse.ohio-state.edu>
> >>>
> >>>
> http://mailman.cse.ohio-state.edu/mailman/listinfo/rdma-hadoop-discuss
> <http://mailman.cse.ohio-state.edu/mailman/listinfo/rdma-hadoop-discuss>
> >>> [1]
> >>>
> >>> _______________________________________________
> >>> RDMA-Hadoop-discuss mailing list
> >>> RDMA-Hadoop-discuss at cse.ohio-state.edu
> <mailto:RDMA-Hadoop-discuss at cse.ohio-state.edu>
> >>>
> http://mailman.cse.ohio-state.edu/mailman/listinfo/rdma-hadoop-discuss
> <http://mailman.cse.ohio-state.edu/mailman/listinfo/rdma-hadoop-discuss>
> >>> [1]
> >>>
> >>>
> >>>
> >>> Links:
> >>> ------
> >>> [1]
> http://mailman.cse.ohio-state.edu/mailman/listinfo/rdma-hadoop-discuss
> <http://mailman.cse.ohio-state.edu/mailman/listinfo/rdma-hadoop-discuss>
> >>>
> >
> > <Quarantined
> Attachment.txt>_______________________________________________
> > RDMA-Hadoop-discuss mailing list
> > RDMA-Hadoop-discuss at cse.ohio-state.edu
> <mailto:RDMA-Hadoop-discuss at cse.ohio-state.edu>
> >
> http://mailman.cse.ohio-state.edu/mailman/listinfo/rdma-hadoop-discuss
> <http://mailman.cse.ohio-state.edu/mailman/listinfo/rdma-hadoop-discuss>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/rdma-hadoop-discuss/attachments/20170628/46a1de04/attachment-0001.html>
More information about the RDMA-Hadoop-discuss
mailing list