[Hadoop-RDMA-discuss] Problem with RDMA-Hadoop and Mellanox CX5
Xiaoyi Lu
lu.932 at osu.edu
Wed Jun 28 10:58:30 EDT 2017
Thanks for your feedback. As Dr. Panda mentioned, we are looking into this issue. Will get back to you later.
Mark - Can you please send us your logs and configurations?
Vittorio - Thanks for sending us your logs. Can you please send us your configurations as well?
Thanks,
Xiaoyi
> On Jun 28, 2017, at 9:58 AM, Mark Goddard <mark at stackhpc.com> wrote:
>
> Thanks for the response Vittorio,
>
> I see the same behaviour with the NICs being enumerated in reverse order. What's odd in my setup though is that the link of the second port is actually down, but this doesn't prevent ib_send_bw or RDMA hadoop from trying to use it.
>
> Regards,
> Mark
>
> On 28 June 2017 at 14:43, <vittorio at a3cube-inc.com> wrote:
> Hi Mark,
>
> I think you have the same issue we have with Mellanox ConnectX5 cards.
>
> The new mellanox driver (I've seen that you have dual port Mellanox CX4 cards) maps the ports as independent devices, mlx5_0 and mlx5_1, with one port each. We have dual channel Mellanox ConnectX5 cards.
> The problem is that the driver (the new one i think) will activate them in the opposite order, mlx5_1 before than mlx5_0, so, when your software starts to use the "first" IB device (by a sort of index), it will receive the mlx5_1 device before than mlx5_0 that, in your case, is the "ethernet" port. You can see that by running ibv_devinfo.
>
> The solution is telling your software to use a specific IB device, by its name. For example, with ib_write_bw I must use option "-d mlx5_0" to be able to run these simple tests on connectX5 cards.
> We do the same with our distributed filesystem "Anima", for IB data exchange, and everything works nicely. With RDMA_hadoop I think we need an option to specify the proper device because I suspect that in multicard or multichannel systems will have the same issue.
>
> Just to be sure I have run the same setup in a mini-cluster with ConnectX3 single channels and the IB communication between nodes works.
> Then we have other issue but I think it is a matter of configuration of map/reduce section together with IB, as i reported in my post, because teragen (it creates the dataset) works fine but terasort has some problems during reduce phase.
>
> Best Regards.
>
> -- Vittorio Rebecchi
>
>
> Il 2017-06-28 11:24 Mark Goddard ha scritto:
> Hi Vittorio,
>
> It sounds like we're experiencing similar issues. I'm using Mellanox
> connectX4 dual port NICs with port 0 in IB mode and am unable to start
> HDFS services. I've not tried running YARN.
>
> Here's my email to this list on the issue:
> http://mailman.cse.ohio-state.edu/pipermail/rdma-hadoop-discuss/2017-June/000095.html.
>
> Regards,
> Mark
>
> On 27 June 2017 at 16:09, Vittorio Rebecchi - A3Cube Inc.
> <vittorio at a3cube-inc.com> wrote:
>
> Hi Xiao yi,
>
> thank for yout attention.
>
> Im sending, as attachment to the mail, our configuration and logs
> from the clusters on which I run RDAM-hadoop.
> I've managed to be able to go on with terasort by removing all the
> optimizations I usually add. The logs were regenerated today for a
> better picture of the problem.
>
> Let me describe better the environment in which we use RDMA-hadoop:
> we don't use neither HDFS nor Lustre as filesystem but our
> distributed filesystem, "A3Cube Anima", and it works really fine
> with hadoop with our specific plug-in. You will see its setup in the
> confs together with RDMA-hadoop. We usually run terasort with our
> distributed filesystem without any problems.
> Im trying to use RDMA-hadoop for its improvements in map/reduce with
> IB. Teragen works fine and it creates the files to process with
> terasort. Then we run terasort with: ./bin/hadoop jar
> ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar
> terasort teraInput teraOutput
>
> The IB IPC communication in yarn and other hadoop parts works fine.
> Terasort completes the map phase and, when its time for reduce after
> a while, for example at "mapreduce.Job: map 100% reduce 33%", it
> wont go further and the logs reports the following lines:
> 2017-06-27 07:11:12,277 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl:
> Removed completed containers from NM context:
> [container_1498572173359_0002_01_000154,
> container_1498572173359_0002_01_000151]
> 2017-06-27 07:15:26,097 INFO
> org.apache.hadoop.mapred.HOMRShuffleHandler: RDMA Receiver 0 is
> turning off!!!
>
> What I have describes happens on the 2 node ConnectX-3 cluster
> (i7-4770 cpus with 16 GB ram). Logs1.tar.gz is the log of node1,
> logs2.tar.gz is the log of node2, etc.tar.gz has the conf files and
> terasort.sh is the terasort.sh with some optimization that breaks
> java communication after a while.
>
> I have added terasort.sh with some optimization we usually use with
> terasort because, during the reduce, phase nodemanager crashes with
> the following output:
>
> #A fatal error has been detected by the Java Runtime Environment:
> #
> # SIGSEGV (0xb) at pc=0x00007fb5c0600761, pid=18982,
> tid=0x00007fb5bcebd700
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_131-b11) (build
> 1.8.0_131-b11)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.131-b11 mixed mode
> linux-amd64 compressed oops)
> # Problematic frame:
> # C [librdmamrserver.so.1.1.0+0x7761]
> ucr_setup_ib_qp_init_params+0x21
>
> I have added the logs about this run and it's called
> logs_extended.tar.gz
>
> Let me add that the same setup I have reported, won't work on out
> ConnectX5 cluster because i think RDMA-hadoop cannot connect to the
> proper Mellanox device in the system.
> It looks like to be mandatory to specify Mellanox devices to use
> (mlx5_0 or mlx5_1) even for ib_write_bw testing command.
>
> Do you have any hints on the setup about map/reduce? There is
> something to fix on the setup for allowing terasort to complete its
> processing?
> Do you have any suggestions to activate the use of RDMA-hadoop with
> Mellanox ConnecX5 cards and to fix the reduce behavour? I have
> checked the configuration with your manual and everything seems
> correct.
>
> Thanks in advance.
>
> Bye
>
> -- Vittorio Rebecchi
>
> Il 26/06/2017 19:48, Xiaoyi Lu ha scritto:
> Hi, Vittorio,
>
> Thanks for your interest in our project. We locally tried to run
> some benchmarks on our ConnextX-5 nodes and things run fine.
>
> Can you please send us your logs, confs, and exact commands? We can
> try to reproduce this and get back to you.
>
> Thanks,
> Xiaoyi
>
> On Jun 26, 2017, at 9:20 AM, vittorio at a3cube-inc.com wrote:
>
> Hi,
>
> My name is Vittorio Rebecchi and I'm testing RDMA-based Apache
> hadoop with an 8 node cluster with 1 Mellanox ConnectX 5 card on
> each machine.
>
> The Mellanox ConnectX 5 cards have 2 ports and they are mapped as 2
> independant devices (mlx5_0 and mlx5_1) on the OS by Mellanox
> drivers. The cards work properly (tested with ib_write_lat and
> ib_write_bw) bui I must specify which IB device to use (in
> ib_write_lat, for example, I must specify "-d mlx5_0").
>
> On my setup, currenty, Yarn starts but the nodemanager nodes report
> the following message:
>
> ctx error: ibv_poll_cq() failed: IBV_WC_SUCCESS != wc.status
> IBV_WC_SUCCESS != wc.status (12)
> ucr_probe_blocking return value -1
>
> I tested the same installation base on other two nodes with ConnectX
> 3 and RDMa-hadoop works witout showing that message.
> So I suppose this error is due to the fact ConnecX5 cards have 2
> ports that are exposed to applications as independent devices by the
> new Mellanox Driver (4.0 - the one that supports CX5) and
> RDMA-hadoop cannot establish which device to use. In other software
> we must specify the device (and sometime even the port) to use, as
> "mlx5_0", to solve similar problems.
>
> Is there a way to specify, in RDMA-bases hadoop (and plugin) setup,
> the proper IB device to use?
>
> Thanks.
>
> Vittorio Rebecchi
> _______________________________________________
> RDMA-Hadoop-discuss mailing list
> RDMA-Hadoop-discuss at cse.ohio-state.edu
>
> http://mailman.cse.ohio-state.edu/mailman/listinfo/rdma-hadoop-discuss
> [1]
>
> _______________________________________________
> RDMA-Hadoop-discuss mailing list
> RDMA-Hadoop-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/rdma-hadoop-discuss
> [1]
>
>
>
> Links:
> ------
> [1] http://mailman.cse.ohio-state.edu/mailman/listinfo/rdma-hadoop-discuss
>
More information about the RDMA-Hadoop-discuss
mailing list