[Hadoop-RDMA-discuss] Problem with RDMA-Hadoop and Mellanox CX5

Wed Jun 28 10:10:48 EDT 2017

We are taking a look at this issue to see what could be happening here. We will keep you updated.

Thanks,

DK
________________________________
From: rdma-hadoop-discuss-bounces at cse.ohio-state.edu on behalf of Mark Goddard [mark at stackhpc.com]
Sent: Wednesday, June 28, 2017 9:58 AM
To: Vittorio Rebecchi - A3Cube Inc.
Cc: rdma-hadoop-discuss at cse.ohio-state.edu
Subject: Re: [Hadoop-RDMA-discuss] Problem with RDMA-Hadoop and Mellanox CX5

Thanks for the response Vittorio,

I see the same behaviour with the NICs being enumerated in reverse order. What's odd in my setup though is that the link of the second port is actually down, but this doesn't prevent ib_send_bw or RDMA hadoop from trying to use it.

Regards,
Mark

On 28 June 2017 at 14:43, <vittorio at a3cube-inc.com<mailto:vittorio at a3cube-inc.com>> wrote:
Hi Mark,

I think you have the same issue we have with Mellanox ConnectX5 cards.

The new mellanox driver (I've seen that you have dual port Mellanox CX4 cards) maps the ports as independent devices, mlx5_0 and mlx5_1, with one port each. We have dual channel Mellanox ConnectX5 cards.
The problem is that the driver (the new one i think) will activate them in the opposite order, mlx5_1 before than mlx5_0, so, when your software starts to use the "first" IB device (by a sort of index), it will receive the mlx5_1 device before than mlx5_0 that, in your case, is the "ethernet" port. You can see that by running ibv_devinfo.

The solution is telling your software to use a specific IB device, by its name. For example, with ib_write_bw I must use option "-d mlx5_0" to be able to run these simple tests on connectX5 cards.
We do the same with our distributed filesystem "Anima", for IB data exchange, and everything works nicely. With RDMA_hadoop I think we need an option to specify the proper device because I suspect that in multicard or multichannel systems will have the same issue.

Just to be sure I have run the same setup in a mini-cluster with ConnectX3 single channels and the IB communication between nodes works.
Then we have other issue but I think it is a matter of configuration of map/reduce section together with IB, as i reported in my post, because teragen (it creates the dataset) works fine but terasort has some problems during reduce phase.

Best Regards.

-- Vittorio Rebecchi

Il 2017-06-28 11:24 Mark Goddard ha scritto:
Hi Vittorio,

It sounds like we're experiencing similar issues. I'm using Mellanox
connectX4 dual port NICs with port 0 in IB mode and am unable to start
HDFS services. I've not tried running YARN.

Here's my email to this list on the issue:
http://mailman.cse.ohio-state.edu/pipermail/rdma-hadoop-discuss/2017-June/000095.html.

Regards,
Mark

On 27 June 2017 at 16:09, Vittorio Rebecchi - A3Cube Inc.
<vittorio at a3cube-inc.com<mailto:vittorio at a3cube-inc.com>> wrote:

Hi Xiao yi,

thank for yout attention.

Im sending, as attachment to the mail, our configuration and logs
from the clusters on which I run RDAM-hadoop.
I've managed to be able to go on with terasort by removing all the
optimizations I usually add. The logs were regenerated today for a
better picture of the problem.

Let me describe better the environment in which we use RDMA-hadoop:
we don't use neither HDFS nor Lustre as filesystem but our
distributed filesystem, "A3Cube Anima", and it works really fine
with hadoop with our specific plug-in. You will see its setup in the
confs together with RDMA-hadoop. We usually run terasort with our
distributed filesystem without any problems.
Im trying to use RDMA-hadoop for its improvements in map/reduce with
IB. Teragen works fine and it creates the files to process with
terasort. Then we run terasort with: ./bin/hadoop jar
./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar
terasort teraInput teraOutput

The IB IPC communication in yarn and other hadoop parts works fine.
Terasort completes the map phase and, when its time for reduce after
a while, for example at "mapreduce.Job:  map 100% reduce 33%", it
wont go further and the logs reports the following lines:
2017-06-27 07:11:12,277 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl:
Removed completed containers from NM context:
[container_1498572173359_0002_01_000154,
container_1498572173359_0002_01_000151]
2017-06-27 07:15:26,097 INFO
org.apache.hadoop.mapred.HOMRShuffleHandler: RDMA Receiver 0 is
turning off!!!

What I have describes happens on the 2 node ConnectX-3 cluster
(i7-4770 cpus with 16 GB ram). Logs1.tar.gz is the log of node1,
logs2.tar.gz is the log of node2, etc.tar.gz has the conf files and
terasort.sh is the terasort.sh with some optimization that breaks
java communication after a while.

I have added terasort.sh with some optimization we usually use with
terasort because, during the reduce, phase nodemanager crashes with
the following output:

#A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fb5c0600761, pid=18982,
tid=0x00007fb5bcebd700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_131-b11) (build
1.8.0_131-b11)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.131-b11 mixed mode
linux-amd64 compressed oops)
# Problematic frame:
# C  [librdmamrserver.so.1.1.0+0x7761]
ucr_setup_ib_qp_init_params+0x21

I have added the logs about this run and it's called
logs_extended.tar.gz

Let me add that the same setup I have reported, won't work on out
ConnectX5 cluster because i think RDMA-hadoop cannot connect to the
proper Mellanox device in the system.
It looks like to be mandatory to specify Mellanox devices to use
(mlx5_0 or mlx5_1) even for ib_write_bw testing command.

Do you have any hints on the setup about map/reduce? There is
something to fix on the setup for allowing terasort to complete its
processing?
Do you have any suggestions to activate the use of RDMA-hadoop with
Mellanox ConnecX5 cards and to fix the reduce behavour? I have
checked the configuration with your manual and everything seems
correct.

Thanks in advance.

Bye

-- Vittorio Rebecchi

Il 26/06/2017 19:48, Xiaoyi Lu ha scritto:
Hi, Vittorio,

Thanks for your interest in our project. We locally tried to run
some benchmarks on our ConnextX-5 nodes and things run fine.

Can you please send us your logs, confs, and exact commands? We can
try to reproduce this and get back to you.

Thanks,
Xiaoyi

On Jun 26, 2017, at 9:20 AM, vittorio at a3cube-inc.com<mailto:vittorio at a3cube-inc.com> wrote:

Hi,

My name is Vittorio Rebecchi and I'm testing RDMA-based Apache
hadoop with an 8 node cluster with 1 Mellanox ConnectX 5 card on
each machine.

The Mellanox ConnectX 5 cards have  2 ports and they are mapped as 2
independant devices (mlx5_0 and mlx5_1) on the OS by Mellanox
drivers. The cards work properly (tested with ib_write_lat and
ib_write_bw) bui I must specify which IB device to use (in
ib_write_lat, for example, I must specify "-d mlx5_0").

On my setup, currenty, Yarn starts but the nodemanager nodes report
the following message:

ctx error: ibv_poll_cq() failed: IBV_WC_SUCCESS != wc.status
IBV_WC_SUCCESS != wc.status (12)
ucr_probe_blocking return value -1

I tested the same installation base on other two nodes with ConnectX
3 and RDMa-hadoop works witout showing that message.
So I suppose this error is due to the fact ConnecX5 cards have 2
ports that are exposed to applications as independent devices by the
new Mellanox Driver (4.0 - the one that supports CX5) and
RDMA-hadoop cannot establish which device to use. In other software
we must specify the device (and sometime even the port) to use, as
"mlx5_0", to solve similar problems.

Is there a way to specify, in RDMA-bases hadoop (and plugin) setup,
the proper IB device to use?

Thanks.

Vittorio Rebecchi
_______________________________________________
RDMA-Hadoop-discuss mailing list
RDMA-Hadoop-discuss at cse.ohio-state.edu<mailto:RDMA-Hadoop-discuss at cse.ohio-state.edu>

http://mailman.cse.ohio-state.edu/mailman/listinfo/rdma-hadoop-discuss
[1]

_______________________________________________
RDMA-Hadoop-discuss mailing list
RDMA-Hadoop-discuss at cse.ohio-state.edu<mailto:RDMA-Hadoop-discuss at cse.ohio-state.edu>
http://mailman.cse.ohio-state.edu/mailman/listinfo/rdma-hadoop-discuss
[1]

Links:
------
[1] http://mailman.cse.ohio-state.edu/mailman/listinfo/rdma-hadoop-discuss

-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 17764 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/rdma-hadoop-discuss/attachments/20170628/0f8eb0ff/attachment-0001.bin>