[Hadoop-RDMA-discuss] Problem with RDMA-Hadoop and Mellanox CX5

Vittorio Rebecchi - A3Cube Inc. vittorio at a3cube-inc.com
Tue Jun 27 11:09:09 EDT 2017


Hi Xiao yi,

thank for yout attention.

Im sending, as attachment to the mail, our configuration and logs from 
the clusters on which I run RDAM-hadoop.
I've managed to be able to go on with terasort by removing all the 
optimizations I usually add. The logs were regenerated today for a 
better picture of the problem.

Let me describe better the environment in which we use RDMA-hadoop: we 
don't use neither HDFS nor Lustre as filesystem but our distributed 
filesystem, "A3Cube Anima", and it works really fine with hadoop with 
our specific plug-in. You will see its setup in the confs together with 
RDMA-hadoop. We usually run terasort with our distributed filesystem 
without any problems.
Im trying to use RDMA-hadoop for its improvements in map/reduce with IB. 
Teragen works fine and it creates the files to process with terasort. 
Then we run terasort with: ./bin/hadoop jar 
./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar terasort 
teraInput teraOutput

The IB IPC communication in yarn and other hadoop parts works fine. 
Terasort completes the map phase and, when its time for reduce after a 
while, for example at "mapreduce.Job:  map 100% reduce 33%", it wont go 
further and the logs reports the following lines:
     2017-06-27 07:11:12,277 INFO 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed 
completed containers from NM context: 
[container_1498572173359_0002_01_000154, 
container_1498572173359_0002_01_000151]
     2017-06-27 07:15:26,097 INFO 
org.apache.hadoop.mapred.HOMRShuffleHandler: RDMA Receiver 0 is turning 
off!!!

What I have describes happens on the 2 node ConnectX-3 cluster (i7-4770 
cpus with 16 GB ram). Logs1.tar.gz is the log of node1, logs2.tar.gz is 
the log of node2, etc.tar.gz has the conf files and terasort.sh is the 
terasort.sh with some optimization that breaks java communication after 
a while.

I have added terasort.sh with some optimization we usually use with 
terasort because, during the reduce, phase nodemanager crashes with the 
following output:

#A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fb5c0600761, pid=18982, tid=0x00007fb5bcebd700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_131-b11) (build 
1.8.0_131-b11)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.131-b11 mixed mode 
linux-amd64 compressed oops)
# Problematic frame:
# C  [librdmamrserver.so.1.1.0+0x7761] ucr_setup_ib_qp_init_params+0x21

I have added the logs about this run and it's called logs_extended.tar.gz

Let me add that the same setup I have reported, won't work on out 
ConnectX5 cluster because i think RDMA-hadoop cannot connect to the 
proper Mellanox device in the system.
It looks like to be mandatory to specify Mellanox devices to use (mlx5_0 
or mlx5_1) even for ib_write_bw testing command.

Do you have any hints on the setup about map/reduce? There is something 
to fix on the setup for allowing terasort to complete its processing?
Do you have any suggestions to activate the use of RDMA-hadoop with 
Mellanox ConnecX5 cards and to fix the reduce behavour? I have checked 
the configuration with your manual and everything seems correct.

Thanks in advance.

Bye

-- Vittorio Rebecchi

Il 26/06/2017 19:48, Xiaoyi Lu ha scritto:
> Hi, Vittorio,
>
> Thanks for your interest in our project. We locally tried to run some benchmarks on our ConnextX-5 nodes and things run fine.
>
> Can you please send us your logs, confs, and exact commands? We can try to reproduce this and get back to you.
>
> Thanks,
> Xiaoyi
>
>> On Jun 26, 2017, at 9:20 AM, vittorio at a3cube-inc.com wrote:
>>
>> Hi,
>>
>> My name is Vittorio Rebecchi and I'm testing RDMA-based Apache hadoop with an 8 node cluster with 1 Mellanox ConnectX 5 card on each machine.
>>
>> The Mellanox ConnectX 5 cards have  2 ports and they are mapped as 2 independant devices (mlx5_0 and mlx5_1) on the OS by Mellanox drivers. The cards work properly (tested with ib_write_lat and ib_write_bw) bui I must specify which IB device to use (in ib_write_lat, for example, I must specify "-d mlx5_0").
>>
>> On my setup, currenty, Yarn starts but the nodemanager nodes report the following message:
>>
>> ctx error: ibv_poll_cq() failed: IBV_WC_SUCCESS != wc.status
>> IBV_WC_SUCCESS != wc.status (12)
>> ucr_probe_blocking return value -1
>>
>> I tested the same installation base on other two nodes with ConnectX 3 and RDMa-hadoop works witout showing that message.
>> So I suppose this error is due to the fact ConnecX5 cards have 2 ports that are exposed to applications as independent devices by the new Mellanox Driver (4.0 - the one that supports CX5) and RDMA-hadoop cannot establish which device to use. In other software we must specify the device (and sometime even the port) to use, as "mlx5_0", to solve similar problems.
>>
>> Is there a way to specify, in RDMA-bases hadoop (and plugin) setup, the proper IB device to use?
>>
>> Thanks.
>>
>> Vittorio Rebecchi
>> _______________________________________________
>> RDMA-Hadoop-discuss mailing list
>> RDMA-Hadoop-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/rdma-hadoop-discuss


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Quarantined Attachment.txt
URL: <http://mailman.cse.ohio-state.edu/pipermail/rdma-hadoop-discuss/attachments/20170627/8c875df8/attachment-0001.txt>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: logs1.tar.gz
Type: application/gzip
Size: 132499 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/rdma-hadoop-discuss/attachments/20170627/8c875df8/attachment-0004.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: logs2.tar.gz
Type: application/gzip
Size: 517508 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/rdma-hadoop-discuss/attachments/20170627/8c875df8/attachment-0005.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: terasort.sh
Type: application/x-shellscript
Size: 973 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/rdma-hadoop-discuss/attachments/20170627/8c875df8/attachment-0006.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: logs_terasort_extended.tar.gz
Type: application/gzip
Size: 226317 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/rdma-hadoop-discuss/attachments/20170627/8c875df8/attachment-0007.bin>


More information about the RDMA-Hadoop-discuss mailing list