[Hadoop-RDMA-discuss] Issue for RDMA Hadoop package

Gugnani, Shashank gugnani.2 at buckeyemail.osu.edu
Mon Apr 29 17:20:14 EDT 2019


Hi Peini,

Can you make sure that you are launching the docker image using a command like this:
docker run --name <name> -it --pid=host --ipc=host --privileged <image> /bin/bash
It is important to use "--pid=host --ipc=host --privileged", otherwise RDMA will not work inside the container.

Can you also share your environment information (OS, IB device, OFED version, etc.), hadoop logs, and configuration files?


Thanks,

Shashank

________________________________
From: RDMA-Hadoop-discuss <rdma-hadoop-discuss-bounces at cse.ohio-state.edu> on behalf of peini liu <peini.liu at bsc.es>
Sent: Friday, April 26, 2019 6:24 AM
To: rdma-hadoop-discuss at cse.ohio-state.edu
Subject: [Hadoop-RDMA-discuss] Issue for RDMA Hadoop package


Hi Xiaoyi, all


Hello, this is Peini Liu. I was trying to test 'rdma-hadoop-2.x-1.3.5-x86-bin.tar.gz' distribution with docker on our testbed.

The versions:  Ubuntu16.04, Docker 18.09, jdk1.8.0_201

The configuration: follow the instruction

The hardware: Mellanox Infiniband

Other things: I have already test the rdma connection with 'ib_write_bw tool'  it works! And also I have the ssh connection between containers. When I try the terasort beachmark, it works fine at begining, but sometimes hang unexcepted when the experiment is running.

-----------------------------------------------------------------------------------------------------------------------------------

There are no ERRORs with the log file, but the Exit and Exception as follows..

Exception from container-launch.
Container id: container_1556185871097_0003_01_000040
Exit code: 134
Exception message: /bin/bash: line 1:  6614 Aborted                 /opt/java/jdk1.8.0_201/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2752m -Djava.io.tmpdir=/opt/Programs/hadoop/hdfs/nm-local-dir/usercache/root/appcache/application_1556185871097_0003/container_1556185871097_0003_01_000040/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/opt/Programs/rdma-hadoop-2.x-1.3.5-x86/logs/userlogs/application_1556185871097_0003/container_1556185871097_0003_01_000040 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 172.30.48.3 38354 attempt_1556185871097_0003_m_000038_0 40 > /opt/Programs/rdma-hadoop-2.x-1.3.5-x86/logs/userlogs/application_1556185871097_0003/container_1556185871097_0003_01_000040/stdout 2> /opt/Programs/rdma-hadoop-2.x-1.3.5-x86/logs/userlogs/application_1556185871097_0003/container_1556185871097_0003_01_000040/stderr

Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1:  6614 Aborted                 /opt/java/jdk1.8.0_201/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2752m -Djava.io.tmpdir=/opt/Programs/hadoop/hdfs/nm-local-dir/usercache/root/appcache/application_1556185871097_0003/container_1556185871097_0003_01_000040/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/opt/Programs/rdma-hadoop-2.x-1.3.5-x86/logs/userlogs/application_1556185871097_0003/container_1556185871097_0003_01_000040 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 172.30.48.3 38354 attempt_1556185871097_0003_m_000038_0 40 > /opt/Programs/rdma-hadoop-2.x-1.3.5-x86/logs/userlogs/application_1556185871097_0003/container_1556185871097_0003_01_000040/stdout 2> /opt/Programs/rdma-hadoop-2.x-1.3.5-x86/logs/userlogs/application_1556185871097_0003/container_1556185871097_0003_01_000040/stderr

        at org.apache.hadoop.util.Shell.runCommand(SourceFile:972)
        at org.apache.hadoop.util.Shell.run(SourceFile:869)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(SourceFile:1170)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:236)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:305)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)


With the above exception I have checked the stderr it seems fine.

buf_id = 0
ep id = 0 receive index = 0
receive buf  = 0
buf_id = 1
ep id = 0 receive index = 0
receive buf  = 1


--------------------------------------------------------------------------------------------------------

Container exited with a non-zero exit code 134

java: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.
/opt/Programs/exec_terasuite-ubuntu.sh: line 106: 22768 Aborted                 (core dumped) ${HADOOP_HOME}/bin/hadoop jar ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.0.jar teragen -Dmapreduce.job.maps=$n_maps_tg -Dmapreduce.map.memory.mb=$map_mem_tg -Dmapreduce.map.java.opts=-Xmx"${map_heap_tg}"m -Dmapreduce.map.cpu.vcores=$map_vcores_tg $size_kilo /benchmarks/teragen-${size_giga}G-${nb_ctn}ctn-${k}

Serveral threads are mutex_locked here.

---------------------------------------------------------------------------------------------------------------

After this, I think the connection is down, so I started getting ConnectException and retried serveral times.

Couldn't connect to hadoopnode-ubuntu-2-1:0
Connect failed, and sleep 3 seconds to retry.
exchange_ep_info() failed
Couldn't connect to hadoopnode-ubuntu-2-1:0
Connect failed, and sleep 3 seconds to retry.

------------------------------------------------------------------------------------------------------------------------------------------

At last, shows the fatal error with the lib librdmadfsclient.so.1.3.5-x86:

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f3cac80a86f, pid=23219, tid=0x00007f3ca0ba6700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_201-b09) (build 1.8.0_201-b09)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.201-b09 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [librdmadfsclient.so.1.3.5-x86+0x786f]  Java_org_apache_hadoop_hdfs_RdmaDFSClient_ucrSendBlocking+0x62
#
# Core dump written. Default location: /tmp/core or core.23219


Thank you so much for your helping!:)  I have keeped those files, anything needs, please let me know.


Best Regards,


Peini




WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/rdma-hadoop-discuss/attachments/20190429/fba7436f/attachment-0001.html>


More information about the RDMA-Hadoop-discuss mailing list