[Hadoop-RDMA-discuss] Issue for RDMA Hadoop package
Gugnani, Shashank
gugnani.2 at buckeyemail.osu.edu
Mon Apr 29 17:20:14 EDT 2019
Hi Peini,
Can you make sure that you are launching the docker image using a command like this:
docker run --name <name> -it --pid=host --ipc=host --privileged <image> /bin/bash
It is important to use "--pid=host --ipc=host --privileged", otherwise RDMA will not work inside the container.
Can you also share your environment information (OS, IB device, OFED version, etc.), hadoop logs, and configuration files?
Thanks,
Shashank
________________________________
From: RDMA-Hadoop-discuss <rdma-hadoop-discuss-bounces at cse.ohio-state.edu> on behalf of peini liu <peini.liu at bsc.es>
Sent: Friday, April 26, 2019 6:24 AM
To: rdma-hadoop-discuss at cse.ohio-state.edu
Subject: [Hadoop-RDMA-discuss] Issue for RDMA Hadoop package
Hi Xiaoyi, all
Hello, this is Peini Liu. I was trying to test 'rdma-hadoop-2.x-1.3.5-x86-bin.tar.gz' distribution with docker on our testbed.
The versions: Ubuntu16.04, Docker 18.09, jdk1.8.0_201
The configuration: follow the instruction
The hardware: Mellanox Infiniband
Other things: I have already test the rdma connection with 'ib_write_bw tool' it works! And also I have the ssh connection between containers. When I try the terasort beachmark, it works fine at begining, but sometimes hang unexcepted when the experiment is running.
-----------------------------------------------------------------------------------------------------------------------------------
There are no ERRORs with the log file, but the Exit and Exception as follows..
Exception from container-launch.
Container id: container_1556185871097_0003_01_000040
Exit code: 134
Exception message: /bin/bash: line 1: 6614 Aborted /opt/java/jdk1.8.0_201/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2752m -Djava.io.tmpdir=/opt/Programs/hadoop/hdfs/nm-local-dir/usercache/root/appcache/application_1556185871097_0003/container_1556185871097_0003_01_000040/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/opt/Programs/rdma-hadoop-2.x-1.3.5-x86/logs/userlogs/application_1556185871097_0003/container_1556185871097_0003_01_000040 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 172.30.48.3 38354 attempt_1556185871097_0003_m_000038_0 40 > /opt/Programs/rdma-hadoop-2.x-1.3.5-x86/logs/userlogs/application_1556185871097_0003/container_1556185871097_0003_01_000040/stdout 2> /opt/Programs/rdma-hadoop-2.x-1.3.5-x86/logs/userlogs/application_1556185871097_0003/container_1556185871097_0003_01_000040/stderr
Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 6614 Aborted /opt/java/jdk1.8.0_201/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2752m -Djava.io.tmpdir=/opt/Programs/hadoop/hdfs/nm-local-dir/usercache/root/appcache/application_1556185871097_0003/container_1556185871097_0003_01_000040/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/opt/Programs/rdma-hadoop-2.x-1.3.5-x86/logs/userlogs/application_1556185871097_0003/container_1556185871097_0003_01_000040 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 172.30.48.3 38354 attempt_1556185871097_0003_m_000038_0 40 > /opt/Programs/rdma-hadoop-2.x-1.3.5-x86/logs/userlogs/application_1556185871097_0003/container_1556185871097_0003_01_000040/stdout 2> /opt/Programs/rdma-hadoop-2.x-1.3.5-x86/logs/userlogs/application_1556185871097_0003/container_1556185871097_0003_01_000040/stderr
at org.apache.hadoop.util.Shell.runCommand(SourceFile:972)
at org.apache.hadoop.util.Shell.run(SourceFile:869)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(SourceFile:1170)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:236)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:305)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
With the above exception I have checked the stderr it seems fine.
buf_id = 0
ep id = 0 receive index = 0
receive buf = 0
buf_id = 1
ep id = 0 receive index = 0
receive buf = 1
--------------------------------------------------------------------------------------------------------
Container exited with a non-zero exit code 134
java: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.
/opt/Programs/exec_terasuite-ubuntu.sh: line 106: 22768 Aborted (core dumped) ${HADOOP_HOME}/bin/hadoop jar ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.0.jar teragen -Dmapreduce.job.maps=$n_maps_tg -Dmapreduce.map.memory.mb=$map_mem_tg -Dmapreduce.map.java.opts=-Xmx"${map_heap_tg}"m -Dmapreduce.map.cpu.vcores=$map_vcores_tg $size_kilo /benchmarks/teragen-${size_giga}G-${nb_ctn}ctn-${k}
Serveral threads are mutex_locked here.
---------------------------------------------------------------------------------------------------------------
After this, I think the connection is down, so I started getting ConnectException and retried serveral times.
Couldn't connect to hadoopnode-ubuntu-2-1:0
Connect failed, and sleep 3 seconds to retry.
exchange_ep_info() failed
Couldn't connect to hadoopnode-ubuntu-2-1:0
Connect failed, and sleep 3 seconds to retry.
------------------------------------------------------------------------------------------------------------------------------------------
At last, shows the fatal error with the lib librdmadfsclient.so.1.3.5-x86:
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f3cac80a86f, pid=23219, tid=0x00007f3ca0ba6700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_201-b09) (build 1.8.0_201-b09)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.201-b09 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C [librdmadfsclient.so.1.3.5-x86+0x786f] Java_org_apache_hadoop_hdfs_RdmaDFSClient_ucrSendBlocking+0x62
#
# Core dump written. Default location: /tmp/core or core.23219
Thank you so much for your helping!:) I have keeped those files, anything needs, please let me know.
Best Regards,
Peini
WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received.
http://www.bsc.es/disclaimer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/rdma-hadoop-discuss/attachments/20190429/fba7436f/attachment-0001.html>
More information about the RDMA-Hadoop-discuss
mailing list