[Hadoop-RDMA-discuss] Issue for RDMA Hadoop package
peini liu
peini.liu at bsc.es
Fri Apr 26 06:24:38 EDT 2019
Hi Xiaoyi, all
Hello, this is Peini Liu. I was trying to test
'rdma-hadoop-2.x-1.3.5-x86-bin.tar.gz' distribution with docker on our
testbed.
The versions: Ubuntu16.04, Docker 18.09, jdk1.8.0_201
The configuration: follow the instruction
The hardware: Mellanox Infiniband
Other things: I have already test the rdma connection with'ib_write_bw
tool' it works! And also I have the ssh connection between containers.
When I try the terasort beachmark, it works fine at begining, but
sometimes hang unexcepted when the experiment is running.
-----------------------------------------------------------------------------------------------------------------------------------
There are no ERRORs with the log file, but the Exit and Exception as
follows..
Exception from container-launch.
Container id: container_1556185871097_0003_01_000040
Exit code: 134
Exception message: /bin/bash: line 1: 6614 Aborted
/opt/java/jdk1.8.0_201/bin/java -Djava.net.preferIPv4Stack=true
-Dhadoop.metrics.log.level=WARN -Xmx2752m
-Djava.io.tmpdir=/opt/Programs/hadoop/hdfs/nm-local-dir/usercache/root/appcache/application_1556185871097_0003/container_1556185871097_0003_01_000040/tmp
-Dlog4j.configuration=container-log4j.properties
-Dyarn.app.container.log.dir=/opt/Programs/rdma-hadoop-2.x-1.3.5-x86/logs/userlogs/application_1556185871097_0003/container_1556185871097_0003_01_000040
-Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA
-Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild
172.30.48.3 38354 attempt_1556185871097_0003_m_000038_0 40 >
/opt/Programs/rdma-hadoop-2.x-1.3.5-x86/logs/userlogs/application_1556185871097_0003/container_1556185871097_0003_01_000040/stdout
2>
/opt/Programs/rdma-hadoop-2.x-1.3.5-x86/logs/userlogs/application_1556185871097_0003/container_1556185871097_0003_01_000040/stderr
Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 6614
Aborted /opt/java/jdk1.8.0_201/bin/java
-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN
-Xmx2752m
-Djava.io.tmpdir=/opt/Programs/hadoop/hdfs/nm-local-dir/usercache/root/appcache/application_1556185871097_0003/container_1556185871097_0003_01_000040/tmp
-Dlog4j.configuration=container-log4j.properties
-Dyarn.app.container.log.dir=/opt/Programs/rdma-hadoop-2.x-1.3.5-x86/logs/userlogs/application_1556185871097_0003/container_1556185871097_0003_01_000040
-Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA
-Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild
172.30.48.3 38354 attempt_1556185871097_0003_m_000038_0 40 >
/opt/Programs/rdma-hadoop-2.x-1.3.5-x86/logs/userlogs/application_1556185871097_0003/container_1556185871097_0003_01_000040/stdout
2>
/opt/Programs/rdma-hadoop-2.x-1.3.5-x86/logs/userlogs/application_1556185871097_0003/container_1556185871097_0003_01_000040/stderr
at org.apache.hadoop.util.Shell.runCommand(SourceFile:972)
at org.apache.hadoop.util.Shell.run(SourceFile:869)
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(SourceFile:1170)
at
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:236)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:305)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
With the above exception I have checked the stderr it seems fine.
buf_id = 0
ep id = 0 receive index = 0
receive buf = 0
buf_id = 1
ep id = 0 receive index = 0
receive buf = 1
--------------------------------------------------------------------------------------------------------
Container exited with a non-zero exit code 134
java: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion
`mutex->__data.__owner == 0' failed.
/opt/Programs/exec_terasuite-ubuntu.sh: line 106: 22768
Aborted (core dumped) ${HADOOP_HOME}/bin/hadoop jar
${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.0.jar
teragen -Dmapreduce.job.maps=$n_maps_tg
-Dmapreduce.map.memory.mb=$map_mem_tg
-Dmapreduce.map.java.opts=-Xmx"${map_heap_tg}"m
-Dmapreduce.map.cpu.vcores=$map_vcores_tg $size_kilo
/benchmarks/teragen-${size_giga}G-${nb_ctn}ctn-${k}
Serveral threads are mutex_locked here.
---------------------------------------------------------------------------------------------------------------
After this, I think the connection is down, so I started getting
ConnectException and retried serveral times.
Couldn't connect to hadoopnode-ubuntu-2-1:0
Connect failed, and sleep 3 seconds to retry.
exchange_ep_info() failed
Couldn't connect to hadoopnode-ubuntu-2-1:0
Connect failed, and sleep 3 seconds to retry.
------------------------------------------------------------------------------------------------------------------------------------------
At last, shows the fatal error with the lib librdmadfsclient.so.1.3.5-x86:
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f3cac80a86f, pid=23219, tid=0x00007f3ca0ba6700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_201-b09) (build
1.8.0_201-b09)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.201-b09 mixed mode
linux-amd64 compressed oops)
# Problematic frame:
# C [librdmadfsclient.so.1.3.5-x86+0x786f]
Java_org_apache_hadoop_hdfs_RdmaDFSClient_ucrSendBlocking+0x62
#
# Core dump written. Default location: /tmp/core or core.23219
Thank you so much for your helping!:) I have keeped those files,
anything needs, please let me know.
Best Regards,
Peini
http://bsc.es/disclaimer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/rdma-hadoop-discuss/attachments/20190426/8e3a15a3/attachment.html>
More information about the RDMA-Hadoop-discuss
mailing list