[Hadoop-RDMA-discuss] Issue for RDMA Hadoop package

peini liu peini.liu at bsc.es
Fri Apr 26 06:24:38 EDT 2019


Hi Xiaoyi, all


Hello, this is Peini Liu. I was trying to test 
'rdma-hadoop-2.x-1.3.5-x86-bin.tar.gz' distribution with docker on our 
testbed.

The versions: Ubuntu16.04, Docker 18.09, jdk1.8.0_201

The configuration: follow the instruction

The hardware: Mellanox Infiniband

Other things: I have already test the rdma connection with'ib_write_bw 
tool' it works! And also I have the ssh connection between containers. 
When I try the terasort beachmark, it works fine at begining, but 
sometimes hang unexcepted when the experiment is running.

-----------------------------------------------------------------------------------------------------------------------------------

There are no ERRORs with the log file, but the Exit and Exception as 
follows..

Exception from container-launch.
Container id: container_1556185871097_0003_01_000040
Exit code: 134
Exception message: /bin/bash: line 1:  6614 Aborted                 
/opt/java/jdk1.8.0_201/bin/java -Djava.net.preferIPv4Stack=true 
-Dhadoop.metrics.log.level=WARN -Xmx2752m 
-Djava.io.tmpdir=/opt/Programs/hadoop/hdfs/nm-local-dir/usercache/root/appcache/application_1556185871097_0003/container_1556185871097_0003_01_000040/tmp 
-Dlog4j.configuration=container-log4j.properties 
-Dyarn.app.container.log.dir=/opt/Programs/rdma-hadoop-2.x-1.3.5-x86/logs/userlogs/application_1556185871097_0003/container_1556185871097_0003_01_000040 
-Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA 
-Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 
172.30.48.3 38354 attempt_1556185871097_0003_m_000038_0 40 > 
/opt/Programs/rdma-hadoop-2.x-1.3.5-x86/logs/userlogs/application_1556185871097_0003/container_1556185871097_0003_01_000040/stdout 
2> 
/opt/Programs/rdma-hadoop-2.x-1.3.5-x86/logs/userlogs/application_1556185871097_0003/container_1556185871097_0003_01_000040/stderr

Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 6614 
Aborted                 /opt/java/jdk1.8.0_201/bin/java 
-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN 
-Xmx2752m 
-Djava.io.tmpdir=/opt/Programs/hadoop/hdfs/nm-local-dir/usercache/root/appcache/application_1556185871097_0003/container_1556185871097_0003_01_000040/tmp 
-Dlog4j.configuration=container-log4j.properties 
-Dyarn.app.container.log.dir=/opt/Programs/rdma-hadoop-2.x-1.3.5-x86/logs/userlogs/application_1556185871097_0003/container_1556185871097_0003_01_000040 
-Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA 
-Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 
172.30.48.3 38354 attempt_1556185871097_0003_m_000038_0 40 > 
/opt/Programs/rdma-hadoop-2.x-1.3.5-x86/logs/userlogs/application_1556185871097_0003/container_1556185871097_0003_01_000040/stdout 
2> 
/opt/Programs/rdma-hadoop-2.x-1.3.5-x86/logs/userlogs/application_1556185871097_0003/container_1556185871097_0003_01_000040/stderr

         at org.apache.hadoop.util.Shell.runCommand(SourceFile:972)
         at org.apache.hadoop.util.Shell.run(SourceFile:869)
         at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(SourceFile:1170)
         at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:236)
         at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:305)
         at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84)
         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
         at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
         at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)


With the above exception I have checked the stderr it seems fine.

buf_id = 0
ep id = 0 receive index = 0
receive buf  = 0
buf_id = 1
ep id = 0 receive index = 0
receive buf  = 1

--------------------------------------------------------------------------------------------------------

Container exited with a non-zero exit code 134

java: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion 
`mutex->__data.__owner == 0' failed.
/opt/Programs/exec_terasuite-ubuntu.sh: line 106: 22768 
Aborted                 (core dumped) ${HADOOP_HOME}/bin/hadoop jar 
${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.0.jar 
teragen -Dmapreduce.job.maps=$n_maps_tg 
-Dmapreduce.map.memory.mb=$map_mem_tg 
-Dmapreduce.map.java.opts=-Xmx"${map_heap_tg}"m 
-Dmapreduce.map.cpu.vcores=$map_vcores_tg $size_kilo 
/benchmarks/teragen-${size_giga}G-${nb_ctn}ctn-${k}

Serveral threads are mutex_locked here.

---------------------------------------------------------------------------------------------------------------

After this, I think the connection is down, so I started getting 
ConnectException and retried serveral times.

Couldn't connect to hadoopnode-ubuntu-2-1:0
Connect failed, and sleep 3 seconds to retry.
exchange_ep_info() failed
Couldn't connect to hadoopnode-ubuntu-2-1:0
Connect failed, and sleep 3 seconds to retry.

------------------------------------------------------------------------------------------------------------------------------------------

At last, shows the fatal error with the lib librdmadfsclient.so.1.3.5-x86:

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f3cac80a86f, pid=23219, tid=0x00007f3ca0ba6700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_201-b09) (build 
1.8.0_201-b09)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.201-b09 mixed mode 
linux-amd64 compressed oops)
# Problematic frame:
# C  [librdmadfsclient.so.1.3.5-x86+0x786f] 
Java_org_apache_hadoop_hdfs_RdmaDFSClient_ucrSendBlocking+0x62
#
# Core dump written. Default location: /tmp/core or core.23219


Thank you so much for your helping!:)  I have keeped those files, 
anything needs, please let me know.


Best Regards,


Peini





http://bsc.es/disclaimer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/rdma-hadoop-discuss/attachments/20190426/8e3a15a3/attachment.html>


More information about the RDMA-Hadoop-discuss mailing list