[Hadoop-RDMA] Troubles with deploying hadoop-rdma-0.9.8 over IB cluster
Panda, Dhabaleswar
panda at cse.ohio-state.edu
Thu Feb 6 08:32:58 EST 2014
Glad to know that you were able to resolve the problem.
Thanks,
DK
________________________________
From: Hadoop-RDMA [hadoop-rdma-bounces at cse.ohio-state.edu] on behalf of Alexander Frolov [alexndr.frolov at gmail.com]
Sent: Wednesday, February 05, 2014 11:46 AM
To: hadoop-rdma at cse.ohio-state.edu
Subject: Re: [Hadoop-RDMA] Troubles with deploying hadoop-rdma-0.9.8 over IB cluster
It seems that I solved my problem. The issue was related to configuration of hadoop.tmp.dir which has been set to NFS partition. By default it is configured to /tmp which is local fs. After removing hadoop.tmp.dir from core-site.xml the problem has been solved.
Thank you.
On Wed, Feb 5, 2014 at 7:26 PM, Alexander Frolov <alexndr.frolov at gmail.com<mailto:alexndr.frolov at gmail.com>> wrote:
UPD: just forgot to attach config files:
frolo at A11:~/hadoop-rdma-0.9.8> cat conf/core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/frolo/hadoop-rdma-0.9.8/tmp</value>
</property>
<property>
<name>fs.default.name<http://fs.default.name></name>
<value>hdfs://A11:9000</value>
<description>URL of NameNode</description>
</property>
<property>
<name>hadoop.ib.enabled</name>
<value>true</value>
<description>Enable the RDMA feature over IB. Default value of hadoop.ib.enabled is true.</description>
</property>
<property>
<name>hadoop.roce.enabled</name>
<value>false</value>
<description>Disable the RDMA feature over RoCE. Default value of hadoop.roce.enabled is false.</description>
</property>
</configuration>
frolo at A11:~/hadoop-rdma-0.9.8> cat conf/hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!--
<property>
<name>dfs.name.dir</name>
<value>/home/frolo/hadoop-rdma-0.9.8/HadoopName</value>
<description>Path on the local filesystem where the NameNode stores the namespace and transactions logs</description>
</property>
-->
<!--
<property>
<name>dfs.data.dir</name>
<value>/home/frolo/hadoop-rdma-0.9.8/HadoopName</value>
<description>Lists of paths on the local filesystem of a DataNode where it should store its block</description>
</property>
-->
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
frolo at A11:~/hadoop-rdma-0.9.8> cat conf/mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>A11:9001</value>
<description>Host or IP and port of JobTracker</description>
</property>
<property>
<name>mapred.system.dir</name>
<value>/home/frolo/hadoop-rdma-0.9.8/tmp</value>
<description>Path on the HDFS where the MapReduce framework stores system files</description>
</property>
</configuration>
On Wed, Feb 5, 2014 at 7:22 PM, Alexander Frolov <alexndr.frolov at gmail.com<mailto:alexndr.frolov at gmail.com>> wrote:
Hello,
I am trying to deploy Hadoop-RDMA on 8 node IB (OFED-1.5.3-4.0.42) cluster and got into the following problem (a.k.a File ... could only be replicated to 0 nodes, instead of 1):
frolo at A11:~/hadoop-rdma-0.9.8> ./bin/hadoop dfs -copyFromLocal ../pg132.txt /user/frolo/input/pg132.txt
Warning: $HADOOP_HOME is deprecated.
14/02/05 19:06:30 WARN hdfs.DFSClient: DataStreamer Exception: java.lang.reflect.UndeclaredThrowableException
at com.sun.proxy.$Proxy1.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(Unknown Source)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(Unknown Source)
at com.sun.proxy.$Proxy1.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.From.Code(Unknown Source)
at org.apache.hadoop.hdfs.From.F(Unknown Source)
at org.apache.hadoop.hdfs.From.F(Unknown Source)
at org.apache.hadoop.hdfs.The.run(Unknown Source)
Caused by: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/frolo/input/pg132.txt could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(Unknown Source)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(Unknown Source)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.ipc.RPC$Server.call(Unknown Source)
at org.apache.hadoop.ipc.rdma.madness.Code(Unknown Source)
at org.apache.hadoop.ipc.rdma.madness.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(Unknown Source)
at org.apache.hadoop.ipc.rdma.be.run(Unknown Source)
at org.apache.hadoop.ipc.rdma.RDMAClient.Code(Unknown Source)
at org.apache.hadoop.ipc.rdma.RDMAClient.call(Unknown Source)
at org.apache.hadoop.ipc.Tempest.invoke(Unknown Source)
... 12 more
14/02/05 19:06:30 WARN hdfs.DFSClient: Error Recovery for null bad datanode[0] nodes == null
14/02/05 19:06:30 WARN hdfs.DFSClient: Could not get block locations. Source file "/user/frolo/input/pg132.txt" - Aborting...
14/02/05 19:06:30 INFO hdfs.DFSClient: exception in isClosed
It seems that data is not transferred to DataNodes when I start copying from local filesystem to HDFS. I tested availability of DataNodes:
frolo at A11:~/hadoop-rdma-0.9.8> ./bin/hadoop dfsadmin -report
Warning: $HADOOP_HOME is deprecated.
Configured Capacity: 0 (0 KB)
Present Capacity: 0 (0 KB)
DFS Remaining: 0 (0 KB)
DFS Used: 0 (0 KB)
DFS Used%: �%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
-------------------------------------------------
Datanodes available: 0 (4 total, 4 dead)
Name: 10.10.1.13:50010<http://10.10.1.13:50010>
Decommission Status : Normal
Configured Capacity: 0 (0 KB)
DFS Used: 0 (0 KB)
Non DFS Used: 0 (0 KB)
DFS Remaining: 0(0 KB)
DFS Used%: 100%
DFS Remaining%: 0%
Last contact: Wed Feb 05 19:02:54 MSK 2014
Name: 10.10.1.14:50010<http://10.10.1.14:50010>
Decommission Status : Normal
Configured Capacity: 0 (0 KB)
DFS Used: 0 (0 KB)
Non DFS Used: 0 (0 KB)
DFS Remaining: 0(0 KB)
DFS Used%: 100%
DFS Remaining%: 0%
Last contact: Wed Feb 05 19:02:54 MSK 2014
Name: 10.10.1.16:50010<http://10.10.1.16:50010>
Decommission Status : Normal
Configured Capacity: 0 (0 KB)
DFS Used: 0 (0 KB)
Non DFS Used: 0 (0 KB)
DFS Remaining: 0(0 KB)
DFS Used%: 100%
DFS Remaining%: 0%
Last contact: Wed Feb 05 19:02:54 MSK 2014
Name: 10.10.1.11:50010<http://10.10.1.11:50010>
Decommission Status : Normal
Configured Capacity: 0 (0 KB)
DFS Used: 0 (0 KB)
Non DFS Used: 0 (0 KB)
DFS Remaining: 0(0 KB)
DFS Used%: 100%
DFS Remaining%: 0%
Last contact: Wed Feb 05 19:02:55 MSK 2014
and tried to mkdir in HDFS filesystem which has been successful. Restarting of Hadoop daemons have not produced any positive effect.
Could you please help me with this issue? Thank you.
Best,
Alex
-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 17364 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/hibd-announce/attachments/20140206/395a05ed/attachment.bin>
More information about the hibd-announce
mailing list