how to access a remote ha-enabled hdfs in a (oozie) distcp action

Accessing a remote hdfs that has high availability enabled it not that straight forward as it used to be with non-ha hdfs setups. In a non-ha hdfs setup you simply use the namenode hostname and port, but with ha-enabled an alias is used and your local hadoop configuration needs to contain a couple of properties so that the hdfs client can determine the active namenode and do a failover if it’s neccessary.

Adding this configuration to your local cluster configuration is quite cumbersome. If you are using Cloudera’s CDH5 distribution of hadoop you need to inject it into the correct “safety valves”. If the remote and your local cluster are both using high availability you will need to override a property that is also essential for your local high availability configuration. And any change of the remote ha-configuration could require a restart of your cluster services for your changes to become active.

Most of the time this remote configuration is only needed when you need to pull data from a remote hdfs using distcp. This blog post will describe how to inject the neccessary configuration directly into the distcp commandline.

collect remote hdfs configuration

First you need to take a look at the client configuration of the remote cluster. See HDFS High Availability Documentation for futher details. These are the properties that are relevant for a hdfs client accessing a remote ha-enabled hdfs:

dfs.nameservices – the logical name for this new nameservice
dfs.ha.namenodes.[nameservice ID] – unique identifiers for each NameNode in the nameservice
dfs.client.failover.proxy.provider.[nameservice ID] – the Java class that HDFS clients use to contact the Active NameNode
dfs.namenode.rpc-address.[nameservice ID].[name node ID] – the fully-qualified RPC address for each NameNode to listen on
dfs.namenode.http-address.[nameservice ID].[name node ID] – the fully-qualified HTTP address for each NameNode to listen on

Your actual configuration may look like this:

 

<property>
	<name>dfs.nameservices</name>
	<value>remotehdfs</value>
</property>
<property>
	<name>dfs.client.failover.proxy.provider.remotehdfs</name>
	<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
	<name>dfs.ha.automatic-failover.enabled.remotehdfs</name>
	<value>true</value>
</property>
<property>
	<name>dfs.ha.namenodes.remotehdfs</name>
	<value>namenode1,namenode2</value>
</property>
<property>
	<name>dfs.namenode.rpc-address.remotehdfs.namenode1</name>
	<value>namenodehost-a:8020</value>
</property>
<property>
<name>dfs.namenode.servicerpc-address.remotehdfs.namenode1/name>
	<value>namenodehost-a:8022</value>
</property>
<property>
	<name>dfs.namenode.http-address.remotehdfs.namenode1</name>
	<value>namenodehost-a:50070</value>
</property>
<property>
	<name>dfs.namenode.https-address.remotehdfs.namenode1</name>
	<value>namenodehost-a:50470</value>
</property>
<property>
	<name>dfs.namenode.rpc-address.remotehdfs.namenode2</name>
	<value>namenodehost-b:8020</value>
</property>
<property>
	<name>dfs.namenode.servicerpc-address.remotehdfs.namenode2</name>
	<value>namenodehost-b:8022</value>
</property>
<property>
	<name>dfs.namenode.http-address.remotehdfs.namenode2</name>
	<value>namenodehost-b:50070</value>
</property>
<property>
	<name>dfs.namenode.https-address.remotehdfs.namenode2</name>
	<value>namenodehost-b:50470</value>
</property>

commandline example

If your local cluster is also ha-enabled, you need to modify the dfs.nameservices property and add your local alias as well (comma separated).

 

You can inject these configuration properties into your distcp command line using the standard hadoop -D parameter:

distcp \
-Ddfs.nameservices=remote-hdfs \
-Ddfs.client.failover.proxy.provider.remote-hdfs=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider \
-Ddfs.ha.namenodes.remote-hdfs=namenode1,namenode2 \
-Ddfs.namenode.rpc-address.remote-hdfs.namenode1=namenodehost-a:8020 \
-Ddfs.namenode.servicerpc-address.remote-hdfs.namenode1=namenodehost-a:8022 \
-Ddfs.namenode.http-address.remote-hdfs.namenode1=namenodehost-a:50070 \
-Ddfs.namenode.https-address.remote-hdfs.namenode1=namenodehost-a:50470 \
-Ddfs.namenode.rpc-address.remote-hdfs.namenode2=namenodehost-b:8020 \
-Ddfs.namenode.servicerpc-address.remote-hdfs.namenode2=namenodehost-b:8022 \
-Ddfs.namenode.http-address.remote-hdfs.namenode2=namenodehost-b:50070 \
-Ddfs.namenode.https-address.remote-hdfs.namenode2=namenodehost-b:50470 \
-m 256 \
hdfs://remote-hdfs/data/dir-to-copy \
/tmp/local-target-dir

oozie distcp action example

Or if you are executing the distcp as an oozie action:

    <action name="transfer">
        <distcp xmlns ="uri:oozie:distcp-action:0.2">
            <arg>-Ddfs.nameservices=remote-hdfs</arg>
            <arg>-Ddfs.client.failover.proxy.provider.remote-hdfs=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider </arg>
            <arg>-Ddfs.ha.namenodes.remote-hdfs=namenode1,namenode2 </arg>
            <arg>-Ddfs.namenode.rpc-address.remote-hdfs.namenode1=namenodehost-a:8020 </arg>
            <arg>-Ddfs.namenode.servicerpc-address.remote-hdfs.namenode1=namenodehost-a:8022 </arg>
            <arg>-Ddfs.namenode.http-address.remote-hdfs.namenode1=namenodehost-a:50070 </arg>
            <arg>-Ddfs.namenode.https-address.remote-hdfs.namenode1=namenodehost-a:50470 </arg>
            <arg>-Ddfs.namenode.rpc-address.remote-hdfs.namenode2=namenodehost-b:8020 </arg>
            <arg>-Ddfs.namenode.servicerpc-address.remote-hdfs.namenode2=namenodehost-b:8022 </arg>
            <arg>-Ddfs.namenode.http-address.remote-hdfs.namenode2=namenodehost-b:50070 </arg>
            <arg>-Ddfs.namenode.https-address.remote-hdfs.namenode2=namenodehost-b:50470 </arg>
            <arg>-m 256 </arg>
            <arg>hdfs://remote-hdfs/data/dir-to-copy</arg>
            <arg>/tmp/local-target-dir</arg>
        </distcp>
        <ok to="move"/>
        <error to="killOnError"/>
    </action>