how to use dynamic allocation in a oozie spark action on CDH5

by Thomas Memenga on 2017-06-21

how to use dynamic allocation in a oozie spark action on CDH5

using spark’s dynamic allocation feature in a oozie spark action can be a tricky.

enable dynamic allocation

First you need to make sure that dynamic allocation is actually available on your cluster. Navigate to your “Spark” service, then “Configuration” and search for “dynamic”.

spark_oozie_action_dynamic_allocation_enabled

Both (shuffle service + dynamic allocation) needs to be enabled.

how to configure the oozie spark action

If you just omit –num-executors in your spark-args definition your job will fall back to configuration defaults and will utilize only two executors:

spark_oozie_action_no_num_executors

If you go for –num-executors 0` your oozie workflow will fail with this strange error message:

Number of executors was 0, but must be at least 1 (or 0 if dynamic executor allocation is enabled). 

Usage: org.apache.spark.deploy.yarn.Client 

[options] Options: 
--jar JAR_PATH Path to your application's JAR file (required in yarn-cluster mode) 
--class CLASS_NAME Name of your application's main class (required) 
--primary-py-file A main Python file 
--primary-r-file A main R file 
--arg ARG Argument to be passed to

So even if you enabled the shuffle service on your cluster and set spark.dynamicAllocation.enabled to true, the spark action seems unaware of these settings.

adding missing configuration properties

To get this working you need to provide spark.dynamicAllocation.enabled=true and spark.shuffle.service.enabled=true together with –num-executors 0 in the spark-args section:

<master>yarn-cluster</master>
<mode>cluster</mode>
<name>spark-job-name</name>
<class>com.syscrest.demo.CustomSparkOozieAction</class>
<jar>${sparkJarPath}</jar>
<spark-opts>
... your job specific options ...
--num-executors 0 
--conf spark.dynamicAllocation.enabled=true 
--conf spark.shuffle.service.enabled=true 
</spark-opts> 
<arg>-t</arg>
Tags: