Posts

how to use dynamic allocation in a oozie spark action on CDH5

using spark’s dynamic allocation feature in a oozie spark action can be a tricky.

enable dynamic allocation

First you need to make sure that dynamic allocation is actually available on your cluster. Navigate to your “Spark” service, then “Configuration” and search for “dynamic”.

spark_oozie_action_dynamic_allocation_enabled

Both (shuffle service + dynamic allocation) needs to be enabled.

how to configure the oozie spark action

If you just omit --num-executors in your spark-args definition your job will fall back to configuration defaults and will utilize only two executors:

spark_oozie_action_no_num_executors

If you go for --num-executors 0 your oozie workflow will fail with this strange error message:

Number of executors was 0, but must be at least 1 (or 0 if dynamic executor allocation is enabled). 

Usage: org.apache.spark.deploy.yarn.Client 

[options] Options: 
--jar JAR_PATH Path to your application's JAR file (required in yarn-cluster mode) 
--class CLASS_NAME Name of your application's main class (required) 
--primary-py-file A main Python file 
--primary-r-file A main R file 
--arg ARG Argument to be passed to

So even if you enabled the shuffle service on your cluster and set spark.dynamicAllocation.enabled to true, the spark action seems unaware of these settings.

adding missing configuration properties

To get this working you need to provide spark.dynamicAllocation.enabled=true and spark.shuffle.service.enabled=true together with --num-executors 0 in the spark-args section:

<master>yarn-cluster</master>>
<mode>cluster</mode>
<name>spark-job-name</name>
<class>com.syscrest.demo.CustomSparkOozieAction</class>
<jar>${sparkJarPath}</jar>
<spark-opts>
... your job specific options ...
--num-executors 0 
--conf spark.dynamicAllocation.enabled=true 
--conf spark.shuffle.service.enabled=true 
</spark-opts> 
<arg>-t</arg>

spark oozie action jobs not showing up on spark history server

If you execute spark jobs within an oozie workflow using a <spark> action node on a Cloudera CDH5 cluster, your job may not show up on your spark history server. Even if you configured all these things using the cloudera manager, your history server may only lists jobs started on the commandline using spark-submit.

adding missing configuration properties

When using mode = cluster and master = yarn-cluster or yarn-client you need to provide spark.yarn.historyServer.addres and spark.eventLog.dir with the actual adress of your spark history server and the fully qualified hdfs path of the applicationHistory directory and set spark.eventLog.enabled to true.

<master>yarn-cluster</master>>
<mode>cluster</mode>
<name>spark-job-name</name>
<class>com.syscrest.demo.CustomSparkOozieAction</class>
<jar>${sparkJarPath}</jar>
<spark-opts>
... your job specific options ...
--conf spark.yarn.historyServer.address=http://historyservernode:18088
--conf spark.eventLog.dir=hdfs://nameservicehost:8020/user/spark/applicationHistory 
--conf spark.eventLog.enabled=true
</spark-opts> 
<arg>-t</arg>

With these additional configuration properties these oozie-controlled spark jobs should also show up on your spark history server.

Patching Oozie in a parcel-based CDH 5.8.0 Installation

This blogpost will guide you to the process of cloning, patching, building and deploying a custom version of the oozie workflow engine based on the cdh 5.8.0 source code that is available on github.
Read more

how to access a remote ha-enabled hdfs in a (oozie) distcp action

how to inject the configuration of a remote ha-hdfs in a distcp call without modifing the local cluster configuration.
Read more

Passing many parameters from Java action to Oozie workflow

oozie’s ‘capture-output’ is a powerful method the pass dynamic configuration properties from action to action, but you may hit the maximum size limit quite fast. Read more

Using Accumulos RangePartioner in a m/r job (and Oozie workflow)

How to use Accumulos RangePartioner to increase your mr-job ingest rate (and the neccessary pieces to include it into an oozie worflow)
Read more

oozie-graphite available for CDH5

Our open source project oozie-graphite is now available for CDH5.
Read more

Increasing mapreduce.job.counters.max on CDH5 YARN (MR2)

How to increase mapreduce.job.counters.max on YARN (MR2) for HUE / HIVE / OOZIE.
Read more

Oozie bundle monitoring: tapping into hadoop counters

This is the first post about GraphiteMRCounterExecutor use cases: we start by utilizing already available hadoop counters that deliver very valueable graphs.
Read more