GDELT on SCDF 2.2.0: Bootstrapping spring cloud data flow on kubernetes using kubectl

In the first part of our blog post series “processing GDELT data with SCDF on kubernetes” we go through the steps to deploy Spring Cloud Data Flow (SCDF) on Kubernetes , including the latest version of starter apps that will be used in the examples.

This is an repost of GDELT on SCDF 1.7.0: Bootstrapping spring cloud data flow on kubernetes using kubectl to update instructions and code from SCDF 1.7.0 to 2.2.0.

We stick to the manual steps described here in the official spring cloud dataflow documentation to deploy all components to our kubernetes cluster into a dedicated namespace scdf-220 to run the examples.

This installation will not be production-ready, it is about experimenting and to ensure compability for all further blog posts.

Preparations

Clone the git repository to retrieve the neccessary kubernetes configuration files and switch to the 2.2.0.RELEASE branch:

git clone https://github.com/spring-cloud/spring-cloud-dataflow.git
cd spring-cloud-dataflow
git checkout v2.2.0.RELEASE

installation with kubectl

We want to use a dedicated namespace scdf-220 for our deployment, so we create it first:

echo '{ "kind": "Namespace", "apiVersion": "v1",  "metadata": { "name": "scdf-220", "labels": { "name": "scdf-220" } } }' | kubectl create -f -

Before we start the actual installation, we want to replace the used kafka version (0.11.0.3) with a more current one (2.0.0).

sed -i 's/2.11-0.11.0.3/2.11-2.0.0 /g' src/kubernetes/kafka/kafka-deployment.yaml

Afterwards we can deploy the dependencies (kafka/mysql/redis) and the spring cloud dataflow server itself:

kubectl create -n scdf-220 -f src/kubernetes/kafka/
kubectl create -n scdf-220 -f src/kubernetes/mysql/
kubectl create -n scdf-220 -f src/kubernetes/skipper/skipper-config-kafka.yaml
kubectl create -n scdf-220 -f src/kubernetes/skipper/skipper-deployment.yaml
kubectl create -n scdf-220 -f src/kubernetes/skipper/skipper-svc.yaml
kubectl create -n scdf-220 -f src/kubernetes/server/server-roles.yaml
kubectl create -n scdf-220 -f src/kubernetes/server/server-rolebinding.yaml
kubectl create -n scdf-220 -f src/kubernetes/server/service-account.yaml
kubectl create -n scdf-220 -f src/kubernetes/server/server-config.yaml
kubectl create -n scdf-220 -f src/kubernetes/server/server-svc.yaml
kubectl create -n scdf-220 -f src/kubernetes/server/server-deployment.yaml

verify and enable access

kubectl -n scdf-220  get all

Output should look like:

NAME                               READY   STATUS    RESTARTS   AGE
pod/kafka-broker-c9bb65d79-rgjdk   1/1     Running   0          8m54s
pod/kafka-zk-5dbc848cc-ffsmh       1/1     Running   0          8m54s
pod/mysql-74c6bc5789-spxpw         1/1     Running   0          8m43s
pod/scdf-server-57cb49d876-x5pcw   1/1     Running   2          2m58s
pod/skipper-c787d9bbf-tl9jd        1/1     Running   0          81s

NAME                  TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
service/kafka         ClusterIP      10.110.232.68    <none>        9092/TCP                     8m54s
service/kafka-zk      ClusterIP      10.101.230.74    <none>        2181/TCP,2888/TCP,3888/TCP   8m54s
service/mysql         ClusterIP      10.104.158.144   <none>        3306/TCP                     8m43s
service/scdf-server   LoadBalancer   10.102.157.67    <pending>     80:32184/TCP                 2m50s
service/skipper       LoadBalancer   10.97.161.195    <pending>     80:32482/TCP                 76s

NAME                           READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/kafka-broker   1/1     1            1           8m54s
deployment.apps/kafka-zk       1/1     1            1           8m54s
deployment.apps/mysql          1/1     1            1           8m43s
deployment.apps/scdf-server    1/1     1            1           2m58s
deployment.apps/skipper        1/1     1            1           81s

NAME                                     DESIRED   CURRENT   READY   AGE
replicaset.apps/kafka-broker-c9bb65d79   1         1         1       8m54s
replicaset.apps/kafka-zk-5dbc848cc       1         1         1       8m54s
replicaset.apps/mysql-74c6bc5789         1         1         1       8m43s
replicaset.apps/scdf-server-57cb49d876   1         1         1       2m58s
replicaset.apps/skipper-c787d9bbf        1         1         1       81s

creating port-forward

To access the UI (and rest-api for the cli) you can create a port-forward to the scdf-server pod (using your individual scdf-service pod name, see output of “kubectl -n scdf-220 get all”):

kubectl -n scdf-220 port-forward \
$(kubectl get pods --namespace scdf-220 -l "app=scdf-server" --no-headers | awk  '{print $1}') \
2345:80

where 2345 would be the local port on your machine where you can now access the ui ( http://localhost:2345 ) and the rest-api.

Install the dataflow cli

You can download the jar file directly from spring’s maven repository:

wget https://repo.spring.io/release/org/springframework/cloud/spring-cloud-dataflow-shell/2.2.0.RELEASE/spring-cloud-dataflow-shell-2.2.0.RELEASE.jar

and start the cli:

java -jar spring-cloud-dataflow-shell-2.2.0.RELEASE.jar \
--dataflow.uri=http://localhost:2345

The output should look like this:

   ...
 / ___| _ __  _ __(_)_ __   __ _   / ___| | ___  _   _  __| |
 \___ \| '_ \| '__| | '_ \ / _` | | |   | |/ _ \| | | |/ _` |
  ___) | |_) | |  | | | | | (_| | | |___| | (_) | |_| | (_| |
 |____/| .__/|_|  |_|_| |_|\__, |  \____|_|\___/ \__,_|\__,_|
  ____ |_|    _          __|___/                 __________
 |  _ \  __ _| |_ __ _  |  ___| | _____      __  \ \ \ \ \ \
 | | | |/ _` | __/ _` | | |_  | |/ _ \ \ /\ / /   \ \ \ \ \ \
 | |_| | (_| | || (_| | |  _| | | (_) \ V  V /    / / / / / /
 |____/ \__,_|\__\__,_| |_|   |_|\___/ \_/\_/    /_/_/_/_/_/
 
2.2.0.RELEASE
 
Welcome to the Spring Cloud Data Flow shell. For assistance hit TAB or type "help".
dataflow:>

Install starter apps

After the initial installation there are no applications (task/stream) registed. As you can see here (for stream apps), you need to pick a explicit combination of the packaging format (jar vs docker) and the messaging technology (Kafka vs RabbitMQ).

We will go for docker (because we are running on Kubernetes) and Kafka as the messaging layer based on newer Spring Boot and Spring Cloud Stream versions (2.1.x + 2.1.x).

We can register all the available starter apps with a single cli command:

dataflow:>app import --uri https://dataflow.spring.io/Einstein-SR3-stream-applications-kafka-docker

Output:

Successfully registered .........
.................................
...... processor.pmml.metadata, sink.router.metadata, sink.mongodb]

and the Spring Cloud Task Starter Apps (based on Spring Boot 2.1.x + Spring Cloud Task 2.1.x):

dataflow:> app import --uri https://dataflow.spring.io/Elston-GA-task-applications-docker





Successfully registered ..........
..................................
........ task.timestamp, task.timestamp.metadata]

You can verify the installation by listing all available applications:

dataflow:>app list

The Output should look like this:

╔═══╤══════════════╤═══════════════════════════╤══════════════════════════╤════════════════════╗
║app│    source    │         processor         │           sink           │        task        ║
╠═══╪══════════════╪═══════════════════════════╪══════════════════════════╪════════════════════╣
║   │file          │bridge                     │aggregate-counter         │composed-task-runner║
║   │ftp           │filter                     │counter                   │timestamp           ║
║   │gemfire       │groovy-filter              │field-value-counter       │timestamp-batch     ║
║   │gemfire-cq    │groovy-transform           │file                      │                    ║
║   │http          │grpc                       │ftp                       │                    ║
║   │jdbc          │header-enricher            │gemfire                   │                    ║
║   │jms           │httpclient                 │hdfs                      │                    ║
║   │load-generator│image-recognition          │jdbc                      │                    ║
║   │loggregator   │object-detection           │log                       │                    ║
║   │mail          │pmml                       │mongodb                   │                    ║
║   │mongodb       │python-http                │mqtt                      │                    ║
║   │mqtt          │python-jython              │pgcopy                    │                    ║
║   │rabbit        │scriptable-transform       │rabbit                    │                    ║
║   │s3            │splitter                   │redis-pubsub              │                    ║
║   │sftp          │tasklaunchrequest-transform│router                    │                    ║
║   │syslog        │tcp-client                 │s3                        │                    ║
║   │tcp           │tensorflow                 │sftp                      │                    ║
║   │tcp-client    │transform                  │task-launcher-cloudfoundry│                    ║
║   │time          │twitter-sentiment          │task-launcher-local       │                    ║
║   │trigger       │                           │task-launcher-yarn        │                    ║
║   │triggertask   │                           │tcp                       │                    ║
║   │twitterstream │                           │throughput                │                    ║
║   │              │                           │websocket                 │                    ║
╚═══╧══════════════╧═══════════════════════════╧══════════════════════════╧════════════════════╝

Continue reading on how to implement a custom source application using the reactive framework on spring cloud streams in our next blog post: Implementing a custom reactive source application for spring cloud data flow.