Blog post series: Processing gdeltproject.org feeds with Spring Cloud Data Flow 1.7.0 on Kubernetes

by Thomas Memenga on 2018-10-20

Blog post series: Processing gdeltproject.org feeds with Spring Cloud Data Flow 1.7.0 on Kubernetes

This blog post got updated for SCDF 2.2.0, please continue reading here: Processing gdeltproject.org feeds with Spring Cloud Data Flow 2.2.0 on kubernetes.

We are starting a blog post series to dig deeper into the capabilities of Spring Cloud Data Flow (SCDF) running on Kubernetes. This blog post will be updated when new posts have been published.

List of blog posts for quick access:

About GDELT

Let’s have a look at the data we want to process:

Besides raw data feeds there are powerful APIs to query and even to visualize the GDELT datasets. This Blog post is inspired by the “rss feed for web archiving coverage about climate change” taken from a blog post on gdeltproject.org:

We want to use this type of query to pull the latest articles for a configurable query from the GDELT project and do some complex processing on Spring Cloud Data Flow (SCDF). We will download the articles, do some analysis on the content, store and also visualize it.

About Spring Cloud Data Flow (SCDF)

Taken from https://cloud.spring.io/spring-cloud-dataflow:

Pipelines consist of Spring Boot apps, built using the Spring Cloud Stream or Spring Cloud Task microservice frameworks. This makes Spring Cloud Data Flow suitable for a range of data processing use cases, from import/export to event streaming and predictive analytics.

Part 1 : Setting up the runtime environment

As there are multiple platform implementations available, it can run your streams/tasks on a Local Server, on CloudFoundry, on Kubernetes, on YARN and even MESOS

As these different implementations require different packaging (jar vs docker), we went for kubernetes as the platform and Kafka as the messaging solution in our examples.

You can find instructions to install Spring Cloud Data Flow in our first blog post about GDELT on SCDF: Bootstrapping SCDF on Kubernetes using KUBECTL.

Part 2 : How to pull gdelt data into SCDF

Continue reading on how to implement a custom source application using the reactive framework on spring cloud streams: Implementing a custom reactive source application for spring cloud data flow.

Part 3 : How to filter duplicates

The next planned blog post will continue to enhance your first stream definition by adding a custom processor to drop duplicate articles. Stay tuned.