We are starting a blog post series to dig deeper into the capabilities of Spring Cloud Data Flow (SCDF) running on Kubernetes. This blog post will be updated when new posts have been published.
List of blog posts for quick access:
- This introduction into our GDELT on SCDF series
- Getting Started: Bootstrapping SCDF on Kubernetes using KUBECTL
- Implementing a custom reactive source application
- Implementing a very basic filtering application to drop duplicate data.
- Implementing an advanced processor to drop duplicate data with kafka streams.
Let’s have a look at the data we want to process:
Supported by [Google Jigsaw](https://jigsaw.google.com/), the [GDELT Project]() monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day, creating a free open platform for computing on the entire world.
Besides raw data feeds there are powerful APIs to query and even to visualize the GDELT datasets. This Blog post is inspired by the “rss feed for web archiving coverage about climate change” taken from a blog post on gdeltproject.org:
This searches for all articles published in the last hour mentioning "climate change" or "global warming" and returns the first 200 articles, ordered by date with the newest articles first and returned as an RSS feed that includes the primary URL of each article as one item and, as a separate item, the URL of the mobile/AMP edition of the page, if available. This demonstrates how to use the API as a data source for web archiving.
We want to use this type of query to pull the latest articles for a configurable query from the GDELT project and do some complex processing on Spring Cloud Data Flow (SCDF). We will download the articles, do some analysis on the content, store and also visualize it.
About Spring Cloud Data Flow (SCDF)
Taken from https://cloud.spring.io/spring-cloud-dataflow:
Spring Cloud Data Flow is a toolkit for building data integration and real-time data processing pipelines. Pipelines consist of Spring Boot apps, built using the Spring Cloud Stream or Spring Cloud Task microservice frameworks. This makes Spring Cloud Data Flow suitable for a range of data processing use cases, from import/export to event streaming and predictive analytics.
Part 1 : Setting up the runtime environment
As these different implementations require different packaging (jar vs docker), we went for kubernetes as the platform and Kafka as the messaging solution in our examples.
You can find instructions to install Spring Cloud Data Flow in our first blog post about GDELT on SCDF: Bootstrapping SCDF on Kubernetes using KUBECTL.
Part 2 : How to pull gdelt data into SCDF
Continue reading on how to implement a custom source application using the reactive framework on spring cloud streams: Implementing a custom reactive source application for spring cloud data flow.
Part 3 : How to filter duplicates
After we pulled some data using our custom source, we are adding our first custom processor to filter the stream: Implementing a very basic filtering application to drop duplicate data.
Part 4 : How to filter duplicates (using Kafka Streams)
We want to improve the deduplication filter by reimplementing it using Kakfa Streams: Implementing an advanced processor based on kafka streams.