Spark

blog-thumb
by Thomas Memenga on 03 Feb 2023

Integrating dbt with Real-Time Data Streams: Challenges and Solutions

In the rapidly evolving world of data analytics, the need for real-time data processing and analytics has become increasingly crucial for businesses seeking to make timely, informed decisions. This shift towards real-time data necessitates the integration of traditional data transformation tools like dbt (data build tool) with streaming platforms such as Apache Kafka or AWS Kinesis. While dbt excels in batch processing, adapting it to a streaming context presents unique challenges and opportunities. In this post, we’ll explore these challenges and the innovative solutions that can be employed to integrate dbt effectively with real-time data streams.

blog-thumb
by Thomas Memenga on 10 Mar 2017

spark oozie action jobs not showing up on spark history server

If you execute spark jobs within an oozie workflow using a action node on a Cloudera CDH5 cluster, your job may not show up on your spark history server. Even if you configured all these things using the cloudera manager, your history server may only lists jobs started on the commandline using spark-submit.

blog-thumb
by Thomas Memenga on 28 Nov 2016

fixing spark classpath issues on CDH5 accessing Accumulo 1.7.2

We experienced some strange NoSuchMethorError while migrating a Accumulo based application from 1.6.0 to 1.7.2 running on CDH5. A couple of code changes where necessary moving from 1.6.0 to 1.7.2, but these were pretty straightforward (members visibility changed, some getters were introduced). Everything compiled fine, but when we executed the spark application on the cluster we got an exception that was pointing directly to a line we changed during the migration: