From Batches to Pipelines - Real-time data streaming architecture

John Sigvald Skauge

Francisco Roque

Half-day workshop - in English

Exploring the treasure trove of (Big) Data is essential for companies to survive the increasingly competitive landscape. By putting data at the forefront and focusing on data-centric architectures, one can establish a platform for building data-driven products, make decisions based on the data, and ultimately take control of the increasing amounts of generated data.

This workshop will be focused on learning the concepts and building a real-time data streaming architecture with Apache Kafka as the main Data Hub. We will start with a good overview on data-centric architectures, Apache Kafka, Spark Streaming, and the data modeling process. We will reserve a large chunk of time for hands-on deployment of the different components, and will be doing interactive data exploration using Jupyter Notebooks (in Python). The aim is to enable you to build applications that take advantage of the low latency, real-time aspect of the architecture.

The participants of the workshop will also use Ansible, AWS, and Docker to prepare the workshop environment, after a short introduction to the tools needed. This will not be the main focus of the workshop, but is needed to get started quickly.

Important topics / technologies covered:

  • Apache Kafka
  • Spark Streaming
  • Jupyter Notebooks
  • Ansible
  • Docker

Primarily for: Developers, Project managers, Architects, Product developers, Managers

Participant requirements: Laptop, AWS account (free tier), SSH client