From Batches to Pipelines - Real-time data streaming architecture
John Sigvald Skauge
Francisco Roque
Half-day workshop - in English
Exploring the treasure trove of (Big) Data is essential for companies to survive the increasingly competitive landscape. By putting data at the forefront and focusing on data-centric architectures, one can establish a platform for building data-driven products, make decisions based on the data, and ultimately take control of the increasing amounts of generated data.
This workshop will be focused on learning the concepts and building a real-time data streaming architecture with Apache Kafka as the main Data Hub. We will start with a good overview on data-centric architectures, Apache Kafka, Spark Streaming, and the data modeling process. We will reserve a large chunk of time for hands-on deployment of the different components, and will be doing interactive data exploration using Jupyter Notebooks (in Python). The aim is to enable you to build applications that take advantage of the low latency, real-time aspect of the architecture.
The participants of the workshop will also use Ansible, AWS, and Docker to prepare the workshop environment, after a short introduction to the tools needed. This will not be the main focus of the workshop, but is needed to get started quickly.
Important topics / technologies covered:
- Apache Kafka
- Spark Streaming
- Jupyter Notebooks
- Ansible
- Docker
Primarily for: Developers, Project managers, Architects, Product developers, Managers
Participant requirements: Laptop, AWS account (free tier), SSH client