Learn how to perform data transformations on real-time event-driven data in Python by integrating distributed data pipelines with scalable, high-throughput and fault-tolerant streaming platforms.
This course provides a hands-on exploration of the industry-standard Apache Kafka distributed streaming platform and how it can be integrated with distributed data pipelines via Apache Spark and its Structured Streaming engine in order to build high-throughput and low-latency real-time data processing systems. This course follows on from our Distributed Data Engineering course, and enables experienced senior data engineers to build systems capable of transforming, and deriving actionable insight from, data in real-time, including performing real-time SQL operations, joins, deduplication and handling data with earlier timestamps but which arrive after data with later timestamps.
- 1. Introduction to Apache Kafka
- 2. Apache Kafka and Python
- 3. Apache Spark Structured Streaming
- 4. Real-Time DataFrame Streaming
- 5. Real-Time Data Transformations
- 6. Real-Time Joins
- 7. Real-Time Deduplication
- 8. Late Data Handling
- Introduction to Python or equivalent.
- Python for Data Analysis or equivalent.
- Distributed Data Engineering or equivalent.
- The ability to apply data transformation techniques to event-driven data in real-time.
- The ability to integrate distributed data pipelines with distributed streaming platforms in order to process and derive actionable insights from data in real-time, including performing real-time SQL operations, joins, deduplication and handling data with earlier timestamps but which arrive after data with later timestamps.
- Knowledge of the industry-standard Apache Spark Structured Streaming engine and Apache Kafka distributed streaming platform.