Computing Course • Jillur Quddus

Distributed Data Engineering

Learn how to perform data transformations on big data in Python by building and deploying distributed data pipelines optimised for processing massive data volumes.

Distributed Data Engineering

Distributed Data Engineering

Jillur Quddus • Founder & Chief Data Scientist • 1st Sep 2020

Back to Training Courses

Overview

Learn how to perform data transformations on big data in Python by building and deploying distributed data pipelines optimised for processing massive data volumes.

Course Details

This course provides a hands-on and in-depth exploration of the industry-standard Apache Spark unified analytics engine, and specifically its Spark SQL, DataFrames and Dataset API with which to build distributed data pipelines capable of processing massive data volumes ranging from gigabytes (GB) to petabytes (PB) in size. This course follows on from our Python for Data Analysis course, and enables experienced senior data engineers to load, model, transform, merge and analyse huge volumes of structured and unstructured data.

Course Modules

  • 1. Introduction to Apache Spark
  • 2. Spark Session
  • 3. Loading Data Sources
  • 4. DataFrames
  • 5. Spark SQL
  • 6. Aggregate Functions
  • 7. Scalar Functions
  • 8. User Defined Functions
  • 9. Dataset API
  • 10. Persistence to Databases
  • 11. Optimising Performance Part 1
  • 12. Optimising Performance Part 2

Requirements

Outcomes

  • The ability to load large structured, semi-structured and unstructured data files (including Parquet, ORC, JSON and Avro files) into distributed and efficient in-memory data structures.
  • The ability to design, build and optimise end-to-end distributed data pipelines capable of loading, merging and transforming large disparate datasets, and saving post-transformed and post-modelled data into SQL and NoSQL distributed databases.
  • The ability to analyse and derive actionable insights from large disparate datasets in order to solve real-world business problems (e.g. descriptive statistics, trend analysis and forecasting).
  • Knowledge of the industry-standard Apache Spark unified analytics engine for distributed transformations of big data.
DASH Platform
Jillur Quddus
Jillur Quddus
Founder & Chief Data Scientist