Computing Course • Jillur Quddus

Distributed Machine Learning

Learn how to apply statistical learning techniques to big data in Python by building, interpreting, visualising and evaluating distributed machine learning models optimised for massive data volumes.

Distributed Machine Learning

Distributed Machine Learning

Jillur Quddus • Founder & Chief Data Scientist • 1st Sep 2020

Back to Training Courses

Overview

Learn how to apply statistical learning techniques to big data in Python by building, interpreting, visualising and evaluating distributed machine learning models optimised for massive data volumes.

Course Details

This course provides a hands-on and in-depth exploration of the industry-standard Apache Spark unified analytics engine, and specifically its MLlib distributed machine learning library with which to build, visualise and evaluate distributed machine learning models applied to real-world business problems and use-cases that require learning from massive data volumes ranging from gigabytes (GB) to petabytes (PB) in size. This course follows on from our Statistical Learning course, and enables senior data scientists to apply the mathematical techniques introduced in that course to real-world use-cases, from which they can make predictions and derive actionable insights from big data. As such, this course details how to build and evaluate linear models for regression and classification, tree-based models and clustering models. This course also details applied techniques for model selection and fine-tuning applied to big data volumes.

Course Modules

  • 1. Introduction to Apache Spark
  • 2. Apache Spark MLlib Basics
  • 3. Linear Models - Regression Part 1
  • 4. Linear Models - Regression Part 2
  • 5. Linear Models - Classification Part 1
  • 6. Linear Models - Classification Part 2
  • 7. Tree-Based Models Part 1
  • 8. Tree-Based Models Part 2
  • 9. Clustering Models Part 1
  • 10. Clustering Models Part 2
  • 11. Collaborative Filtering
  • 12. Model Selection and Evaluation

Requirements

Outcomes

  • The ability to apply statistical learning techniques in Apache Spark.
  • The ability to build, interpret, visualise and evaluate supervised and unsupervised distributed machine learning models applied to real-world business problems and use-cases that require learning from massive amounts of structured and unstructured data, ranging from gigabytes (GB) to petabytes (PB).
  • The ability to select and fine-tune distributed models applied to big data business problems.
  • Advanced knowledge of the industry-standard Apache Spark MLlib machine learning library.
DASH Platform
Jillur Quddus
Jillur Quddus
Founder & Chief Data Scientist