This repository contains references and code for the book Distributed Machine Learning Patterns from Manning Publications by Yuan Tang.
🔥 Both eBook and physical copies of the book are available on Manning website.
In Distributed Machine Learning Patterns you will learn how to:
- Apply patterns to build scalable and reliable machine learning systems.
- Construct machine learning pipelines with data ingestion, distributed training, model serving, and more.
- Automate machine learning tasks with Kubernetes, TensorFlow, Kubeflow, and Argo Workflows.
- Make trade off decisions between different patterns and approaches.
- Manage and monitor machine learning workloads at scale.
This book teaches you how to take machine learning models from your personal laptop to large distributed clusters. You’ll explore key concepts and patterns behind successful distributed machine learning systems, and learn technologies like TensorFlow, Kubernetes, Kubeflow, and Argo Workflows directly from a key maintainer and contributor. Real-world scenarios, hands-on projects, and clear, practical advice DevOps techniques and let you easily launch, manage, and monitor cloud-native distributed machine learning pipelines.
Scaling up models from personal devices to large distributed clusters is one of the biggest challenges faced by modern machine learning practitioners. Distributing machine learning systems allow developers to handle extremely large datasets across multiple clusters, take advantage of automation tools, and benefit from hardware accelerations. In this book, Yuan Tang shares patterns, techniques, and experience gained from years spent building and managing cutting-edge distributed machine learning infrastructure.
Distributed Machine Learning Patterns is filled with practical patterns for running machine learning systems on distributed Kubernetes clusters in the cloud. Each pattern is designed to help solve common challenges faced when building distributed machine learning systems, including supporting distributed model training, handling unexpected failures, and dynamic model serving traffic. Real-world scenarios provide clear examples of how to apply each pattern, alongside the potential trade-offs for each approach. Once you’ve mastered these cutting-edge techniques, you’ll put them all into practice and finish up by building a comprehensive distributed machine learning system.
For data analysts, data scientists, and software engineers familiar with the basics of machine learning algorithms and running machine learning in production. Readers should be familiar with the basics of Bash, Python, and Docker.
Yuan is a principal software engineer at Red Hat, working on OpenShift AI. Previously, he has led teams to build AI infrastructure and platforms at various companies, including Alibaba and Akuity. He's a project lead of Argo and Kubeflow, a maintainer of TensorFlow and XGBoost, and an author of many popular open source projects. In addition, Yuan authored three machine learning books and published numerous impactful papers. He's a regular conference speaker, technical advisor, leader, and mentor at various organizations.
"This is a wonderful book for those wanting to understand how to be more effective with Machine Learning at scale, explained clearly and from first principles!"
-- Laurence Moroney, AI Developer Relations Lead at Google
"This book is an exceptionally timely and comprehensive guide to developing, running, and managing machine learning systems in a distributed environment. It covers essential topics such as data partitioning, ingestion, model training, serving, and workflow management. What truly sets this book apart is its discussion of these topics from a pattern perspective, accompanied by real-world examples and widely adopted systems like Kubernetes, Kubeflow, and Argo. I highly recommend it!"
-- Yuan Chen, Principal Software Engineer at Apple
"This book provides a high-level understanding of patterns with practical code examples needed for all MLOps engineering tasks. This is a must-read for anyone in the field."
-- Brian Ray, Global Head of Data Science and Artificial Intelligence at Eviden
"This book weaves together concepts from distributed systems, machine learning, and site reliability engineering in a way that’s approachable for beginners and that’ll excite and inspire experienced practitioners. As soon as I finished reading, I was ready to start building."
-- James Lamb, Staff Data Engineer at SpotHero
"Whatever your role is in the data ecosystem (scientist, analyst, or engineer), if you are looking to take your knowledge and skills to the next level, then this book is for you. This book is an amazing guide to the concepts and state-of-the-art when it comes to designing resilient and scalable, ML systems for both training and serving models. Regardless of what platform you may be working with, this book teaches you the patterns you should be familiar with when trying to scale out your systems."
-- Ryan Russon, Senior Manager of Model Training at Capital One
"AI is the new electricity, and distributed systems is the new power grid. Whether you are a research scientist, engineer, or product developer, you will find the best practices and recipes in this book to scale up your greatest endeavors."
-- Linxi "Jim" Fan, Senior AI Research Scientist at NVIDIA, Stanford PhD
"This book discusses various architectural approaches to tackle common data science problems such as scaling machine learning processes and building robust workflows and pipelines. It serves as an excellent introduction to the world of MLOps for data scientists and ML engineers who want to enhance their knowledge in this field."
-- Rami Krispin, Senior Data Science and Engineering Manager