tf.keras-Distributed-Training

Accompanies with this report.

This repository shows how to seamlessly integrate tf.distribute.MirroredStrategy for distributing your training workloads across multiple GPUs for tf.keras models. Distributed training can be particularly very useful when you have very large datasets and the need to scale the training costs becomes very prominent with that. It becomes unrealistic to perform the training on only a single hardware accelerator (a GPU in this case), hence the need for performing distributed training.

TensorFlow's distributed strategies make it extremely easier for us to seamlessly scale up our heavy training workloads across multiple hardware accelerators -- be it GPUs or even TPUs. That said, distributed training has been a challenge for a long time especially when it comes to neural network training. The primary challenges that come with distributed training procedures are as follows:

How are we going to distribute the model parameters across the different devices?
How are we going to accumulate the gradients during backpropagation?
How are the model parameters going to be updated?

All of these may sound very daunting if you think of the training process end-to-end. Thankfully, libraries like TensorFlow give us the freedom of incorporating distributed training very easily -- be it for tf.keras models with the classic fit and compile paradigm or be it for custom training loops. This report, however, only deals with the former. If you are interested in learning more about distributed training for custom training loops, be sure to check this tutorial out.

Dataset used

Cats vs. Dogs

Hardware

n1-standard-4vCPUs-15 GB
4 Tesla k80s
Preconfigured Image: TensorFlow 2.1 (with Intel MKL-DNN/MKL and CUDA 10.1)

Acknowledgements

ML-GDE Program (know about the GDE program here) for allowing me GCP Cloud Credits otherwise, these experiments (all of them are done on GCP) wouldn't have been possible.
Martin Gorner for his guidance.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
Multi_GPU_Training-WB-Larger-BS.ipynb		Multi_GPU_Training-WB-Larger-BS.ipynb
Multi_GPU_Training-WB.ipynb		Multi_GPU_Training-WB.ipynb
README.md		README.md
Single_GPU_Training.ipynb		Single_GPU_Training.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tf.keras-Distributed-Training

Dataset used

Hardware

Acknowledgements

About

Releases

Packages

Languages

sayakpaul/tf.keras-Distributed-Training

Folders and files

Latest commit

History

Repository files navigation

tf.keras-Distributed-Training

Dataset used

Hardware

Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages