Skip to content

sayakpaul/tf.keras-Distributed-Training

Repository files navigation

tf.keras-Distributed-Training

Accompanies with this report.

This repository shows how to seamlessly integrate tf.distribute.MirroredStrategy for distributing your training workloads across multiple GPUs for tf.keras models. Distributed training can be particularly very useful when you have very large datasets and the need to scale the training costs becomes very prominent with that. It becomes unrealistic to perform the training on only a single hardware accelerator (a GPU in this case), hence the need for performing distributed training.

TensorFlow's distributed strategies make it extremely easier for us to seamlessly scale up our heavy training workloads across multiple hardware accelerators -- be it GPUs or even TPUs. That said, distributed training has been a challenge for a long time especially when it comes to neural network training. The primary challenges that come with distributed training procedures are as follows:

  • How are we going to distribute the model parameters across the different devices?
  • How are we going to accumulate the gradients during backpropagation?
  • How are the model parameters going to be updated?

All of these may sound very daunting if you think of the training process end-to-end. Thankfully, libraries like TensorFlow give us the freedom of incorporating distributed training very easily -- be it for tf.keras models with the classic fit and compile paradigm or be it for custom training loops. This report, however, only deals with the former. If you are interested in learning more about distributed training for custom training loops, be sure to check this tutorial out.

Dataset used

Cats vs. Dogs

Hardware

  • n1-standard-4vCPUs-15 GB
  • 4 Tesla k80s
  • Preconfigured Image: TensorFlow 2.1 (with Intel MKL-DNN/MKL and CUDA 10.1)

Acknowledgements

  • ML-GDE Program (know about the GDE program here) for allowing me GCP Cloud Credits otherwise, these experiments (all of them are done on GCP) wouldn't have been possible.
  • Martin Gorner for his guidance.

About

Shows how to use MirroredStrategy to distribute training workloads when using the regular fit and compile paradigm in tf.keras.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published