Distributed Computing Overview

This is supposed to provide an overview over different techniques. Some of them already supported, or partially supported, or not yet implemented (but all could be done).

Distributed PyTorch
RETURNN multi-GPU training (using Horovod)
Distributed TensorFlow
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding: paper
GSPMD: General and Scalable Parallelization for Neural Networks: blog, paper.
- GSPMD is generalized from the backend of GShard; extension to XLA
- used by GLaM, LaMDA, some others.
Mesh TensorFlow
TensorFlow DTensor
Pathways: Asynchronous Distributed Dataflow for ML: blog, paper. used by PaLM. closed source

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed Computing Overview

Clone this wiki locally