Skip to content

Latest commit

 

History

History
28 lines (21 loc) · 2.28 KB

torch-processgroup.md

File metadata and controls

28 lines (21 loc) · 2.28 KB

torch: ProcessGroup

Definition

A ProcessGroup in PyTorch is an abstraction that represents a group of processes that can communicate with each other. These processes can be on the same machine or spread across multiple machines. The primary role of a ProcessGroup is to handle and coordinate the communication between these processes, which is essential for tasks like synchronized updates of model parameters during distributed training.

Functions of ProcessGroup

  1. Communication Backend: The ProcessGroup abstracts the communication backend (e.g., MPI, Gloo, or NCCL) used for the inter-process communication. Each backend has different capabilities and is optimized for different environments. For instance, NCCL is highly efficient on GPUs for all-reduce operations, whereas Gloo is designed for more general collective operations on both CPUs and GPUs.
  2. Collective Operations: It supports various collective communication operations such as:
    • all_reduce: Reduces the tensor data across all machines in such a way that all get the result.
    • broadcast: Shares a tensor from one process to all other processes.
    • all_gather: Gathers tensors from all processes and distributes it to all.
    • reduce: Reduces the tensor data across all machines but only one machine gets the result.
  3. Group Management: It manages the group of processes involved in the distributed training. This includes initializing the group, adding or removing processes, and managing the lifecycle of these processes.
  4. Synchronizing model parameters: After each backward pass, DDP uses the ProcessGroup to synchronize gradients or updated parameters across all processes. This ensures that each process (and thus each model replica) stays in sync with others, maintaining consistency across the replicas.
  5. Efficiency and Scalability: By efficiently managing the communication between processes, ProcessGroup helps in scaling the training process to larger models and datasets, potentially across many GPUs and machines.

Usage

import torch.distributed as dist
import torch.nn.parallel import DistributedDataParallel as DDP

# init_process_group initialize the communication (backend, init_method)
dist.init_process_group(backend='nccl', init_method='env://')
model = DDP(model)