Skip to content

Latest commit

 

History

History
58 lines (32 loc) · 2.88 KB

nccl.md

File metadata and controls

58 lines (32 loc) · 2.88 KB
description
Nvidia Collective multi-GPU Communication Library

NCCL

Communication primitive

  1. Point-to point communication (1 sender + 1 receiver)
  2. Collective communication (multi-sender + multi-receiver): broadcast, gather, all-gather, scatter, reduce, all-reduce, reduce-scatter, all-to-all.

Receive data from multi-sender and combine to one node

Reduce

Receive data from multi-sender and combine to every node

All-reduce

Collective communication assumes the topology of nodes is a fat tree, which has highest communication efficiency. But real topology could be more complex, then ring-based collective communication is applied.

Ring-based collectives

Ring-based collectives form a directed cyclic ring with all nodes and transmit data sequencially around the ring.

GPU0 -> GPU1 -> GPU2 -> GPU3

Assume data is N, bandwidth is B, then total transmission time is (K - 1) N / B.

Transmit N/S data each time

Assume data is divided into N/S, then total time is S*(N/S/B) + (k-2)*(N/S/B) = N(S+K-2)/(SB). When S >> K, the time is N/B, which means communication time wont increase with number of nodes.

How to form a ring:

Sngle node, 4 GPU with PCIe

Single node, 8 GPU with 2 PCIe switch

NCCL implementation

  • 3 primitives: Copy, Reduce, ReduceAndCopy
  • Start from NCCL 2.0 supporting multi-node, multi-cards

Form communication ring

  • Ring-based algorithm scales latency with number of GPUs. Thus, new algorithms like 2D ring algorithm introduced in NCCL 2.4 to replace flat ring algorithm.

Practices

  • `torch.distributed()` supports 3 native backend: NCCL, Gloo, MPI.
  • Suggested to use nightly build of NCCL (>=2.4) from source for ML model training.

Reference

{% embed url="https://www.zhihu.com/question/63219175/answer/206697974" %}

{% embed url="https://developer.nvidia.com/blog/massively-scale-deep-learning-training-nccl-2-4/" %}