description |
---|
Nvidia Collective multi-GPU Communication Library |
- Point-to point communication (1 sender + 1 receiver)
- Collective communication (multi-sender + multi-receiver): broadcast, gather, all-gather, scatter, reduce, all-reduce, reduce-scatter, all-to-all.
Receive data from multi-sender and combine to one node
Reduce
Receive data from multi-sender and combine to every node
All-reduce
Collective communication assumes the topology of nodes is a fat tree, which has highest communication efficiency. But real topology could be more complex, then ring-based collective communication is applied.
Ring-based collectives form a directed cyclic ring with all nodes and transmit data sequencially around the ring.
GPU0 -> GPU1 -> GPU2 -> GPU3
Assume data is N, bandwidth is B, then total transmission time is (K - 1) N / B.
Transmit N/S data each time
Assume data is divided into N/S, then total time is S*(N/S/B) + (k-2)*(N/S/B) = N(S+K-2)/(SB). When S >> K, the time is N/B, which means communication time wont increase with number of nodes.
How to form a ring:
Sngle node, 4 GPU with PCIe
Single node, 8 GPU with 2 PCIe switch
- 3 primitives: Copy, Reduce, ReduceAndCopy
- Start from NCCL 2.0 supporting multi-node, multi-cards
Form communication ring
- Ring-based algorithm scales latency with number of GPUs. Thus, new algorithms like 2D ring algorithm introduced in NCCL 2.4 to replace flat ring algorithm.
- `torch.distributed()` supports 3 native backend: NCCL, Gloo, MPI.
- Suggested to use nightly build of NCCL (>=2.4) from source for ML model training.
{% embed url="https://www.zhihu.com/question/63219175/answer/206697974" %}
{% embed url="https://developer.nvidia.com/blog/massively-scale-deep-learning-training-nccl-2-4/" %}