NCCL Topology File, Rank Ordering, and Determinism #1461

asjaffe · 2024-09-25T18:45:29Z

This is a question related to understanding the order aggregation operations happen over ranks in NCCL and if it's possible for these to run deterministically.

My team is particularly interested in maintaining determinism for the purposes of testing for unexpected effects when we alter some aspect of our complicated LLM setup. We've noticed that historically a source of non-determinism was the order in which collective operations ran over ranks, as floating point ops would accumulate errors differently.

What we also empirically noticed, however, was that when supplying a NCCL topology file, the orderings became deterministic, or at least the error accumulation issue we previously encountered disappeared. While I don't have the training runs in front of me right now, empirically we've been running multi-node determinism checks for some time now, and in the past when we did NOT supply the topology file, they'd almost always diverge very quickly.

So, with this context, I have two questions.

Is our experience universally the case, or have we just been getting lucky; Does supplying an explicit topology file induce a deterministic ordering for a particular operation?
If so, what's the mechanism for this to happen? If not, could there be some alternative mechanism at play?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL Topology File, Rank Ordering, and Determinism #1461

NCCL Topology File, Rank Ordering, and Determinism #1461

asjaffe commented Sep 25, 2024

NCCL Topology File, Rank Ordering, and Determinism #1461

NCCL Topology File, Rank Ordering, and Determinism #1461

Comments

asjaffe commented Sep 25, 2024