Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL Topology File, Rank Ordering, and Determinism #1461

Open
asjaffe opened this issue Sep 25, 2024 · 0 comments
Open

NCCL Topology File, Rank Ordering, and Determinism #1461

asjaffe opened this issue Sep 25, 2024 · 0 comments

Comments

@asjaffe
Copy link

asjaffe commented Sep 25, 2024

This is a question related to understanding the order aggregation operations happen over ranks in NCCL and if it's possible for these to run deterministically.

My team is particularly interested in maintaining determinism for the purposes of testing for unexpected effects when we alter some aspect of our complicated LLM setup. We've noticed that historically a source of non-determinism was the order in which collective operations ran over ranks, as floating point ops would accumulate errors differently.

What we also empirically noticed, however, was that when supplying a NCCL topology file, the orderings became deterministic, or at least the error accumulation issue we previously encountered disappeared. While I don't have the training runs in front of me right now, empirically we've been running multi-node determinism checks for some time now, and in the past when we did NOT supply the topology file, they'd almost always diverge very quickly.

So, with this context, I have two questions.

  1. Is our experience universally the case, or have we just been getting lucky; Does supplying an explicit topology file induce a deterministic ordering for a particular operation?
  2. If so, what's the mechanism for this to happen? If not, could there be some alternative mechanism at play?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant