You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a question related to understanding the order aggregation operations happen over ranks in NCCL and if it's possible for these to run deterministically.
My team is particularly interested in maintaining determinism for the purposes of testing for unexpected effects when we alter some aspect of our complicated LLM setup. We've noticed that historically a source of non-determinism was the order in which collective operations ran over ranks, as floating point ops would accumulate errors differently.
What we also empirically noticed, however, was that when supplying a NCCL topology file, the orderings became deterministic, or at least the error accumulation issue we previously encountered disappeared. While I don't have the training runs in front of me right now, empirically we've been running multi-node determinism checks for some time now, and in the past when we did NOT supply the topology file, they'd almost always diverge very quickly.
So, with this context, I have two questions.
Is our experience universally the case, or have we just been getting lucky; Does supplying an explicit topology file induce a deterministic ordering for a particular operation?
If so, what's the mechanism for this to happen? If not, could there be some alternative mechanism at play?
The text was updated successfully, but these errors were encountered:
This is a question related to understanding the order aggregation operations happen over ranks in NCCL and if it's possible for these to run deterministically.
My team is particularly interested in maintaining determinism for the purposes of testing for unexpected effects when we alter some aspect of our complicated LLM setup. We've noticed that historically a source of non-determinism was the order in which collective operations ran over ranks, as floating point ops would accumulate errors differently.
What we also empirically noticed, however, was that when supplying a NCCL topology file, the orderings became deterministic, or at least the error accumulation issue we previously encountered disappeared. While I don't have the training runs in front of me right now, empirically we've been running multi-node determinism checks for some time now, and in the past when we did NOT supply the topology file, they'd almost always diverge very quickly.
So, with this context, I have two questions.
The text was updated successfully, but these errors were encountered: