Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support torchrun+SLURM multi-node trainings in
ReturnnTrainingJob
#552base: main
Are you sure you want to change the base?
Support torchrun+SLURM multi-node trainings in
ReturnnTrainingJob
#552Changes from 1 commit
b55c1cf
d049f01
c6b4190
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the whole logic is dependent on the cluster and maybe even on the specific configuration at our site.. could the cluster dependent logic somehow be abstracted in the sisyphus engine s.t. it can be implemented for other schedulers should we decide to switch in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the information a job has about the environment it is running in is quite limited, this would be a first.
I could see an interface like:
I suppose array tasks would then see single-entry items in there, and be given information about how many array tasks they are and which one they are right now? And the information in there would then be filled out by the currently active engine, e.g. by querying the environment variables or using other kinds of APIs.
Tbqh I feel this API risks being reduced to the lowest common denominator between all the different schedulers there are. Also the API can be limiting. What if the nodes assigned to a job change dynamically on the fly, e.g. if broken nodes are put back into rotation? Maybe it's better after all to push this responsibility into the job (as it is now) and then adapt the job for different types of schedulers as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree though, it's a bummer how much this mixes up generic job responsibilities and engine-specific responsibilities currently.
Especially the part on
NCCL_NET
is bad. NCCL seems to default to InfiniBand (IB) communication, even if there is no IB installed on the nodes. Then it crashes and you have to force it to use ethernet viaNCCL_NET=socket
(could be a peculiarity on our nodes though?). This env variable needs to be set on the training job, and it does not seem like setting it viags.check_engine_limits
works. Probably the job has already been pickled by that point, which is why changing the env from there no longer has an effect. Currently I set it in my regular training pipeline viajob.set_env(...)
, but I find this quite annoying as it ties the job very closely to the nodes that it is going to run on.Maybe we could extend the engines to set certain env variables on a per-cluster partition basis? I'm not sure. Or we improve RETURNN to try and detect the presence of IB and fall back to eth automatically if needed. I thought this would already be the default in NCCL. cc @albertz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you show a list of all the env vars which are set in such a multi node job? Maybe there are some others which are more generic (not so specific about Slurm).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, NCCL normally figures these things out by itself. Did you file a NCCL issue (https://github.com/NVIDIA/nccl/issues; they are usually quite responsive; code is actually also not so hard to understand)? Did you check the NCCL debug logs (
NCCL_DEBUG=INFO
)?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Horovod, we used mpirun, because usually, mpirin in a Cluster is better connected with the scheduling system, and usually that should always work, and MPI then should not about all the available nodes. Maybe that would be another option. Either to use mpirun directly instead of torchrun, or to somehow use MPI to get this info in some other way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree.
No engine and no Job currently supports this. But support can be added if need be. If it becomes relevant we could even define capabilities that a Job can request and then it can only be scheduled on engines that support these capabilities.
I strongly disagree. The point of Jobs is to define what is executed and sisyphus+engine care about the how
And besides:
if that is the only problem, there are engine independent ways around it. E.g. all tasks atomically write their hostname into a file and then wait until there are N names in the file. The first name becomes the master.
One could even implement a node drop out and be replaced scheme when writing time stamps and any node that does not update their timestamp in a period is dropped 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure there are relevant vars that are not SLURM-specific.
Okay, would you then rather have the runtime interface or the file-based solution? The latter sounds to me like quite the hack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see some MPI related env vars (HYDRA, OMPI, I_MPI). Probably MPI would be able to tell you the list of nodes in some way. That would already give you the runtime interface you are asking for, in a very standardized way (MPI should always be available in such cluster setting, esp with multi node). I.e. no need to introduce anything new here, no need to reinvent the wheel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I like this idea. I'm going to explore it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand, why is this need? You already have the nodes already (
partaking_nodes
)? How isnodes
different?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partaking_nodes
is in SLURMs short abbreviated format, e.g. you'll get something likeg-[11-13],g-16
, and that command "decodes" the short form into the full list of proper node names.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, I agree to @michelwi (see his comment above), we should not make this logic dependent on Slurm here in the job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed? Can't rdzv should pick some own random port? What happens when you just do:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All nodes need to pick the same port for the processes to find themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you think that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean one node is the master node, to which the other nodes connect to. If the master node picks a random port, how are the other nodes supposed to establish a TCP connection to the
torchrun
process on that node? This is why you specify the RDZV-address as a tuple of host and port, no?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand why you need to specify that at all. With
mpirun
, we did not need that. It somehow did that automatically. I think SGE/Slurm did that already. Not sure how exactly it worked.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part here is also quite annoying. If you don't set the env var (at least for Gloo), you get a crash on the AppTek cluster. Gloo is also supposed to handle this automatically for you, but it does not seem like it works well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this. What crash? What is the problem? This sounds more like sth the admins should fix?
When reading how NCCL does the logic, it sounds almost the same as what you do here? (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html) So what's different here? Did you file an NCCL issue?
Also such NCCL issue should be referenced here in a code comment, explaining why NCCL does not work, and then pointing to the NCCL issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To expand here: This sounds like you just have something wrongly configured on your side. (Maybe also a NCCL bug, but even if so, probably could be worked around by some NCCL settings.) Thus I don't think this belongs here, but rather you should fix this in your env. (Unless you show me good reasons that this is not possible.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can use multiple interfaces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue here is that all processes need to specify the same number of interfaces according to the docs:
If we just specify one interface (for now) we don't break on setups where some nodes have different numbers of UP network interfaces than others. I wonder if this situation arises in practice. I guess ideally we'd always want to use the maximum number of available interfaces, but I'm not sure how to best find this out in distributed training scenaria.