Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReturnnTrainingJob with multiple processes (distributed training) sets use_horovod also for Torch #461

Open
albertz opened this issue Nov 26, 2023 · 0 comments

Comments

@albertz
Copy link
Member

albertz commented Nov 26, 2023

First note, our horovod_num_processes actually is not only for Horovod but in general for any distributed training (this is a separate issue, we should rename this: #456).

In create_returnn_config, we do this:

        if horovod_num_processes is not None:
            config["use_horovod"] = True

This is a problem, because RETURNN then assumes the TF backend in several places (logging, dataset). I just pushed a commit on RETURNN (rwth-i6/returnn@9c72180) to workaround this issue, so this might be solved now (needs more testing). However, i think this is still not quite correct in general.

Note that in principle, PyTorch could also use Horovod. Horovod has support for PyTorch. This would probably be configured via the torch_distributed setting. This is currently not supported.

Also note, TensorFlow also supports other ways for distributed training, and we partly support that, although not so much tested, and we usually use Horovod.

I'm not sure how to solve this now. ReturnnTrainingJob maybe should not always set this? But this would break all hashes now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant