You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First note, our horovod_num_processes actually is not only for Horovod but in general for any distributed training (this is a separate issue, we should rename this: #456).
This is a problem, because RETURNN then assumes the TF backend in several places (logging, dataset). I just pushed a commit on RETURNN (rwth-i6/returnn@9c72180) to workaround this issue, so this might be solved now (needs more testing). However, i think this is still not quite correct in general.
Note that in principle, PyTorch could also use Horovod. Horovod has support for PyTorch. This would probably be configured via the torch_distributed setting. This is currently not supported.
Also note, TensorFlow also supports other ways for distributed training, and we partly support that, although not so much tested, and we usually use Horovod.
I'm not sure how to solve this now. ReturnnTrainingJob maybe should not always set this? But this would break all hashes now.
The text was updated successfully, but these errors were encountered:
First note, our
horovod_num_processes
actually is not only for Horovod but in general for any distributed training (this is a separate issue, we should rename this: #456).In
create_returnn_config
, we do this:This is a problem, because RETURNN then assumes the TF backend in several places (logging, dataset). I just pushed a commit on RETURNN (rwth-i6/returnn@9c72180) to workaround this issue, so this might be solved now (needs more testing). However, i think this is still not quite correct in general.
Note that in principle, PyTorch could also use Horovod. Horovod has support for PyTorch. This would probably be configured via the
torch_distributed
setting. This is currently not supported.Also note, TensorFlow also supports other ways for distributed training, and we partly support that, although not so much tested, and we usually use Horovod.
I'm not sure how to solve this now.
ReturnnTrainingJob
maybe should not always set this? But this would break all hashes now.The text was updated successfully, but these errors were encountered: