Rename variable `horovod_num_processes` in `ReturnnTrainingJob` and `ReturnnRasrTrainingJob` #456

christophmluscher · 2023-10-20T07:34:27Z

The jobs got extended to enable multi-GPU usage for the torch backend (see #444 and #445). The horovod_num_processes variable name is now incorrect. This change needs to be done carefully since this is a potentially hash-breaking change.
Analog to distributed_launch_command rename horovod_num_processes to distributed_num_processes?

@albertz @Judyxujj @JackTemaki comments?

The text was updated successfully, but these errors were encountered:

albertz · 2023-10-20T07:36:00Z

Yes, this makes sense.

distributed_num_processes is a good name.

However, I currently don't have a good idea how to do this without breaking setups. Do you?

albertz · 2023-10-20T07:38:47Z

Ok, we could maybe define a custom _sis_hash for the ReturnnTrainingJob, and there use exactly the old name (horovod_num_processes), and then in __init__, we can do any handling we want for kwargs, so supporting both the new name and the old name. I think sth like this would work.

albertz mentioned this issue Nov 26, 2023

ReturnnTrainingJob with multiple processes (distributed training) sets use_horovod also for Torch #461

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rename variable `horovod_num_processes` in `ReturnnTrainingJob` and `ReturnnRasrTrainingJob` #456

Rename variable `horovod_num_processes` in `ReturnnTrainingJob` and `ReturnnRasrTrainingJob` #456

christophmluscher commented Oct 20, 2023

albertz commented Oct 20, 2023

albertz commented Oct 20, 2023

Rename variable horovod_num_processes in ReturnnTrainingJob and ReturnnRasrTrainingJob #456

Rename variable horovod_num_processes in ReturnnTrainingJob and ReturnnRasrTrainingJob #456

Comments

christophmluscher commented Oct 20, 2023

albertz commented Oct 20, 2023

albertz commented Oct 20, 2023

Rename variable `horovod_num_processes` in `ReturnnTrainingJob` and `ReturnnRasrTrainingJob` #456

Rename variable `horovod_num_processes` in `ReturnnTrainingJob` and `ReturnnRasrTrainingJob` #456