Enable multi-gpu training when "torch" is chosen as the RETURNN backend #444

Judyxujj · 2023-08-18T15:09:47Z

Using the current mpirun to launch the torch distributed data parallel (DDP) training gives error ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set. Therefore in order to enable the multi-gpu training for torch backend, once backend = "torch" is detected in returnn config, ReturnnTrainingJob will now use torchrun to launch DDP training.

The text was updated successfully, but these errors were encountered:

Judyxujj self-assigned this Aug 18, 2023

Judyxujj linked a pull request Aug 18, 2023 that will close this issue

Enable multi-gpu training when "torch" is chosen as the RETURNN backend #445

Merged

christophmluscher mentioned this issue Oct 20, 2023

Rename variable horovod_num_processes in ReturnnTrainingJob and ReturnnRasrTrainingJob #456

Open

curufinwe closed this as completed in #445 Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable multi-gpu training when "torch" is chosen as the RETURNN backend #444

Enable multi-gpu training when "torch" is chosen as the RETURNN backend #444

Judyxujj commented Aug 18, 2023

Enable multi-gpu training when "torch" is chosen as the RETURNN backend #444

Enable multi-gpu training when "torch" is chosen as the RETURNN backend #444

Comments

Judyxujj commented Aug 18, 2023