Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Horovod within the Merlin Models API #651

Closed
viswa-nvidia opened this issue Oct 6, 2022 · 1 comment · Fixed by NVIDIA-Merlin/models#825
Closed

Enable Horovod within the Merlin Models API #651

viswa-nvidia opened this issue Oct 6, 2022 · 1 comment · Fixed by NVIDIA-Merlin/models#825
Assignees
Milestone

Comments

@viswa-nvidia
Copy link

No description provided.

@edknv
Copy link
Contributor

edknv commented Oct 10, 2022

To modify a vanilla tensorflow/keras training script to use multi-GPUs, one would make the following changes. One potential approach is to hide all of the following details behind the Models API and let Models automatically take care of them for users. That is, a single-GPU training script that uses Merlin without any reference to horovod should be able to run on multiple GPUs, with everything (e.g., hvd.init() or scaling learning rate, etc.) figured out automatically for the user, as long as the user runs the script with e.g, horovodrun -np 4 python merlin_training.py.

Horovod with Keras

  1. Run hvd.init().
    • Solution: Check if horovod is installed. If so, run hvd.init() anywhere within Models. If not, skip this and all the following steps.
  2. Pin each GPU to a single process.
    • The official horovod doc recommends using tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')
    • set_visible_devices does not seem to work with the naive approach of setting this first thing in Models (i.e., __init__.py). Some investigation needed to make this work so that users don't have to run this.
    • A last resort is setting the environment variable os.environ['CUDA_VISIBLE_DEVICES'] = str(hvd.local_rank()), but this is not ideal because this has to be set before importing tensorflow, and if users import tensorflow before Merlin, this will break, so we still have to guide users to set the environment variable anyway.
  3. Scale the learning rate by the number of workers, e.g., scaled_lr = 0.001 * hvd.size().
    • Solution: We can automatically do this behind Models fit().
  4. Wrap the optimizer in hvd.DistributedOptimizer.
    • Solution: If horovod is installed and hvd.size() > 1, we wrap the optimizer that the user passed to fit() in hvd.DistributedOptimizer.
    • We additionally implement our own merlin.models.DistributedOptimizer in case users want to specify some optional parameters in DistributedOptimizer. This will be a thin wrapper around horovod's distributed optimizer for now, but we can extend this to support distributed embedding, which is on the roadmap for 22.12 (?).
  5. Add hvd.callbacks.BroadcastGlobalVariablesCallback(0) to fit() callbacks.
    • Solution: If horovod is installed and hvd.size() > 1, append BroadcastGlobalVariablesCallback to callbacks.

More details at https://horovod.readthedocs.io/en/stable/keras.html.

Challenges with DataLoader

  • DataLoader has parameters global_size and global_rank which can be set using hvd.size() and hvd.rank().
  • DataLoader produces an unequal number of batches when the dataset is partitioned, which is problematic for horovod because one worker might finish processing all the batches and wait idle and/or hang and/or time out while the other worker(s) are still processing their batches. There might be some workaround like seeding from the dataloader side as mentioned in [BUG] Data parallel training freezes due to different number of batches dataloader#75, or from horovod side using hvd.join(), but the best solution is to have dataloader produce equal number of batches when partitioned.

Proposed API

  • Users do not have to make any modifications to their existing training script to run it on multiple GPUs.
  • To run the script on multiple GPUs, the script is executed with horovodrun -np <number of GPUs> python script.py. If the script is executed normally (i.e., python script.py), it runs only on single GPU.
  • We additionally implement our own merlin.models.DistributedOptimizer in case users want to specify some optional parameters in DistributedOptimizer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants