Enable Horovod within the Merlin Models API #651

viswa-nvidia · 2022-10-06T01:04:13Z

No description provided.

edknv · 2022-10-10T04:52:58Z

To modify a vanilla tensorflow/keras training script to use multi-GPUs, one would make the following changes. One potential approach is to hide all of the following details behind the Models API and let Models automatically take care of them for users. That is, a single-GPU training script that uses Merlin without any reference to horovod should be able to run on multiple GPUs, with everything (e.g., hvd.init() or scaling learning rate, etc.) figured out automatically for the user, as long as the user runs the script with e.g, horovodrun -np 4 python merlin_training.py.

Horovod with Keras

Run hvd.init().
- Solution: Check if horovod is installed. If so, run hvd.init() anywhere within Models. If not, skip this and all the following steps.
Pin each GPU to a single process.
- The official horovod doc recommends using tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')
- set_visible_devices does not seem to work with the naive approach of setting this first thing in Models (i.e., __init__.py). Some investigation needed to make this work so that users don't have to run this.
- A last resort is setting the environment variable os.environ['CUDA_VISIBLE_DEVICES'] = str(hvd.local_rank()), but this is not ideal because this has to be set before importing tensorflow, and if users import tensorflow before Merlin, this will break, so we still have to guide users to set the environment variable anyway.
Scale the learning rate by the number of workers, e.g., scaled_lr = 0.001 * hvd.size().
- Solution: We can automatically do this behind Models fit().
Wrap the optimizer in hvd.DistributedOptimizer.
- Solution: If horovod is installed and hvd.size() > 1, we wrap the optimizer that the user passed to fit() in hvd.DistributedOptimizer.
- We additionally implement our own merlin.models.DistributedOptimizer in case users want to specify some optional parameters in DistributedOptimizer. This will be a thin wrapper around horovod's distributed optimizer for now, but we can extend this to support distributed embedding, which is on the roadmap for 22.12 (?).
Add hvd.callbacks.BroadcastGlobalVariablesCallback(0) to fit() callbacks.
- Solution: If horovod is installed and hvd.size() > 1, append BroadcastGlobalVariablesCallback to callbacks.

More details at https://horovod.readthedocs.io/en/stable/keras.html.

Challenges with DataLoader

DataLoader has parameters global_size and global_rank which can be set using hvd.size() and hvd.rank().
DataLoader produces an unequal number of batches when the dataset is partitioned, which is problematic for horovod because one worker might finish processing all the batches and wait idle and/or hang and/or time out while the other worker(s) are still processing their batches. There might be some workaround like seeding from the dataloader side as mentioned in [BUG] Data parallel training freezes due to different number of batches dataloader#75, or from horovod side using hvd.join(), but the best solution is to have dataloader produce equal number of batches when partitioned.

Proposed API

Users do not have to make any modifications to their existing training script to run it on multiple GPUs.
To run the script on multiple GPUs, the script is executed with horovodrun -np <number of GPUs> python script.py. If the script is executed normally (i.e., python script.py), it runs only on single GPU.
We additionally implement our own merlin.models.DistributedOptimizer in case users want to specify some optional parameters in DistributedOptimizer.

viswa-nvidia mentioned this issue Oct 6, 2022

[RMP] Multi-GPU Data Parallel training for Tensorflow in Merlin Models #536

Closed

2 tasks

edknv mentioned this issue Oct 10, 2022

[WIP] Add horovod for distributed training NVIDIA-Merlin/models#783

Closed

viswa-nvidia assigned edknv Oct 25, 2022

viswa-nvidia added this to the Merlin 22.11 milestone Oct 25, 2022

edknv mentioned this issue Oct 31, 2022

Add horovod for distributed training NVIDIA-Merlin/models#825

Merged

2 tasks

marcromeyn closed this as completed in NVIDIA-Merlin/models#825 Nov 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Horovod within the Merlin Models API #651

Enable Horovod within the Merlin Models API #651

viswa-nvidia commented Oct 6, 2022

edknv commented Oct 10, 2022

Enable Horovod within the Merlin Models API #651

Enable Horovod within the Merlin Models API #651

Comments

viswa-nvidia commented Oct 6, 2022

edknv commented Oct 10, 2022

Horovod with Keras

Challenges with DataLoader

Proposed API