-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RMP] Multi-GPU Data Parallel training for Tensorflow in Merlin Models #536
Comments
@marcromeyn can you flesh this out a little further. |
This is a related ticket #764 - [BUG] Models does not support to tf.distribute.MirroredStrategy() for data parallel training. |
@viswa-nvidia to follow up with DLFW team regarding native support |
We need to add following success criteria :
If we provide only the technical functionality without testing the points above, we cannot guarantee it is working. I get these questions often from customers. |
I provided an example to show that Merlin Models work with horovods: NVIDIA-Merlin/models#778 However, we need to address the points above + the bug ( NVIDIA-Merlin/dataloader#75 ). In addition, we should make it more user-friendly. |
Noted. @EvenOldridge , please review and add to the goals. I am not sure if this ticket is fully defined. |
I've been doing some experimentation with horovod integration with Models API, based on @bschifferer's example NVIDIA-Merlin/models#778, and I fully agree with his point on all the success criteria he listed above, and also the need for dataloader to produce equal number of batches across partitions, as mentioned in NVIDIA-Merlin/dataloader#75. Some additional notes on NVIDIA-Merlin/dataloader#75, it seems that the dataloader produces an unequal number of batches when the dataset is partitioned, which is problematic for horovod because one worker might finish processing all the batches and wait idle and/or hang and/or time out while the other worker(s) are still processing their batches. There might be some workaround like seeding from the dataloader side as mentioned in the issue, or from horovod side using |
I provided following example based on the current code: NVIDIA-Merlin/models#855 - as I am OOO next week. @edknv did a great job to provide the horovod functionality in Merlin Models. I think we need to review the current flow of multi-GPU. I think it is not fully user-friendly / end-to-end integration:
We havent look on the points here:
|
@bschifferer , please create a separate RMP ticket for multi GpU enhancement |
@viswa-nvidia, I don't think that the tasks under Staring Point and Examples are the relevant items. Can you please follow up with @edknv and @bschifferer to capture tasks that enabled this roadmap ticket? |
Problem:
Single GPU training takes significantly longer than multi-gpu. Customers would like to be able to accelerate their training workflows by distributing training across multiple GPUs on a single node.
Goal:
Enable customers to do data parallel training within Merlin Models training pipeline.
Constraints:
Starting Point:
Example
The text was updated successfully, but these errors were encountered: