Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RMP] Multi-GPU Data Parallel training for Tensorflow in Merlin Models #536

Closed
2 tasks done
viswa-nvidia opened this issue Aug 10, 2022 · 10 comments
Closed
2 tasks done
Assignees
Labels
Milestone

Comments

@viswa-nvidia
Copy link

viswa-nvidia commented Aug 10, 2022

Problem:

Single GPU training takes significantly longer than multi-gpu. Customers would like to be able to accelerate their training workflows by distributing training across multiple GPUs on a single node.

Goal:

Enable customers to do data parallel training within Merlin Models training pipeline.

Constraints:

  • Single node
  • Embedding tables fit within the memory of a single gpu
  • Use NVIDIA best practices; aka Horovod

Starting Point:

Example

@viswa-nvidia viswa-nvidia changed the title [RMP] Merlin Models/systems enhancement - Multi-GPU training (DP) [RMP] Merlin Models/systems enhancement - Multi-GPU (TF) training (DP) Aug 10, 2022
@karlhigley karlhigley changed the title [RMP] Merlin Models/systems enhancement - Multi-GPU (TF) training (DP) [RMP] Multi-GPU Data Parallel training for Tensorflow Aug 15, 2022
@EvenOldridge EvenOldridge added this to the Merlin 22.11 milestone Aug 31, 2022
@EvenOldridge
Copy link
Member

@marcromeyn can you flesh this out a little further.

@gabrielspmoreira
Copy link
Member

gabrielspmoreira commented Sep 26, 2022

This is a related ticket #764 - [BUG] Models does not support to tf.distribute.MirroredStrategy() for data parallel training.
It is multi-GPU but not Horovod,

@viswa-nvidia
Copy link
Author

@viswa-nvidia to follow up with DLFW team regarding native support

@EvenOldridge EvenOldridge changed the title [RMP] Multi-GPU Data Parallel training for Tensorflow [RMP] Multi-GPU Data Parallel training for Tensorflow in Merlin Models Sep 28, 2022
@bschifferer
Copy link
Contributor

@viswa-nvidia @EvenOldridge

We need to add following success criteria :

  • Analyze scaling factor by using multiple GPUs: If we go from 1x GPU -> 2x GPUs -> 4x GPUs -> 8 GPUx - how much higher is the throughput?
  • Provide performance metrics (accuracy / AUC / etc) to show that there is no negative effect in the model performance
  • Provide guidance how to set global batch size, batch size per GPU and learning rate when scaling

If we provide only the technical functionality without testing the points above, we cannot guarantee it is working. I get these questions often from customers.

@bschifferer
Copy link
Contributor

I provided an example to show that Merlin Models work with horovods: NVIDIA-Merlin/models#778

However, we need to address the points above + the bug ( NVIDIA-Merlin/dataloader#75 ).

In addition, we should make it more user-friendly.

@viswa-nvidia
Copy link
Author

Noted. @EvenOldridge , please review and add to the goals. I am not sure if this ticket is fully defined.

@edknv
Copy link
Contributor

edknv commented Oct 6, 2022

I've been doing some experimentation with horovod integration with Models API, based on @bschifferer's example NVIDIA-Merlin/models#778, and I fully agree with his point on all the success criteria he listed above, and also the need for dataloader to produce equal number of batches across partitions, as mentioned in NVIDIA-Merlin/dataloader#75.

Some additional notes on NVIDIA-Merlin/dataloader#75, it seems that the dataloader produces an unequal number of batches when the dataset is partitioned, which is problematic for horovod because one worker might finish processing all the batches and wait idle and/or hang and/or time out while the other worker(s) are still processing their batches. There might be some workaround like seeding from the dataloader side as mentioned in the issue, or from horovod side using hvd.join(), but the best solution is to have dataloader produce equal number of batches when partitioned.

@bschifferer
Copy link
Contributor

I provided following example based on the current code: NVIDIA-Merlin/models#855 - as I am OOO next week.

@edknv did a great job to provide the horovod functionality in Merlin Models.

I think we need to review the current flow of multi-GPU. I think it is not fully user-friendly / end-to-end integration:

  • I am not sure, if the issues with unequal batch size for the data loader is solved: [BUG] Data parallel training freezes due to different number of batches dataloader#75 - if the solution is moved to the data is generated, correctly - how das that work? How are we ensure it with NVTabular? How about users who do NOT use NVTabular?
  • The unittest is written that each worker runs through the FULL dataset per epoch. That is incorrect. If we have 1M data points and 2 GPUs, each GPU should run only through 500k data points. I wrote the example that NVTabular produces distinct files per worker - however, that is not a solution which guarantees the point above.

We havent look on the points here:

  • Analyze scaling factor by using multiple GPUs: If we go from 1x GPU -> 2x GPUs -> 4x GPUs -> 8 GPUx - how much higher is the throughput?
  • Provide performance metrics (accuracy / AUC / etc) to show that there is no negative effect in the model performance
  • Provide guidance how to set global batch size, batch size per GPU and learning rate when scaling

@viswa-nvidia
Copy link
Author

@bschifferer , please create a separate RMP ticket for multi GpU enhancement

@jsohn-nvidia
Copy link
Collaborator

jsohn-nvidia commented Dec 16, 2022

@viswa-nvidia, I don't think that the tasks under Staring Point and Examples are the relevant items. Can you please follow up with @edknv and @bschifferer to capture tasks that enabled this roadmap ticket?
think they are under here? https://github.com/orgs/NVIDIA-Merlin/projects/6/views/34?filterQuery=RMP536-DATA+PARALLEL+MULTI-GPU

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants