[RMP] Multi-GPU Data Parallel training for Tensorflow in Merlin Models #536

viswa-nvidia · 2022-08-10T00:18:42Z

Problem:

Single GPU training takes significantly longer than multi-gpu. Customers would like to be able to accelerate their training workflows by distributing training across multiple GPUs on a single node.

Goal:

Enable customers to do data parallel training within Merlin Models training pipeline.

Constraints:

Single node
Embedding tables fit within the memory of a single gpu
Use NVIDIA best practices; aka Horovod

Starting Point:

[BUG] Getting error from XGB model when loading the model back and passing the booster arg to the constructor models#651

Example

update logo usecase models#693

EvenOldridge · 2022-08-31T17:00:58Z

@marcromeyn can you flesh this out a little further.

gabrielspmoreira · 2022-09-26T17:17:17Z

This is a related ticket #764 - [BUG] Models does not support to tf.distribute.MirroredStrategy() for data parallel training.
It is multi-GPU but not Horovod,

viswa-nvidia · 2022-09-26T17:22:00Z

@viswa-nvidia to follow up with DLFW team regarding native support

bschifferer · 2022-09-29T07:00:06Z

@viswa-nvidia @EvenOldridge

We need to add following success criteria :

Analyze scaling factor by using multiple GPUs: If we go from 1x GPU -> 2x GPUs -> 4x GPUs -> 8 GPUx - how much higher is the throughput?
Provide performance metrics (accuracy / AUC / etc) to show that there is no negative effect in the model performance
Provide guidance how to set global batch size, batch size per GPU and learning rate when scaling

If we provide only the technical functionality without testing the points above, we cannot guarantee it is working. I get these questions often from customers.

bschifferer · 2022-09-29T09:00:38Z

I provided an example to show that Merlin Models work with horovods: NVIDIA-Merlin/models#778

However, we need to address the points above + the bug ( NVIDIA-Merlin/dataloader#75 ).

In addition, we should make it more user-friendly.

viswa-nvidia · 2022-10-06T01:06:14Z

Noted. @EvenOldridge , please review and add to the goals. I am not sure if this ticket is fully defined.

edknv · 2022-10-06T05:08:42Z

I've been doing some experimentation with horovod integration with Models API, based on @bschifferer's example NVIDIA-Merlin/models#778, and I fully agree with his point on all the success criteria he listed above, and also the need for dataloader to produce equal number of batches across partitions, as mentioned in NVIDIA-Merlin/dataloader#75.

Some additional notes on NVIDIA-Merlin/dataloader#75, it seems that the dataloader produces an unequal number of batches when the dataset is partitioned, which is problematic for horovod because one worker might finish processing all the batches and wait idle and/or hang and/or time out while the other worker(s) are still processing their batches. There might be some workaround like seeding from the dataloader side as mentioned in the issue, or from horovod side using hvd.join(), but the best solution is to have dataloader produce equal number of batches when partitioned.

bschifferer · 2022-11-03T13:48:15Z

I provided following example based on the current code: NVIDIA-Merlin/models#855 - as I am OOO next week.

@edknv did a great job to provide the horovod functionality in Merlin Models.

I think we need to review the current flow of multi-GPU. I think it is not fully user-friendly / end-to-end integration:

I am not sure, if the issues with unequal batch size for the data loader is solved: [BUG] Data parallel training freezes due to different number of batches dataloader#75 - if the solution is moved to the data is generated, correctly - how das that work? How are we ensure it with NVTabular? How about users who do NOT use NVTabular?
The unittest is written that each worker runs through the FULL dataset per epoch. That is incorrect. If we have 1M data points and 2 GPUs, each GPU should run only through 500k data points. I wrote the example that NVTabular produces distinct files per worker - however, that is not a solution which guarantees the point above.

We havent look on the points here:

Analyze scaling factor by using multiple GPUs: If we go from 1x GPU -> 2x GPUs -> 4x GPUs -> 8 GPUx - how much higher is the throughput?
Provide performance metrics (accuracy / AUC / etc) to show that there is no negative effect in the model performance
Provide guidance how to set global batch size, batch size per GPU and learning rate when scaling

viswa-nvidia · 2022-11-15T18:06:22Z

@bschifferer , please create a separate RMP ticket for multi GpU enhancement

jsohn-nvidia · 2022-12-16T09:25:13Z

@viswa-nvidia, I don't think that the tasks under Staring Point and Examples are the relevant items. Can you please follow up with @edknv and @bschifferer to capture tasks that enabled this roadmap ticket?
think they are under here? https://github.com/orgs/NVIDIA-Merlin/projects/6/views/34?filterQuery=RMP536-DATA+PARALLEL+MULTI-GPU

viswa-nvidia added the roadmap label Aug 10, 2022

viswa-nvidia assigned EvenOldridge Aug 10, 2022

viswa-nvidia changed the title ~~[RMP] Merlin Models/systems enhancement - Multi-GPU training (DP)~~ [RMP] Merlin Models/systems enhancement - Multi-GPU (TF) training (DP) Aug 10, 2022

karlhigley changed the title ~~[RMP] Merlin Models/systems enhancement - Multi-GPU (TF) training (DP)~~ [RMP] Multi-GPU Data Parallel training for Tensorflow Aug 15, 2022

EvenOldridge added this to the Merlin 22.11 milestone Aug 31, 2022

EvenOldridge assigned marcromeyn Aug 31, 2022

EvenOldridge changed the title ~~[RMP] Multi-GPU Data Parallel training for Tensorflow~~ [RMP] Multi-GPU Data Parallel training for Tensorflow in Merlin Models Sep 28, 2022

EvenOldridge modified the milestones: Merlin 22.11, Merlin 22.12 Nov 8, 2022

bschifferer mentioned this issue Nov 30, 2022

[RMP] Refine Multi-GPU Data Parallel training for Tensorflow in Merlin Models #752

Open

viswa-nvidia closed this as completed Dec 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RMP] Multi-GPU Data Parallel training for Tensorflow in Merlin Models #536

[RMP] Multi-GPU Data Parallel training for Tensorflow in Merlin Models #536

viswa-nvidia commented Aug 10, 2022 •

edited

Loading

EvenOldridge commented Aug 31, 2022

gabrielspmoreira commented Sep 26, 2022 •

edited

Loading

viswa-nvidia commented Sep 26, 2022

bschifferer commented Sep 29, 2022

bschifferer commented Sep 29, 2022

viswa-nvidia commented Oct 6, 2022

edknv commented Oct 6, 2022

bschifferer commented Nov 3, 2022

viswa-nvidia commented Nov 15, 2022

jsohn-nvidia commented Dec 16, 2022 •

edited

Loading

[RMP] Multi-GPU Data Parallel training for Tensorflow in Merlin Models #536

[RMP] Multi-GPU Data Parallel training for Tensorflow in Merlin Models #536

Comments

viswa-nvidia commented Aug 10, 2022 • edited Loading

Problem:

Goal:

Constraints:

Starting Point:

Example

EvenOldridge commented Aug 31, 2022

gabrielspmoreira commented Sep 26, 2022 • edited Loading

viswa-nvidia commented Sep 26, 2022

bschifferer commented Sep 29, 2022

bschifferer commented Sep 29, 2022

viswa-nvidia commented Oct 6, 2022

edknv commented Oct 6, 2022

bschifferer commented Nov 3, 2022

viswa-nvidia commented Nov 15, 2022

jsohn-nvidia commented Dec 16, 2022 • edited Loading

viswa-nvidia commented Aug 10, 2022 •

edited

Loading

gabrielspmoreira commented Sep 26, 2022 •

edited

Loading

jsohn-nvidia commented Dec 16, 2022 •

edited

Loading