[RMP] Refine Multi-GPU Data Parallel training for Tensorflow in Merlin Models #752

bschifferer · 2022-11-30T11:11:51Z

Problem:

In (536)[https://github.com//issues/536], we provided functionality of horovod with Merlin Models and added features which automates the process on the Merlin Models side. However, the current feature is not 100% user friendly and there are still open questions how a user can use multi-GPU data parallel training.

Goal:

Improve the user experience to user multi-GPU data parallel training
Test multi-GPU data parallel training: AUC? Scale-Up Performance?

Constraints:

I am not sure, if the issues with unequal batch size for the data loader is solved: [BUG] Data parallel training freezes due to different number of batches dataloader#75:
-- if the solution is about how the data is generated, correctly - how das that work?
-- How are we ensure it with NVTabular?
-- How about users who do NOT use NVTabular?
The unittest is written that each worker runs through the FULL dataset per epoch. That is incorrect. If we have 1M data points and 2 GPUs, each GPU should run only through 500k data points. I wrote the example that NVTabular produces distinct files per worker. Is that the proposed workflow for a user?

Starting Point:

Analyze scaling factor by using multiple GPUs: If we go from 1x GPU -> 2x GPUs -> 4x GPUs -> 8 GPUx - how much higher is the throughput?
Provide performance metrics (accuracy / AUC / etc) to show that there is no negative effect in the model performance
Provide guidance how to set global batch size, batch size per GPU and learning rate when scaling

bschifferer · 2022-11-30T11:14:07Z

@EvenOldridge @viswa-nvidia - as we discussed - I created a follow up roadmap ticket for multi-data parallel GPU feature

bschifferer added the roadmap label Nov 30, 2022

bschifferer assigned EvenOldridge Nov 30, 2022

bschifferer mentioned this issue Feb 2, 2023

[BUG] Fixing Dataparallel Training NVIDIA-Merlin/models#971

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RMP] Refine Multi-GPU Data Parallel training for Tensorflow in Merlin Models #752

[RMP] Refine Multi-GPU Data Parallel training for Tensorflow in Merlin Models #752

bschifferer commented Nov 30, 2022 •

edited

Loading

bschifferer commented Nov 30, 2022

[RMP] Refine Multi-GPU Data Parallel training for Tensorflow in Merlin Models #752

[RMP] Refine Multi-GPU Data Parallel training for Tensorflow in Merlin Models #752

Comments

bschifferer commented Nov 30, 2022 • edited Loading

Problem:

Goal:

Constraints:

Starting Point:

bschifferer commented Nov 30, 2022

bschifferer commented Nov 30, 2022 •

edited

Loading