Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RMP] Refine Multi-GPU Data Parallel training for Tensorflow in Merlin Models #752

Open
bschifferer opened this issue Nov 30, 2022 · 1 comment
Assignees
Labels

Comments

@bschifferer
Copy link
Contributor

bschifferer commented Nov 30, 2022

Problem:

In (536)[https://github.com//issues/536], we provided functionality of horovod with Merlin Models and added features which automates the process on the Merlin Models side. However, the current feature is not 100% user friendly and there are still open questions how a user can use multi-GPU data parallel training.

Goal:

  • Improve the user experience to user multi-GPU data parallel training
  • Test multi-GPU data parallel training: AUC? Scale-Up Performance?

Constraints:

  • I am not sure, if the issues with unequal batch size for the data loader is solved: [BUG] Data parallel training freezes due to different number of batches dataloader#75:
    -- if the solution is about how the data is generated, correctly - how das that work?
    -- How are we ensure it with NVTabular?
    -- How about users who do NOT use NVTabular?
  • The unittest is written that each worker runs through the FULL dataset per epoch. That is incorrect. If we have 1M data points and 2 GPUs, each GPU should run only through 500k data points. I wrote the example that NVTabular produces distinct files per worker. Is that the proposed workflow for a user?

Starting Point:

  • Analyze scaling factor by using multiple GPUs: If we go from 1x GPU -> 2x GPUs -> 4x GPUs -> 8 GPUx - how much higher is the throughput?
  • Provide performance metrics (accuracy / AUC / etc) to show that there is no negative effect in the model performance
  • Provide guidance how to set global batch size, batch size per GPU and learning rate when scaling
@bschifferer
Copy link
Contributor Author

@EvenOldridge @viswa-nvidia - as we discussed - I created a follow up roadmap ticket for multi-data parallel GPU feature

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants