Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] An odd, sudden OOM using NVT Dataloaders with TorchRec #842

Open
shoyasaxa opened this issue Feb 27, 2023 · 3 comments
Open

[QST] An odd, sudden OOM using NVT Dataloaders with TorchRec #842

shoyasaxa opened this issue Feb 27, 2023 · 3 comments
Assignees
Labels
P1 Priority 1 question Further information is requested

Comments

@shoyasaxa
Copy link

❓ Questions & Help

Hello, I was just playing around with using NVT dataloaders with TorchRec, and it was working fine for the most part. However, when it came to trying out batch inference on a large dataset, I ran into a peculiar bug where the script would run perfectly fine for about an hour with stable GPU memory usage (at around 94% for the first GPU), then suddenly at random point the GPU memory (for the first GPU out of four V100's I used) would start to creep up towards 100% and quickly OOM. Weirdly, I am no longer able to reproduce this issue, but nevertheless I was wondering if anyone had any ideas on why that could be the case.

One possible idea that @rnyak suggested was that perhaps the data partitions are not evenly split, and one of the files happen to have bigger partitions than other files. So when it comes to loading that one file, the GPU memory usage shoots up.

Also I am using NVTabular to preprocess the data. One feature request I have is for NVTabular to spit out the most optimal number of files when preprocessing (currently if I use 4 GPUs to preprocess a humongous dataset without setting a out_files_per_proc parameter, it spits out 4 humongous files).

@shoyasaxa shoyasaxa added the question Further information is requested label Feb 27, 2023
@rnyak
Copy link
Contributor

rnyak commented Feb 28, 2023

@shoyasaxa thanks for creating the ticket.

just to clarify: I thought you are doing batch inference on multiple GPUs? not on single GPU? can you please confirm/clarify that?

My suggestions was particularly for multi-gpu training case.. meaning for example if you train your model with multiple-gpu we expect the number of partitions per parquet file is divisible by number of GPUs. That means, if you are using 4 GPUs at the same time for model training (or inference) via torch.nn.parallel(), or torch.distributed, your parquet files should have 4, or 8, or 12, or 16, .. partitions that can be evenly distributed over GPUs.

@rnyak rnyak added the P1 Priority 1 label Feb 28, 2023
@shoyasaxa
Copy link
Author

Yes - this is doing batch inference on multiple GPUs (one instance with 4 V100 GPUs).

And yes - I also do the preprocessing using 4 GPUs, so the number of files outputted is a multiple of 4 as well.

@rnyak
Copy link
Contributor

rnyak commented Mar 9, 2023

(currently if I use 4 GPUs to preprocess a humongous dataset without setting a out_files_per_proc parameter, it spits out 4 humongous files).

we have a WIP PR hopefully will answer your request..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 Priority 1 question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants