You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I was just playing around with using NVT dataloaders with TorchRec, and it was working fine for the most part. However, when it came to trying out batch inference on a large dataset, I ran into a peculiar bug where the script would run perfectly fine for about an hour with stable GPU memory usage (at around 94% for the first GPU), then suddenly at random point the GPU memory (for the first GPU out of four V100's I used) would start to creep up towards 100% and quickly OOM. Weirdly, I am no longer able to reproduce this issue, but nevertheless I was wondering if anyone had any ideas on why that could be the case.
One possible idea that @rnyak suggested was that perhaps the data partitions are not evenly split, and one of the files happen to have bigger partitions than other files. So when it comes to loading that one file, the GPU memory usage shoots up.
Also I am using NVTabular to preprocess the data. One feature request I have is for NVTabular to spit out the most optimal number of files when preprocessing (currently if I use 4 GPUs to preprocess a humongous dataset without setting a out_files_per_proc parameter, it spits out 4 humongous files).
The text was updated successfully, but these errors were encountered:
just to clarify: I thought you are doing batch inference on multiple GPUs? not on single GPU? can you please confirm/clarify that?
My suggestions was particularly for multi-gpu training case.. meaning for example if you train your model with multiple-gpu we expect the number of partitions per parquet file is divisible by number of GPUs. That means, if you are using 4 GPUs at the same time for model training (or inference) via torch.nn.parallel(), or torch.distributed, your parquet files should have 4, or 8, or 12, or 16, .. partitions that can be evenly distributed over GPUs.
❓ Questions & Help
Hello, I was just playing around with using NVT dataloaders with TorchRec, and it was working fine for the most part. However, when it came to trying out batch inference on a large dataset, I ran into a peculiar bug where the script would run perfectly fine for about an hour with stable GPU memory usage (at around 94% for the first GPU), then suddenly at random point the GPU memory (for the first GPU out of four V100's I used) would start to creep up towards 100% and quickly OOM. Weirdly, I am no longer able to reproduce this issue, but nevertheless I was wondering if anyone had any ideas on why that could be the case.
One possible idea that @rnyak suggested was that perhaps the data partitions are not evenly split, and one of the files happen to have bigger partitions than other files. So when it comes to loading that one file, the GPU memory usage shoots up.
Also I am using NVTabular to preprocess the data. One feature request I have is for NVTabular to spit out the most optimal number of files when preprocessing (currently if I use 4 GPUs to preprocess a humongous dataset without setting a
out_files_per_proc
parameter, it spits out 4 humongous files).The text was updated successfully, but these errors were encountered: