-
Notifications
You must be signed in to change notification settings - Fork 622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The negative influence of directly converting inputs to tensors and transferring back to CPU #5659
Comments
Hello @Arith2 , Meanwhile, if you use the latest release 1.42, you can try to do the D2H copy inside DALI with the new "dynamic" executor. self.to_tensor = fn.cast(self.gaussian_blur, dtype=types.FLOAT, device=device).cpu() Additionally - |
Hi @mzient , thanks very much for your quick response. I define load_binary_images() by myself to read a binary file and then use fn.external_source() to load it. I am more interested in how the pipeline length affect the overall performance and what is the side effect of transferring data back to CPU to simulate the case of transferring data to other accelerators. This is the command I use to run the script. This is the full file of my script called run_DALI_preprocess_batch.py:
Now my nvidia.dali version is 1.41.0 with cudatoolkit 11.0. |
Hello, @Arith2 ! Thank you for bringing up this interesting issue. For now, I can only explain some part of the results that you see. Let's list the factors that add up to the total time you are measuring:
Still, this doesn't explain why the performance degrades for bigger batch sizes. With the model I explained above, I'd expect the results to look like this: Thank you for providing the reproduction code. I'll experiment with it a bit more and try to figure out what causes this sudden growth of runtime for batches larger that 128. I'll get back to you once I have some more answers. If you have any questions, or see something that doesn't fit the explanations above, please let me know. |
@mzient @szkarpinski Hi Michal and Szymon, I test my baseline many times and one interesting observation is that the execution time is the same for both online and offline training of ResNet50 (ImageNet 13GB in JPEG).
So these experiments imply that the resource for preprocessing with DALI is perfectly hidden inside the training (as long as the computation is not so easy like AlexNet). I am interested for the unified programming environment of both preprocessing and training. It looks that DALI pays nothing and the online training is in the same level of efficiency as offline training, while the popularity of utilizing DALI as the default data loader is not as high as what I think. If you are also interested in this kind of questions, feel free to contact me via email yu.zhu@inf.ethz.ch. |
Hello @Arith2 ! First, let me answer the unanswered question from my previous response, in which I couldn't explain why the performance degrades for bigger batch sizes. I still can't, but it turns out it's not DALI, but PyTorch ;)
import torch
import time
IMAGE_HEIGHT = 96
IMAGE_WIDTH = 96
NUM_CHANNELS = 3
DATASET_SIZE = 1024*16
for batch_size in [1,2,4,8,16,32,64,128,256,512,1024,2048]:
start_time = time.time()
mock_gpu_data = torch.zeros((batch_size, NUM_CHANNELS, IMAGE_HEIGHT, IMAGE_WIDTH)).cuda()
num_iterations = DATASET_SIZE // batch_size
for _ in range(num_iterations):
mock_gpu_data.cpu()
end_time = time.time()
print("Batch size", batch_size, "took", end_time - start_time) yields:
It might be the case that PyTorch has some fast path for smaller copies. You can try to have a look at their source code or ask them on GitHub. Anyway, with your model running on a GPU, you shouldn't need this GPU->CPU copy. |
I'm glad that your experiments confirmed that. As you are saying, for simple enough preprocessing pipelines DALI overhead should be minimal.
Also, offline preprocessing might not always be possible: in particular, when you perform many random augmentations, in most cases you want them to be different each epoch. With offline-preprocessed data, until you duplicate it for each epoch, you can't achieve that. We believe that in such cases DALI is the optimal solution for data processing in terms of performance. Thank you for your interest in DALI! Please feel free to reach out to us again if you see some other problems or things that need improvement! |
Hi @Arith2,
Thank you for sharing your research interests. Feel free to reach us using Dali-Team@nvidia.com email. When writing, please specify the aspects of collaborations you have in mind and how we can contribute. |
Describe the question.
Hi, recently I want to run DALI for some preprocessing pipelines in GPU and I find some problems which are very weird.
My pipeline is like this:
And this is how I access the data in this pipeline (preprocessed_images is not in use):
The problem is that:
Here is the plot:
In the command, I fix some other parameters, like num_threads, prefetch_queue_depth, py_num_workers.
I test in 3090 in local machine and in Nvidia V100 in google cloud and observe similar phenomenon.
Check for duplicates
The text was updated successfully, but these errors were encountered: