-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Image Augmentations on GPU Tests #483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think that if we want to use GPU preprocessing in the data loader we would be restraining our users to use Python 2, which might be a bit too much. Also, I'm not 100% convinced that in the setup that you showed it would be better to perform operations on the GPU. The reason why I'm not convinced is that if we perform all the data augmentation on the CPU, then the GPU is free to run (asynchronously!) the network, while the different threads of the data loader will be loading and preprocessing data in the background. Did you have the chance to see if performing the operations on the GPU was actually useful in a training pipeline? |
I have the same ideas. In my experiences, I have a multithreading pipeline to train my model, and the training thread (or process) is always waiting for the preprocessing to be over (which includes augmentations). This is especially true on model that have relatively low numbers of computations per image like in video deep learning. I'm looking into augmentations on gpu. I just found out that opencv python doesn't allow that. |
@hyperfraise what kinds of augmentation are you looking for? |
Well of course I use a little bit more but even that would be very appreciable.I wonder though if it would be that much more efficient for say 720p images, or even 224,224,3 like I use (since there is the transfer to take into account, and maybe python would be slower even on gpu than the optimized code of opencv that has C underlying). Anyways I'm using : I'm actually working with video, so doing all of this on cpu is pretty costly. |
BTW, have you looked at https://github.com/NVIDIA/nvvl ? Note that you can combine flipping / cropping / padding / resizing as a single kernel launch using |
I'm very curious about how python 2 could help preprocessing on GPU. Could you elaborate on that please? |
Here is me trying to make my computer tell cats from dogs using a simple CNN and image augmentation using PIL transfomers: If I use image augmentation CPU usage is almost 100%, while GPU is bored.. Also it makes almost no difference if I train the net on the GPU or CPU because augmentation takes way longer anyway... |
OpenCV 4 provides the possibility to do augmentations on the GPU in python, I was told. I haven't tested it. |
@stmax82 Maybe you could look into Nvidia's Dali framework |
as discussed in the WG meeting this morning, we want to fix this typo, so merging.
Hello Pytorch vision people !
I am currently working on a project that requires lots of image augmentations
to perform better. And I believe this is not only my case. When reading
about topics such a domain randomization, we see that big variations on images
leads to much more generalization.
I saw that pytorch does not seem to provide a way to perform
any image augmentation on GPU as comment in #45 . In some posts I saw people
not encouraging to do it (https://discuss.pytorch.org/t/preprocess-images-on-gpu/5096) but i really disagree, specially for the cases where several augmentations are applied.
To show this point I provide a gist code showing an example illustrating the
possible speed up gains on a multiplication operation ( brightness augmentation ? )
https://gist.github.com/felipecode/f3531e2d04e846da99053aff16b06028
On the gist, i show a GPU augmentation interface is working as following:
Unfortunately the GPU augmentation could not be smoothly interfaced with the dataloader without
sacrificing the multi threading for data reading. However, the speed ups obtained seems promising
The following plot shows up when running the gist code with a TITAN Xp and Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz CPU, note that I remove the loading time when plotting.
The plot shows the time to compute in function of number of multiplications. For this test, on each data point about 500 RGB images of 224x224 are multiplied by a constant.
Of course, there is no clear reason on why should one do 60 multiplications.
However, I implemented an small library where I used imgaug library as reference
and implemented more functions in GPU. For the following augmentation set
used in my project I obtained about 3-4 times speed up.
This speed up is even higher if more augmentations are added.
So, how can I improve this API ? How could something like this fit in a pull request ?
How can this be more smoothly merged inside the dataloader , but keeping the
multithreading for data reading ?
I still have to test the training time for the full system, but I don't believe there will
be any overhead since images have to be copyed to GPU anyway.
The text was updated successfully, but these errors were encountered: