Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements for transforms v2 vs. v1 #6818

Closed
31 tasks done
pmeier opened this issue Oct 24, 2022 · 8 comments · Fixed by #6983
Closed
31 tasks done

Performance improvements for transforms v2 vs. v1 #6818

pmeier opened this issue Oct 24, 2022 · 8 comments · Fixed by #6983
Labels
module: transforms Perf For performance improvements prototype

Comments

@pmeier
Copy link
Collaborator

pmeier commented Oct 24, 2022

In addition to a lot of other goodies that transforms v2 will bring, we are also actively working on improving the performance. This is a tracker / overview issue of our progress.

Performance was measured with this benchmark script. Unless noted otherwise, the performance improvements reported above were computed on uint8, RGB images and videos while running single-threaded on CPU. You can find the full benchmark results alongside the benchmark script. The results will be constantly updated if new PRs are merged that have an effect on the kernels.

Kernels:

Transform Classes:

C++ (PyTorch core):

cc @vfdev-5 @datumbox @bjuncek

@datumbox
Copy link
Contributor

datumbox commented Oct 24, 2022

Concerning elastic and all the affine transform kernels (affine, perspective, rotate), there are some very limited opportunities for optimization. Perhaps a couple of in-place ops in elastic_transform & _perspective_grid and a few optimizations in _apply_grid_transform (split of mask and img, bilinear fill estimation etc). Also some minor fixes related to the input assertion. @vfdev-5 would you be OK to assess on your side whether it make sense to do these or leave the methods on _FT to avoid copy-pasting? Perhaps you have in mind other optimizations that I can't see that could affect performance?

Concerning crop, erase, pad, resize, horizontal_flip and vertical_flip, I don't see any further improvements other than the input assertions. It might be worth to have a look on your side, @pmeier and @vfdev-5, incase you see something I don't.

@pmeier
Copy link
Collaborator Author

pmeier commented Oct 26, 2022

I did another deep dive into the ops in the second paragraph of #6818 (comment) and I'm fairly confident that there is little we can do to improve on our side. The only two things I found are

  • For padding modes "edge" and "reflect" we cast to float32 and back:

    if (padding_mode != "constant") and img.dtype not in (torch.float32, torch.float64):
    # Here we temporary cast input tensor to float
    # until pytorch issue is resolved :
    # https://github.com/pytorch/pytorch/issues/40763
    need_cast = True
    img = img.to(torch.float32)

    There is a long standing issue on PyTorch core Enhance supported types of functional.pad  pytorch#40763 that reports this and is assigned to @vfdev-5.

  • We support "symmetric" padding in F.pad, but torch.nn.functional.pad doesn't. Thus, we have a custom implementation for it

    def _pad_symmetric(img: Tensor, padding: List[int]) -> Tensor:

    Since it is written in Python, a possible speed up would be to implement this padding mode in C++ on the PyTorch core side.

Fixing this, we would get speed-ups for padding modes "edge", "reflect", and "symmetric" but not for the default and ubiquitous "constant" padding mode. Skimming the repository, it seems the only time we use non-"constant" padding is

img = torch_pad(img, padding, mode="reflect")

In there the image is guaranteed to be float and thus would not get any performance boost.

While I think both things mentioned above would be good to have in general, I don't think we should prioritize them.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Oct 28, 2022

Concerning elastic and all the affine transform kernels (affine, perspective, rotate), there are some very limited opportunities for optimization. Perhaps a couple of in-place ops in elastic_transform & _perspective_grid and a few optimizations in _apply_grid_transform (split of mask and img, bilinear fill estimation etc). Also some minor fixes related to the input assertion. @vfdev-5 would you be OK to assess on your side whether it make sense to do these or leave the methods on _FT to avoid copy-pasting? Perhaps you have in mind other optimizations that I can't see that could affect performance?

Checking various options with affine, there is no obvious way to improve runtime perfs. Yes, we can make some inplace "split of mask and img, bilinear fill estimation etc". There is also an open issue about incorrect behaviour of bilinear mode with provided not-None fill (#6517). Given that I think we can keep this implementation.

@vadimkantorov
Copy link

About not vectorized bitwise shifts, is there an issue in core?

@pmeier
Copy link
Collaborator Author

pmeier commented Oct 31, 2022

About not vectorized bitwise shifts, is there an issue in core?

I don't think so, but @alexsamardzic wanted to have a look at it.

Edit: pytorch/pytorch#88607

@datumbox
Copy link
Contributor

datumbox commented Nov 10, 2022

@pmeier I'm keeping the list up-to-date with all linked PRs. I'm marking as [NEEDS RETEST]/[NEEDS TEST] any kernel that I touch to run further benchmarks and update the numbers.

@vadimkantorov
Copy link

vadimkantorov commented Nov 10, 2022

An interesting question is whether a sequence of these transformations can be fused with Inductor/Dynamo (or sth else?) and produce a fused low-memory-access CPU kernel (working with uint8 or fp32?) and how it connects with randomness of whether to apply a transform or not

@datumbox
Copy link
Contributor

datumbox commented Nov 15, 2022

Speed Benchmarks V1 vs V2

Summary

The Transforms V2 API is faster than V1 (stable) because it introduces several optimizations on the Transform Classes and Functional kernels. Summarizing the performance gains on a single number should be taken with a grain of salt because:

  1. The performance heavily depends on the selected configuration (CPU vs CUDA device, Tensor vs PIL backend, uint8 vs float32 dtypes, number of threads etc). Though we included in our benchmarks the most common configurations, different setups might yield different results.
  2. The execution times of the different Transforms vary significantly (often in orders of magnitude). Though we report % differences, a simple unweighted average can't tell the full story.
  3. The training speed depends on multitude of factors including the mix of augmentations, the size of the model etc. Though we use a commonly used SoTA recipe, the results can differ depending on whether we are IO/Memory/Compute bound.

With the above in mind, here are some statistics that summarize the performance of the new API:

  1. Training: Using TorchVision's latest training recipe, we observe a significant 18% improvement on the training times using the Tensor backend. The performance of PIL backend remains the same.
  2. Transform Classes: The average improvement for the transform classes is about 8%. On the Tensor backend, float32 ops were improved on average by 9% and uint8 by 12%. On PIL backend the performance remains the same.
  3. Functional Kernels: The average improvement for the functional kernels is about 21%. On the Tensor backend, cpu performance was improved by 23% and cuda by 29%. On PIL backend the performance remains the same.

To estimate the above aggregate statistics we used this script on top of the detailed benchmarks:

Aggregate Statistics
TRANSFORMS:
Overall execution time reduction: -8.37%
                    %
device dtype         
cpu    float32  -7.47
       pil      -0.10
       uint8   -11.61
cuda   float32  -8.43
       uint8   -13.47
----------------------------
DISPATCHERS:
Overall execution time reduction: -21.49%
                    %
device dtype         
cpu    float32 -21.31
       pil      -3.26
       uint8   -24.21
cuda   float32 -29.09
       uint8   -29.43
----------------------------

Speed Benchmarks

For all benchmarks below we use PyTorch nightly 1.14.0.dev20221115, CUDA 11.6 and TorchVision main from ad128b7. The statistics were estimated on a p4d24xlarge AWS instance with A100 GPU. Since the both V1 and V2 use the same PyTorch version, the speed improvements below don't include performance optimizations performed on the C++ kernels of Core.

Training

To assess the performance in real world applications, we trained a ResNet50 using TorchVision's SoTA recipe for a reduced number of 10 epochs across different setups:

PYTHONPATH=$PYTHONPATH:`pwd` python -u run_with_submitit.py --ngpus 8 --nodes 1 --model resnet50 --batch-size 128 --lr 0.5 --lr-scheduler cosineannealinglr --lr-warmup-epochs 5 --lr-warmup-method linear --auto-augment ta_wide --epochs 10 --random-erase 0.1 --label-smoothing 0.1 --mixup-alpha 0.2 --cutmix-alpha 1.0 --weight-decay 0.00002 --norm-weight-decay 0.0 --train-crop-size 176 --model-ema --val-resize-size 232 --ra-sampler --ra-reps 4 --data-path /datasets01/imagenet_full_size/061417/
Detailed Benchmarks

V1 using ad128b7 of main branch (PIL):

Submitted job_id: 77904
Epoch: [0] Total time: 0:03:07
Epoch: [1] Total time: 0:03:04
Epoch: [2] Total time: 0:03:03
Epoch: [3] Total time: 0:03:03
Epoch: [4] Total time: 0:03:02
Epoch: [5] Total time: 0:03:03
Epoch: [6] Total time: 0:03:03
Epoch: [7] Total time: 0:03:02
Epoch: [8] Total time: 0:03:00
Epoch: [9] Total time: 0:03:05

V1 using 46bd6d9 of #6952 (Tensor uint8):

Submitted job_id: 77827
Epoch: [0] Total time: 0:03:43
Epoch: [1] Total time: 0:04:05
Epoch: [2] Total time: 0:03:59
Epoch: [3] Total time: 0:04:24
Epoch: [4] Total time: 0:04:39
Epoch: [5] Total time: 0:04:42
Epoch: [6] Total time: 0:04:46
Epoch: [7] Total time: 0:04:42
Epoch: [8] Total time: 0:03:40
Epoch: [9] Total time: 0:03:32

V2 using 8b53036 of #6433 (PIL). Marginal median improvement of 1.64%:

Submitted job_id: 77905
Epoch: [0] Total time: 0:03:09
Epoch: [1] Total time: 0:03:02
Epoch: [2] Total time: 0:03:00
Epoch: [3] Total time: 0:03:00
Epoch: [4] Total time: 0:03:00
Epoch: [5] Total time: 0:02:59
Epoch: [6] Total time: 0:03:00
Epoch: [7] Total time: 0:03:00
Epoch: [8] Total time: 0:03:00
Epoch: [9] Total time: 0:03:00

V2 using bda072d of #6433 (Tensor uint8). Median improvement of 18.27%:

Submitted job_id: 77901
Epoch: [0] Total time: 0:03:52
Epoch: [1] Total time: 0:03:36
Epoch: [2] Total time: 0:03:35
Epoch: [3] Total time: 0:03:31
Epoch: [4] Total time: 0:03:28
Epoch: [5] Total time: 0:03:28
Epoch: [6] Total time: 0:03:28
Epoch: [7] Total time: 0:03:26
Epoch: [8] Total time: 0:03:27
Epoch: [9] Total time: 0:03:25

V2 using 8f07159 of #6433 (Tensor float32). Note that this configuration wasn't supported in V1 because not all kernels and augmentations supported floats:

Submitted job_id: 77902
Epoch: [0] Total time: 0:04:25
Epoch: [1] Total time: 0:04:13
Epoch: [2] Total time: 0:04:12
Epoch: [3] Total time: 0:04:12
Epoch: [4] Total time: 0:04:13
Epoch: [5] Total time: 0:04:10
Epoch: [6] Total time: 0:04:11
Epoch: [7] Total time: 0:04:11
Epoch: [8] Total time: 0:04:12
Epoch: [9] Total time: 0:04:11

Transform Classes

Generated using the following script, inspired from earlier iterations from @vfdev-5 and amended by @pmeier. We compare V1 against V2 for all kernels for many configurations (batch size, dtype, device, number of threads etc) and then estimate the average performance improvement across all configuration to summarize the end result.

Detailed Benchmarks
[-------------------------------- RandomErasing ---------------------------------]
                                            |         V1        |         V2      
1 threads: -----------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   359 (+- 92) us  |   333 (+-  2) us
      cuda torch.float32 (3, 400, 400)      |   322 (+-  2) us  |   331 (+-  3) us
      cpu torch.float32 (16, 3, 400, 400)   |  4995 (+- 80) us  |  4978 (+- 54) us
      cuda torch.float32 (16, 3, 400, 400)  |  2144 (+-102) us  |  2135 (+-102) us
      cpu torch.uint8 (3, 400, 400)         |   219 (+-  1) us  |   226 (+-  2) us
      cuda torch.uint8 (3, 400, 400)        |   227 (+-  2) us  |   236 (+-  2) us
      cpu torch.uint8 (16, 3, 400, 400)     |  1787 (+- 44) us  |  1789 (+- 42) us
      cuda torch.uint8 (16, 3, 400, 400)    |  1313 (+- 55) us  |  1316 (+- 56) us
6 threads: -----------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   410 (+-  4) us  |   418 (+-  3) us
      cpu torch.float32 (16, 3, 400, 400)   |  5191 (+- 78) us  |  5225 (+- 61) us
      cpu torch.uint8 (3, 400, 400)         |   302 (+-  3) us  |   310 (+-  4) us
      cpu torch.uint8 (16, 3, 400, 400)     |  1973 (+- 40) us  |  1977 (+- 49) us

Times are in microseconds (us).
Performance of V1 vs V2: -1.228% (slowdown)

[---------------------------------- AugMix ----------------------------------]
                                          |        V1        |        V2      
1 threads: -------------------------------------------------------------------
      cpu torch.uint8 (3, 400, 400)       |   22 (+-  7) ms  |   19 (+-  2) ms
      cuda torch.uint8 (3, 400, 400)      |    2 (+-  1) ms  |    2 (+-  0) ms
      cpu torch.uint8 (16, 3, 400, 400)   |  736 (+-262) ms  |  738 (+-234) ms
      cuda torch.uint8 (16, 3, 400, 400)  |   10 (+-  3) ms  |    3 (+-  0) ms
      cpu pil (3, 400, 400)               |   25 (+-  3) ms  |   23 (+-  2) ms
6 threads: -------------------------------------------------------------------
      cpu torch.uint8 (3, 400, 400)       |   27 (+-  3) ms  |   23 (+-  3) ms
      cpu torch.uint8 (16, 3, 400, 400)   |  803 (+-271) ms  |  735 (+-240) ms

Times are in milliseconds (ms).
Performance of V1 vs V2: 21.496% (improvement)

[------------------------------------ AutoAugment ------------------------------------]
                                          |           V1          |          V2        
1 threads: ----------------------------------------------------------------------------
      cpu torch.uint8 (3, 400, 400)       |    3478 (+-253) us    |   2952 (+-251) us  
      cuda torch.uint8 (3, 400, 400)      |     746 (+- 27) us    |    317 (+-  6) us  
      cpu torch.uint8 (16, 3, 400, 400)   |  103178 (+-19894) us  |  87614 (+-27733) us
      cuda torch.uint8 (16, 3, 400, 400)  |    6868 (+-671) us    |    635 (+- 18) us  
      cpu pil (3, 400, 400)               |    1194 (+-133) us    |   1153 (+- 31) us  
6 threads: ----------------------------------------------------------------------------
      cpu torch.uint8 (3, 400, 400)       |    4128 (+-269) us    |   3366 (+-278) us  
      cpu torch.uint8 (16, 3, 400, 400)   |   72148 (+-94797) us  |  89567 (+-30107) us

Times are in microseconds (us).
Performance of V1 vs V2: 30.867% (improvement)

[------------------------------------- RandAugment --------------------------------------]
                                          |           V1           |           V2         
1 threads: -------------------------------------------------------------------------------
      cpu torch.uint8 (3, 400, 400)       |    6604 (+-1089) us    |    5928 (+-275) us   
      cuda torch.uint8 (3, 400, 400)      |     798 (+- 14) us     |     574 (+- 10) us   
      cpu torch.uint8 (16, 3, 400, 400)   |  172182 (+-119305) us  |  162579 (+-110068) us
      cuda torch.uint8 (16, 3, 400, 400)  |    2982 (+-580) us     |     945 (+- 47) us   
      cpu pil (3, 400, 400)               |    2036 (+-149) us     |    1933 (+-147) us   
6 threads: -------------------------------------------------------------------------------
      cpu torch.uint8 (3, 400, 400)       |    7738 (+-1201) us    |    6920 (+-1190) us  
      cpu torch.uint8 (16, 3, 400, 400)   |  180085 (+-119892) us  |  163626 (+-115677) us

Times are in microseconds (us).
Performance of V1 vs V2: 20.997% (improvement)

[--------------------------------- TrivialAugmentWide ---------------------------------]
                                          |           V1           |          V2        
1 threads: -----------------------------------------------------------------------------
      cpu torch.uint8 (3, 400, 400)       |    3387 (+-329) us     |   3081 (+-321) us  
      cuda torch.uint8 (3, 400, 400)      |     451 (+- 13) us     |    297 (+-  8) us  
      cpu torch.uint8 (16, 3, 400, 400)   |  101788 (+-91224) us   |  89224 (+-87124) us
      cuda torch.uint8 (16, 3, 400, 400)  |    1578 (+-373) us     |    501 (+- 19) us  
      cpu pil (3, 400, 400)               |    1133 (+-137) us     |   1062 (+-138) us  
6 threads: -----------------------------------------------------------------------------
      cpu torch.uint8 (3, 400, 400)       |    4069 (+-355) us     |   3618 (+-361) us  
      cpu torch.uint8 (16, 3, 400, 400)   |  102527 (+-100556) us  |  91264 (+-90662) us

Times are in microseconds (us).
Performance of V1 vs V2: 22.838% (improvement)

[------------------------------------- ColorJitter --------------------------------------]
                                            |           V1          |           V2        
1 threads: -------------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |    24640 (+-766) us   |    16808 (+-187) us 
      cuda torch.float32 (3, 400, 400)      |    1071 (+- 36) us    |     791 (+- 33) us  
      cpu torch.float32 (16, 3, 400, 400)   |  899045 (+-18215) us  |  495452 (+-23080) us
      cuda torch.float32 (16, 3, 400, 400)  |    6444 (+-  6) us    |    2648 (+-  1) us  
      cpu torch.uint8 (3, 400, 400)         |    26271 (+-237) us   |    18410 (+-126) us 
      cuda torch.uint8 (3, 400, 400)        |    1200 (+-  9) us    |     887 (+-  5) us  
      cpu torch.uint8 (16, 3, 400, 400)     |  938875 (+-13454) us  |  534734 (+-12761) us
      cuda torch.uint8 (16, 3, 400, 400)    |    6657 (+-  1) us    |    2942 (+-  0) us  
      cpu pil (3, 400, 400)                 |    14835 (+-410) us   |    14801 (+-402) us 
6 threads: -------------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |    29098 (+-352) us   |    20871 (+-425) us 
      cpu torch.float32 (16, 3, 400, 400)   |  914067 (+-20531) us  |  528114 (+-15384) us
      cpu torch.uint8 (3, 400, 400)         |    31858 (+-314) us   |    23345 (+-330) us 
      cpu torch.uint8 (16, 3, 400, 400)     |  946323 (+-17617) us  |  523300 (+-14203) us

Times are in microseconds (us).
Performance of V1 vs V2: 31.440% (improvement)

[-------------------------------- RandomAdjustSharpness ---------------------------------]
                                            |           V1          |           V2        
1 threads: -------------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |    5130 (+- 24) us    |    4030 (+- 74) us  
      cuda torch.float32 (3, 400, 400)      |     187 (+-  1) us    |     147 (+-  1) us  
      cpu torch.float32 (16, 3, 400, 400)   |   202595 (+-768) us   |   185755 (+-6337) us
      cuda torch.float32 (16, 3, 400, 400)  |     489 (+-  1) us    |     382 (+-  1) us  
      cpu torch.uint8 (3, 400, 400)         |    5564 (+- 39) us    |    4288 (+- 19) us  
      cuda torch.uint8 (3, 400, 400)        |     222 (+-  1) us    |     157 (+-  1) us  
      cpu torch.uint8 (16, 3, 400, 400)     |   217870 (+-6078) us  |   191308 (+-4504) us
      cuda torch.uint8 (16, 3, 400, 400)    |     578 (+-  1) us    |     458 (+-  1) us  
      cpu pil (3, 400, 400)                 |    3561 (+- 16) us    |    3581 (+- 12) us  
6 threads: -------------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |    6139 (+- 47) us    |    4912 (+- 44) us  
      cpu torch.float32 (16, 3, 400, 400)   |   220111 (+-8890) us  |   201278 (+-1894) us
      cpu torch.uint8 (3, 400, 400)         |    6848 (+- 41) us    |    5268 (+- 52) us  
      cpu torch.uint8 (16, 3, 400, 400)     |  235867 (+-27195) us  |  207550 (+-20399) us

Times are in microseconds (us).
Performance of V1 vs V2: 15.608% (improvement)

[------------------------------- RandomAutocontrast -------------------------------]
                                            |         V1         |         V2       
1 threads: -------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   721 (+-  1) us   |   572 (+-  3) us 
      cuda torch.float32 (3, 400, 400)      |   177 (+- 20) us   |   117 (+-  1) us 
      cpu torch.float32 (16, 3, 400, 400)   |  18869 (+-343) us  |  14033 (+- 95) us
      cuda torch.float32 (16, 3, 400, 400)  |   239 (+-  0) us   |   222 (+-  0) us 
      cpu torch.uint8 (3, 400, 400)         |  1144 (+-  8) us   |   809 (+-  5) us 
      cuda torch.uint8 (3, 400, 400)        |   177 (+-  1) us   |   132 (+-  1) us 
      cpu torch.uint8 (16, 3, 400, 400)     |  24274 (+-155) us  |  13676 (+-130) us
      cuda torch.uint8 (16, 3, 400, 400)    |   256 (+-  5) us   |   273 (+-  0) us 
      cpu pil (3, 400, 400)                 |   747 (+-  2) us   |   767 (+-  1) us 
6 threads: -------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   943 (+- 19) us   |   791 (+- 23) us 
      cpu torch.float32 (16, 3, 400, 400)   |  19014 (+-248) us  |  14404 (+-359) us
      cpu torch.uint8 (3, 400, 400)         |  1460 (+- 15) us   |  1112 (+- 32) us 
      cpu torch.uint8 (16, 3, 400, 400)     |  25074 (+-235) us  |  14291 (+-235) us

Times are in microseconds (us).
Performance of V1 vs V2: 17.171% (improvement)

[--------------------------------- RandomEqualize --------------------------------]
                                          |          V1         |         V2       
1 threads: ------------------------------------------------------------------------
      cpu torch.uint8 (3, 400, 400)       |   2913 (+- 12) us   |  2411 (+- 11) us 
      cuda torch.uint8 (3, 400, 400)      |    978 (+-306) us   |   288 (+-  1) us 
      cpu torch.uint8 (16, 3, 400, 400)   |   47271 (+-185) us  |  40238 (+-157) us
      cuda torch.uint8 (16, 3, 400, 400)  |  14421 (+-1185) us  |   826 (+-  1) us 
      cpu pil (3, 400, 400)               |    756 (+-  2) us   |   776 (+-  1) us 
6 threads: ------------------------------------------------------------------------
      cpu torch.uint8 (3, 400, 400)       |   3649 (+- 38) us   |  2615 (+- 28) us 
      cpu torch.uint8 (16, 3, 400, 400)   |  59636 (+-1869) us  |  40607 (+-454) us

Times are in microseconds (us).
Performance of V1 vs V2: 34.185% (improvement)

[--------------------------------- RandomInvert ---------------------------------]
                                            |         V1        |         V2      
1 threads: -----------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   187 (+-  1) us  |   195 (+-  1) us
      cuda torch.float32 (3, 400, 400)      |    20 (+-  0) us  |    28 (+-  0) us
      cpu torch.float32 (16, 3, 400, 400)   |  4103 (+- 33) us  |  4096 (+- 25) us
      cuda torch.float32 (16, 3, 400, 400)  |    49 (+-  0) us  |    49 (+-  0) us
      cpu torch.uint8 (3, 400, 400)         |   164 (+-  1) us  |    50 (+-  0) us
      cuda torch.uint8 (3, 400, 400)        |    20 (+-  0) us  |    25 (+-  0) us
      cpu torch.uint8 (16, 3, 400, 400)     |  2282 (+- 19) us  |   627 (+-  1) us
      cuda torch.uint8 (16, 3, 400, 400)    |    20 (+-  0) us  |    25 (+-  0) us
      cpu pil (3, 400, 400)                 |   327 (+-  1) us  |   346 (+-  1) us
6 threads: -----------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   234 (+-  3) us  |   242 (+-  3) us
      cpu torch.float32 (16, 3, 400, 400)   |  4412 (+- 56) us  |  4392 (+- 34) us
      cpu torch.uint8 (3, 400, 400)         |   208 (+-  3) us  |    95 (+-  2) us
      cpu torch.uint8 (16, 3, 400, 400)     |  2352 (+- 31) us  |   684 (+-  6) us

Times are in microseconds (us).
Performance of V1 vs V2: 3.451% (improvement)

[------------------------------ RandomPosterize -------------------------------]
                                          |         V1        |         V2      
1 threads: ---------------------------------------------------------------------
      cpu torch.uint8 (3, 400, 400)       |   127 (+-  1) us  |   136 (+-  1) us
      cuda torch.uint8 (3, 400, 400)      |    20 (+-  0) us  |    28 (+-  0) us
      cpu torch.uint8 (16, 3, 400, 400)   |  1636 (+-  7) us  |  1642 (+- 20) us
      cuda torch.uint8 (16, 3, 400, 400)  |    20 (+-  0) us  |    28 (+-  0) us
      cpu pil (3, 400, 400)               |   334 (+-  1) us  |   354 (+-  2) us
6 threads: ---------------------------------------------------------------------
      cpu torch.uint8 (3, 400, 400)       |   169 (+-  2) us  |   178 (+-  2) us
      cpu torch.uint8 (16, 3, 400, 400)   |  1700 (+- 17) us  |  1708 (+- 26) us

Times are in microseconds (us).
Performance of V1 vs V2: -16.203% (slowdown)

[--------------------------------- RandomSolarize ---------------------------------]
                                            |         V1         |         V2       
1 threads: -------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   957 (+-  6) us   |   961 (+-  6) us 
      cuda torch.float32 (3, 400, 400)      |    41 (+-  0) us   |    50 (+-  0) us 
      cpu torch.float32 (16, 3, 400, 400)   |  17249 (+- 98) us  |  17450 (+-231) us
      cuda torch.float32 (16, 3, 400, 400)  |   157 (+-  0) us   |   159 (+-  0) us 
      cpu torch.uint8 (3, 400, 400)         |  1081 (+-  7) us   |   976 (+-  8) us 
      cuda torch.uint8 (3, 400, 400)        |    40 (+-  0) us   |    44 (+-  0) us 
      cpu torch.uint8 (16, 3, 400, 400)     |  18245 (+-111) us  |  16818 (+-108) us
      cuda torch.uint8 (16, 3, 400, 400)    |    60 (+-  0) us   |    62 (+-  0) us 
      cpu pil (3, 400, 400)                 |   333 (+-  1) us   |   353 (+-  1) us 
6 threads: -------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |  1104 (+- 20) us   |  1107 (+- 20) us 
      cpu torch.float32 (16, 3, 400, 400)   |  17576 (+-205) us  |  17469 (+-322) us
      cpu torch.uint8 (3, 400, 400)         |  1249 (+- 20) us   |  1139 (+- 82) us 
      cpu torch.uint8 (16, 3, 400, 400)     |  18673 (+-280) us  |  17263 (+-308) us

Times are in microseconds (us).
Performance of V1 vs V2: -3.361% (slowdown)

[--------------------------------- CenterCrop ---------------------------------]
                                            |        V1        |        V2      
1 threads: ---------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   11 (+-  0) us  |    9 (+-  0) us
      cuda torch.float32 (3, 400, 400)      |   12 (+-  0) us  |   10 (+-  0) us
      cpu torch.float32 (16, 3, 400, 400)   |   12 (+-  0) us  |   10 (+-  0) us
      cuda torch.float32 (16, 3, 400, 400)  |   12 (+-  0) us  |   10 (+-  0) us
      cpu torch.uint8 (3, 400, 400)         |   11 (+-  0) us  |    9 (+-  0) us
      cuda torch.uint8 (3, 400, 400)        |   12 (+-  0) us  |   10 (+-  0) us
      cpu torch.uint8 (16, 3, 400, 400)     |   11 (+-  0) us  |    9 (+-  0) us
      cuda torch.uint8 (16, 3, 400, 400)    |   12 (+-  0) us  |   10 (+-  0) us
      cpu pil (3, 400, 400)                 |   17 (+-  0) us  |   16 (+-  0) us
6 threads: ---------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   11 (+-  0) us  |    9 (+-  0) us
      cpu torch.float32 (16, 3, 400, 400)   |   12 (+-  0) us  |    9 (+-  0) us
      cpu torch.uint8 (3, 400, 400)         |   11 (+-  0) us  |    9 (+-  0) us
      cpu torch.uint8 (16, 3, 400, 400)     |   12 (+-  0) us  |    9 (+-  0) us

Times are in microseconds (us).
Performance of V1 vs V2: 15.328% (improvement)

[------------------------------ ElasticTransform ------------------------------]
                                            |        V1        |        V2      
1 threads: ---------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |  100 (+-  1) ms  |  100 (+-  1) ms
      cuda torch.float32 (3, 400, 400)      |   96 (+-  1) ms  |   96 (+-  1) ms
      cpu torch.float32 (16, 3, 400, 400)   |  181 (+-  4) ms  |  166 (+-  2) ms
      cuda torch.float32 (16, 3, 400, 400)  |   97 (+-  1) ms  |   96 (+-  1) ms
      cpu torch.uint8 (3, 400, 400)         |  101 (+-  1) ms  |  100 (+-  1) ms
      cuda torch.uint8 (3, 400, 400)        |   96 (+-  1) ms  |   96 (+-  1) ms
      cpu torch.uint8 (16, 3, 400, 400)     |  193 (+-  5) ms  |  176 (+-  2) ms
      cuda torch.uint8 (16, 3, 400, 400)    |   97 (+-  2) ms  |   96 (+-  1) ms
      cpu pil (3, 400, 400)                 |  104 (+-  1) ms  |  103 (+-  1) ms
6 threads: ---------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |  103 (+-  2) ms  |  101 (+-  2) ms
      cpu torch.float32 (16, 3, 400, 400)   |  184 (+-  2) ms  |  170 (+-  3) ms
      cpu torch.uint8 (3, 400, 400)         |  103 (+-  1) ms  |  102 (+-  1) ms
      cpu torch.uint8 (16, 3, 400, 400)     |  197 (+-  2) ms  |  181 (+-  2) ms

Times are in milliseconds (ms).
Performance of V1 vs V2: 2.308% (improvement)

[---------------------------------- FiveCrop ----------------------------------]
                                            |        V1        |        V2      
1 threads: ---------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   40 (+-  0) us  |   27 (+-  0) us
      cuda torch.float32 (3, 400, 400)      |   42 (+-  0) us  |   29 (+-  0) us
      cpu torch.float32 (16, 3, 400, 400)   |   40 (+-  0) us  |   28 (+-  0) us
      cuda torch.float32 (16, 3, 400, 400)  |   42 (+-  0) us  |   29 (+-  0) us
      cpu torch.uint8 (3, 400, 400)         |   40 (+-  0) us  |   27 (+-  0) us
      cuda torch.uint8 (3, 400, 400)        |   41 (+-  0) us  |   28 (+-  0) us
      cpu torch.uint8 (16, 3, 400, 400)     |   40 (+-  0) us  |   27 (+-  0) us
      cuda torch.uint8 (16, 3, 400, 400)    |   42 (+-  0) us  |   29 (+-  0) us
      cpu pil (3, 400, 400)                 |  111 (+-  1) us  |  106 (+-  0) us
6 threads: ---------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   40 (+-  0) us  |   28 (+-  0) us
      cpu torch.float32 (16, 3, 400, 400)   |   40 (+-  0) us  |   28 (+-  0) us
      cpu torch.uint8 (3, 400, 400)         |   40 (+-  0) us  |   27 (+-  0) us
      cpu torch.uint8 (16, 3, 400, 400)     |   40 (+-  0) us  |   27 (+-  0) us

Times are in microseconds (us).
Performance of V1 vs V2: 26.077% (improvement)

[------------------------------------- Pad --------------------------------------]
                                            |         V1        |         V2      
1 threads: -----------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   287 (+-  1) us  |   297 (+-  1) us
      cuda torch.float32 (3, 400, 400)      |    27 (+-  0) us  |    34 (+-  0) us
      cpu torch.float32 (16, 3, 400, 400)   |  6883 (+- 43) us  |  6922 (+- 63) us
      cuda torch.float32 (16, 3, 400, 400)  |    79 (+-  0) us  |    79 (+-  0) us
      cpu torch.uint8 (3, 400, 400)         |   220 (+-  1) us  |   230 (+-  1) us
      cuda torch.uint8 (3, 400, 400)        |    27 (+-  0) us  |    34 (+-  0) us
      cpu torch.uint8 (16, 3, 400, 400)     |  3231 (+- 20) us  |  3249 (+- 11) us
      cuda torch.uint8 (16, 3, 400, 400)    |    38 (+-  0) us  |    38 (+-  0) us
      cpu pil (3, 400, 400)                 |   147 (+-  1) us  |   151 (+-  1) us
6 threads: -----------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   376 (+-  3) us  |   388 (+-  3) us
      cpu torch.float32 (16, 3, 400, 400)   |  6969 (+-193) us  |  7031 (+- 63) us
      cpu torch.uint8 (3, 400, 400)         |   302 (+-  3) us  |   314 (+-  3) us
      cpu torch.uint8 (16, 3, 400, 400)     |  3354 (+- 25) us  |  3379 (+- 35) us

Times are in microseconds (us).
Performance of V1 vs V2: -6.993% (slowdown)

[------------------------------------- Resize -------------------------------------]
                                            |         V1         |         V2       
1 threads: -------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |  1096 (+-  7) us   |  1103 (+-  7) us 
      cuda torch.float32 (3, 400, 400)      |    23 (+-  0) us   |    25 (+-  0) us 
      cpu torch.float32 (16, 3, 400, 400)   |  16734 (+-116) us  |  16712 (+- 95) us
      cuda torch.float32 (16, 3, 400, 400)  |   162 (+-  1) us   |   162 (+-  0) us 
      cpu torch.uint8 (3, 400, 400)         |  1391 (+-  8) us   |  1370 (+-  9) us 
      cuda torch.uint8 (3, 400, 400)        |    51 (+-  0) us   |    53 (+-  0) us 
      cpu torch.uint8 (16, 3, 400, 400)     |  22197 (+-127) us  |  22000 (+-143) us
      cuda torch.uint8 (16, 3, 400, 400)    |   229 (+-  0) us   |   228 (+-  0) us 
      cpu pil (3, 400, 400)                 |  1124 (+-  5) us   |  1125 (+-  7) us 
6 threads: -------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |  1186 (+- 22) us   |  1191 (+- 20) us 
      cpu torch.float32 (16, 3, 400, 400)   |  16956 (+-317) us  |  16976 (+-184) us
      cpu torch.uint8 (3, 400, 400)         |  1608 (+- 21) us   |  1586 (+- 26) us 
      cpu torch.uint8 (16, 3, 400, 400)     |  22713 (+-290) us  |  22526 (+-420) us

Times are in microseconds (us).
Performance of V1 vs V2: -1.247% (slowdown)

[----------------------------------- TenCrop ------------------------------------]
                                            |         V1        |         V2      
1 threads: -----------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   324 (+-  1) us  |   295 (+-  1) us
      cuda torch.float32 (3, 400, 400)      |    99 (+-  1) us  |    66 (+-  1) us
      cpu torch.float32 (16, 3, 400, 400)   |  6064 (+-163) us  |  5996 (+- 19) us
      cuda torch.float32 (16, 3, 400, 400)  |    99 (+-  1) us  |    66 (+-  0) us
      cpu torch.uint8 (3, 400, 400)         |   386 (+-  1) us  |   357 (+-  2) us
      cuda torch.uint8 (3, 400, 400)        |    98 (+-  1) us  |    66 (+-  0) us
      cpu torch.uint8 (16, 3, 400, 400)     |  4660 (+- 13) us  |  4626 (+- 17) us
      cuda torch.uint8 (16, 3, 400, 400)    |    99 (+-  1) us  |    67 (+-  0) us
      cpu pil (3, 400, 400)                 |   356 (+-  1) us  |   328 (+-  1) us
6 threads: -----------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   374 (+-  3) us  |   344 (+-  4) us
      cpu torch.float32 (16, 3, 400, 400)   |  6064 (+- 66) us  |  6027 (+- 57) us
      cpu torch.uint8 (3, 400, 400)         |   433 (+-  4) us  |   403 (+-  4) us
      cpu torch.uint8 (16, 3, 400, 400)     |  4741 (+- 36) us  |  4709 (+- 40) us

Times are in microseconds (us).
Performance of V1 vs V2: 16.425% (improvement)

[------------------------------------- RandomAffine -------------------------------------]
                                            |           V1          |           V2        
1 threads: -------------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |    14378 (+-906) us   |    14169 (+-104) us 
      cuda torch.float32 (3, 400, 400)      |     555 (+- 32) us    |     514 (+-  2) us  
      cpu torch.float32 (16, 3, 400, 400)   |  453405 (+-30956) us  |  456598 (+-31138) us
      cuda torch.float32 (16, 3, 400, 400)  |    1584 (+- 15) us    |    1579 (+- 10) us  
      cpu torch.uint8 (3, 400, 400)         |    14589 (+-319) us   |    14540 (+-305) us 
      cuda torch.uint8 (3, 400, 400)        |     550 (+-  1) us    |     543 (+-  2) us  
      cpu torch.uint8 (16, 3, 400, 400)     |  472450 (+-33505) us  |  463440 (+-32430) us
      cuda torch.uint8 (16, 3, 400, 400)    |    1685 (+-  9) us    |    1677 (+- 10) us  
      cpu pil (3, 400, 400)                 |     359 (+-  1) us    |     365 (+-  2) us  
6 threads: -------------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |    14949 (+-351) us   |    14951 (+-326) us 
      cpu torch.float32 (16, 3, 400, 400)   |  458052 (+-32305) us  |  457932 (+-31797) us
      cpu torch.uint8 (3, 400, 400)         |    15542 (+-337) us   |    15445 (+-329) us 
      cpu torch.uint8 (16, 3, 400, 400)     |  470605 (+-33002) us  |  468819 (+-33084) us

Times are in microseconds (us).
Performance of V1 vs V2: 0.725% (improvement)

[---------------------------------- RandomCrop ----------------------------------]
                                            |         V1        |         V2      
1 threads: -----------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   323 (+-  1) us  |   335 (+-  1) us
      cuda torch.float32 (3, 400, 400)      |    55 (+-  0) us  |    62 (+-  0) us
      cpu torch.float32 (16, 3, 400, 400)   |  6914 (+-112) us  |  6929 (+- 34) us
      cuda torch.float32 (16, 3, 400, 400)  |    79 (+-  0) us  |    79 (+-  0) us
      cpu torch.uint8 (3, 400, 400)         |   250 (+-  1) us  |   262 (+-  1) us
      cuda torch.uint8 (3, 400, 400)        |    54 (+-  0) us  |    61 (+-  0) us
      cpu torch.uint8 (16, 3, 400, 400)     |  3241 (+-  8) us  |  3259 (+- 16) us
      cuda torch.uint8 (16, 3, 400, 400)    |    48 (+-  0) us  |    62 (+-  0) us
      cpu pil (3, 400, 400)                 |   203 (+-  1) us  |   208 (+- 11) us
6 threads: -----------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   416 (+-  4) us  |   428 (+-  4) us
      cpu torch.float32 (16, 3, 400, 400)   |  7003 (+-236) us  |  7069 (+- 66) us
      cpu torch.uint8 (3, 400, 400)         |   337 (+-  4) us  |   349 (+-  4) us
      cpu torch.uint8 (16, 3, 400, 400)     |  3375 (+- 36) us  |  3395 (+- 34) us

Times are in microseconds (us).
Performance of V1 vs V2: -6.462% (slowdown)

[----------------------------- RandomHorizontalFlip -----------------------------]
                                            |         V1        |         V2      
1 threads: -----------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   247 (+-  1) us  |   254 (+-  1) us
      cuda torch.float32 (3, 400, 400)      |    23 (+-  0) us  |    27 (+-  0) us
      cpu torch.float32 (16, 3, 400, 400)   |  5931 (+- 48) us  |  5919 (+- 15) us
      cuda torch.float32 (16, 3, 400, 400)  |    51 (+-  0) us  |    51 (+-  0) us
      cpu torch.uint8 (3, 400, 400)         |   309 (+-  1) us  |   316 (+-  1) us
      cuda torch.uint8 (3, 400, 400)        |    23 (+-  0) us  |    27 (+-  0) us
      cpu torch.uint8 (16, 3, 400, 400)     |  4574 (+-  9) us  |  4582 (+-  9) us
      cuda torch.uint8 (16, 3, 400, 400)    |    23 (+-  0) us  |    27 (+-  0) us
      cpu pil (3, 400, 400)                 |   133 (+-  1) us  |   138 (+-  1) us
6 threads: -----------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   296 (+-  2) us  |   304 (+-  3) us
      cpu torch.float32 (16, 3, 400, 400)   |  5932 (+- 76) us  |  5928 (+- 38) us
      cpu torch.uint8 (3, 400, 400)         |   354 (+-  3) us  |   360 (+-  4) us
      cpu torch.uint8 (16, 3, 400, 400)     |  4647 (+- 35) us  |  4654 (+- 44) us

Times are in microseconds (us).
Performance of V1 vs V2: -5.806% (slowdown)

[---------------------------------- RandomPerspective -----------------------------------]
                                            |           V1          |           V2        
1 threads: -------------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |    5669 (+-309) us    |    5016 (+- 14) us  
      cuda torch.float32 (3, 400, 400)      |     668 (+-  2) us    |     638 (+-  1) us  
      cpu torch.float32 (16, 3, 400, 400)   |  103699 (+-11683) us  |   87578 (+-11757) us
      cuda torch.float32 (16, 3, 400, 400)  |     872 (+- 11) us    |     852 (+-  6) us  
      cpu torch.uint8 (3, 400, 400)         |    6140 (+- 17) us    |    5418 (+- 14) us  
      cuda torch.uint8 (3, 400, 400)        |     707 (+-  2) us    |     672 (+-  1) us  
      cpu torch.uint8 (16, 3, 400, 400)     |  115905 (+-11269) us  |   96945 (+-11355) us
      cuda torch.uint8 (16, 3, 400, 400)    |     915 (+-  8) us    |     897 (+-  8) us  
      cpu pil (3, 400, 400)                 |    3385 (+- 40) us    |    3410 (+- 40) us  
6 threads: -------------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |    6295 (+- 50) us    |    5589 (+- 48) us  
      cpu torch.float32 (16, 3, 400, 400)   |  106728 (+-12443) us  |   90306 (+-12163) us
      cpu torch.uint8 (3, 400, 400)         |    6919 (+- 64) us    |    6111 (+- 39) us  
      cpu torch.uint8 (16, 3, 400, 400)     |  118305 (+-11773) us  |  100258 (+-11661) us

Times are in microseconds (us).
Performance of V1 vs V2: 6.612% (improvement)

[-------------------------------- RandomResizedCrop ---------------------------------]
                                            |          V1         |          V2       
1 threads: ---------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |    845 (+- 20) us   |    835 (+- 20) us 
      cuda torch.float32 (3, 400, 400)      |    108 (+-  1) us   |     97 (+-  1) us 
      cpu torch.float32 (16, 3, 400, 400)   |   11057 (+-922) us  |   11051 (+-924) us
      cuda torch.float32 (16, 3, 400, 400)  |    119 (+-  2) us   |    119 (+-  1) us 
      cpu torch.uint8 (3, 400, 400)         |   1053 (+- 23) us   |   1014 (+- 23) us 
      cuda torch.uint8 (3, 400, 400)        |    134 (+-  1) us   |    122 (+-  1) us 
      cpu torch.uint8 (16, 3, 400, 400)     |  14512 (+-1163) us  |  14136 (+-1084) us
      cuda torch.uint8 (16, 3, 400, 400)    |    130 (+-  1) us   |    129 (+-  1) us 
      cpu pil (3, 400, 400)                 |    902 (+- 70) us   |    885 (+-  6) us 
6 threads: ---------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |    945 (+- 30) us   |    934 (+- 31) us 
      cpu torch.float32 (16, 3, 400, 400)   |   11308 (+-967) us  |   11291 (+-956) us
      cpu torch.uint8 (3, 400, 400)         |   1270 (+- 29) us   |   1230 (+- 36) us 
      cpu torch.uint8 (16, 3, 400, 400)     |  14894 (+-1149) us  |  14507 (+-1140) us

Times are in microseconds (us).
Performance of V1 vs V2: 3.028% (improvement)

[------------------------------------- RandomRotation -------------------------------------]
                                            |           V1           |           V2         
1 threads: ---------------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   17180 (+-4965) us    |   15761 (+-1568) us  
      cuda torch.float32 (3, 400, 400)      |     656 (+-  2) us     |     624 (+-  1) us   
      cpu torch.float32 (16, 3, 400, 400)   |  458672 (+-160941) us  |  430656 (+-28398) us 
      cuda torch.float32 (16, 3, 400, 400)  |    1581 (+- 41) us     |    1571 (+- 42) us   
      cpu torch.uint8 (3, 400, 400)         |   16548 (+-1619) us    |   16330 (+-1549) us  
      cuda torch.uint8 (3, 400, 400)        |     693 (+-  1) us     |     656 (+-  1) us   
      cpu torch.uint8 (16, 3, 400, 400)     |  477884 (+-173543) us  |  449887 (+-28933) us 
      cuda torch.uint8 (16, 3, 400, 400)    |    1737 (+- 47) us     |    1746 (+- 51) us   
      cpu pil (3, 400, 400)                 |     611 (+-  4) us     |     615 (+-  4) us   
6 threads: ---------------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   16987 (+-1742) us    |   16902 (+-1664) us  
      cpu torch.float32 (16, 3, 400, 400)   |  464165 (+-160255) us  |  463919 (+-159642) us
      cpu torch.uint8 (3, 400, 400)         |   17776 (+-1622) us    |   17486 (+-1536) us  
      cpu torch.uint8 (16, 3, 400, 400)     |  481863 (+-176986) us  |  476903 (+-166256) us

Times are in microseconds (us).
Performance of V1 vs V2: 1.887% (improvement)

[------------------------------ RandomVerticalFlip ------------------------------]
                                            |         V1        |         V2      
1 threads: -----------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   184 (+-  1) us  |   192 (+-  1) us
      cuda torch.float32 (3, 400, 400)      |    23 (+-  0) us  |    27 (+-  0) us
      cpu torch.float32 (16, 3, 400, 400)   |  4606 (+- 37) us  |  4602 (+- 19) us
      cuda torch.float32 (16, 3, 400, 400)  |    52 (+-  0) us  |    52 (+-  0) us
      cpu torch.uint8 (3, 400, 400)         |    91 (+-  1) us  |    98 (+-  1) us
      cuda torch.uint8 (3, 400, 400)        |    23 (+-  0) us  |    26 (+-  0) us
      cpu torch.uint8 (16, 3, 400, 400)     |  1089 (+- 10) us  |  1097 (+-  9) us
      cuda torch.uint8 (16, 3, 400, 400)    |    24 (+-  0) us  |    27 (+-  1) us
      cpu pil (3, 400, 400)                 |    74 (+-  0) us  |    80 (+-  0) us
6 threads: -----------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   229 (+-  2) us  |   237 (+-  2) us
      cpu torch.float32 (16, 3, 400, 400)   |  4706 (+- 87) us  |  4707 (+- 37) us
      cpu torch.uint8 (3, 400, 400)         |   134 (+-  2) us  |   142 (+-  2) us
      cpu torch.uint8 (16, 3, 400, 400)     |  1152 (+- 19) us  |  1161 (+- 17) us

Times are in microseconds (us).
Performance of V1 vs V2: -6.617% (slowdown)

[------------------------------- ConvertImageDtype --------------------------------]
                                            |         V1         |         V2       
1 threads: -------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   134 (+-  1) us   |   132 (+-  1) us 
      cuda torch.float32 (3, 400, 400)      |    16 (+-  0) us   |    14 (+-  0) us 
      cpu torch.float32 (16, 3, 400, 400)   |  2519 (+- 20) us   |  2514 (+- 16) us 
      cuda torch.float32 (16, 3, 400, 400)  |    44 (+-  0) us   |    44 (+-  0) us 
      cpu torch.uint8 (3, 400, 400)         |  1053 (+-  5) us   |   981 (+-  6) us 
      cuda torch.uint8 (3, 400, 400)        |    25 (+-  0) us   |    22 (+-  0) us 
      cpu torch.uint8 (16, 3, 400, 400)     |  16495 (+- 62) us  |  15196 (+- 62) us
      cuda torch.uint8 (16, 3, 400, 400)    |    52 (+-  0) us   |    40 (+-  0) us 
6 threads: -------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   180 (+-  2) us   |   177 (+-  3) us 
      cpu torch.float32 (16, 3, 400, 400)   |  2619 (+- 34) us   |  2617 (+- 39) us 
      cpu torch.uint8 (3, 400, 400)         |  1139 (+- 16) us   |  1062 (+- 14) us 
      cpu torch.uint8 (16, 3, 400, 400)     |  16690 (+-252) us  |  15337 (+-256) us

Times are in microseconds (us).
Performance of V1 vs V2: 7.949% (improvement)

[------------------------------------- GaussianBlur -------------------------------------]
                                            |           V1          |           V2        
1 threads: -------------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |    3281 (+-260) us    |    3174 (+-258) us  
      cuda torch.float32 (3, 400, 400)      |     239 (+- 31) us    |     140 (+-  1) us  
      cpu torch.float32 (16, 3, 400, 400)   |  241303 (+-59097) us  |  241166 (+-58982) us
      cuda torch.float32 (16, 3, 400, 400)  |     305 (+-  1) us    |     221 (+-  0) us  
      cpu torch.uint8 (3, 400, 400)         |    3896 (+-239) us    |    3657 (+-246) us  
      cuda torch.uint8 (3, 400, 400)        |     257 (+-  2) us    |     171 (+-  1) us  
      cpu torch.uint8 (16, 3, 400, 400)     |   256446 (+-2638) us  |   254117 (+-796) us 
      cuda torch.uint8 (16, 3, 400, 400)    |     433 (+-  1) us    |     344 (+-  0) us  
      cpu pil (3, 400, 400)                 |    7085 (+-303) us    |    6921 (+-282) us  
6 threads: -------------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |    4452 (+-266) us    |    4315 (+-262) us  
      cpu torch.float32 (16, 3, 400, 400)   |   264282 (+-2007) us  |   264110 (+-2584) us
      cpu torch.uint8 (3, 400, 400)         |    5110 (+-257) us    |    4934 (+-258) us  
      cpu torch.uint8 (16, 3, 400, 400)     |   279173 (+-2179) us  |   276032 (+-3026) us

Times are in microseconds (us).
Performance of V1 vs V2: 13.555% (improvement)

[---------------------------------- Normalize -----------------------------------]
                                            |         V1        |         V2      
1 threads: -----------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   383 (+-  1) us  |   291 (+-  1) us
      cuda torch.float32 (3, 400, 400)      |   118 (+-  1) us  |    74 (+-  0) us
      cpu torch.float32 (16, 3, 400, 400)   |  6943 (+- 19) us  |  5478 (+- 55) us
      cuda torch.float32 (16, 3, 400, 400)  |   224 (+-  1) us  |   140 (+-  1) us
6 threads: -----------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   516 (+-  4) us  |   380 (+-  4) us
      cpu torch.float32 (16, 3, 400, 400)   |  7282 (+- 58) us  |  6006 (+- 49) us

Times are in microseconds (us).
Performance of V1 vs V2: 30.002% (improvement)

Functional Kernels

Generated using @pmeier's script. We compare V1 against V2 for all kernels for many configurations (batch size, dtype, device, number of threads etc) and then estimate the average performance improvement across all configuration to summarize the end result.

Detailed Benchmarks
[----------- adjust_brightness @ torchvision==0.15.0a0+1098dad ------------]
                                          |        v1       |        v2     
1 threads: -----------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   1303 (+-  5)  |   572 (+-  1) 
      (3, 400, 400) / uint8 / cuda        |    53 (+-  0)   |    25 (+-  0) 
      (3, 400, 400) / PIL                 |   814 (+-  2)   |   811 (+-  2) 
      (3, 400, 400) / float32 / cpu       |   830 (+-  4)   |   253 (+-  1) 
      (3, 400, 400) / float32 / cuda      |    44 (+-  0)   |    17 (+-  0) 
      (16, 3, 400, 400) / uint8 / cpu     |  31009 (+-836)  |  12549 (+- 61)
      (16, 3, 400, 400) / uint8 / cuda    |   261 (+-  0)   |   127 (+-  0) 
      (16, 3, 400, 400) / float32 / cpu   |  23236 (+-539)  |   5201 (+- 31)
      (16, 3, 400, 400) / float32 / cuda  |   241 (+-  0)   |    96 (+-  0) 
6 threads: -----------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   1650 (+- 26)  |   744 (+- 16) 
      (3, 400, 400) / float32 / cpu       |   1050 (+- 18)  |   339 (+-  4) 
      (16, 3, 400, 400) / uint8 / cpu     |  31815 (+-396)  |  12900 (+-247)
      (16, 3, 400, 400) / float32 / cpu   |  23572 (+-309)  |   5473 (+- 48)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +48.8% (improvement)
[------------- adjust_contrast @ torchvision==0.15.0a0+1098dad -------------]
                                          |        v1        |        v2     
1 threads: ------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   1380 (+- 18)   |   954 (+-  4) 
      (3, 400, 400) / uint8 / cuda        |   134 (+-  2)    |    82 (+-  1) 
      (3, 400, 400) / PIL                 |   1081 (+-  8)   |   1077 (+-  5)
      (3, 400, 400) / float32 / cpu       |   905 (+- 12)    |   540 (+-  1) 
      (3, 400, 400) / float32 / cuda      |   107 (+-  2)    |    66 (+-  0) 
      (16, 3, 400, 400) / uint8 / cpu     |  35265 (+-122)   |  24302 (+-125)
      (16, 3, 400, 400) / uint8 / cuda    |   293 (+-  0)    |   242 (+-  0) 
      (16, 3, 400, 400) / float32 / cpu   |  23713 (+-109)   |  13330 (+- 86)
      (16, 3, 400, 400) / float32 / cuda  |   252 (+-  0)    |   197 (+-  0) 
6 threads: ------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   2046 (+- 30)   |   1501 (+- 20)
      (3, 400, 400) / float32 / cpu       |   1290 (+- 23)   |   841 (+- 20) 
      (16, 3, 400, 400) / uint8 / cpu     |  38422 (+-1500)  |  25921 (+-293)
      (16, 3, 400, 400) / float32 / cpu   |  23309 (+-428)   |  13884 (+-180)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +26.0% (improvement)
[--------------- adjust_gamma @ torchvision==0.15.0a0+1098dad ---------------]
                                          |        v1        |        v2      
1 threads: -------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   4640 (+- 12)   |   4362 (+-  8) 
      (3, 400, 400) / uint8 / cuda        |    81 (+-  1)    |    53 (+-  0)  
      (3, 400, 400) / PIL                 |   463 (+-  1)    |   457 (+-  1)  
      (3, 400, 400) / float32 / cpu       |   3789 (+- 17)   |   3641 (+-  8) 
      (3, 400, 400) / float32 / cuda      |    29 (+-  0)    |    21 (+-  0)  
      (16, 3, 400, 400) / uint8 / cpu     |  82220 (+-634)   |  72850 (+-394) 
      (16, 3, 400, 400) / uint8 / cuda    |   331 (+-  0)    |   312 (+-  0)  
      (16, 3, 400, 400) / float32 / cpu   |  63453 (+-496)   |  58586 (+-298) 
      (16, 3, 400, 400) / float32 / cuda  |   150 (+-  0)    |   142 (+-  0)  
6 threads: -------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   5042 (+- 47)   |   4751 (+- 29) 
      (3, 400, 400) / float32 / cpu       |   4003 (+- 46)   |   3866 (+- 36) 
      (16, 3, 400, 400) / uint8 / cpu     |  83791 (+-3026)  |  75086 (+-1903)
      (16, 3, 400, 400) / float32 / cpu   |  65180 (+-1774)  |  60190 (+-1816)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +10.4% (improvement)
[------------------ adjust_hue @ torchvision==0.15.0a0+1098dad ------------------]
                                          |         v1         |         v2       
1 threads: -----------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   21044 (+- 98)    |   15068 (+-119)  
      (3, 400, 400) / uint8 / cuda        |    904 (+- 47)     |    558 (+-  1)   
      (3, 400, 400) / PIL                 |   10260 (+- 54)    |   10253 (+- 59)  
      (3, 400, 400) / float32 / cpu       |   20317 (+-132)    |   14494 (+-171)  
      (3, 400, 400) / float32 / cuda      |    693 (+-  2)     |    510 (+-  2)   
      (16, 3, 400, 400) / uint8 / cpu     |  806291 (+-26060)  |  458049 (+-1384) 
      (16, 3, 400, 400) / uint8 / cuda    |    5805 (+-  1)    |    2331 (+-  0)  
      (16, 3, 400, 400) / float32 / cpu   |  769409 (+-9376)   |  442955 (+-5570) 
      (16, 3, 400, 400) / float32 / cuda  |    5633 (+-  1)    |    2170 (+-  3)  
6 threads: -----------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   24682 (+-213)    |   18549 (+-318)  
      (3, 400, 400) / float32 / cpu       |   23809 (+-217)    |   17842 (+-238)  
      (16, 3, 400, 400) / uint8 / cpu     |  799018 (+-14984)  |  467291 (+-6339) 
      (16, 3, 400, 400) / float32 / cpu   |  781586 (+-2532)   |  451900 (+-13095)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +32.7% (improvement)
[----------- adjust_saturation @ torchvision==0.15.0a0+1098dad ------------]
                                          |        v1       |        v2     
1 threads: -----------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   1462 (+-  7)  |   976 (+-  5) 
      (3, 400, 400) / uint8 / cuda        |   104 (+-  1)   |    65 (+-  0) 
      (3, 400, 400) / PIL                 |   932 (+- 43)   |   929 (+-  6) 
      (3, 400, 400) / float32 / cpu       |   986 (+-  3)   |   570 (+-  1) 
      (3, 400, 400) / float32 / cuda      |    86 (+-  0)   |    50 (+-  0) 
      (16, 3, 400, 400) / uint8 / cpu     |  35543 (+-128)  |  25018 (+- 88)
      (16, 3, 400, 400) / uint8 / cuda    |   287 (+-  0)   |   241 (+-  0) 
      (16, 3, 400, 400) / float32 / cpu   |  24104 (+-435)  |  13323 (+-223)
      (16, 3, 400, 400) / float32 / cuda  |   262 (+-  0)   |   199 (+-  0) 
6 threads: -----------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   2117 (+- 19)  |   1462 (+- 18)
      (3, 400, 400) / float32 / cpu       |   1344 (+- 19)  |   802 (+- 18) 
      (16, 3, 400, 400) / uint8 / cpu     |  38526 (+-345)  |  26558 (+-327)
      (16, 3, 400, 400) / float32 / cpu   |  25124 (+-212)  |  13919 (+-218)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +26.9% (improvement)
[-------------- adjust_sharpness @ torchvision==0.15.0a0+1098dad --------------]
                                          |         v1        |         v2      
1 threads: ---------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |    5508 (+-167)   |    4232 (+- 36) 
      (3, 400, 400) / uint8 / cuda        |    205 (+-  1)    |    115 (+-  0)  
      (3, 400, 400) / PIL                 |    3532 (+- 11)   |    3523 (+-  8) 
      (3, 400, 400) / float32 / cpu       |    4955 (+- 34)   |    4040 (+- 39) 
      (3, 400, 400) / float32 / cuda      |    170 (+-  1)    |    104 (+-  0)  
      (16, 3, 400, 400) / uint8 / cpu     |  286173 (+-5670)  |   258815 (+-769)
      (16, 3, 400, 400) / uint8 / cuda    |    575 (+-  1)    |    455 (+-  0)  
      (16, 3, 400, 400) / float32 / cpu   |  270322 (+-7024)  |   252958 (+-710)
      (16, 3, 400, 400) / float32 / cuda  |    487 (+-  1)    |    380 (+-  3)  
6 threads: ---------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |    6857 (+-193)   |    5214 (+- 62) 
      (3, 400, 400) / float32 / cpu       |    6000 (+- 47)   |    4875 (+- 58) 
      (16, 3, 400, 400) / uint8 / cpu     |  306676 (+-3053)  |  279421 (+-2652)
      (16, 3, 400, 400) / float32 / cpu   |  291542 (+-2580)  |  274144 (+-2823)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +18.4% (improvement)
[------------------- affine @ torchvision==0.15.0a0+1098dad -------------------]
                                          |         v1        |         v2      
1 threads: ---------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |    2130 (+-821)   |    2021 (+-816) 
      (3, 400, 400) / uint8 / cuda        |    245 (+-  1)    |    221 (+-  2)  
      (3, 400, 400) / PIL                 |   3700 (+-1658)   |   3689 (+-1658) 
      (3, 400, 400) / float32 / cpu       |    1609 (+-795)   |    1590 (+-793) 
      (3, 400, 400) / float32 / cuda      |    212 (+-  1)    |    194 (+-  1)  
      (16, 3, 400, 400) / uint8 / cpu     |  70314 (+-11520)  |  68234 (+-11541)
      (16, 3, 400, 400) / uint8 / cuda    |    383 (+- 33)    |    373 (+- 34)  
      (16, 3, 400, 400) / float32 / cpu   |  58610 (+-11623)  |  29207 (+-14296)
      (16, 3, 400, 400) / float32 / cuda  |    256 (+- 34)    |    249 (+- 34)  
6 threads: ---------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |    2455 (+-806)   |    2349 (+-808) 
      (3, 400, 400) / float32 / cpu       |    1802 (+-803)   |    1776 (+-804) 
      (16, 3, 400, 400) / uint8 / cpu     |  71781 (+-11712)  |  69805 (+-12170)
      (16, 3, 400, 400) / float32 / cpu   |  58848 (+-11868)  |  58908 (+-11957)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +3.4% (improvement)
[-------------- autocontrast @ torchvision==0.15.0a0+1098dad --------------]
                                          |        v1       |        v2     
1 threads: -----------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   1130 (+-  4)  |   767 (+-  2) 
      (3, 400, 400) / uint8 / cuda        |   163 (+-  1)   |    96 (+-  1) 
      (3, 400, 400) / PIL                 |   724 (+-  1)   |   720 (+-  2) 
      (3, 400, 400) / float32 / cpu       |   712 (+-  1)   |   532 (+-  1) 
      (3, 400, 400) / float32 / cuda      |   135 (+-  1)   |    81 (+-  0) 
      (16, 3, 400, 400) / uint8 / cpu     |  23736 (+-103)  |  13385 (+- 61)
      (16, 3, 400, 400) / uint8 / cuda    |   254 (+-  0)   |   270 (+-  0) 
      (16, 3, 400, 400) / float32 / cpu   |  18042 (+- 83)  |  13180 (+-131)
      (16, 3, 400, 400) / float32 / cuda  |   237 (+-  0)   |   221 (+-  0) 
6 threads: -----------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   1447 (+- 32)  |   1066 (+- 17)
      (3, 400, 400) / float32 / cpu       |   931 (+- 17)   |   745 (+- 14) 
      (16, 3, 400, 400) / uint8 / cpu     |  24379 (+-436)  |  13972 (+-231)
      (16, 3, 400, 400) / float32 / cpu   |  18798 (+-735)  |  13514 (+-260)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +20.9% (improvement)
[------------ center_crop @ torchvision==0.15.0a0+1098dad -------------]
                                          |       v1      |       v2    
1 threads: -------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   11 (+-  0)  |    5 (+-  0)
      (3, 400, 400) / uint8 / cuda        |   11 (+-  0)  |    5 (+-  0)
      (3, 400, 400) / PIL                 |   15 (+-  0)  |   11 (+-  0)
      (3, 400, 400) / float32 / cpu       |   11 (+-  0)  |    5 (+-  0)
      (3, 400, 400) / float32 / cuda      |   11 (+-  0)  |    5 (+-  0)
      (16, 3, 400, 400) / uint8 / cpu     |   11 (+-  0)  |    5 (+-  0)
      (16, 3, 400, 400) / uint8 / cuda    |   11 (+-  0)  |    6 (+-  0)
      (16, 3, 400, 400) / float32 / cpu   |   11 (+-  0)  |    5 (+-  0)
      (16, 3, 400, 400) / float32 / cuda  |   11 (+-  0)  |    6 (+-  0)
6 threads: -------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   11 (+-  0)  |    5 (+-  0)
      (3, 400, 400) / float32 / cpu       |   11 (+-  0)  |    5 (+-  0)
      (16, 3, 400, 400) / uint8 / cpu     |   11 (+-  0)  |    5 (+-  0)
      (16, 3, 400, 400) / float32 / cpu   |   11 (+-  0)  |    5 (+-  0)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +45.7% (improvement)
[---------- convert_color_space @ torchvision==0.15.0a0+1098dad ----------]
                                          |        v1       |       v2     
1 threads: ----------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   485 (+-  1)   |  310 (+-  1) 
      (3, 400, 400) / uint8 / cuda        |    58 (+-  0)   |   37 (+-  0) 
      (3, 400, 400) / PIL                 |   112 (+-  0)   |  111 (+-  1) 
      (3, 400, 400) / float32 / cpu       |   357 (+-  1)   |  165 (+-  0) 
      (3, 400, 400) / float32 / cuda      |    49 (+-  0)   |   29 (+-  0) 
      (16, 3, 400, 400) / uint8 / cpu     |   9109 (+- 57)  |  5401 (+- 10)
      (16, 3, 400, 400) / uint8 / cuda    |    83 (+-  1)   |   57 (+-  0) 
      (16, 3, 400, 400) / float32 / cpu   |   6209 (+- 12)  |  2763 (+-  9)
      (16, 3, 400, 400) / float32 / cuda  |    84 (+-  1)   |   44 (+-  0) 
6 threads: ----------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   854 (+- 19)   |  594 (+- 14) 
      (3, 400, 400) / float32 / cpu       |   562 (+-  4)   |  291 (+-  4) 
      (16, 3, 400, 400) / uint8 / cpu     |  10484 (+-233)  |  6001 (+- 39)
      (16, 3, 400, 400) / float32 / cpu   |   6770 (+-116)  |  3014 (+- 21)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +34.4% (improvement)
[----------- convert_dtype @ torchvision==0.15.0a0+1098dad ------------]
                                        |       v1       |       v2     
1 threads: -------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu       |  283 (+-  1)   |  189 (+-  1) 
      (3, 400, 400) / uint8 / cuda      |   24 (+-  0)   |   16 (+-  0) 
      (16, 3, 400, 400) / uint8 / cpu   |  6775 (+-124)  |  3609 (+- 11)
      (16, 3, 400, 400) / uint8 / cuda  |   90 (+-  1)   |   87 (+-  1) 
6 threads: -------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu       |  373 (+-  3)   |  274 (+-  3) 
      (16, 3, 400, 400) / uint8 / cpu   |  6845 (+-183)  |  3781 (+- 37)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +28.8% (improvement)
[---------------- crop @ torchvision==0.15.0a0+1098dad ----------------]
                                          |       v1      |       v2    
1 threads: -------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |    6 (+-  0)  |    4 (+-  0)
      (3, 400, 400) / uint8 / cuda        |    7 (+-  0)  |    4 (+-  0)
      (3, 400, 400) / PIL                 |   11 (+-  0)  |   10 (+-  0)
      (3, 400, 400) / float32 / cpu       |    6 (+-  0)  |    4 (+-  0)
      (3, 400, 400) / float32 / cuda      |    7 (+-  0)  |    4 (+-  0)
      (16, 3, 400, 400) / uint8 / cpu     |    7 (+-  0)  |    4 (+-  0)
      (16, 3, 400, 400) / uint8 / cuda    |    7 (+-  0)  |    5 (+-  0)
      (16, 3, 400, 400) / float32 / cpu   |    6 (+-  0)  |    4 (+-  0)
      (16, 3, 400, 400) / float32 / cuda  |    7 (+-  0)  |    5 (+-  0)
6 threads: -------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |    6 (+-  0)  |    4 (+-  0)
      (3, 400, 400) / float32 / cpu       |    6 (+-  0)  |    4 (+-  0)
      (16, 3, 400, 400) / uint8 / cpu     |    6 (+-  0)  |    4 (+-  0)
      (16, 3, 400, 400) / float32 / cpu   |    6 (+-  0)  |    4 (+-  0)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +29.2% (improvement)
[----------------- elastic @ torchvision==0.15.0a0+1098dad ------------------]
                                          |        v1        |        v2      
1 threads: -------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   4316 (+- 17)   |   4150 (+- 15) 
      (3, 400, 400) / uint8 / cuda        |   1004 (+-  5)   |   479 (+-  1)  
      (3, 400, 400) / PIL                 |   6821 (+- 16)   |   6664 (+- 18) 
      (3, 400, 400) / float32 / cpu       |   3823 (+- 15)   |   3722 (+- 13) 
      (3, 400, 400) / float32 / cuda      |   972 (+-  5)    |   455 (+-  1)  
      (16, 3, 400, 400) / uint8 / cpu     |  84065 (+-2585)  |  83337 (+-1763)
      (16, 3, 400, 400) / uint8 / cuda    |   1051 (+-  6)   |   493 (+-  1)  
      (16, 3, 400, 400) / float32 / cpu   |  73713 (+-268)   |  73115 (+-1222)
      (16, 3, 400, 400) / float32 / cuda  |   975 (+-  6)    |   448 (+-  1)  
6 threads: -------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   4590 (+- 29)   |   4426 (+- 30) 
      (3, 400, 400) / float32 / cpu       |   3975 (+- 38)   |   3880 (+- 42) 
      (16, 3, 400, 400) / uint8 / cpu     |  85571 (+-1098)  |  84945 (+-1901)
      (16, 3, 400, 400) / float32 / cpu   |  74562 (+-1859)  |  73956 (+-2809)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +22.5% (improvement)
[----------------- equalize @ torchvision==0.15.0a0+1098dad ----------------]
                                          |        v1        |        v2     
1 threads: ------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   2883 (+-  8)   |   2356 (+-  8)
      (3, 400, 400) / uint8 / cuda        |   904 (+-101)    |   239 (+-  1) 
      (3, 400, 400) / PIL                 |   731 (+-  1)    |   727 (+-  1) 
      (16, 3, 400, 400) / uint8 / cpu     |  46920 (+-246)   |  39804 (+-193)
      (16, 3, 400, 400) / uint8 / cuda    |  14259 (+-1271)  |   838 (+-  8) 
      (3, 400, 400) / float32 / cpu       |                  |   3001 (+- 12)
      (3, 400, 400) / float32 / cuda      |                  |   287 (+-  1) 
      (16, 3, 400, 400) / float32 / cpu   |                  |  53390 (+- 87)
      (16, 3, 400, 400) / float32 / cuda  |                  |   1010 (+- 23)
6 threads: ------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   3602 (+- 38)   |   2547 (+- 29)
      (16, 3, 400, 400) / uint8 / cpu     |  59143 (+-1811)  |  40616 (+-371)
      (3, 400, 400) / float32 / cpu       |                  |   3348 (+- 47)
      (16, 3, 400, 400) / float32 / cpu   |                  |  54473 (+-407)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +36.0% (improvement)
[------------- five_crop @ torchvision==0.15.0a0+1098dad --------------]
                                          |       v1      |       v2    
1 threads: -------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   39 (+-  0)  |   21 (+-  0)
      (3, 400, 400) / uint8 / cuda        |   41 (+-  0)  |   22 (+-  0)
      (3, 400, 400) / PIL                 |  104 (+-  1)  |   90 (+-  1)
      (3, 400, 400) / float32 / cpu       |   39 (+-  0)  |   21 (+-  0)
      (3, 400, 400) / float32 / cuda      |   41 (+-  0)  |   22 (+-  0)
      (16, 3, 400, 400) / uint8 / cpu     |   40 (+-  0)  |   21 (+-  0)
      (16, 3, 400, 400) / uint8 / cuda    |   41 (+-  0)  |   22 (+-  0)
      (16, 3, 400, 400) / float32 / cpu   |   40 (+-  0)  |   21 (+-  0)
      (16, 3, 400, 400) / float32 / cuda  |   41 (+-  0)  |   22 (+-  0)
6 threads: -------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   39 (+-  1)  |   21 (+-  0)
      (3, 400, 400) / float32 / cpu       |   39 (+-  0)  |   21 (+-  0)
      (16, 3, 400, 400) / uint8 / cpu     |   40 (+-  0)  |   21 (+-  0)
      (16, 3, 400, 400) / float32 / cpu   |   40 (+-  0)  |   21 (+-  0)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +40.2% (improvement)
[--------------- gaussian_blur @ torchvision==0.15.0a0+1098dad ----------------]
                                          |         v1        |         v2      
1 threads: ---------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |    4256 (+-110)   |    4000 (+-  1) 
      (3, 400, 400) / uint8 / cuda        |    263 (+- 27)    |    139 (+-  1)  
      (3, 400, 400) / PIL                 |    7339 (+- 31)   |    7143 (+- 58) 
      (3, 400, 400) / float32 / cpu       |    3634 (+- 10)   |    3520 (+- 22) 
      (3, 400, 400) / float32 / cuda      |    211 (+-  2)    |    111 (+-  0)  
      (16, 3, 400, 400) / uint8 / cpu     |  261353 (+-2761)  |   258668 (+-930)
      (16, 3, 400, 400) / uint8 / cuda    |    701 (+- 17)    |    609 (+-  0)  
      (16, 3, 400, 400) / float32 / cpu   |  246807 (+-2297)  |   246708 (+-679)
      (16, 3, 400, 400) / float32 / cuda  |    575 (+-  1)    |    485 (+-  0)  
6 threads: ---------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |    5466 (+- 52)   |    5246 (+- 31) 
      (3, 400, 400) / float32 / cpu       |    4730 (+- 40)   |    4597 (+- 38) 
      (16, 3, 400, 400) / uint8 / cpu     |  283245 (+-3426)  |  280457 (+-2995)
      (16, 3, 400, 400) / float32 / cpu   |  269702 (+-7677)  |  269556 (+-3773)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +13.7% (improvement)
[---------------- invert @ torchvision==0.15.0a0+1098dad ----------------]
                                          |       v1       |       v2     
1 threads: ---------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |  154 (+-  1)   |   19 (+-  0) 
      (3, 400, 400) / uint8 / cuda        |   13 (+-  0)   |    7 (+-  0) 
      (3, 400, 400) / PIL                 |  309 (+-  1)   |  306 (+-  1) 
      (3, 400, 400) / float32 / cpu       |  175 (+-  1)   |  166 (+-  1) 
      (3, 400, 400) / float32 / cuda      |   13 (+-  0)   |   10 (+-  0) 
      (16, 3, 400, 400) / uint8 / cpu     |  2269 (+-  6)  |  596 (+-  2) 
      (16, 3, 400, 400) / uint8 / cuda    |   14 (+-  0)   |   14 (+-  0) 
      (16, 3, 400, 400) / float32 / cpu   |  3982 (+-104)  |  4062 (+-107)
      (16, 3, 400, 400) / float32 / cuda  |   49 (+-  0)   |   49 (+-  0) 
6 threads: ---------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |  196 (+-  2)   |   58 (+-  0) 
      (3, 400, 400) / float32 / cpu       |  220 (+-  4)   |  210 (+-  3) 
      (16, 3, 400, 400) / uint8 / cpu     |  2330 (+- 28)  |  649 (+-  4) 
      (16, 3, 400, 400) / float32 / cpu   |  4142 (+- 80)  |  4001 (+- 42)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +23.3% (improvement)
[-------------- normalize @ torchvision==0.15.0a0+1098dad ---------------]
                                          |       v1       |       v2     
1 threads: ---------------------------------------------------------------
      (3, 400, 400) / float32 / cpu       |  376 (+-  1)   |  271 (+-  1) 
      (3, 400, 400) / float32 / cuda      |  116 (+-  0)   |   62 (+-  0) 
      (16, 3, 400, 400) / float32 / cpu   |  6698 (+- 15)  |  5207 (+- 19)
      (16, 3, 400, 400) / float32 / cuda  |  224 (+-  1)   |  139 (+-  0) 
6 threads: ---------------------------------------------------------------
      (3, 400, 400) / float32 / cpu       |  510 (+-  5)   |  359 (+-  4) 
      (16, 3, 400, 400) / float32 / cpu   |  7015 (+-227)  |  5805 (+- 43)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +33.6% (improvement)
[--------------- perspective @ torchvision==0.15.0a0+1098dad ----------------]
                                          |        v1        |        v2      
1 threads: -------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   4261 (+-148)   |   4000 (+-  8) 
      (3, 400, 400) / uint8 / cuda        |   502 (+- 49)    |   457 (+-  1)  
      (3, 400, 400) / PIL                 |   1819 (+- 28)   |   1809 (+-  7) 
      (3, 400, 400) / float32 / cpu       |   3763 (+- 11)   |   3644 (+-  7) 
      (3, 400, 400) / float32 / cuda      |   441 (+-  1)    |   426 (+-  1)  
      (16, 3, 400, 400) / uint8 / cpu     |  65562 (+-1493)  |  60287 (+-469) 
      (16, 3, 400, 400) / uint8 / cuda    |   475 (+-  1)    |   458 (+-  2)  
      (16, 3, 400, 400) / float32 / cpu   |  51973 (+-409)   |  51620 (+-473) 
      (16, 3, 400, 400) / float32 / cuda  |   438 (+-  1)    |   425 (+-  1)  
6 threads: -------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   4728 (+- 30)   |   4509 (+- 38) 
      (3, 400, 400) / float32 / cpu       |   4072 (+- 47)   |   4000 (+- 42) 
      (16, 3, 400, 400) / uint8 / cpu     |  67120 (+-1483)  |  61356 (+-1763)
      (16, 3, 400, 400) / float32 / cpu   |  51711 (+-546)   |  51035 (+-1607)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +3.7% (improvement)
[-------------- posterize @ torchvision==0.15.0a0+1098dad ---------------]
                                          |       v1       |       v2     
1 threads: ---------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |  116 (+-  1)   |  109 (+-  0) 
      (3, 400, 400) / uint8 / cuda        |   14 (+-  0)   |   10 (+-  0) 
      (3, 400, 400) / PIL                 |  318 (+-  1)   |  313 (+-  1) 
      (16, 3, 400, 400) / uint8 / cpu     |  1621 (+-  6)  |  1609 (+-  6)
      (16, 3, 400, 400) / uint8 / cuda    |   14 (+-  0)   |   14 (+-  0) 
      (3, 400, 400) / float32 / cpu       |                |  410 (+-  1) 
      (3, 400, 400) / float32 / cuda      |                |   26 (+-  0) 
      (16, 3, 400, 400) / float32 / cpu   |                |  8108 (+- 48)
      (16, 3, 400, 400) / float32 / cuda  |                |  178 (+-  0) 
6 threads: ---------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |  157 (+-  2)   |  148 (+-  3) 
      (16, 3, 400, 400) / uint8 / cpu     |  1706 (+- 24)  |  1673 (+- 25)
      (3, 400, 400) / float32 / cpu       |                |  579 (+-  4) 
      (16, 3, 400, 400) / float32 / cpu   |                |  8794 (+-202)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +6.9% (improvement)
[------------------- resize @ torchvision==0.15.0a0+1098dad -------------------]
                                          |         v1        |         v2      
1 threads: ---------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |    1987 (+-706)   |    1945 (+-716) 
      (3, 400, 400) / uint8 / cuda        |     49 (+-  0)    |     43 (+-  0)  
      (3, 400, 400) / PIL                 |    1235 (+-427)   |    1228 (+-435) 
      (3, 400, 400) / float32 / cpu       |    1649 (+-744)   |    1638 (+-743) 
      (3, 400, 400) / float32 / cuda      |     21 (+-  0)    |     17 (+-  0)  
      (16, 3, 400, 400) / uint8 / cpu     |   9739 (+-9872)   |   8027 (+-11715)
      (16, 3, 400, 400) / uint8 / cuda    |     90 (+- 16)    |     89 (+- 16)  
      (16, 3, 400, 400) / float32 / cpu   |  26834 (+-12079)  |  26845 (+-12087)
      (16, 3, 400, 400) / float32 / cuda  |     23 (+- 13)    |     23 (+- 13)  
6 threads: ---------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |    1527 (+-355)   |    1482 (+-359) 
      (3, 400, 400) / float32 / cpu       |    1073 (+-398)   |    1066 (+-403) 
      (16, 3, 400, 400) / uint8 / cpu     |   10123 (+-5967)  |   8402 (+-5907) 
      (16, 3, 400, 400) / float32 / cpu   |   16026 (+-6296)  |   16022 (+-6289)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +5.0% (improvement)
[----------------- resized_crop @ torchvision==0.15.0a0+1098dad -----------------]
                                          |         v1         |         v2       
1 threads: -----------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   1244 (+-1802)    |   1099 (+-1769)  
      (3, 400, 400) / uint8 / cuda        |     62 (+-  1)     |     52 (+-  0)   
      (3, 400, 400) / PIL                 |    2250 (+-979)    |    2236 (+-979)  
      (3, 400, 400) / float32 / cpu       |   2902 (+-2412)    |    531 (+-2408)  
      (3, 400, 400) / float32 / cuda      |     31 (+-  5)     |     24 (+-  5)   
      (16, 3, 400, 400) / uint8 / cpu     |  106292 (+-38242)  |  104513 (+-40684)
      (16, 3, 400, 400) / uint8 / cuda    |    218 (+- 25)     |    214 (+- 26)   
      (16, 3, 400, 400) / float32 / cpu   |   9466 (+-41424)   |   9384 (+-41517) 
      (16, 3, 400, 400) / float32 / cuda  |     81 (+- 16)     |     81 (+- 16)   
6 threads: -----------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   1431 (+-1432)    |   1283 (+-1406)  
      (3, 400, 400) / float32 / cpu       |   3951 (+-1671)    |   3945 (+-1684)  
      (16, 3, 400, 400) / uint8 / cpu     |  77433 (+-22907)   |  73118 (+-23566) 
      (16, 3, 400, 400) / float32 / cpu   |  61734 (+-26037)   |  61725 (+-26093) 

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +5.8% (improvement)
[------------------- rotate @ torchvision==0.15.0a0+1098dad -------------------]
                                          |         v1        |         v2      
1 threads: ---------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |    2112 (+-854)   |    2026 (+-852) 
      (3, 400, 400) / uint8 / cuda        |    263 (+- 33)    |    217 (+-  3)  
      (3, 400, 400) / PIL                 |   3820 (+-1719)   |   3815 (+-1726) 
      (3, 400, 400) / float32 / cpu       |    1604 (+-815)   |    1588 (+-822) 
      (3, 400, 400) / float32 / cuda      |    204 (+-  1)    |    188 (+-  2)  
      (16, 3, 400, 400) / uint8 / cpu     |  81554 (+-16016)  |  78530 (+-15963)
      (16, 3, 400, 400) / uint8 / cuda    |    382 (+- 28)    |    371 (+- 28)  
      (16, 3, 400, 400) / float32 / cpu   |  66279 (+-14933)  |  66624 (+-15357)
      (16, 3, 400, 400) / float32 / cuda  |    255 (+- 28)    |    247 (+- 29)  
6 threads: ---------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |    2434 (+-843)   |    2356 (+-828) 
      (3, 400, 400) / float32 / cpu       |    1791 (+-833)   |    1783 (+-834) 
      (16, 3, 400, 400) / uint8 / cpu     |  81143 (+-16182)  |  80019 (+-16450)
      (16, 3, 400, 400) / float32 / cpu   |  66174 (+-15374)  |  66373 (+-15422)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +3.6% (improvement)
[---------------- solarize @ torchvision==0.15.0a0+1098dad ----------------]
                                          |        v1       |        v2     
1 threads: -----------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   1088 (+-  5)  |   962 (+-  5) 
      (3, 400, 400) / uint8 / cuda        |    32 (+-  0)   |    23 (+-  0) 
      (3, 400, 400) / PIL                 |   316 (+-  1)   |   314 (+-  1) 
      (3, 400, 400) / float32 / cpu       |   2357 (+- 39)  |   2344 (+-  5)
      (3, 400, 400) / float32 / cuda      |    33 (+-  0)   |    27 (+-  0) 
      (16, 3, 400, 400) / uint8 / cpu     |  18286 (+- 67)  |  16803 (+- 67)
      (16, 3, 400, 400) / uint8 / cuda    |    59 (+-  0)   |    62 (+-  0) 
      (16, 3, 400, 400) / float32 / cpu   |  38606 (+- 85)  |  38526 (+-217)
      (16, 3, 400, 400) / float32 / cuda  |   157 (+-  0)   |   158 (+-  0) 
6 threads: -----------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   1247 (+- 22)  |   1119 (+- 18)
      (3, 400, 400) / float32 / cpu       |   2502 (+- 30)  |   2493 (+- 59)
      (16, 3, 400, 400) / uint8 / cpu     |  18600 (+-399)  |  17198 (+-256)
      (16, 3, 400, 400) / float32 / cpu   |  38938 (+-405)  |  38945 (+-370)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +6.2% (improvement)
[--------------- ten_crop @ torchvision==0.15.0a0+1098dad ---------------]
                                          |       v1       |       v2     
1 threads: ---------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |  381 (+-  1)   |  337 (+-  1) 
      (3, 400, 400) / uint8 / cuda        |   97 (+-  0)   |   56 (+-  0) 
      (3, 400, 400) / PIL                 |  346 (+-  1)   |  309 (+-  1) 
      (3, 400, 400) / float32 / cpu       |  318 (+-  2)   |  273 (+-  1) 
      (3, 400, 400) / float32 / cuda      |   98 (+-  1)   |   56 (+-  0) 
      (16, 3, 400, 400) / uint8 / cpu     |  4648 (+- 12)  |  4602 (+- 18)
      (16, 3, 400, 400) / uint8 / cuda    |   99 (+-  0)   |   57 (+-  0) 
      (16, 3, 400, 400) / float32 / cpu   |  5918 (+- 81)  |  5887 (+- 60)
      (16, 3, 400, 400) / float32 / cuda  |   98 (+-  1)   |   56 (+-  0) 
6 threads: ---------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |  426 (+-  5)   |  382 (+-  4) 
      (3, 400, 400) / float32 / cpu       |  367 (+-  4)   |  322 (+-  3) 
      (16, 3, 400, 400) / uint8 / cpu     |  4731 (+- 31)  |  4685 (+- 48)
      (16, 3, 400, 400) / float32 / cpu   |  5951 (+- 67)  |  5892 (+- 62)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +21.7% (improvement)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: transforms Perf For performance improvements prototype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants