Performance improvements for transforms v2 vs. v1 #6818

pmeier · 2022-10-24T09:47:51Z

In addition to a lot of other goodies that transforms v2 will bring, we are also actively working on improving the performance. This is a tracker / overview issue of our progress.

Performance was measured with this benchmark script. Unless noted otherwise, the performance improvements reported above were computed on uint8, RGB images and videos while running single-threaded on CPU. You can find the full benchmark results alongside the benchmark script. The results will be constantly updated if new PRs are merged that have an effect on the kernels.

Kernels:

color
geometry
- affine Fix bug on prototype pad #6949
- center_crop [prototype] Optimize Center Crop performance #6880 Fix bug on prototype pad #6949
- crop Fix bug on prototype pad #6949
- elastic [prototype] Port elastic and minor cleanups #6942
- erase [prototype] Remove _FT aliases from functional #6983
- five_crop: Composite kernel Fix bug on prototype pad #6949
- pad Fix bug on prototype pad #6949
- perspective [proto] Small optim for perspective op on images #6907 Fix bug on prototype pad #6949
- resize [prototype] Clean up and port the resize kernel in V2 #6892
- resized_crop: Composite kernel [prototype] Clean up and port the resize kernel in V2 #6892 Fix bug on prototype pad #6949
- rotate Fix bug on prototype pad #6949
- ten_crop: Composite kernel Fix bug on prototype pad #6949
meta
- convert_color_space [proto] Speed up adjust color ops #6784 [prototype] Minor improvements on functional #6832
- convert_dtype improve perf on convert_image_dtype and add tests #6795 replace tensor division with scalar division and tensor multiplication #6903
  - There is still some performance gain left for int to int conversion. Currently, we are using a multiplication
    but theoretically bit shifts are faster. However, on PyTorch core the CPU kernels for bit shifts are not
    vectorized making them slower for regular sized images than a multiplication. Vectorized CPU code implementing left shift operator. pytorch#88607
misc
- gaussian_blur [proto] Small optimization for gaussian_blur functional op #6762 [prototype] Gaussian Blur clean up #6888
- normalize [prototype] Speed improvement for normalize op #6821

Transform Classes:

MixUp/CutMix [prototype] Speed up Augment Transform Classes #6835
ColorJitter, RandomPhotometricDistort [prototype] Minor speed and nit optimizations on Transform Classes #6837

C++ (PyTorch core):

cc @vfdev-5 @datumbox @bjuncek

The text was updated successfully, but these errors were encountered:

datumbox · 2022-10-24T14:48:12Z

Concerning elastic and all the affine transform kernels (affine, perspective, rotate), there are some very limited opportunities for optimization. Perhaps a couple of in-place ops in elastic_transform & _perspective_grid and a few optimizations in _apply_grid_transform (split of mask and img, bilinear fill estimation etc). Also some minor fixes related to the input assertion. @vfdev-5 would you be OK to assess on your side whether it make sense to do these or leave the methods on _FT to avoid copy-pasting? Perhaps you have in mind other optimizations that I can't see that could affect performance?

Concerning crop, erase, pad, resize, horizontal_flip and vertical_flip, I don't see any further improvements other than the input assertions. It might be worth to have a look on your side, @pmeier and @vfdev-5, incase you see something I don't.

pmeier · 2022-10-26T13:28:06Z

I did another deep dive into the ops in the second paragraph of #6818 (comment) and I'm fairly confident that there is little we can do to improve on our side. The only two things I found are

For padding modes "edge" and "reflect" we cast to float32 and back:

vision/torchvision/transforms/functional_tensor.py

Lines 426 to 431 in c84dbfa

    
           if (padding_mode != "constant") and img.dtype not in (torch.float32, torch.float64): 
        
               # Here we temporary cast input tensor to float 
        
               # until pytorch issue is resolved : 
        
               # https://github.com/pytorch/pytorch/issues/40763 
        
               need_cast = True 
        
               img = img.to(torch.float32)

There is a long standing issue on PyTorch core Enhance supported types of functional.pad pytorch#40763 that reports this and is assigned to @vfdev-5.

We support "symmetric" padding in F.pad, but torch.nn.functional.pad doesn't. Thus, we have a custom implementation for it

vision/torchvision/transforms/functional_tensor.py

Line 330 in c84dbfa

def _pad_symmetric(img: Tensor, padding: List[int]) -> Tensor:

Since it is written in Python, a possible speed up would be to implement this padding mode in C++ on the PyTorch core side.

Fixing this, we would get speed-ups for padding modes "edge", "reflect", and "symmetric" but not for the default and ubiquitous "constant" padding mode. Skimming the repository, it seems the only time we use non-"constant" padding is

vision/torchvision/transforms/functional_tensor.py

Line 762 in c84dbfa

img = torch_pad(img, padding, mode="reflect")

In there the image is guaranteed to be float and thus would not get any performance boost.

While I think both things mentioned above would be good to have in general, I don't think we should prioritize them.

vfdev-5 · 2022-10-28T15:16:23Z

Concerning elastic and all the affine transform kernels (affine, perspective, rotate), there are some very limited opportunities for optimization. Perhaps a couple of in-place ops in elastic_transform & _perspective_grid and a few optimizations in _apply_grid_transform (split of mask and img, bilinear fill estimation etc). Also some minor fixes related to the input assertion. @vfdev-5 would you be OK to assess on your side whether it make sense to do these or leave the methods on _FT to avoid copy-pasting? Perhaps you have in mind other optimizations that I can't see that could affect performance?

Checking various options with affine, there is no obvious way to improve runtime perfs. Yes, we can make some inplace "split of mask and img, bilinear fill estimation etc". There is also an open issue about incorrect behaviour of bilinear mode with provided not-None fill (#6517). Given that I think we can keep this implementation.

vadimkantorov · 2022-10-28T15:52:13Z

About not vectorized bitwise shifts, is there an issue in core?

pmeier · 2022-10-31T19:59:29Z

About not vectorized bitwise shifts, is there an issue in core?

I don't think so, but @alexsamardzic wanted to have a look at it.

Edit: pytorch/pytorch#88607

datumbox · 2022-11-10T11:34:43Z

@pmeier I'm keeping the list up-to-date with all linked PRs. I'm marking as [NEEDS RETEST]/[NEEDS TEST] any kernel that I touch to run further benchmarks and update the numbers.

vadimkantorov · 2022-11-10T12:20:29Z

An interesting question is whether a sequence of these transformations can be fused with Inductor/Dynamo (or sth else?) and produce a fused low-memory-access CPU kernel (working with uint8 or fp32?) and how it connects with randomness of whether to apply a transform or not

datumbox · 2022-11-15T17:40:57Z

Speed Benchmarks V1 vs V2

Summary

The Transforms V2 API is faster than V1 (stable) because it introduces several optimizations on the Transform Classes and Functional kernels. Summarizing the performance gains on a single number should be taken with a grain of salt because:

The performance heavily depends on the selected configuration (CPU vs CUDA device, Tensor vs PIL backend, uint8 vs float32 dtypes, number of threads etc). Though we included in our benchmarks the most common configurations, different setups might yield different results.
The execution times of the different Transforms vary significantly (often in orders of magnitude). Though we report % differences, a simple unweighted average can't tell the full story.
The training speed depends on multitude of factors including the mix of augmentations, the size of the model etc. Though we use a commonly used SoTA recipe, the results can differ depending on whether we are IO/Memory/Compute bound.

With the above in mind, here are some statistics that summarize the performance of the new API:

Training: Using TorchVision's latest training recipe, we observe a significant 18% improvement on the training times using the Tensor backend. The performance of PIL backend remains the same.
Transform Classes: The average improvement for the transform classes is about 8%. On the Tensor backend, float32 ops were improved on average by 9% and uint8 by 12%. On PIL backend the performance remains the same.
Functional Kernels: The average improvement for the functional kernels is about 21%. On the Tensor backend, cpu performance was improved by 23% and cuda by 29%. On PIL backend the performance remains the same.

To estimate the above aggregate statistics we used this script on top of the detailed benchmarks:

Aggregate Statistics

TRANSFORMS:
Overall execution time reduction: -8.37%
                    %
device dtype         
cpu    float32  -7.47
       pil      -0.10
       uint8   -11.61
cuda   float32  -8.43
       uint8   -13.47
----------------------------
DISPATCHERS:
Overall execution time reduction: -21.49%
                    %
device dtype         
cpu    float32 -21.31
       pil      -3.26
       uint8   -24.21
cuda   float32 -29.09
       uint8   -29.43
----------------------------

Speed Benchmarks

For all benchmarks below we use PyTorch nightly 1.14.0.dev20221115, CUDA 11.6 and TorchVision main from ad128b7. The statistics were estimated on a p4d24xlarge AWS instance with A100 GPU. Since the both V1 and V2 use the same PyTorch version, the speed improvements below don't include performance optimizations performed on the C++ kernels of Core.

Training

To assess the performance in real world applications, we trained a ResNet50 using TorchVision's SoTA recipe for a reduced number of 10 epochs across different setups:

PYTHONPATH=$PYTHONPATH:`pwd` python -u run_with_submitit.py --ngpus 8 --nodes 1 --model resnet50 --batch-size 128 --lr 0.5 --lr-scheduler cosineannealinglr --lr-warmup-epochs 5 --lr-warmup-method linear --auto-augment ta_wide --epochs 10 --random-erase 0.1 --label-smoothing 0.1 --mixup-alpha 0.2 --cutmix-alpha 1.0 --weight-decay 0.00002 --norm-weight-decay 0.0 --train-crop-size 176 --model-ema --val-resize-size 232 --ra-sampler --ra-reps 4 --data-path /datasets01/imagenet_full_size/061417/

Detailed Benchmarks

V1 using ad128b7 of main branch (PIL):

Submitted job_id: 77904
Epoch: [0] Total time: 0:03:07
Epoch: [1] Total time: 0:03:04
Epoch: [2] Total time: 0:03:03
Epoch: [3] Total time: 0:03:03
Epoch: [4] Total time: 0:03:02
Epoch: [5] Total time: 0:03:03
Epoch: [6] Total time: 0:03:03
Epoch: [7] Total time: 0:03:02
Epoch: [8] Total time: 0:03:00
Epoch: [9] Total time: 0:03:05

V1 using 46bd6d9 of #6952 (Tensor uint8):

Submitted job_id: 77827
Epoch: [0] Total time: 0:03:43
Epoch: [1] Total time: 0:04:05
Epoch: [2] Total time: 0:03:59
Epoch: [3] Total time: 0:04:24
Epoch: [4] Total time: 0:04:39
Epoch: [5] Total time: 0:04:42
Epoch: [6] Total time: 0:04:46
Epoch: [7] Total time: 0:04:42
Epoch: [8] Total time: 0:03:40
Epoch: [9] Total time: 0:03:32

V2 using 8b53036 of #6433 (PIL). Marginal median improvement of 1.64%:

Submitted job_id: 77905
Epoch: [0] Total time: 0:03:09
Epoch: [1] Total time: 0:03:02
Epoch: [2] Total time: 0:03:00
Epoch: [3] Total time: 0:03:00
Epoch: [4] Total time: 0:03:00
Epoch: [5] Total time: 0:02:59
Epoch: [6] Total time: 0:03:00
Epoch: [7] Total time: 0:03:00
Epoch: [8] Total time: 0:03:00
Epoch: [9] Total time: 0:03:00

V2 using bda072d of #6433 (Tensor uint8). Median improvement of 18.27%:

Submitted job_id: 77901
Epoch: [0] Total time: 0:03:52
Epoch: [1] Total time: 0:03:36
Epoch: [2] Total time: 0:03:35
Epoch: [3] Total time: 0:03:31
Epoch: [4] Total time: 0:03:28
Epoch: [5] Total time: 0:03:28
Epoch: [6] Total time: 0:03:28
Epoch: [7] Total time: 0:03:26
Epoch: [8] Total time: 0:03:27
Epoch: [9] Total time: 0:03:25

V2 using 8f07159 of #6433 (Tensor float32). Note that this configuration wasn't supported in V1 because not all kernels and augmentations supported floats:

Submitted job_id: 77902
Epoch: [0] Total time: 0:04:25
Epoch: [1] Total time: 0:04:13
Epoch: [2] Total time: 0:04:12
Epoch: [3] Total time: 0:04:12
Epoch: [4] Total time: 0:04:13
Epoch: [5] Total time: 0:04:10
Epoch: [6] Total time: 0:04:11
Epoch: [7] Total time: 0:04:11
Epoch: [8] Total time: 0:04:12
Epoch: [9] Total time: 0:04:11

Transform Classes

Generated using the following script, inspired from earlier iterations from @vfdev-5 and amended by @pmeier. We compare V1 against V2 for all kernels for many configurations (batch size, dtype, device, number of threads etc) and then estimate the average performance improvement across all configuration to summarize the end result.

Detailed Benchmarks

[-------------------------------- RandomErasing ---------------------------------]
                                            |         V1        |         V2      
1 threads: -----------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   359 (+- 92) us  |   333 (+-  2) us
      cuda torch.float32 (3, 400, 400)      |   322 (+-  2) us  |   331 (+-  3) us
      cpu torch.float32 (16, 3, 400, 400)   |  4995 (+- 80) us  |  4978 (+- 54) us
      cuda torch.float32 (16, 3, 400, 400)  |  2144 (+-102) us  |  2135 (+-102) us
      cpu torch.uint8 (3, 400, 400)         |   219 (+-  1) us  |   226 (+-  2) us
      cuda torch.uint8 (3, 400, 400)        |   227 (+-  2) us  |   236 (+-  2) us
      cpu torch.uint8 (16, 3, 400, 400)     |  1787 (+- 44) us  |  1789 (+- 42) us
      cuda torch.uint8 (16, 3, 400, 400)    |  1313 (+- 55) us  |  1316 (+- 56) us
6 threads: -----------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   410 (+-  4) us  |   418 (+-  3) us
      cpu torch.float32 (16, 3, 400, 400)   |  5191 (+- 78) us  |  5225 (+- 61) us
      cpu torch.uint8 (3, 400, 400)         |   302 (+-  3) us  |   310 (+-  4) us
      cpu torch.uint8 (16, 3, 400, 400)     |  1973 (+- 40) us  |  1977 (+- 49) us

Times are in microseconds (us).
Performance of V1 vs V2: -1.228% (slowdown)

[---------------------------------- AugMix ----------------------------------]
                                          |        V1        |        V2      
1 threads: -------------------------------------------------------------------
      cpu torch.uint8 (3, 400, 400)       |   22 (+-  7) ms  |   19 (+-  2) ms
      cuda torch.uint8 (3, 400, 400)      |    2 (+-  1) ms  |    2 (+-  0) ms
      cpu torch.uint8 (16, 3, 400, 400)   |  736 (+-262) ms  |  738 (+-234) ms
      cuda torch.uint8 (16, 3, 400, 400)  |   10 (+-  3) ms  |    3 (+-  0) ms
      cpu pil (3, 400, 400)               |   25 (+-  3) ms  |   23 (+-  2) ms
6 threads: -------------------------------------------------------------------
      cpu torch.uint8 (3, 400, 400)       |   27 (+-  3) ms  |   23 (+-  3) ms
      cpu torch.uint8 (16, 3, 400, 400)   |  803 (+-271) ms  |  735 (+-240) ms

Times are in milliseconds (ms).
Performance of V1 vs V2: 21.496% (improvement)

[------------------------------------ AutoAugment ------------------------------------]
                                          |           V1          |          V2        
1 threads: ----------------------------------------------------------------------------
      cpu torch.uint8 (3, 400, 400)       |    3478 (+-253) us    |   2952 (+-251) us  
      cuda torch.uint8 (3, 400, 400)      |     746 (+- 27) us    |    317 (+-  6) us  
      cpu torch.uint8 (16, 3, 400, 400)   |  103178 (+-19894) us  |  87614 (+-27733) us
      cuda torch.uint8 (16, 3, 400, 400)  |    6868 (+-671) us    |    635 (+- 18) us  
      cpu pil (3, 400, 400)               |    1194 (+-133) us    |   1153 (+- 31) us  
6 threads: ----------------------------------------------------------------------------
      cpu torch.uint8 (3, 400, 400)       |    4128 (+-269) us    |   3366 (+-278) us  
      cpu torch.uint8 (16, 3, 400, 400)   |   72148 (+-94797) us  |  89567 (+-30107) us

Times are in microseconds (us).
Performance of V1 vs V2: 30.867% (improvement)

[------------------------------------- RandAugment --------------------------------------]
                                          |           V1           |           V2         
1 threads: -------------------------------------------------------------------------------
      cpu torch.uint8 (3, 400, 400)       |    6604 (+-1089) us    |    5928 (+-275) us   
      cuda torch.uint8 (3, 400, 400)      |     798 (+- 14) us     |     574 (+- 10) us   
      cpu torch.uint8 (16, 3, 400, 400)   |  172182 (+-119305) us  |  162579 (+-110068) us
      cuda torch.uint8 (16, 3, 400, 400)  |    2982 (+-580) us     |     945 (+- 47) us   
      cpu pil (3, 400, 400)               |    2036 (+-149) us     |    1933 (+-147) us   
6 threads: -------------------------------------------------------------------------------
      cpu torch.uint8 (3, 400, 400)       |    7738 (+-1201) us    |    6920 (+-1190) us  
      cpu torch.uint8 (16, 3, 400, 400)   |  180085 (+-119892) us  |  163626 (+-115677) us

Times are in microseconds (us).
Performance of V1 vs V2: 20.997% (improvement)

[--------------------------------- TrivialAugmentWide ---------------------------------]
                                          |           V1           |          V2        
1 threads: -----------------------------------------------------------------------------
      cpu torch.uint8 (3, 400, 400)       |    3387 (+-329) us     |   3081 (+-321) us  
      cuda torch.uint8 (3, 400, 400)      |     451 (+- 13) us     |    297 (+-  8) us  
      cpu torch.uint8 (16, 3, 400, 400)   |  101788 (+-91224) us   |  89224 (+-87124) us
      cuda torch.uint8 (16, 3, 400, 400)  |    1578 (+-373) us     |    501 (+- 19) us  
      cpu pil (3, 400, 400)               |    1133 (+-137) us     |   1062 (+-138) us  
6 threads: -----------------------------------------------------------------------------
      cpu torch.uint8 (3, 400, 400)       |    4069 (+-355) us     |   3618 (+-361) us  
      cpu torch.uint8 (16, 3, 400, 400)   |  102527 (+-100556) us  |  91264 (+-90662) us

Times are in microseconds (us).
Performance of V1 vs V2: 22.838% (improvement)

[------------------------------------- ColorJitter --------------------------------------]
                                            |           V1          |           V2        
1 threads: -------------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |    24640 (+-766) us   |    16808 (+-187) us 
      cuda torch.float32 (3, 400, 400)      |    1071 (+- 36) us    |     791 (+- 33) us  
      cpu torch.float32 (16, 3, 400, 400)   |  899045 (+-18215) us  |  495452 (+-23080) us
      cuda torch.float32 (16, 3, 400, 400)  |    6444 (+-  6) us    |    2648 (+-  1) us  
      cpu torch.uint8 (3, 400, 400)         |    26271 (+-237) us   |    18410 (+-126) us 
      cuda torch.uint8 (3, 400, 400)        |    1200 (+-  9) us    |     887 (+-  5) us  
      cpu torch.uint8 (16, 3, 400, 400)     |  938875 (+-13454) us  |  534734 (+-12761) us
      cuda torch.uint8 (16, 3, 400, 400)    |    6657 (+-  1) us    |    2942 (+-  0) us  
      cpu pil (3, 400, 400)                 |    14835 (+-410) us   |    14801 (+-402) us 
6 threads: -------------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |    29098 (+-352) us   |    20871 (+-425) us 
      cpu torch.float32 (16, 3, 400, 400)   |  914067 (+-20531) us  |  528114 (+-15384) us
      cpu torch.uint8 (3, 400, 400)         |    31858 (+-314) us   |    23345 (+-330) us 
      cpu torch.uint8 (16, 3, 400, 400)     |  946323 (+-17617) us  |  523300 (+-14203) us

Times are in microseconds (us).
Performance of V1 vs V2: 31.440% (improvement)

[-------------------------------- RandomAdjustSharpness ---------------------------------]
                                            |           V1          |           V2        
1 threads: -------------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |    5130 (+- 24) us    |    4030 (+- 74) us  
      cuda torch.float32 (3, 400, 400)      |     187 (+-  1) us    |     147 (+-  1) us  
      cpu torch.float32 (16, 3, 400, 400)   |   202595 (+-768) us   |   185755 (+-6337) us
      cuda torch.float32 (16, 3, 400, 400)  |     489 (+-  1) us    |     382 (+-  1) us  
      cpu torch.uint8 (3, 400, 400)         |    5564 (+- 39) us    |    4288 (+- 19) us  
      cuda torch.uint8 (3, 400, 400)        |     222 (+-  1) us    |     157 (+-  1) us  
      cpu torch.uint8 (16, 3, 400, 400)     |   217870 (+-6078) us  |   191308 (+-4504) us
      cuda torch.uint8 (16, 3, 400, 400)    |     578 (+-  1) us    |     458 (+-  1) us  
      cpu pil (3, 400, 400)                 |    3561 (+- 16) us    |    3581 (+- 12) us  
6 threads: -------------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |    6139 (+- 47) us    |    4912 (+- 44) us  
      cpu torch.float32 (16, 3, 400, 400)   |   220111 (+-8890) us  |   201278 (+-1894) us
      cpu torch.uint8 (3, 400, 400)         |    6848 (+- 41) us    |    5268 (+- 52) us  
      cpu torch.uint8 (16, 3, 400, 400)     |  235867 (+-27195) us  |  207550 (+-20399) us

Times are in microseconds (us).
Performance of V1 vs V2: 15.608% (improvement)

[------------------------------- RandomAutocontrast -------------------------------]
                                            |         V1         |         V2       
1 threads: -------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   721 (+-  1) us   |   572 (+-  3) us 
      cuda torch.float32 (3, 400, 400)      |   177 (+- 20) us   |   117 (+-  1) us 
      cpu torch.float32 (16, 3, 400, 400)   |  18869 (+-343) us  |  14033 (+- 95) us
      cuda torch.float32 (16, 3, 400, 400)  |   239 (+-  0) us   |   222 (+-  0) us 
      cpu torch.uint8 (3, 400, 400)         |  1144 (+-  8) us   |   809 (+-  5) us 
      cuda torch.uint8 (3, 400, 400)        |   177 (+-  1) us   |   132 (+-  1) us 
      cpu torch.uint8 (16, 3, 400, 400)     |  24274 (+-155) us  |  13676 (+-130) us
      cuda torch.uint8 (16, 3, 400, 400)    |   256 (+-  5) us   |   273 (+-  0) us 
      cpu pil (3, 400, 400)                 |   747 (+-  2) us   |   767 (+-  1) us 
6 threads: -------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   943 (+- 19) us   |   791 (+- 23) us 
      cpu torch.float32 (16, 3, 400, 400)   |  19014 (+-248) us  |  14404 (+-359) us
      cpu torch.uint8 (3, 400, 400)         |  1460 (+- 15) us   |  1112 (+- 32) us 
      cpu torch.uint8 (16, 3, 400, 400)     |  25074 (+-235) us  |  14291 (+-235) us

Times are in microseconds (us).
Performance of V1 vs V2: 17.171% (improvement)

[--------------------------------- RandomEqualize --------------------------------]
                                          |          V1         |         V2       
1 threads: ------------------------------------------------------------------------
      cpu torch.uint8 (3, 400, 400)       |   2913 (+- 12) us   |  2411 (+- 11) us 
      cuda torch.uint8 (3, 400, 400)      |    978 (+-306) us   |   288 (+-  1) us 
      cpu torch.uint8 (16, 3, 400, 400)   |   47271 (+-185) us  |  40238 (+-157) us
      cuda torch.uint8 (16, 3, 400, 400)  |  14421 (+-1185) us  |   826 (+-  1) us 
      cpu pil (3, 400, 400)               |    756 (+-  2) us   |   776 (+-  1) us 
6 threads: ------------------------------------------------------------------------
      cpu torch.uint8 (3, 400, 400)       |   3649 (+- 38) us   |  2615 (+- 28) us 
      cpu torch.uint8 (16, 3, 400, 400)   |  59636 (+-1869) us  |  40607 (+-454) us

Times are in microseconds (us).
Performance of V1 vs V2: 34.185% (improvement)

[--------------------------------- RandomInvert ---------------------------------]
                                            |         V1        |         V2      
1 threads: -----------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   187 (+-  1) us  |   195 (+-  1) us
      cuda torch.float32 (3, 400, 400)      |    20 (+-  0) us  |    28 (+-  0) us
      cpu torch.float32 (16, 3, 400, 400)   |  4103 (+- 33) us  |  4096 (+- 25) us
      cuda torch.float32 (16, 3, 400, 400)  |    49 (+-  0) us  |    49 (+-  0) us
      cpu torch.uint8 (3, 400, 400)         |   164 (+-  1) us  |    50 (+-  0) us
      cuda torch.uint8 (3, 400, 400)        |    20 (+-  0) us  |    25 (+-  0) us
      cpu torch.uint8 (16, 3, 400, 400)     |  2282 (+- 19) us  |   627 (+-  1) us
      cuda torch.uint8 (16, 3, 400, 400)    |    20 (+-  0) us  |    25 (+-  0) us
      cpu pil (3, 400, 400)                 |   327 (+-  1) us  |   346 (+-  1) us
6 threads: -----------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   234 (+-  3) us  |   242 (+-  3) us
      cpu torch.float32 (16, 3, 400, 400)   |  4412 (+- 56) us  |  4392 (+- 34) us
      cpu torch.uint8 (3, 400, 400)         |   208 (+-  3) us  |    95 (+-  2) us
      cpu torch.uint8 (16, 3, 400, 400)     |  2352 (+- 31) us  |   684 (+-  6) us

Times are in microseconds (us).
Performance of V1 vs V2: 3.451% (improvement)

[------------------------------ RandomPosterize -------------------------------]
                                          |         V1        |         V2      
1 threads: ---------------------------------------------------------------------
      cpu torch.uint8 (3, 400, 400)       |   127 (+-  1) us  |   136 (+-  1) us
      cuda torch.uint8 (3, 400, 400)      |    20 (+-  0) us  |    28 (+-  0) us
      cpu torch.uint8 (16, 3, 400, 400)   |  1636 (+-  7) us  |  1642 (+- 20) us
      cuda torch.uint8 (16, 3, 400, 400)  |    20 (+-  0) us  |    28 (+-  0) us
      cpu pil (3, 400, 400)               |   334 (+-  1) us  |   354 (+-  2) us
6 threads: ---------------------------------------------------------------------
      cpu torch.uint8 (3, 400, 400)       |   169 (+-  2) us  |   178 (+-  2) us
      cpu torch.uint8 (16, 3, 400, 400)   |  1700 (+- 17) us  |  1708 (+- 26) us

Times are in microseconds (us).
Performance of V1 vs V2: -16.203% (slowdown)

[--------------------------------- RandomSolarize ---------------------------------]
                                            |         V1         |         V2       
1 threads: -------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   957 (+-  6) us   |   961 (+-  6) us 
      cuda torch.float32 (3, 400, 400)      |    41 (+-  0) us   |    50 (+-  0) us 
      cpu torch.float32 (16, 3, 400, 400)   |  17249 (+- 98) us  |  17450 (+-231) us
      cuda torch.float32 (16, 3, 400, 400)  |   157 (+-  0) us   |   159 (+-  0) us 
      cpu torch.uint8 (3, 400, 400)         |  1081 (+-  7) us   |   976 (+-  8) us 
      cuda torch.uint8 (3, 400, 400)        |    40 (+-  0) us   |    44 (+-  0) us 
      cpu torch.uint8 (16, 3, 400, 400)     |  18245 (+-111) us  |  16818 (+-108) us
      cuda torch.uint8 (16, 3, 400, 400)    |    60 (+-  0) us   |    62 (+-  0) us 
      cpu pil (3, 400, 400)                 |   333 (+-  1) us   |   353 (+-  1) us 
6 threads: -------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |  1104 (+- 20) us   |  1107 (+- 20) us 
      cpu torch.float32 (16, 3, 400, 400)   |  17576 (+-205) us  |  17469 (+-322) us
      cpu torch.uint8 (3, 400, 400)         |  1249 (+- 20) us   |  1139 (+- 82) us 
      cpu torch.uint8 (16, 3, 400, 400)     |  18673 (+-280) us  |  17263 (+-308) us

Times are in microseconds (us).
Performance of V1 vs V2: -3.361% (slowdown)

[--------------------------------- CenterCrop ---------------------------------]
                                            |        V1        |        V2      
1 threads: ---------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   11 (+-  0) us  |    9 (+-  0) us
      cuda torch.float32 (3, 400, 400)      |   12 (+-  0) us  |   10 (+-  0) us
      cpu torch.float32 (16, 3, 400, 400)   |   12 (+-  0) us  |   10 (+-  0) us
      cuda torch.float32 (16, 3, 400, 400)  |   12 (+-  0) us  |   10 (+-  0) us
      cpu torch.uint8 (3, 400, 400)         |   11 (+-  0) us  |    9 (+-  0) us
      cuda torch.uint8 (3, 400, 400)        |   12 (+-  0) us  |   10 (+-  0) us
      cpu torch.uint8 (16, 3, 400, 400)     |   11 (+-  0) us  |    9 (+-  0) us
      cuda torch.uint8 (16, 3, 400, 400)    |   12 (+-  0) us  |   10 (+-  0) us
      cpu pil (3, 400, 400)                 |   17 (+-  0) us  |   16 (+-  0) us
6 threads: ---------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   11 (+-  0) us  |    9 (+-  0) us
      cpu torch.float32 (16, 3, 400, 400)   |   12 (+-  0) us  |    9 (+-  0) us
      cpu torch.uint8 (3, 400, 400)         |   11 (+-  0) us  |    9 (+-  0) us
      cpu torch.uint8 (16, 3, 400, 400)     |   12 (+-  0) us  |    9 (+-  0) us

Times are in microseconds (us).
Performance of V1 vs V2: 15.328% (improvement)

[------------------------------ ElasticTransform ------------------------------]
                                            |        V1        |        V2      
1 threads: ---------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |  100 (+-  1) ms  |  100 (+-  1) ms
      cuda torch.float32 (3, 400, 400)      |   96 (+-  1) ms  |   96 (+-  1) ms
      cpu torch.float32 (16, 3, 400, 400)   |  181 (+-  4) ms  |  166 (+-  2) ms
      cuda torch.float32 (16, 3, 400, 400)  |   97 (+-  1) ms  |   96 (+-  1) ms
      cpu torch.uint8 (3, 400, 400)         |  101 (+-  1) ms  |  100 (+-  1) ms
      cuda torch.uint8 (3, 400, 400)        |   96 (+-  1) ms  |   96 (+-  1) ms
      cpu torch.uint8 (16, 3, 400, 400)     |  193 (+-  5) ms  |  176 (+-  2) ms
      cuda torch.uint8 (16, 3, 400, 400)    |   97 (+-  2) ms  |   96 (+-  1) ms
      cpu pil (3, 400, 400)                 |  104 (+-  1) ms  |  103 (+-  1) ms
6 threads: ---------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |  103 (+-  2) ms  |  101 (+-  2) ms
      cpu torch.float32 (16, 3, 400, 400)   |  184 (+-  2) ms  |  170 (+-  3) ms
      cpu torch.uint8 (3, 400, 400)         |  103 (+-  1) ms  |  102 (+-  1) ms
      cpu torch.uint8 (16, 3, 400, 400)     |  197 (+-  2) ms  |  181 (+-  2) ms

Times are in milliseconds (ms).
Performance of V1 vs V2: 2.308% (improvement)

[---------------------------------- FiveCrop ----------------------------------]
                                            |        V1        |        V2      
1 threads: ---------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   40 (+-  0) us  |   27 (+-  0) us
      cuda torch.float32 (3, 400, 400)      |   42 (+-  0) us  |   29 (+-  0) us
      cpu torch.float32 (16, 3, 400, 400)   |   40 (+-  0) us  |   28 (+-  0) us
      cuda torch.float32 (16, 3, 400, 400)  |   42 (+-  0) us  |   29 (+-  0) us
      cpu torch.uint8 (3, 400, 400)         |   40 (+-  0) us  |   27 (+-  0) us
      cuda torch.uint8 (3, 400, 400)        |   41 (+-  0) us  |   28 (+-  0) us
      cpu torch.uint8 (16, 3, 400, 400)     |   40 (+-  0) us  |   27 (+-  0) us
      cuda torch.uint8 (16, 3, 400, 400)    |   42 (+-  0) us  |   29 (+-  0) us
      cpu pil (3, 400, 400)                 |  111 (+-  1) us  |  106 (+-  0) us
6 threads: ---------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   40 (+-  0) us  |   28 (+-  0) us
      cpu torch.float32 (16, 3, 400, 400)   |   40 (+-  0) us  |   28 (+-  0) us
      cpu torch.uint8 (3, 400, 400)         |   40 (+-  0) us  |   27 (+-  0) us
      cpu torch.uint8 (16, 3, 400, 400)     |   40 (+-  0) us  |   27 (+-  0) us

Times are in microseconds (us).
Performance of V1 vs V2: 26.077% (improvement)

[------------------------------------- Pad --------------------------------------]
                                            |         V1        |         V2      
1 threads: -----------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   287 (+-  1) us  |   297 (+-  1) us
      cuda torch.float32 (3, 400, 400)      |    27 (+-  0) us  |    34 (+-  0) us
      cpu torch.float32 (16, 3, 400, 400)   |  6883 (+- 43) us  |  6922 (+- 63) us
      cuda torch.float32 (16, 3, 400, 400)  |    79 (+-  0) us  |    79 (+-  0) us
      cpu torch.uint8 (3, 400, 400)         |   220 (+-  1) us  |   230 (+-  1) us
      cuda torch.uint8 (3, 400, 400)        |    27 (+-  0) us  |    34 (+-  0) us
      cpu torch.uint8 (16, 3, 400, 400)     |  3231 (+- 20) us  |  3249 (+- 11) us
      cuda torch.uint8 (16, 3, 400, 400)    |    38 (+-  0) us  |    38 (+-  0) us
      cpu pil (3, 400, 400)                 |   147 (+-  1) us  |   151 (+-  1) us
6 threads: -----------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   376 (+-  3) us  |   388 (+-  3) us
      cpu torch.float32 (16, 3, 400, 400)   |  6969 (+-193) us  |  7031 (+- 63) us
      cpu torch.uint8 (3, 400, 400)         |   302 (+-  3) us  |   314 (+-  3) us
      cpu torch.uint8 (16, 3, 400, 400)     |  3354 (+- 25) us  |  3379 (+- 35) us

Times are in microseconds (us).
Performance of V1 vs V2: -6.993% (slowdown)

[------------------------------------- Resize -------------------------------------]
                                            |         V1         |         V2       
1 threads: -------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |  1096 (+-  7) us   |  1103 (+-  7) us 
      cuda torch.float32 (3, 400, 400)      |    23 (+-  0) us   |    25 (+-  0) us 
      cpu torch.float32 (16, 3, 400, 400)   |  16734 (+-116) us  |  16712 (+- 95) us
      cuda torch.float32 (16, 3, 400, 400)  |   162 (+-  1) us   |   162 (+-  0) us 
      cpu torch.uint8 (3, 400, 400)         |  1391 (+-  8) us   |  1370 (+-  9) us 
      cuda torch.uint8 (3, 400, 400)        |    51 (+-  0) us   |    53 (+-  0) us 
      cpu torch.uint8 (16, 3, 400, 400)     |  22197 (+-127) us  |  22000 (+-143) us
      cuda torch.uint8 (16, 3, 400, 400)    |   229 (+-  0) us   |   228 (+-  0) us 
      cpu pil (3, 400, 400)                 |  1124 (+-  5) us   |  1125 (+-  7) us 
6 threads: -------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |  1186 (+- 22) us   |  1191 (+- 20) us 
      cpu torch.float32 (16, 3, 400, 400)   |  16956 (+-317) us  |  16976 (+-184) us
      cpu torch.uint8 (3, 400, 400)         |  1608 (+- 21) us   |  1586 (+- 26) us 
      cpu torch.uint8 (16, 3, 400, 400)     |  22713 (+-290) us  |  22526 (+-420) us

Times are in microseconds (us).
Performance of V1 vs V2: -1.247% (slowdown)

[----------------------------------- TenCrop ------------------------------------]
                                            |         V1        |         V2      
1 threads: -----------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   324 (+-  1) us  |   295 (+-  1) us
      cuda torch.float32 (3, 400, 400)      |    99 (+-  1) us  |    66 (+-  1) us
      cpu torch.float32 (16, 3, 400, 400)   |  6064 (+-163) us  |  5996 (+- 19) us
      cuda torch.float32 (16, 3, 400, 400)  |    99 (+-  1) us  |    66 (+-  0) us
      cpu torch.uint8 (3, 400, 400)         |   386 (+-  1) us  |   357 (+-  2) us
      cuda torch.uint8 (3, 400, 400)        |    98 (+-  1) us  |    66 (+-  0) us
      cpu torch.uint8 (16, 3, 400, 400)     |  4660 (+- 13) us  |  4626 (+- 17) us
      cuda torch.uint8 (16, 3, 400, 400)    |    99 (+-  1) us  |    67 (+-  0) us
      cpu pil (3, 400, 400)                 |   356 (+-  1) us  |   328 (+-  1) us
6 threads: -----------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   374 (+-  3) us  |   344 (+-  4) us
      cpu torch.float32 (16, 3, 400, 400)   |  6064 (+- 66) us  |  6027 (+- 57) us
      cpu torch.uint8 (3, 400, 400)         |   433 (+-  4) us  |   403 (+-  4) us
      cpu torch.uint8 (16, 3, 400, 400)     |  4741 (+- 36) us  |  4709 (+- 40) us

Times are in microseconds (us).
Performance of V1 vs V2: 16.425% (improvement)

[------------------------------------- RandomAffine -------------------------------------]
                                            |           V1          |           V2        
1 threads: -------------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |    14378 (+-906) us   |    14169 (+-104) us 
      cuda torch.float32 (3, 400, 400)      |     555 (+- 32) us    |     514 (+-  2) us  
      cpu torch.float32 (16, 3, 400, 400)   |  453405 (+-30956) us  |  456598 (+-31138) us
      cuda torch.float32 (16, 3, 400, 400)  |    1584 (+- 15) us    |    1579 (+- 10) us  
      cpu torch.uint8 (3, 400, 400)         |    14589 (+-319) us   |    14540 (+-305) us 
      cuda torch.uint8 (3, 400, 400)        |     550 (+-  1) us    |     543 (+-  2) us  
      cpu torch.uint8 (16, 3, 400, 400)     |  472450 (+-33505) us  |  463440 (+-32430) us
      cuda torch.uint8 (16, 3, 400, 400)    |    1685 (+-  9) us    |    1677 (+- 10) us  
      cpu pil (3, 400, 400)                 |     359 (+-  1) us    |     365 (+-  2) us  
6 threads: -------------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |    14949 (+-351) us   |    14951 (+-326) us 
      cpu torch.float32 (16, 3, 400, 400)   |  458052 (+-32305) us  |  457932 (+-31797) us
      cpu torch.uint8 (3, 400, 400)         |    15542 (+-337) us   |    15445 (+-329) us 
      cpu torch.uint8 (16, 3, 400, 400)     |  470605 (+-33002) us  |  468819 (+-33084) us

Times are in microseconds (us).
Performance of V1 vs V2: 0.725% (improvement)

[---------------------------------- RandomCrop ----------------------------------]
                                            |         V1        |         V2      
1 threads: -----------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   323 (+-  1) us  |   335 (+-  1) us
      cuda torch.float32 (3, 400, 400)      |    55 (+-  0) us  |    62 (+-  0) us
      cpu torch.float32 (16, 3, 400, 400)   |  6914 (+-112) us  |  6929 (+- 34) us
      cuda torch.float32 (16, 3, 400, 400)  |    79 (+-  0) us  |    79 (+-  0) us
      cpu torch.uint8 (3, 400, 400)         |   250 (+-  1) us  |   262 (+-  1) us
      cuda torch.uint8 (3, 400, 400)        |    54 (+-  0) us  |    61 (+-  0) us
      cpu torch.uint8 (16, 3, 400, 400)     |  3241 (+-  8) us  |  3259 (+- 16) us
      cuda torch.uint8 (16, 3, 400, 400)    |    48 (+-  0) us  |    62 (+-  0) us
      cpu pil (3, 400, 400)                 |   203 (+-  1) us  |   208 (+- 11) us
6 threads: -----------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   416 (+-  4) us  |   428 (+-  4) us
      cpu torch.float32 (16, 3, 400, 400)   |  7003 (+-236) us  |  7069 (+- 66) us
      cpu torch.uint8 (3, 400, 400)         |   337 (+-  4) us  |   349 (+-  4) us
      cpu torch.uint8 (16, 3, 400, 400)     |  3375 (+- 36) us  |  3395 (+- 34) us

Times are in microseconds (us).
Performance of V1 vs V2: -6.462% (slowdown)

[----------------------------- RandomHorizontalFlip -----------------------------]
                                            |         V1        |         V2      
1 threads: -----------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   247 (+-  1) us  |   254 (+-  1) us
      cuda torch.float32 (3, 400, 400)      |    23 (+-  0) us  |    27 (+-  0) us
      cpu torch.float32 (16, 3, 400, 400)   |  5931 (+- 48) us  |  5919 (+- 15) us
      cuda torch.float32 (16, 3, 400, 400)  |    51 (+-  0) us  |    51 (+-  0) us
      cpu torch.uint8 (3, 400, 400)         |   309 (+-  1) us  |   316 (+-  1) us
      cuda torch.uint8 (3, 400, 400)        |    23 (+-  0) us  |    27 (+-  0) us
      cpu torch.uint8 (16, 3, 400, 400)     |  4574 (+-  9) us  |  4582 (+-  9) us
      cuda torch.uint8 (16, 3, 400, 400)    |    23 (+-  0) us  |    27 (+-  0) us
      cpu pil (3, 400, 400)                 |   133 (+-  1) us  |   138 (+-  1) us
6 threads: -----------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   296 (+-  2) us  |   304 (+-  3) us
      cpu torch.float32 (16, 3, 400, 400)   |  5932 (+- 76) us  |  5928 (+- 38) us
      cpu torch.uint8 (3, 400, 400)         |   354 (+-  3) us  |   360 (+-  4) us
      cpu torch.uint8 (16, 3, 400, 400)     |  4647 (+- 35) us  |  4654 (+- 44) us

Times are in microseconds (us).
Performance of V1 vs V2: -5.806% (slowdown)

[---------------------------------- RandomPerspective -----------------------------------]
                                            |           V1          |           V2        
1 threads: -------------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |    5669 (+-309) us    |    5016 (+- 14) us  
      cuda torch.float32 (3, 400, 400)      |     668 (+-  2) us    |     638 (+-  1) us  
      cpu torch.float32 (16, 3, 400, 400)   |  103699 (+-11683) us  |   87578 (+-11757) us
      cuda torch.float32 (16, 3, 400, 400)  |     872 (+- 11) us    |     852 (+-  6) us  
      cpu torch.uint8 (3, 400, 400)         |    6140 (+- 17) us    |    5418 (+- 14) us  
      cuda torch.uint8 (3, 400, 400)        |     707 (+-  2) us    |     672 (+-  1) us  
      cpu torch.uint8 (16, 3, 400, 400)     |  115905 (+-11269) us  |   96945 (+-11355) us
      cuda torch.uint8 (16, 3, 400, 400)    |     915 (+-  8) us    |     897 (+-  8) us  
      cpu pil (3, 400, 400)                 |    3385 (+- 40) us    |    3410 (+- 40) us  
6 threads: -------------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |    6295 (+- 50) us    |    5589 (+- 48) us  
      cpu torch.float32 (16, 3, 400, 400)   |  106728 (+-12443) us  |   90306 (+-12163) us
      cpu torch.uint8 (3, 400, 400)         |    6919 (+- 64) us    |    6111 (+- 39) us  
      cpu torch.uint8 (16, 3, 400, 400)     |  118305 (+-11773) us  |  100258 (+-11661) us

Times are in microseconds (us).
Performance of V1 vs V2: 6.612% (improvement)

[-------------------------------- RandomResizedCrop ---------------------------------]
                                            |          V1         |          V2       
1 threads: ---------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |    845 (+- 20) us   |    835 (+- 20) us 
      cuda torch.float32 (3, 400, 400)      |    108 (+-  1) us   |     97 (+-  1) us 
      cpu torch.float32 (16, 3, 400, 400)   |   11057 (+-922) us  |   11051 (+-924) us
      cuda torch.float32 (16, 3, 400, 400)  |    119 (+-  2) us   |    119 (+-  1) us 
      cpu torch.uint8 (3, 400, 400)         |   1053 (+- 23) us   |   1014 (+- 23) us 
      cuda torch.uint8 (3, 400, 400)        |    134 (+-  1) us   |    122 (+-  1) us 
      cpu torch.uint8 (16, 3, 400, 400)     |  14512 (+-1163) us  |  14136 (+-1084) us
      cuda torch.uint8 (16, 3, 400, 400)    |    130 (+-  1) us   |    129 (+-  1) us 
      cpu pil (3, 400, 400)                 |    902 (+- 70) us   |    885 (+-  6) us 
6 threads: ---------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |    945 (+- 30) us   |    934 (+- 31) us 
      cpu torch.float32 (16, 3, 400, 400)   |   11308 (+-967) us  |   11291 (+-956) us
      cpu torch.uint8 (3, 400, 400)         |   1270 (+- 29) us   |   1230 (+- 36) us 
      cpu torch.uint8 (16, 3, 400, 400)     |  14894 (+-1149) us  |  14507 (+-1140) us

Times are in microseconds (us).
Performance of V1 vs V2: 3.028% (improvement)

[------------------------------------- RandomRotation -------------------------------------]
                                            |           V1           |           V2         
1 threads: ---------------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   17180 (+-4965) us    |   15761 (+-1568) us  
      cuda torch.float32 (3, 400, 400)      |     656 (+-  2) us     |     624 (+-  1) us   
      cpu torch.float32 (16, 3, 400, 400)   |  458672 (+-160941) us  |  430656 (+-28398) us 
      cuda torch.float32 (16, 3, 400, 400)  |    1581 (+- 41) us     |    1571 (+- 42) us   
      cpu torch.uint8 (3, 400, 400)         |   16548 (+-1619) us    |   16330 (+-1549) us  
      cuda torch.uint8 (3, 400, 400)        |     693 (+-  1) us     |     656 (+-  1) us   
      cpu torch.uint8 (16, 3, 400, 400)     |  477884 (+-173543) us  |  449887 (+-28933) us 
      cuda torch.uint8 (16, 3, 400, 400)    |    1737 (+- 47) us     |    1746 (+- 51) us   
      cpu pil (3, 400, 400)                 |     611 (+-  4) us     |     615 (+-  4) us   
6 threads: ---------------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   16987 (+-1742) us    |   16902 (+-1664) us  
      cpu torch.float32 (16, 3, 400, 400)   |  464165 (+-160255) us  |  463919 (+-159642) us
      cpu torch.uint8 (3, 400, 400)         |   17776 (+-1622) us    |   17486 (+-1536) us  
      cpu torch.uint8 (16, 3, 400, 400)     |  481863 (+-176986) us  |  476903 (+-166256) us

Times are in microseconds (us).
Performance of V1 vs V2: 1.887% (improvement)

[------------------------------ RandomVerticalFlip ------------------------------]
                                            |         V1        |         V2      
1 threads: -----------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   184 (+-  1) us  |   192 (+-  1) us
      cuda torch.float32 (3, 400, 400)      |    23 (+-  0) us  |    27 (+-  0) us
      cpu torch.float32 (16, 3, 400, 400)   |  4606 (+- 37) us  |  4602 (+- 19) us
      cuda torch.float32 (16, 3, 400, 400)  |    52 (+-  0) us  |    52 (+-  0) us
      cpu torch.uint8 (3, 400, 400)         |    91 (+-  1) us  |    98 (+-  1) us
      cuda torch.uint8 (3, 400, 400)        |    23 (+-  0) us  |    26 (+-  0) us
      cpu torch.uint8 (16, 3, 400, 400)     |  1089 (+- 10) us  |  1097 (+-  9) us
      cuda torch.uint8 (16, 3, 400, 400)    |    24 (+-  0) us  |    27 (+-  1) us
      cpu pil (3, 400, 400)                 |    74 (+-  0) us  |    80 (+-  0) us
6 threads: -----------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   229 (+-  2) us  |   237 (+-  2) us
      cpu torch.float32 (16, 3, 400, 400)   |  4706 (+- 87) us  |  4707 (+- 37) us
      cpu torch.uint8 (3, 400, 400)         |   134 (+-  2) us  |   142 (+-  2) us
      cpu torch.uint8 (16, 3, 400, 400)     |  1152 (+- 19) us  |  1161 (+- 17) us

Times are in microseconds (us).
Performance of V1 vs V2: -6.617% (slowdown)

[------------------------------- ConvertImageDtype --------------------------------]
                                            |         V1         |         V2       
1 threads: -------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   134 (+-  1) us   |   132 (+-  1) us 
      cuda torch.float32 (3, 400, 400)      |    16 (+-  0) us   |    14 (+-  0) us 
      cpu torch.float32 (16, 3, 400, 400)   |  2519 (+- 20) us   |  2514 (+- 16) us 
      cuda torch.float32 (16, 3, 400, 400)  |    44 (+-  0) us   |    44 (+-  0) us 
      cpu torch.uint8 (3, 400, 400)         |  1053 (+-  5) us   |   981 (+-  6) us 
      cuda torch.uint8 (3, 400, 400)        |    25 (+-  0) us   |    22 (+-  0) us 
      cpu torch.uint8 (16, 3, 400, 400)     |  16495 (+- 62) us  |  15196 (+- 62) us
      cuda torch.uint8 (16, 3, 400, 400)    |    52 (+-  0) us   |    40 (+-  0) us 
6 threads: -------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   180 (+-  2) us   |   177 (+-  3) us 
      cpu torch.float32 (16, 3, 400, 400)   |  2619 (+- 34) us   |  2617 (+- 39) us 
      cpu torch.uint8 (3, 400, 400)         |  1139 (+- 16) us   |  1062 (+- 14) us 
      cpu torch.uint8 (16, 3, 400, 400)     |  16690 (+-252) us  |  15337 (+-256) us

Times are in microseconds (us).
Performance of V1 vs V2: 7.949% (improvement)

[------------------------------------- GaussianBlur -------------------------------------]
                                            |           V1          |           V2        
1 threads: -------------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |    3281 (+-260) us    |    3174 (+-258) us  
      cuda torch.float32 (3, 400, 400)      |     239 (+- 31) us    |     140 (+-  1) us  
      cpu torch.float32 (16, 3, 400, 400)   |  241303 (+-59097) us  |  241166 (+-58982) us
      cuda torch.float32 (16, 3, 400, 400)  |     305 (+-  1) us    |     221 (+-  0) us  
      cpu torch.uint8 (3, 400, 400)         |    3896 (+-239) us    |    3657 (+-246) us  
      cuda torch.uint8 (3, 400, 400)        |     257 (+-  2) us    |     171 (+-  1) us  
      cpu torch.uint8 (16, 3, 400, 400)     |   256446 (+-2638) us  |   254117 (+-796) us 
      cuda torch.uint8 (16, 3, 400, 400)    |     433 (+-  1) us    |     344 (+-  0) us  
      cpu pil (3, 400, 400)                 |    7085 (+-303) us    |    6921 (+-282) us  
6 threads: -------------------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |    4452 (+-266) us    |    4315 (+-262) us  
      cpu torch.float32 (16, 3, 400, 400)   |   264282 (+-2007) us  |   264110 (+-2584) us
      cpu torch.uint8 (3, 400, 400)         |    5110 (+-257) us    |    4934 (+-258) us  
      cpu torch.uint8 (16, 3, 400, 400)     |   279173 (+-2179) us  |   276032 (+-3026) us

Times are in microseconds (us).
Performance of V1 vs V2: 13.555% (improvement)

[---------------------------------- Normalize -----------------------------------]
                                            |         V1        |         V2      
1 threads: -----------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   383 (+-  1) us  |   291 (+-  1) us
      cuda torch.float32 (3, 400, 400)      |   118 (+-  1) us  |    74 (+-  0) us
      cpu torch.float32 (16, 3, 400, 400)   |  6943 (+- 19) us  |  5478 (+- 55) us
      cuda torch.float32 (16, 3, 400, 400)  |   224 (+-  1) us  |   140 (+-  1) us
6 threads: -----------------------------------------------------------------------
      cpu torch.float32 (3, 400, 400)       |   516 (+-  4) us  |   380 (+-  4) us
      cpu torch.float32 (16, 3, 400, 400)   |  7282 (+- 58) us  |  6006 (+- 49) us

Times are in microseconds (us).
Performance of V1 vs V2: 30.002% (improvement)

Functional Kernels

Generated using @pmeier's script. We compare V1 against V2 for all kernels for many configurations (batch size, dtype, device, number of threads etc) and then estimate the average performance improvement across all configuration to summarize the end result.

Detailed Benchmarks

[----------- adjust_brightness @ torchvision==0.15.0a0+1098dad ------------]
                                          |        v1       |        v2     
1 threads: -----------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   1303 (+-  5)  |   572 (+-  1) 
      (3, 400, 400) / uint8 / cuda        |    53 (+-  0)   |    25 (+-  0) 
      (3, 400, 400) / PIL                 |   814 (+-  2)   |   811 (+-  2) 
      (3, 400, 400) / float32 / cpu       |   830 (+-  4)   |   253 (+-  1) 
      (3, 400, 400) / float32 / cuda      |    44 (+-  0)   |    17 (+-  0) 
      (16, 3, 400, 400) / uint8 / cpu     |  31009 (+-836)  |  12549 (+- 61)
      (16, 3, 400, 400) / uint8 / cuda    |   261 (+-  0)   |   127 (+-  0) 
      (16, 3, 400, 400) / float32 / cpu   |  23236 (+-539)  |   5201 (+- 31)
      (16, 3, 400, 400) / float32 / cuda  |   241 (+-  0)   |    96 (+-  0) 
6 threads: -----------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   1650 (+- 26)  |   744 (+- 16) 
      (3, 400, 400) / float32 / cpu       |   1050 (+- 18)  |   339 (+-  4) 
      (16, 3, 400, 400) / uint8 / cpu     |  31815 (+-396)  |  12900 (+-247)
      (16, 3, 400, 400) / float32 / cpu   |  23572 (+-309)  |   5473 (+- 48)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +48.8% (improvement)
[------------- adjust_contrast @ torchvision==0.15.0a0+1098dad -------------]
                                          |        v1        |        v2     
1 threads: ------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   1380 (+- 18)   |   954 (+-  4) 
      (3, 400, 400) / uint8 / cuda        |   134 (+-  2)    |    82 (+-  1) 
      (3, 400, 400) / PIL                 |   1081 (+-  8)   |   1077 (+-  5)
      (3, 400, 400) / float32 / cpu       |   905 (+- 12)    |   540 (+-  1) 
      (3, 400, 400) / float32 / cuda      |   107 (+-  2)    |    66 (+-  0) 
      (16, 3, 400, 400) / uint8 / cpu     |  35265 (+-122)   |  24302 (+-125)
      (16, 3, 400, 400) / uint8 / cuda    |   293 (+-  0)    |   242 (+-  0) 
      (16, 3, 400, 400) / float32 / cpu   |  23713 (+-109)   |  13330 (+- 86)
      (16, 3, 400, 400) / float32 / cuda  |   252 (+-  0)    |   197 (+-  0) 
6 threads: ------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   2046 (+- 30)   |   1501 (+- 20)
      (3, 400, 400) / float32 / cpu       |   1290 (+- 23)   |   841 (+- 20) 
      (16, 3, 400, 400) / uint8 / cpu     |  38422 (+-1500)  |  25921 (+-293)
      (16, 3, 400, 400) / float32 / cpu   |  23309 (+-428)   |  13884 (+-180)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +26.0% (improvement)
[--------------- adjust_gamma @ torchvision==0.15.0a0+1098dad ---------------]
                                          |        v1        |        v2      
1 threads: -------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   4640 (+- 12)   |   4362 (+-  8) 
      (3, 400, 400) / uint8 / cuda        |    81 (+-  1)    |    53 (+-  0)  
      (3, 400, 400) / PIL                 |   463 (+-  1)    |   457 (+-  1)  
      (3, 400, 400) / float32 / cpu       |   3789 (+- 17)   |   3641 (+-  8) 
      (3, 400, 400) / float32 / cuda      |    29 (+-  0)    |    21 (+-  0)  
      (16, 3, 400, 400) / uint8 / cpu     |  82220 (+-634)   |  72850 (+-394) 
      (16, 3, 400, 400) / uint8 / cuda    |   331 (+-  0)    |   312 (+-  0)  
      (16, 3, 400, 400) / float32 / cpu   |  63453 (+-496)   |  58586 (+-298) 
      (16, 3, 400, 400) / float32 / cuda  |   150 (+-  0)    |   142 (+-  0)  
6 threads: -------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   5042 (+- 47)   |   4751 (+- 29) 
      (3, 400, 400) / float32 / cpu       |   4003 (+- 46)   |   3866 (+- 36) 
      (16, 3, 400, 400) / uint8 / cpu     |  83791 (+-3026)  |  75086 (+-1903)
      (16, 3, 400, 400) / float32 / cpu   |  65180 (+-1774)  |  60190 (+-1816)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +10.4% (improvement)
[------------------ adjust_hue @ torchvision==0.15.0a0+1098dad ------------------]
                                          |         v1         |         v2       
1 threads: -----------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   21044 (+- 98)    |   15068 (+-119)  
      (3, 400, 400) / uint8 / cuda        |    904 (+- 47)     |    558 (+-  1)   
      (3, 400, 400) / PIL                 |   10260 (+- 54)    |   10253 (+- 59)  
      (3, 400, 400) / float32 / cpu       |   20317 (+-132)    |   14494 (+-171)  
      (3, 400, 400) / float32 / cuda      |    693 (+-  2)     |    510 (+-  2)   
      (16, 3, 400, 400) / uint8 / cpu     |  806291 (+-26060)  |  458049 (+-1384) 
      (16, 3, 400, 400) / uint8 / cuda    |    5805 (+-  1)    |    2331 (+-  0)  
      (16, 3, 400, 400) / float32 / cpu   |  769409 (+-9376)   |  442955 (+-5570) 
      (16, 3, 400, 400) / float32 / cuda  |    5633 (+-  1)    |    2170 (+-  3)  
6 threads: -----------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   24682 (+-213)    |   18549 (+-318)  
      (3, 400, 400) / float32 / cpu       |   23809 (+-217)    |   17842 (+-238)  
      (16, 3, 400, 400) / uint8 / cpu     |  799018 (+-14984)  |  467291 (+-6339) 
      (16, 3, 400, 400) / float32 / cpu   |  781586 (+-2532)   |  451900 (+-13095)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +32.7% (improvement)
[----------- adjust_saturation @ torchvision==0.15.0a0+1098dad ------------]
                                          |        v1       |        v2     
1 threads: -----------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   1462 (+-  7)  |   976 (+-  5) 
      (3, 400, 400) / uint8 / cuda        |   104 (+-  1)   |    65 (+-  0) 
      (3, 400, 400) / PIL                 |   932 (+- 43)   |   929 (+-  6) 
      (3, 400, 400) / float32 / cpu       |   986 (+-  3)   |   570 (+-  1) 
      (3, 400, 400) / float32 / cuda      |    86 (+-  0)   |    50 (+-  0) 
      (16, 3, 400, 400) / uint8 / cpu     |  35543 (+-128)  |  25018 (+- 88)
      (16, 3, 400, 400) / uint8 / cuda    |   287 (+-  0)   |   241 (+-  0) 
      (16, 3, 400, 400) / float32 / cpu   |  24104 (+-435)  |  13323 (+-223)
      (16, 3, 400, 400) / float32 / cuda  |   262 (+-  0)   |   199 (+-  0) 
6 threads: -----------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   2117 (+- 19)  |   1462 (+- 18)
      (3, 400, 400) / float32 / cpu       |   1344 (+- 19)  |   802 (+- 18) 
      (16, 3, 400, 400) / uint8 / cpu     |  38526 (+-345)  |  26558 (+-327)
      (16, 3, 400, 400) / float32 / cpu   |  25124 (+-212)  |  13919 (+-218)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +26.9% (improvement)
[-------------- adjust_sharpness @ torchvision==0.15.0a0+1098dad --------------]
                                          |         v1        |         v2      
1 threads: ---------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |    5508 (+-167)   |    4232 (+- 36) 
      (3, 400, 400) / uint8 / cuda        |    205 (+-  1)    |    115 (+-  0)  
      (3, 400, 400) / PIL                 |    3532 (+- 11)   |    3523 (+-  8) 
      (3, 400, 400) / float32 / cpu       |    4955 (+- 34)   |    4040 (+- 39) 
      (3, 400, 400) / float32 / cuda      |    170 (+-  1)    |    104 (+-  0)  
      (16, 3, 400, 400) / uint8 / cpu     |  286173 (+-5670)  |   258815 (+-769)
      (16, 3, 400, 400) / uint8 / cuda    |    575 (+-  1)    |    455 (+-  0)  
      (16, 3, 400, 400) / float32 / cpu   |  270322 (+-7024)  |   252958 (+-710)
      (16, 3, 400, 400) / float32 / cuda  |    487 (+-  1)    |    380 (+-  3)  
6 threads: ---------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |    6857 (+-193)   |    5214 (+- 62) 
      (3, 400, 400) / float32 / cpu       |    6000 (+- 47)   |    4875 (+- 58) 
      (16, 3, 400, 400) / uint8 / cpu     |  306676 (+-3053)  |  279421 (+-2652)
      (16, 3, 400, 400) / float32 / cpu   |  291542 (+-2580)  |  274144 (+-2823)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +18.4% (improvement)
[------------------- affine @ torchvision==0.15.0a0+1098dad -------------------]
                                          |         v1        |         v2      
1 threads: ---------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |    2130 (+-821)   |    2021 (+-816) 
      (3, 400, 400) / uint8 / cuda        |    245 (+-  1)    |    221 (+-  2)  
      (3, 400, 400) / PIL                 |   3700 (+-1658)   |   3689 (+-1658) 
      (3, 400, 400) / float32 / cpu       |    1609 (+-795)   |    1590 (+-793) 
      (3, 400, 400) / float32 / cuda      |    212 (+-  1)    |    194 (+-  1)  
      (16, 3, 400, 400) / uint8 / cpu     |  70314 (+-11520)  |  68234 (+-11541)
      (16, 3, 400, 400) / uint8 / cuda    |    383 (+- 33)    |    373 (+- 34)  
      (16, 3, 400, 400) / float32 / cpu   |  58610 (+-11623)  |  29207 (+-14296)
      (16, 3, 400, 400) / float32 / cuda  |    256 (+- 34)    |    249 (+- 34)  
6 threads: ---------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |    2455 (+-806)   |    2349 (+-808) 
      (3, 400, 400) / float32 / cpu       |    1802 (+-803)   |    1776 (+-804) 
      (16, 3, 400, 400) / uint8 / cpu     |  71781 (+-11712)  |  69805 (+-12170)
      (16, 3, 400, 400) / float32 / cpu   |  58848 (+-11868)  |  58908 (+-11957)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +3.4% (improvement)
[-------------- autocontrast @ torchvision==0.15.0a0+1098dad --------------]
                                          |        v1       |        v2     
1 threads: -----------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   1130 (+-  4)  |   767 (+-  2) 
      (3, 400, 400) / uint8 / cuda        |   163 (+-  1)   |    96 (+-  1) 
      (3, 400, 400) / PIL                 |   724 (+-  1)   |   720 (+-  2) 
      (3, 400, 400) / float32 / cpu       |   712 (+-  1)   |   532 (+-  1) 
      (3, 400, 400) / float32 / cuda      |   135 (+-  1)   |    81 (+-  0) 
      (16, 3, 400, 400) / uint8 / cpu     |  23736 (+-103)  |  13385 (+- 61)
      (16, 3, 400, 400) / uint8 / cuda    |   254 (+-  0)   |   270 (+-  0) 
      (16, 3, 400, 400) / float32 / cpu   |  18042 (+- 83)  |  13180 (+-131)
      (16, 3, 400, 400) / float32 / cuda  |   237 (+-  0)   |   221 (+-  0) 
6 threads: -----------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   1447 (+- 32)  |   1066 (+- 17)
      (3, 400, 400) / float32 / cpu       |   931 (+- 17)   |   745 (+- 14) 
      (16, 3, 400, 400) / uint8 / cpu     |  24379 (+-436)  |  13972 (+-231)
      (16, 3, 400, 400) / float32 / cpu   |  18798 (+-735)  |  13514 (+-260)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +20.9% (improvement)
[------------ center_crop @ torchvision==0.15.0a0+1098dad -------------]
                                          |       v1      |       v2    
1 threads: -------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   11 (+-  0)  |    5 (+-  0)
      (3, 400, 400) / uint8 / cuda        |   11 (+-  0)  |    5 (+-  0)
      (3, 400, 400) / PIL                 |   15 (+-  0)  |   11 (+-  0)
      (3, 400, 400) / float32 / cpu       |   11 (+-  0)  |    5 (+-  0)
      (3, 400, 400) / float32 / cuda      |   11 (+-  0)  |    5 (+-  0)
      (16, 3, 400, 400) / uint8 / cpu     |   11 (+-  0)  |    5 (+-  0)
      (16, 3, 400, 400) / uint8 / cuda    |   11 (+-  0)  |    6 (+-  0)
      (16, 3, 400, 400) / float32 / cpu   |   11 (+-  0)  |    5 (+-  0)
      (16, 3, 400, 400) / float32 / cuda  |   11 (+-  0)  |    6 (+-  0)
6 threads: -------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   11 (+-  0)  |    5 (+-  0)
      (3, 400, 400) / float32 / cpu       |   11 (+-  0)  |    5 (+-  0)
      (16, 3, 400, 400) / uint8 / cpu     |   11 (+-  0)  |    5 (+-  0)
      (16, 3, 400, 400) / float32 / cpu   |   11 (+-  0)  |    5 (+-  0)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +45.7% (improvement)
[---------- convert_color_space @ torchvision==0.15.0a0+1098dad ----------]
                                          |        v1       |       v2     
1 threads: ----------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   485 (+-  1)   |  310 (+-  1) 
      (3, 400, 400) / uint8 / cuda        |    58 (+-  0)   |   37 (+-  0) 
      (3, 400, 400) / PIL                 |   112 (+-  0)   |  111 (+-  1) 
      (3, 400, 400) / float32 / cpu       |   357 (+-  1)   |  165 (+-  0) 
      (3, 400, 400) / float32 / cuda      |    49 (+-  0)   |   29 (+-  0) 
      (16, 3, 400, 400) / uint8 / cpu     |   9109 (+- 57)  |  5401 (+- 10)
      (16, 3, 400, 400) / uint8 / cuda    |    83 (+-  1)   |   57 (+-  0) 
      (16, 3, 400, 400) / float32 / cpu   |   6209 (+- 12)  |  2763 (+-  9)
      (16, 3, 400, 400) / float32 / cuda  |    84 (+-  1)   |   44 (+-  0) 
6 threads: ----------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   854 (+- 19)   |  594 (+- 14) 
      (3, 400, 400) / float32 / cpu       |   562 (+-  4)   |  291 (+-  4) 
      (16, 3, 400, 400) / uint8 / cpu     |  10484 (+-233)  |  6001 (+- 39)
      (16, 3, 400, 400) / float32 / cpu   |   6770 (+-116)  |  3014 (+- 21)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +34.4% (improvement)
[----------- convert_dtype @ torchvision==0.15.0a0+1098dad ------------]
                                        |       v1       |       v2     
1 threads: -------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu       |  283 (+-  1)   |  189 (+-  1) 
      (3, 400, 400) / uint8 / cuda      |   24 (+-  0)   |   16 (+-  0) 
      (16, 3, 400, 400) / uint8 / cpu   |  6775 (+-124)  |  3609 (+- 11)
      (16, 3, 400, 400) / uint8 / cuda  |   90 (+-  1)   |   87 (+-  1) 
6 threads: -------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu       |  373 (+-  3)   |  274 (+-  3) 
      (16, 3, 400, 400) / uint8 / cpu   |  6845 (+-183)  |  3781 (+- 37)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +28.8% (improvement)
[---------------- crop @ torchvision==0.15.0a0+1098dad ----------------]
                                          |       v1      |       v2    
1 threads: -------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |    6 (+-  0)  |    4 (+-  0)
      (3, 400, 400) / uint8 / cuda        |    7 (+-  0)  |    4 (+-  0)
      (3, 400, 400) / PIL                 |   11 (+-  0)  |   10 (+-  0)
      (3, 400, 400) / float32 / cpu       |    6 (+-  0)  |    4 (+-  0)
      (3, 400, 400) / float32 / cuda      |    7 (+-  0)  |    4 (+-  0)
      (16, 3, 400, 400) / uint8 / cpu     |    7 (+-  0)  |    4 (+-  0)
      (16, 3, 400, 400) / uint8 / cuda    |    7 (+-  0)  |    5 (+-  0)
      (16, 3, 400, 400) / float32 / cpu   |    6 (+-  0)  |    4 (+-  0)
      (16, 3, 400, 400) / float32 / cuda  |    7 (+-  0)  |    5 (+-  0)
6 threads: -------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |    6 (+-  0)  |    4 (+-  0)
      (3, 400, 400) / float32 / cpu       |    6 (+-  0)  |    4 (+-  0)
      (16, 3, 400, 400) / uint8 / cpu     |    6 (+-  0)  |    4 (+-  0)
      (16, 3, 400, 400) / float32 / cpu   |    6 (+-  0)  |    4 (+-  0)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +29.2% (improvement)
[----------------- elastic @ torchvision==0.15.0a0+1098dad ------------------]
                                          |        v1        |        v2      
1 threads: -------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   4316 (+- 17)   |   4150 (+- 15) 
      (3, 400, 400) / uint8 / cuda        |   1004 (+-  5)   |   479 (+-  1)  
      (3, 400, 400) / PIL                 |   6821 (+- 16)   |   6664 (+- 18) 
      (3, 400, 400) / float32 / cpu       |   3823 (+- 15)   |   3722 (+- 13) 
      (3, 400, 400) / float32 / cuda      |   972 (+-  5)    |   455 (+-  1)  
      (16, 3, 400, 400) / uint8 / cpu     |  84065 (+-2585)  |  83337 (+-1763)
      (16, 3, 400, 400) / uint8 / cuda    |   1051 (+-  6)   |   493 (+-  1)  
      (16, 3, 400, 400) / float32 / cpu   |  73713 (+-268)   |  73115 (+-1222)
      (16, 3, 400, 400) / float32 / cuda  |   975 (+-  6)    |   448 (+-  1)  
6 threads: -------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   4590 (+- 29)   |   4426 (+- 30) 
      (3, 400, 400) / float32 / cpu       |   3975 (+- 38)   |   3880 (+- 42) 
      (16, 3, 400, 400) / uint8 / cpu     |  85571 (+-1098)  |  84945 (+-1901)
      (16, 3, 400, 400) / float32 / cpu   |  74562 (+-1859)  |  73956 (+-2809)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +22.5% (improvement)
[----------------- equalize @ torchvision==0.15.0a0+1098dad ----------------]
                                          |        v1        |        v2     
1 threads: ------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   2883 (+-  8)   |   2356 (+-  8)
      (3, 400, 400) / uint8 / cuda        |   904 (+-101)    |   239 (+-  1) 
      (3, 400, 400) / PIL                 |   731 (+-  1)    |   727 (+-  1) 
      (16, 3, 400, 400) / uint8 / cpu     |  46920 (+-246)   |  39804 (+-193)
      (16, 3, 400, 400) / uint8 / cuda    |  14259 (+-1271)  |   838 (+-  8) 
      (3, 400, 400) / float32 / cpu       |                  |   3001 (+- 12)
      (3, 400, 400) / float32 / cuda      |                  |   287 (+-  1) 
      (16, 3, 400, 400) / float32 / cpu   |                  |  53390 (+- 87)
      (16, 3, 400, 400) / float32 / cuda  |                  |   1010 (+- 23)
6 threads: ------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   3602 (+- 38)   |   2547 (+- 29)
      (16, 3, 400, 400) / uint8 / cpu     |  59143 (+-1811)  |  40616 (+-371)
      (3, 400, 400) / float32 / cpu       |                  |   3348 (+- 47)
      (16, 3, 400, 400) / float32 / cpu   |                  |  54473 (+-407)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +36.0% (improvement)
[------------- five_crop @ torchvision==0.15.0a0+1098dad --------------]
                                          |       v1      |       v2    
1 threads: -------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   39 (+-  0)  |   21 (+-  0)
      (3, 400, 400) / uint8 / cuda        |   41 (+-  0)  |   22 (+-  0)
      (3, 400, 400) / PIL                 |  104 (+-  1)  |   90 (+-  1)
      (3, 400, 400) / float32 / cpu       |   39 (+-  0)  |   21 (+-  0)
      (3, 400, 400) / float32 / cuda      |   41 (+-  0)  |   22 (+-  0)
      (16, 3, 400, 400) / uint8 / cpu     |   40 (+-  0)  |   21 (+-  0)
      (16, 3, 400, 400) / uint8 / cuda    |   41 (+-  0)  |   22 (+-  0)
      (16, 3, 400, 400) / float32 / cpu   |   40 (+-  0)  |   21 (+-  0)
      (16, 3, 400, 400) / float32 / cuda  |   41 (+-  0)  |   22 (+-  0)
6 threads: -------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   39 (+-  1)  |   21 (+-  0)
      (3, 400, 400) / float32 / cpu       |   39 (+-  0)  |   21 (+-  0)
      (16, 3, 400, 400) / uint8 / cpu     |   40 (+-  0)  |   21 (+-  0)
      (16, 3, 400, 400) / float32 / cpu   |   40 (+-  0)  |   21 (+-  0)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +40.2% (improvement)
[--------------- gaussian_blur @ torchvision==0.15.0a0+1098dad ----------------]
                                          |         v1        |         v2      
1 threads: ---------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |    4256 (+-110)   |    4000 (+-  1) 
      (3, 400, 400) / uint8 / cuda        |    263 (+- 27)    |    139 (+-  1)  
      (3, 400, 400) / PIL                 |    7339 (+- 31)   |    7143 (+- 58) 
      (3, 400, 400) / float32 / cpu       |    3634 (+- 10)   |    3520 (+- 22) 
      (3, 400, 400) / float32 / cuda      |    211 (+-  2)    |    111 (+-  0)  
      (16, 3, 400, 400) / uint8 / cpu     |  261353 (+-2761)  |   258668 (+-930)
      (16, 3, 400, 400) / uint8 / cuda    |    701 (+- 17)    |    609 (+-  0)  
      (16, 3, 400, 400) / float32 / cpu   |  246807 (+-2297)  |   246708 (+-679)
      (16, 3, 400, 400) / float32 / cuda  |    575 (+-  1)    |    485 (+-  0)  
6 threads: ---------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |    5466 (+- 52)   |    5246 (+- 31) 
      (3, 400, 400) / float32 / cpu       |    4730 (+- 40)   |    4597 (+- 38) 
      (16, 3, 400, 400) / uint8 / cpu     |  283245 (+-3426)  |  280457 (+-2995)
      (16, 3, 400, 400) / float32 / cpu   |  269702 (+-7677)  |  269556 (+-3773)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +13.7% (improvement)
[---------------- invert @ torchvision==0.15.0a0+1098dad ----------------]
                                          |       v1       |       v2     
1 threads: ---------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |  154 (+-  1)   |   19 (+-  0) 
      (3, 400, 400) / uint8 / cuda        |   13 (+-  0)   |    7 (+-  0) 
      (3, 400, 400) / PIL                 |  309 (+-  1)   |  306 (+-  1) 
      (3, 400, 400) / float32 / cpu       |  175 (+-  1)   |  166 (+-  1) 
      (3, 400, 400) / float32 / cuda      |   13 (+-  0)   |   10 (+-  0) 
      (16, 3, 400, 400) / uint8 / cpu     |  2269 (+-  6)  |  596 (+-  2) 
      (16, 3, 400, 400) / uint8 / cuda    |   14 (+-  0)   |   14 (+-  0) 
      (16, 3, 400, 400) / float32 / cpu   |  3982 (+-104)  |  4062 (+-107)
      (16, 3, 400, 400) / float32 / cuda  |   49 (+-  0)   |   49 (+-  0) 
6 threads: ---------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |  196 (+-  2)   |   58 (+-  0) 
      (3, 400, 400) / float32 / cpu       |  220 (+-  4)   |  210 (+-  3) 
      (16, 3, 400, 400) / uint8 / cpu     |  2330 (+- 28)  |  649 (+-  4) 
      (16, 3, 400, 400) / float32 / cpu   |  4142 (+- 80)  |  4001 (+- 42)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +23.3% (improvement)
[-------------- normalize @ torchvision==0.15.0a0+1098dad ---------------]
                                          |       v1       |       v2     
1 threads: ---------------------------------------------------------------
      (3, 400, 400) / float32 / cpu       |  376 (+-  1)   |  271 (+-  1) 
      (3, 400, 400) / float32 / cuda      |  116 (+-  0)   |   62 (+-  0) 
      (16, 3, 400, 400) / float32 / cpu   |  6698 (+- 15)  |  5207 (+- 19)
      (16, 3, 400, 400) / float32 / cuda  |  224 (+-  1)   |  139 (+-  0) 
6 threads: ---------------------------------------------------------------
      (3, 400, 400) / float32 / cpu       |  510 (+-  5)   |  359 (+-  4) 
      (16, 3, 400, 400) / float32 / cpu   |  7015 (+-227)  |  5805 (+- 43)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +33.6% (improvement)
[--------------- perspective @ torchvision==0.15.0a0+1098dad ----------------]
                                          |        v1        |        v2      
1 threads: -------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   4261 (+-148)   |   4000 (+-  8) 
      (3, 400, 400) / uint8 / cuda        |   502 (+- 49)    |   457 (+-  1)  
      (3, 400, 400) / PIL                 |   1819 (+- 28)   |   1809 (+-  7) 
      (3, 400, 400) / float32 / cpu       |   3763 (+- 11)   |   3644 (+-  7) 
      (3, 400, 400) / float32 / cuda      |   441 (+-  1)    |   426 (+-  1)  
      (16, 3, 400, 400) / uint8 / cpu     |  65562 (+-1493)  |  60287 (+-469) 
      (16, 3, 400, 400) / uint8 / cuda    |   475 (+-  1)    |   458 (+-  2)  
      (16, 3, 400, 400) / float32 / cpu   |  51973 (+-409)   |  51620 (+-473) 
      (16, 3, 400, 400) / float32 / cuda  |   438 (+-  1)    |   425 (+-  1)  
6 threads: -------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   4728 (+- 30)   |   4509 (+- 38) 
      (3, 400, 400) / float32 / cpu       |   4072 (+- 47)   |   4000 (+- 42) 
      (16, 3, 400, 400) / uint8 / cpu     |  67120 (+-1483)  |  61356 (+-1763)
      (16, 3, 400, 400) / float32 / cpu   |  51711 (+-546)   |  51035 (+-1607)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +3.7% (improvement)
[-------------- posterize @ torchvision==0.15.0a0+1098dad ---------------]
                                          |       v1       |       v2     
1 threads: ---------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |  116 (+-  1)   |  109 (+-  0) 
      (3, 400, 400) / uint8 / cuda        |   14 (+-  0)   |   10 (+-  0) 
      (3, 400, 400) / PIL                 |  318 (+-  1)   |  313 (+-  1) 
      (16, 3, 400, 400) / uint8 / cpu     |  1621 (+-  6)  |  1609 (+-  6)
      (16, 3, 400, 400) / uint8 / cuda    |   14 (+-  0)   |   14 (+-  0) 
      (3, 400, 400) / float32 / cpu       |                |  410 (+-  1) 
      (3, 400, 400) / float32 / cuda      |                |   26 (+-  0) 
      (16, 3, 400, 400) / float32 / cpu   |                |  8108 (+- 48)
      (16, 3, 400, 400) / float32 / cuda  |                |  178 (+-  0) 
6 threads: ---------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |  157 (+-  2)   |  148 (+-  3) 
      (16, 3, 400, 400) / uint8 / cpu     |  1706 (+- 24)  |  1673 (+- 25)
      (3, 400, 400) / float32 / cpu       |                |  579 (+-  4) 
      (16, 3, 400, 400) / float32 / cpu   |                |  8794 (+-202)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +6.9% (improvement)
[------------------- resize @ torchvision==0.15.0a0+1098dad -------------------]
                                          |         v1        |         v2      
1 threads: ---------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |    1987 (+-706)   |    1945 (+-716) 
      (3, 400, 400) / uint8 / cuda        |     49 (+-  0)    |     43 (+-  0)  
      (3, 400, 400) / PIL                 |    1235 (+-427)   |    1228 (+-435) 
      (3, 400, 400) / float32 / cpu       |    1649 (+-744)   |    1638 (+-743) 
      (3, 400, 400) / float32 / cuda      |     21 (+-  0)    |     17 (+-  0)  
      (16, 3, 400, 400) / uint8 / cpu     |   9739 (+-9872)   |   8027 (+-11715)
      (16, 3, 400, 400) / uint8 / cuda    |     90 (+- 16)    |     89 (+- 16)  
      (16, 3, 400, 400) / float32 / cpu   |  26834 (+-12079)  |  26845 (+-12087)
      (16, 3, 400, 400) / float32 / cuda  |     23 (+- 13)    |     23 (+- 13)  
6 threads: ---------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |    1527 (+-355)   |    1482 (+-359) 
      (3, 400, 400) / float32 / cpu       |    1073 (+-398)   |    1066 (+-403) 
      (16, 3, 400, 400) / uint8 / cpu     |   10123 (+-5967)  |   8402 (+-5907) 
      (16, 3, 400, 400) / float32 / cpu   |   16026 (+-6296)  |   16022 (+-6289)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +5.0% (improvement)
[----------------- resized_crop @ torchvision==0.15.0a0+1098dad -----------------]
                                          |         v1         |         v2       
1 threads: -----------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   1244 (+-1802)    |   1099 (+-1769)  
      (3, 400, 400) / uint8 / cuda        |     62 (+-  1)     |     52 (+-  0)   
      (3, 400, 400) / PIL                 |    2250 (+-979)    |    2236 (+-979)  
      (3, 400, 400) / float32 / cpu       |   2902 (+-2412)    |    531 (+-2408)  
      (3, 400, 400) / float32 / cuda      |     31 (+-  5)     |     24 (+-  5)   
      (16, 3, 400, 400) / uint8 / cpu     |  106292 (+-38242)  |  104513 (+-40684)
      (16, 3, 400, 400) / uint8 / cuda    |    218 (+- 25)     |    214 (+- 26)   
      (16, 3, 400, 400) / float32 / cpu   |   9466 (+-41424)   |   9384 (+-41517) 
      (16, 3, 400, 400) / float32 / cuda  |     81 (+- 16)     |     81 (+- 16)   
6 threads: -----------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   1431 (+-1432)    |   1283 (+-1406)  
      (3, 400, 400) / float32 / cpu       |   3951 (+-1671)    |   3945 (+-1684)  
      (16, 3, 400, 400) / uint8 / cpu     |  77433 (+-22907)   |  73118 (+-23566) 
      (16, 3, 400, 400) / float32 / cpu   |  61734 (+-26037)   |  61725 (+-26093) 

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +5.8% (improvement)
[------------------- rotate @ torchvision==0.15.0a0+1098dad -------------------]
                                          |         v1        |         v2      
1 threads: ---------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |    2112 (+-854)   |    2026 (+-852) 
      (3, 400, 400) / uint8 / cuda        |    263 (+- 33)    |    217 (+-  3)  
      (3, 400, 400) / PIL                 |   3820 (+-1719)   |   3815 (+-1726) 
      (3, 400, 400) / float32 / cpu       |    1604 (+-815)   |    1588 (+-822) 
      (3, 400, 400) / float32 / cuda      |    204 (+-  1)    |    188 (+-  2)  
      (16, 3, 400, 400) / uint8 / cpu     |  81554 (+-16016)  |  78530 (+-15963)
      (16, 3, 400, 400) / uint8 / cuda    |    382 (+- 28)    |    371 (+- 28)  
      (16, 3, 400, 400) / float32 / cpu   |  66279 (+-14933)  |  66624 (+-15357)
      (16, 3, 400, 400) / float32 / cuda  |    255 (+- 28)    |    247 (+- 29)  
6 threads: ---------------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |    2434 (+-843)   |    2356 (+-828) 
      (3, 400, 400) / float32 / cpu       |    1791 (+-833)   |    1783 (+-834) 
      (16, 3, 400, 400) / uint8 / cpu     |  81143 (+-16182)  |  80019 (+-16450)
      (16, 3, 400, 400) / float32 / cpu   |  66174 (+-15374)  |  66373 (+-15422)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +3.6% (improvement)
[---------------- solarize @ torchvision==0.15.0a0+1098dad ----------------]
                                          |        v1       |        v2     
1 threads: -----------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   1088 (+-  5)  |   962 (+-  5) 
      (3, 400, 400) / uint8 / cuda        |    32 (+-  0)   |    23 (+-  0) 
      (3, 400, 400) / PIL                 |   316 (+-  1)   |   314 (+-  1) 
      (3, 400, 400) / float32 / cpu       |   2357 (+- 39)  |   2344 (+-  5)
      (3, 400, 400) / float32 / cuda      |    33 (+-  0)   |    27 (+-  0) 
      (16, 3, 400, 400) / uint8 / cpu     |  18286 (+- 67)  |  16803 (+- 67)
      (16, 3, 400, 400) / uint8 / cuda    |    59 (+-  0)   |    62 (+-  0) 
      (16, 3, 400, 400) / float32 / cpu   |  38606 (+- 85)  |  38526 (+-217)
      (16, 3, 400, 400) / float32 / cuda  |   157 (+-  0)   |   158 (+-  0) 
6 threads: -----------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |   1247 (+- 22)  |   1119 (+- 18)
      (3, 400, 400) / float32 / cpu       |   2502 (+- 30)  |   2493 (+- 59)
      (16, 3, 400, 400) / uint8 / cpu     |  18600 (+-399)  |  17198 (+-256)
      (16, 3, 400, 400) / float32 / cpu   |  38938 (+-405)  |  38945 (+-370)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +6.2% (improvement)
[--------------- ten_crop @ torchvision==0.15.0a0+1098dad ---------------]
                                          |       v1       |       v2     
1 threads: ---------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |  381 (+-  1)   |  337 (+-  1) 
      (3, 400, 400) / uint8 / cuda        |   97 (+-  0)   |   56 (+-  0) 
      (3, 400, 400) / PIL                 |  346 (+-  1)   |  309 (+-  1) 
      (3, 400, 400) / float32 / cpu       |  318 (+-  2)   |  273 (+-  1) 
      (3, 400, 400) / float32 / cuda      |   98 (+-  1)   |   56 (+-  0) 
      (16, 3, 400, 400) / uint8 / cpu     |  4648 (+- 12)  |  4602 (+- 18)
      (16, 3, 400, 400) / uint8 / cuda    |   99 (+-  0)   |   57 (+-  0) 
      (16, 3, 400, 400) / float32 / cpu   |  5918 (+- 81)  |  5887 (+- 60)
      (16, 3, 400, 400) / float32 / cuda  |   98 (+-  1)   |   56 (+-  0) 
6 threads: ---------------------------------------------------------------
      (3, 400, 400) / uint8 / cpu         |  426 (+-  5)   |  382 (+-  4) 
      (3, 400, 400) / float32 / cpu       |  367 (+-  4)   |  322 (+-  3) 
      (16, 3, 400, 400) / uint8 / cpu     |  4731 (+- 31)  |  4685 (+- 48)
      (16, 3, 400, 400) / float32 / cpu   |  5951 (+- 67)  |  5892 (+- 62)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +21.7% (improvement)

pmeier added module: transforms Perf For performance improvements prototype labels Oct 24, 2022

This was referenced Oct 24, 2022

improve performance of {invert, solarize}_image_tensor #6819

Merged

remove unneccesary checks from posterize_image_tensor #6823

Closed

datumbox mentioned this issue Oct 28, 2022

[prototype] Add support of inplace on convert_format_bounding_box #6858

Merged

This was referenced Nov 1, 2022

[prototype] Optimize Center Crop performance #6880

Merged

[prototype] Gaussian Blur clean up #6888

Merged

pmeier mentioned this issue Nov 2, 2022

remove unnecessary checks from pad_image_tensor #6894

Merged

This was referenced Nov 10, 2022

[prototype] Port elastic and minor cleanups #6942

Merged

[prototype] Optimize and clean up all affine methods #6945

Merged

datumbox mentioned this issue Nov 28, 2022

[prototype] Remove _FT aliases from functional #6983

Merged

datumbox closed this as completed in #6983 Nov 28, 2022

pmeier mentioned this issue Dec 12, 2022

Enforce contiguous outputs on the transforms v2 kernels? #6839

Open

pmeier mentioned this issue Feb 27, 2023

A request: generalizing the design of affine transforms #7240

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvements for transforms v2 vs. v1 #6818

Performance improvements for transforms v2 vs. v1 #6818

pmeier commented Oct 24, 2022 •

edited by datumbox

Loading

datumbox commented Oct 24, 2022 •

edited

Loading

pmeier commented Oct 26, 2022

vfdev-5 commented Oct 28, 2022

vadimkantorov commented Oct 28, 2022

pmeier commented Oct 31, 2022 •

edited

Loading

datumbox commented Nov 10, 2022 •

edited

Loading

vadimkantorov commented Nov 10, 2022 •

edited

Loading

datumbox commented Nov 15, 2022 •

edited

Loading

Performance improvements for transforms v2 vs. v1 #6818

Performance improvements for transforms v2 vs. v1 #6818

Comments

pmeier commented Oct 24, 2022 • edited by datumbox Loading

datumbox commented Oct 24, 2022 • edited Loading

pmeier commented Oct 26, 2022

vfdev-5 commented Oct 28, 2022

vadimkantorov commented Oct 28, 2022

pmeier commented Oct 31, 2022 • edited Loading

datumbox commented Nov 10, 2022 • edited Loading

vadimkantorov commented Nov 10, 2022 • edited Loading

datumbox commented Nov 15, 2022 • edited Loading

Speed Benchmarks V1 vs V2

Summary

Speed Benchmarks

Training

Transform Classes

Functional Kernels

pmeier commented Oct 24, 2022 •

edited by datumbox

Loading

datumbox commented Oct 24, 2022 •

edited

Loading

pmeier commented Oct 31, 2022 •

edited

Loading

datumbox commented Nov 10, 2022 •

edited

Loading

vadimkantorov commented Nov 10, 2022 •

edited

Loading

datumbox commented Nov 15, 2022 •

edited

Loading