-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance improvements for transforms v2 vs. v1 #6818
Comments
Concerning Concerning |
I did another deep dive into the ops in the second paragraph of #6818 (comment) and I'm fairly confident that there is little we can do to improve on our side. The only two things I found are
Fixing this, we would get speed-ups for padding modes
In there the image is guaranteed to be float and thus would not get any performance boost. While I think both things mentioned above would be good to have in general, I don't think we should prioritize them. |
Checking various options with |
About not vectorized bitwise shifts, is there an issue in core? |
I don't think so, but @alexsamardzic wanted to have a look at it. Edit: pytorch/pytorch#88607 |
@pmeier I'm keeping the list up-to-date with all linked PRs. I'm marking as |
An interesting question is whether a sequence of these transformations can be fused with Inductor/Dynamo (or sth else?) and produce a fused low-memory-access CPU kernel (working with uint8 or fp32?) and how it connects with randomness of whether to apply a transform or not |
Speed Benchmarks V1 vs V2SummaryThe Transforms V2 API is faster than V1 (stable) because it introduces several optimizations on the Transform Classes and Functional kernels. Summarizing the performance gains on a single number should be taken with a grain of salt because:
With the above in mind, here are some statistics that summarize the performance of the new API:
To estimate the above aggregate statistics we used this script on top of the detailed benchmarks: Aggregate Statistics
Speed BenchmarksFor all benchmarks below we use PyTorch nightly TrainingTo assess the performance in real world applications, we trained a ResNet50 using TorchVision's SoTA recipe for a reduced number of 10 epochs across different setups:
Detailed BenchmarksV1 using ad128b7 of main branch (PIL):
V1 using 46bd6d9 of #6952 (Tensor uint8):
V2 using 8b53036 of #6433 (PIL). Marginal median improvement of 1.64%:
V2 using bda072d of #6433 (Tensor uint8). Median improvement of 18.27%:
V2 using 8f07159 of #6433 (Tensor float32). Note that this configuration wasn't supported in V1 because not all kernels and augmentations supported floats:
Transform ClassesGenerated using the following script, inspired from earlier iterations from @vfdev-5 and amended by @pmeier. We compare V1 against V2 for all kernels for many configurations (batch size, dtype, device, number of threads etc) and then estimate the average performance improvement across all configuration to summarize the end result. Detailed Benchmarks
Functional KernelsGenerated using @pmeier's script. We compare V1 against V2 for all kernels for many configurations (batch size, dtype, device, number of threads etc) and then estimate the average performance improvement across all configuration to summarize the end result. Detailed Benchmarks
|
In addition to a lot of other goodies that transforms v2 will bring, we are also actively working on improving the performance. This is a tracker / overview issue of our progress.
Performance was measured with this benchmark script. Unless noted otherwise, the performance improvements reported above were computed on uint8, RGB images and videos while running single-threaded on CPU. You can find the full benchmark results alongside the benchmark script. The results will be constantly updated if new PRs are merged that have an effect on the kernels.
Kernels:
adjust_brightness
[proto] Speed up adjust color ops #6784adjust_contrast
[proto] Speed up adjust color ops #6784 [prototype] Speed upadjust_contrast_image_tensor
#6933adjust_gamma
[prototype] Speed improvement for adjust gamma op #6820 replace tensor division with scalar division and tensor multiplication #6903adjust_hue
[proto] Speed improvements for adjust hue op #6805 replace tensor division with scalar division and tensor multiplication #6903 [prototype] Speed upadjust_hue_image_tensor
#6938adjust_saturation
[proto] Speed up adjust color ops #6784 [prototype] Minor change onadjust_saturation_image_tensor
uint8 #6940adjust_sharpness
[proto] Speed up adjust color ops #6784 [prototype] Speed upadjust_sharpness_image_tensor
#6930autocontrast
[proto] Speed improvement for autocontrast op #6811 [prototype] Speed upautocontrast_image_tensor
#6935 [prototype] Port elastic and minor cleanups #6942equalize
[proto] Small improvement for tensor equalize op #6738, [proto] Performance improvements for equalize op #6757, another round of perf improvements for equalize #6776invert
improve performance of {invert, solarize}_image_tensor #6819posterize
remove unneccesary checks from posterize_image_tensor #6823, extend support of posterize to all integer and floating dtypes #6847solarize
improve performance of {invert, solarize}_image_tensor #6819affine
Fix bug on prototypepad
#6949center_crop
[prototype] Optimize Center Crop performance #6880 Fix bug on prototypepad
#6949crop
Fix bug on prototypepad
#6949elastic
[prototype] Port elastic and minor cleanups #6942erase
[prototype] Remove_FT
aliases from functional #6983five_crop
: Composite kernel Fix bug on prototypepad
#6949pad
Fix bug on prototypepad
#6949perspective
[proto] Small optim for perspective op on images #6907 Fix bug on prototypepad
#6949resize
[prototype] Clean up and port the resize kernel in V2 #6892resized_crop
: Composite kernel [prototype] Clean up and port the resize kernel in V2 #6892 Fix bug on prototypepad
#6949rotate
Fix bug on prototypepad
#6949ten_crop
: Composite kernel Fix bug on prototypepad
#6949convert_color_space
[proto] Speed up adjust color ops #6784 [prototype] Minor improvements on functional #6832convert_dtype
improve perf on convert_image_dtype and add tests #6795 replace tensor division with scalar division and tensor multiplication #6903int
toint
conversion. Currently, we are using a multiplicationbut theoretically bit shifts are faster. However, on PyTorch core the CPU kernels for bit shifts are not
vectorized making them slower for regular sized images than a multiplication. Vectorized CPU code implementing left shift operator. pytorch#88607
gaussian_blur
[proto] Small optimization for gaussian_blur functional op #6762 [prototype] Gaussian Blur clean up #6888normalize
[prototype] Speed improvement for normalize op #6821Transform Classes:
C++ (PyTorch core):
vertical_flip
[prototype] Remove_FT
aliases from functional #6983 Optimized vertical flip using memcpy pytorch#89414horizontal_flip
[prototype] Remove_FT
aliases from functional #6983 Vectorized horizontal flip implementation pytorch#88989 Optimized vertical flip using memcpy pytorch#89414cc @vfdev-5 @datumbox @bjuncek
The text was updated successfully, but these errors were encountered: