-
-
Notifications
You must be signed in to change notification settings - Fork 851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP - Speed improvements to resize convolution (no vpermps w/ FMA) #2793
base: main
Are you sure you want to change the base?
Conversation
public static Vector512<float> MultiplyAddEstimate(Vector512<float> a, Vector512<float> b, Vector512<float> c) | ||
|
||
// Don't actually use FMA as it requires many more instruction to extract the | ||
// upper and lower parts of the vector and then recombine them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this about? When inlined, a helper wrapping Avx512F.FusedMultiplyAdd
will compile to a single instruction. One of:
Instruction: vfmadd132ps zmm, zmm, zmm
vfmadd213ps zmm, zmm, zmm
vfmadd231ps zmm, zmm, zmm
CPUID Flags: AVX512F
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Urgh... I didn't even think of Avx512F.FusedMultiplyAdd
. I saw the equivalent method in the runtime PR exposing the functionality and blindly copied.
src/ImageSharp/Processing/Processors/Transforms/Resize/ResizeKernel.cs
Outdated
Show resolved
Hide resolved
Brain's not fully awake yet today, but I'll give the maths a look soon. |
…ernel.cs Co-authored-by: Clinton Ingram <clinton.ingram@outlook.com>
Co-authored-by: Clinton Ingram <clinton.ingram@outlook.com>
Thanks for the review so far. I still haven't figured out what is going on with |
It looks to me like the only differences are due to the change to single precision for kernel normalization and for calculation of the distances passed to the interpolation function. You'll definitely give up some accuracy there, and I'm not sure it's worth it since the kernels only have to be built once per resize. You can see here that @antonfirsov changed the precision to double from the initial implementation some years back. Since the periodic kernel map relies on each repetition of the kernel weights being exact, I can see how precision loss might lead to some differences when compared with a separate calculation per interval. I've actually never looked at your implementation of the kernel map before, and now my curiosity is piqued because I arrived at something similar myself, but my implementation calculates each kernel window separately, and only replaces that interval with the periodic version if they match exactly. Part of this was due to a lack of confidence in the maths on my part, as I only discovered the periodicity of the kernel weights by observation and kind of intuited my way to a solution. @antonfirsov would you mind filling in some gaps on the theory behind your periodic kernel map implementation? Did you use some paper or other implementation as inspiration, or did you arrive at it observationally like I did? |
Prerequisites
Description
Fixes #1515
This is a replacement for #1518 by @Sergio0694 with most of the work based upon his implementation. I've modernized some of the code and added
Vector512
support also.Resize tests currently have four failing tests with minor differences while the ResizeKernelMap has 3 single failing tests. Turning off the periodic kernel map fixes the kernel map failing tests so that is somehow related (I have no idea why).
I would like to hopefully get issues fixed and merge this because performance in the Playground Benchmarks looks really, really good so if anyone can spare some time to either provide assistance or point me in the right direction. please let me know.
CC @antonfirsov @saucecontrol