-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] SIMD-accelerated IDCT #146
Conversation
@fintelia , @HeroicKatora : This PR is not finalized, but I'd love to get your feedback early. @lilith : This may be of some interest to you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this direction! We need to figure out a way to handle cases were packed_simd is unavailable, but perhaps just making it an optional feature + duplicating the dequantize_and_idct_block_* functions would be enough?
Yes, we can do that. But ideally, I would have wanted even non-simd targets to benefit from this change, by using simulated simd, which is still faster than loops. Unfortunately, the API of ssimd seems to be out of date, and the crate seems to be unmaintained... |
I have started to work on adapting ssimd to the latest packed-simd interface, so that we can use it with stable compilers here. The amount of work to do is moderate, since we use only a few simd types and operators. |
This commit is a first step towards SIMD-accelerated IDCT. It optimizes only the first part of the IDCT, and no fallback has been implemented for non-nightly compilers. Benchmark results: Benchmarking decode a 512x512 JPEG: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 11.3s or reduce sample count to 40 decode a 512x512 JPEG time: [2.2417 ms 2.2633 ms 2.2846 ms] change: [-26.302% -23.320% -20.299%] (p = 0.00 < 0.05) Performance has improved.
Benchmark: decode a 512x512 JPEG time: [2.1719 ms 2.1900 ms 2.2063 ms]
decode a 512x512 JPEG time: [2.1308 ms 2.1523 ms 2.1742 ms]
The packed_simd features enables SIMD. If the feature isn't enabled, ssimd is used, which uses plain structs to emulate simd vectors. When optimizations are enabled and the target architecture supports it, the compiler should still emit SIMD instructions for the ssimd code.
This necessitates to transpose the result matrix at the end of the IDCT, but benchmarking still shows a ~5% performance improvement with this (on the 512x12 jpeg decode benchmark, when compiled on rust nightly with target-cpu=native and the packed_simd feature)
This greatly improves the benchmark times on non-simd targets
Final benchmark Before this PR
After this PRall optimizations,
|
Is there something you still wanted to do here? It's still marked as a draft and the comment alludes a partial regression. |
Hi @HeroicKatora ! I wanted to switch from packed_simd to simdeez, but then I found an inconsistency in simdeez, for which I proposed a PR. It took some time, but it is now merged, so there is no more blocker. Are you interested in working on this, @HeroicKatora ? |
I see, and can totally feel how that would turn off some motivation. Yeah, the total of 20% performance is signifcant. Switching to a stable crate with I'm looking for what project to focus on next after |
Hi all! |
Yes, please pick this up ! |
This commit is a first step towards SIMD-accelerated IDCT.
It optimizes only the first part of the IDCT, and no fallback has been implemented for non-nightly compilers.
Benchmark results:
Closes: #79
Please merge #144 first