-
-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of SSE2 no_surf_alpha_opaque_dst blitter #2601
Improve performance of SSE2 no_surf_alpha_opaque_dst blitter #2601
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
My test results using the test program:
Before PR Test Results (multiple runs) - 799x100
--------------------------------------
9.54190182685852
9.535496711730957
9.523637771606445
9.511376142501831
After PR Test Results (multiple runs) - 799x100
--------------------------------------
4.210539102554321
4.189740419387817
4.187772512435913
4.1871418952941895
Before PR Test Results (multiple runs) - 800x600
--------------------------------------
5.045648813247681
5.021234035491943
5.04343843460083
5.013594388961792
After PR Test Results (multiple runs) - 800x600
--------------------------------------
4.173273324966431
4.166048526763916
4.1809751987457275
4.171604871749878
A very good speedup on the odd/large number of trailing pixels per row case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Major performance boosts on the Apple M2 Max
Before
<799x100>
8.609078884124756
8.535655975341797
8.516473054885864
<800x100>
4.360736131668091
4.322415113449097
4.34481406211853
After
<799x100>
3.417734146118164
3.400517702102661
3.3988189697265625
<800x100>
3.435307025909424
3.3835179805755615
3.361617088317871
I have the reviews necessary to merge, but I plan on leaving this open a few more days in case @itzpr3d4t0r wants to take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from a minor nitpick and a question, this looks good to me, good job!!
sub_dst = _mm_and_si128(sub_dst, mm_rgb_mask); | ||
/* reset alpha to 0 */ | ||
*dstp32 = _mm_cvtsi128_si32(sub_dst); | ||
/*[00][00][00][00][0A][0A][0A][0A] -> mm_src_alpha*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, could there be a way to reduce the amount of unpacks we do here? like with a shuffle? maybe like get the alpha in a register, shuffle that value in all other positions, and then right shift that? Could that save an instruction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah actually when experimenting with premul_alpha() i found out a shuffle is straight up better, replaces multiple instructions with one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course a shuffle would be better, the problem is that SSE2 does not have a byte-level shuffle instruction. SSSE3 does, AVX or AVX2 does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the alpha fetching here is not dynamic to the actual location of the alpha channel. (Same as it's always been in this function).
I'd be willing to trade performance for making this work correctly on different pixelformats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question is "what amount of ARM systems justify a 3rd optimization level," where we put in an SSE4.2 level of SIMD for translation of advanced features like byte-level shuffles to NEON.
Benchmarks show a small decline on even surfaces, big (nearly 2x) improvement on odd surfaces.
97a6a7e
to
1f1ebf0
Compare
LOOP_UNROLLED4, "Duff's Device", doesn't handle looping 0 times properly, it needs a pre check before entering that there are loops to be had. I forgot to add this while porting the no_surf_alpha_opaque_dst to use the modern stride switching conventions (see pygame-community#2601)
Previously, this blitter used a 2x path if the blit had an even width, and a 1x path if the blit had an odd width. Now it uses a 4x path in all cases, with stride switching to handle excess pixels. ("x" = "pixels at a time" in this parlance, 2x = 2 pixels at a time, etc.)
The 4x strategy was inspired by the AVX blitters. There are operations it can do on 4 pixels at a time (load, store, alpha masking), then it goes into sub batches to process the first 2 pixels and the second 2 pixels.
I measure a 17% time reduction for an 800x100 blit, and a 56% time reduction for a 799x100 blit.
Why are the SSE blitters important? They are our cross platform accelerators. ARM uses them, using sse2neon. If we ever get WASM SIMD working it will use them. AVX2 is still way faster and shinier for x86, but SSE is important. In fact, it might become more important in coming years, such that we might want to raise our target to SSE3/SSE4 to take advantage of more advanced instructions.
Testing
#undef __AVX2__
to simd_blitters_avx2.c to make sure you're testing SSE2 blittersPerformance test script
Future things