-
-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized SSE2 alpha blit algorithm #2766
base: main
Are you sure you want to change the base?
Optimized SSE2 alpha blit algorithm #2766
Conversation
Hey, this is a great avenue for a PR, the SSE alpha blitters are some of the oldest SIMD code we have, and we've learned new strategies since they were written. In the old algorithm, since it was going one pixel at a time in the odd path, it had a fast path where it would just copy over the pixel data if certain conditions were met. How does the performance now stack up against that fast path? Also, tell me more about the changes. Did you do any algorithmic changes or is it just changes to increase the parallelism? |
Anyway about this:
The changes are mainly focusing on improving parallelism as I'm following the sdl formula exactly. I tried to save a few unpack here and there but it's really not much different from before. |
Test program: from sys import stdout
from pstats import Stats
from cProfile import Profile
import pygame
pygame.init()
def speed_test_blits():
# chosen because they contain edge cases.
nums = [0, 1, 65, 126, 127, 199, 254, 255]
results = {}
for iterations in range(0, 10000):
for dst_r, dst_b, dst_a in zip(nums, reversed(nums), reversed(nums)):
for src_r, src_b, src_a in zip(nums, reversed(nums), nums):
src_surf = pygame.Surface((66, 66), pygame.SRCALPHA, 32)
src_surf.fill((src_r, 255, src_b, src_a))
dest_surf = pygame.Surface((66, 66), pygame.SRCALPHA, 32)
dest_surf.fill((dst_r, 255, dst_b, dst_a))
dest_surf.blit(src_surf, (0, 0))
key = ((dst_r, dst_b, dst_a), (src_r, src_b, src_a))
results[key] = dest_surf.get_at((65, 33))
if __name__ == "__main__":
profiler = Profile()
profiler.runcall(speed_test_blits)
stats = Stats(profiler, stream=stdout)
stats.strip_dirs()
stats.sort_stats("cumulative")
stats.print_stats() Results on my Raspberry Pi:
Looks like a solid improvement here when the SSE2 code is converted to NEON code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, LGTM 👍
I see a solid performance improvement heading down this code path and it passes all the unit tests. The SIMD code is a lot simpler now and more in tune with what we've been doing in recent blitters rather than these older versions. Much has been learned since!
Good job! 👍 🎈 🍾
Thanks for the ping on discord, this had slipped off my radar. Every time I write or review one of our blitter functions I write a fuzzer test to make sure the output is exactly the same before and after. import random
import hashlib
import pygame
pygame.init()
random.seed(36)
def populate_surf(surf):
for y in range(surf.get_height()):
for x in range(surf.get_width()):
surf.set_at(
(x, y),
(
random.randint(0, 255),
random.randint(0, 255),
random.randint(0, 255),
random.randint(0, 255),
),
)
sizes = [(random.randint(1, 50), random.randint(1, 50)) for _ in range(30)]
offsets = [(random.randint(1, 20), random.randint(1, 20)) for _ in range(30)]
expected_hashes = [
"d35958573201ec5788755bc9e3e471bb5415203144671643ad9bfb908f7df646",
"4f5102ac56cf56462c1fc66d395d9c44777a3b756b82a28b9beb8342f0d0287c",
"f59a078f67688d66b65a542ed8faa7eacefc91a27db2a5dc85ebc736324100a5",
"94bc01747d7a182bfc8adf39b43b00ec46c4733a348181529c64cdfbea615c39",
"e14cffeaeef21879ed6ec6c29171e9645ecd89dd95449914994f1a7ec9b9ccfd",
"a6d5f758b65088347f896043a92c084d000c1d9bb66319d8dda4f1f2f5f7757f",
"9d3940ae03ed0f985625c351736ac5ee07dec7e170f76370fdbd8b9257541a1f",
"83985622f2f39aebe2efa7b8765e567e832e79d7b42d842afb8cc064596ed5eb",
"3eb6988b34dd22689bb5a0b898b789dc47fd34eb4fe113831564bec02d9f4eb6",
"66b3c4d65ecb300f12f603a778841c1fa87506ca9923bf72f51f5773566446f3",
"50c8c736cc2bcf6197ecb5929b044bfdbde393f86e5e0373538c87e4e71ba645",
"04d04558ad1efe6432c3940293ebe614c68935e94a7eed1577d2cd1246a7e043",
"240c1db632a0f3c7d0804aca4d2a0937bf428a25bc9517686d0b313035283d0a",
"a70d794fde1ccbec2addbccd87566b3dc6d89d22969156184e68cfca4b26639b",
"1f7c02681cbb4ca7c4d6e21bc05b28ad3011984d77e0928960dfc4454573d97f",
"c421f931cb6cbbeb13aa194967e2575d4f94a9dcfd9d3db197fc8cf7de1cd3f6",
"6942e0a43afba737ff20ad0418a6a6700a166566f347707dbd80c28c9383e720",
"93a4be6d2aa71396fe6c8fcb78fad62c8d4e72a22df76c3d1dd9273123dbb15d",
"867551157ab0c352a5b730664087322e5757c4c50c01ce5778c55e520852c325",
"15b6c9a4051dd8f2499cd8244c4f0d3effab225ce6eba254e2e6f83da3156a82",
"fc280abba3b0c05cf8f3a3d67c8bf946a9509f15d00538c41ce9191bb2849dce",
"22c14fb6f991056287b2aceab9e9a3630bbacc00529564b91a85183cb2559c3c",
"31b70c1c6da2a55036d0139da7ee1d0171bc87ca80f5e8d3f8b60ef3d714fb16",
"e2e87c6cc6b07b64e8d9367c28c19d6cd4b712db2fd0514ac941def98470bcf7",
"a67aa0f76b96ac26c5bfa2f558431b5244a84fe40843d17a85d1bfdb80991a6f",
"488d7f0cc883e498575f261f52df7cc142cb0f0244d7722f28018c6e87542b31",
"c6ad14ba201acaba673381093fc7f04764cae6ab1b4c724cd4336dd25341ed7c",
"47573d029dd21db29351074501033a5c029f47e34d625d39e1be166d5c4ed537",
"1c3e54146c2a0d4918f64047525dc79fd45c2c3db74d9dea5dd4a9c862e32a87",
"1940432ae8813a746d4201c14ccc6456d93fb65680c3a69861967579ab14a03e",
]
hashes = []
for size, offset in zip(sizes, offsets):
source = pygame.Surface(size, pygame.SRCALPHA)
populate_surf(source)
dest = pygame.Surface(
(random.randint(1, 50), (random.randint(1, 50))), pygame.SRCALPHA
)
populate_surf(dest)
dest.blit(source, offset)
sha256 = hashlib.sha256()
sha256.update(pygame.image.tobytes(dest, "RGBA"))
digest = sha256.hexdigest()
assert digest == expected_hashes.pop(0)
hashes.append(digest)
print(digest)
print(hashes)
print("Success") This script passes successfully on both SSE2 and AVX2 implementations before this PR, but raises an exception in the SSE2 implementation in this PR. I have not tracked down exactly what is different, but this shows that the new implementation is not producing the same results as the old one. If I adjust the code to print instead of assert I can see that 22/30 cases are the same, it's just 8/30 doing something different. |
dstp = (Uint32 *)dstp128; | ||
|
||
for (i = 0; i < pxl_excess; i++) { | ||
if (!(*dstp & amask)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line definitely needs a comment.
With that in mind I noticed the old implementation had a lot more comments that linked lines of intrinsics to operations in the formula. It would be nice to preserve something like that if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on this, but your program seems to pass on my machine, i got this output:
d35958573201ec5788755bc9e3e471bb5415203144671643ad9bfb908f7df646
4f5102ac56cf56462c1fc66d395d9c44777a3b756b82a28b9beb8342f0d0287c
f59a078f67688d66b65a542ed8faa7eacefc91a27db2a5dc85ebc736324100a5
94bc01747d7a182bfc8adf39b43b00ec46c4733a348181529c64cdfbea615c39
e14cffeaeef21879ed6ec6c29171e9645ecd89dd95449914994f1a7ec9b9ccfd
a6d5f758b65088347f896043a92c084d000c1d9bb66319d8dda4f1f2f5f7757f
9d3940ae03ed0f985625c351736ac5ee07dec7e170f76370fdbd8b9257541a1f
83985622f2f39aebe2efa7b8765e567e832e79d7b42d842afb8cc064596ed5eb
3eb6988b34dd22689bb5a0b898b789dc47fd34eb4fe113831564bec02d9f4eb6
66b3c4d65ecb300f12f603a778841c1fa87506ca9923bf72f51f5773566446f3
50c8c736cc2bcf6197ecb5929b044bfdbde393f86e5e0373538c87e4e71ba645
04d04558ad1efe6432c3940293ebe614c68935e94a7eed1577d2cd1246a7e043
240c1db632a0f3c7d0804aca4d2a0937bf428a25bc9517686d0b313035283d0a
a70d794fde1ccbec2addbccd87566b3dc6d89d22969156184e68cfca4b26639b
1f7c02681cbb4ca7c4d6e21bc05b28ad3011984d77e0928960dfc4454573d97f
c421f931cb6cbbeb13aa194967e2575d4f94a9dcfd9d3db197fc8cf7de1cd3f6
6942e0a43afba737ff20ad0418a6a6700a166566f347707dbd80c28c9383e720
93a4be6d2aa71396fe6c8fcb78fad62c8d4e72a22df76c3d1dd9273123dbb15d
867551157ab0c352a5b730664087322e5757c4c50c01ce5778c55e520852c325
15b6c9a4051dd8f2499cd8244c4f0d3effab225ce6eba254e2e6f83da3156a82
fc280abba3b0c05cf8f3a3d67c8bf946a9509f15d00538c41ce9191bb2849dce
22c14fb6f991056287b2aceab9e9a3630bbacc00529564b91a85183cb2559c3c
31b70c1c6da2a55036d0139da7ee1d0171bc87ca80f5e8d3f8b60ef3d714fb16
e2e87c6cc6b07b64e8d9367c28c19d6cd4b712db2fd0514ac941def98470bcf7
a67aa0f76b96ac26c5bfa2f558431b5244a84fe40843d17a85d1bfdb80991a6f
488d7f0cc883e498575f261f52df7cc142cb0f0244d7722f28018c6e87542b31
c6ad14ba201acaba673381093fc7f04764cae6ab1b4c724cd4336dd25341ed7c
47573d029dd21db29351074501033a5c029f47e34d625d39e1be166d5c4ed537
1c3e54146c2a0d4918f64047525dc79fd45c2c3db74d9dea5dd4a9c862e32a87
1940432ae8813a746d4201c14ccc6456d93fb65680c3a69861967579ab14a03e
['d35958573201ec5788755bc9e3e471bb5415203144671643ad9bfb908f7df646', '4f5102ac56cf56462c1fc66d395d9c44777a3b756b82a28b9beb8342f0d0287c', 'f59a078f67688d66b65a542ed8faa7eacefc91a27db2a5dc85ebc736324100a5', '94bc01747d7a182bfc8adf39b43b00ec46c4733a348181529c64cdfb
ea615c39', 'e14cffeaeef21879ed6ec6c29171e9645ecd89dd95449914994f1a7ec9b9ccfd', 'a6d5f758b65088347f896043a92c084d000c1d9bb66319d8dda4f1f2f5f7757f', '9d3940ae03ed0f985625c351736ac5ee07dec7e170f76370fdbd8b9257541a1f', '83985622f2f39aebe2efa7b8765e567e832e79d7b42d84
2afb8cc064596ed5eb', '3eb6988b34dd22689bb5a0b898b789dc47fd34eb4fe113831564bec02d9f4eb6', '66b3c4d65ecb300f12f603a778841c1fa87506ca9923bf72f51f5773566446f3', '50c8c736cc2bcf6197ecb5929b044bfdbde393f86e5e0373538c87e4e71ba645', '04d04558ad1efe6432c3940293ebe614c689
35e94a7eed1577d2cd1246a7e043', '240c1db632a0f3c7d0804aca4d2a0937bf428a25bc9517686d0b313035283d0a', 'a70d794fde1ccbec2addbccd87566b3dc6d89d22969156184e68cfca4b26639b', '1f7c02681cbb4ca7c4d6e21bc05b28ad3011984d77e0928960dfc4454573d97f', 'c421f931cb6cbbeb13aa194967
e2575d4f94a9dcfd9d3db197fc8cf7de1cd3f6', '6942e0a43afba737ff20ad0418a6a6700a166566f347707dbd80c28c9383e720', '93a4be6d2aa71396fe6c8fcb78fad62c8d4e72a22df76c3d1dd9273123dbb15d', '867551157ab0c352a5b730664087322e5757c4c50c01ce5778c55e520852c325', '15b6c9a4051dd8f2
499cd8244c4f0d3effab225ce6eba254e2e6f83da3156a82', 'fc280abba3b0c05cf8f3a3d67c8bf946a9509f15d00538c41ce9191bb2849dce', '22c14fb6f991056287b2aceab9e9a3630bbacc00529564b91a85183cb2559c3c', '31b70c1c6da2a55036d0139da7ee1d0171bc87ca80f5e8d3f8b60ef3d714fb16', 'e2e87c
6cc6b07b64e8d9367c28c19d6cd4b712db2fd0514ac941def98470bcf7', 'a67aa0f76b96ac26c5bfa2f558431b5244a84fe40843d17a85d1bfdb80991a6f', '488d7f0cc883e498575f261f52df7cc142cb0f0244d7722f28018c6e87542b31', 'c6ad14ba201acaba673381093fc7f04764cae6ab1b4c724cd4336dd25341ed7c', '47573d029dd21db29351074501033a5c029f47e34d625d39e1be166d5c4ed537', '1c3e54146c2a0d4918f64047525dc79fd45c2c3db74d9dea5dd4a9c862e32a87', '1940432ae8813a746d4201c14ccc6456d93fb65680c3a69861967579ab14a03e']
Success
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh and the pugame version:
pygame-ce 2.5.0.dev1 (SDL 2.30.1, Python 3.12.0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure you’re running the SSE2 codepath?
I also was testing on the branch rebased against develop, if it makes a difference.
e71d400
to
7f5ac33
Compare
@@ -392,6 +400,9 @@ alphablit_alpha_sse2_argb_no_surf_alpha(SDL_BlitInfo *info) | |||
mm_res_pixels = _mm_and_si128(mm_res_pixels, mm_rgb_mask); | |||
mm_res_pixels = _mm_or_si128(mm_res_pixels, mm_res_a); | |||
|
|||
mm_res_pixels = | |||
pg_mm_blendv_epi8(mm_res_pixels, mm_src, mask); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've figured out that this the equivalent of the if (!(*dstp & amask)) {
below, but when reading through this the first time I was very confused about why you added this to one codepath and not the other. This should be explained in a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes absolutely, I'll add a lot of comments but it will take a little while.
Retested the correctness with my script and with an expanded version that checks more random cases, I'm now highly confident that this PR exhibits identical behavior to the preceding implementation. |
Remake of #2680. Sort of a continuation of #2601. This should solve some issues with CI and clear the mess that was the previous PR.
The old strategy of processing 2 pixels at once for even widths or 1 pixel at a time for odd widths has been replaced. Now, it processes 4 pixels at once for (width // 4) iterations, and 1, 2, or 3 pixels as needed.
This change speeds up the algorithm by approximately 55% for even blits and 170% for odd blits.