Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of SSE2 no_surf_alpha_opaque_dst blitter #2601

Conversation

Starbuck5
Copy link
Member

@Starbuck5 Starbuck5 commented Dec 13, 2023

Previously, this blitter used a 2x path if the blit had an even width, and a 1x path if the blit had an odd width. Now it uses a 4x path in all cases, with stride switching to handle excess pixels. ("x" = "pixels at a time" in this parlance, 2x = 2 pixels at a time, etc.)

The 4x strategy was inspired by the AVX blitters. There are operations it can do on 4 pixels at a time (load, store, alpha masking), then it goes into sub batches to process the first 2 pixels and the second 2 pixels.

I measure a 17% time reduction for an 800x100 blit, and a 56% time reduction for a 799x100 blit.

Why are the SSE blitters important? They are our cross platform accelerators. ARM uses them, using sse2neon. If we ever get WASM SIMD working it will use them. AVX2 is still way faster and shinier for x86, but SSE is important. In fact, it might become more important in coming years, such that we might want to raise our target to SSE3/SSE4 to take advantage of more advanced instructions.

Testing

  • Add #undef __AVX2__ to simd_blitters_avx2.c to make sure you're testing SSE2 blitters
Performance test script
import pygame
import time

pygame.init()

width = 799

screen = pygame.Surface((width,100), depth=32)
screen.fill((60, 70, 80))

surf = pygame.Surface((width,100), pygame.SRCALPHA)
surf.fill((5,216,77,120))

print(screen, screen.get_shifts(), screen.get_alpha(), screen.get_colorkey())
print(surf, surf.get_shifts(), surf.get_alpha(), surf.get_colorkey())

# blit once before to get rid of any spurious warnings
screen.blit(surf, (0, 0))  

start = time.time()
for _ in range(100000):
    screen.blit(surf, (0, 0))    
print(time.time() - start)

Future things

  • It's very scary to me that this function is hardcoded to support ARGB and it seems like it could be called by any 4 byte alpha blit. This PR doesn't change the behavior at all, but I'd like this blitter to support varied alpha channel locations like the modern AVX2 ones.
  • This 4x strategy should be relevant to other things in simd_blitters_sse2, like the blend modes.
  • In the future this could be more macro-ed up, like our modern AVX2 blitters and the filler codes. It's not ready for that yet. This gets it closer. The other alpha blitters need significant work.

@Starbuck5 Starbuck5 requested a review from a team as a code owner December 13, 2023 07:59
@Starbuck5 Starbuck5 marked this pull request as draft December 13, 2023 07:59
@Starbuck5 Starbuck5 changed the title Improve alphablit alpha sse2 argb no surf alpha opaque dst Improve performance of SSE2 no_surf_alpha_opaque_dst blitter Dec 13, 2023
@Starbuck5 Starbuck5 added ARM ARM stuff Performance Related to the speed or resource usage of the project SIMD Surface pygame.Surface labels Dec 13, 2023
@Starbuck5 Starbuck5 marked this pull request as ready for review December 13, 2023 08:25
Copy link
Member

@MyreMylar MyreMylar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

My test results using the test program:

Before PR Test Results (multiple runs) - 799x100
--------------------------------------
9.54190182685852
9.535496711730957
9.523637771606445
9.511376142501831

After PR Test Results (multiple runs) - 799x100
--------------------------------------
4.210539102554321
4.189740419387817
4.187772512435913
4.1871418952941895



Before PR Test Results (multiple runs) - 800x600
--------------------------------------
5.045648813247681
5.021234035491943
5.04343843460083
5.013594388961792

After PR Test Results (multiple runs) - 800x600
--------------------------------------
4.173273324966431
4.166048526763916
4.1809751987457275
4.171604871749878

A very good speedup on the odd/large number of trailing pixels per row case.

Copy link
Member

@novialriptide novialriptide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major performance boosts on the Apple M2 Max

Before

<799x100>
8.609078884124756
8.535655975341797
8.516473054885864

<800x100>
4.360736131668091
4.322415113449097
4.34481406211853

After

<799x100>
3.417734146118164
3.400517702102661
3.3988189697265625

<800x100>
3.435307025909424
3.3835179805755615
3.361617088317871

@Starbuck5
Copy link
Member Author

I have the reviews necessary to merge, but I plan on leaving this open a few more days in case @itzpr3d4t0r wants to take a look.

Copy link
Member

@itzpr3d4t0r itzpr3d4t0r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from a minor nitpick and a question, this looks good to me, good job!!

sub_dst = _mm_and_si128(sub_dst, mm_rgb_mask);
/* reset alpha to 0 */
*dstp32 = _mm_cvtsi128_si32(sub_dst);
/*[00][00][00][00][0A][0A][0A][0A] -> mm_src_alpha*/
Copy link
Member

@itzpr3d4t0r itzpr3d4t0r Dec 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, could there be a way to reduce the amount of unpacks we do here? like with a shuffle? maybe like get the alpha in a register, shuffle that value in all other positions, and then right shift that? Could that save an instruction?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah actually when experimenting with premul_alpha() i found out a shuffle is straight up better, replaces multiple instructions with one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course a shuffle would be better, the problem is that SSE2 does not have a byte-level shuffle instruction. SSSE3 does, AVX or AVX2 does.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the alpha fetching here is not dynamic to the actual location of the alpha channel. (Same as it's always been in this function).

I'd be willing to trade performance for making this work correctly on different pixelformats.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is "what amount of ARM systems justify a 3rd optimization level," where we put in an SSE4.2 level of SIMD for translation of advanced features like byte-level shuffles to NEON.

src_c/simd_blitters_sse2.c Show resolved Hide resolved
@Starbuck5 Starbuck5 force-pushed the improve-alphablit_alpha_sse2_argb_no_surf_alpha_opaque_dst branch from 97a6a7e to 1f1ebf0 Compare January 4, 2024 08:43
@Starbuck5 Starbuck5 merged commit 717d032 into pygame-community:main Jan 4, 2024
30 checks passed
@Starbuck5 Starbuck5 deleted the improve-alphablit_alpha_sse2_argb_no_surf_alpha_opaque_dst branch January 4, 2024 09:24
@itzpr3d4t0r itzpr3d4t0r added this to the 2.5.0 milestone Jan 4, 2024
Starbuck5 added a commit to Starbuck5/pygame-ce that referenced this pull request Jun 1, 2024
LOOP_UNROLLED4, "Duff's Device", doesn't handle looping 0 times properly, it needs a pre check before entering that there are loops to be had. I forgot to add this while porting the no_surf_alpha_opaque_dst to use the modern stride switching conventions (see pygame-community#2601)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ARM ARM stuff Performance Related to the speed or resource usage of the project SIMD Surface pygame.Surface
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants