Improve performance of SSE2 no_surf_alpha_opaque_dst blitter #2601

Starbuck5 · 2023-12-13T07:59:03Z

Previously, this blitter used a 2x path if the blit had an even width, and a 1x path if the blit had an odd width. Now it uses a 4x path in all cases, with stride switching to handle excess pixels. ("x" = "pixels at a time" in this parlance, 2x = 2 pixels at a time, etc.)

The 4x strategy was inspired by the AVX blitters. There are operations it can do on 4 pixels at a time (load, store, alpha masking), then it goes into sub batches to process the first 2 pixels and the second 2 pixels.

I measure a 17% time reduction for an 800x100 blit, and a 56% time reduction for a 799x100 blit.

Why are the SSE blitters important? They are our cross platform accelerators. ARM uses them, using sse2neon. If we ever get WASM SIMD working it will use them. AVX2 is still way faster and shinier for x86, but SSE is important. In fact, it might become more important in coming years, such that we might want to raise our target to SSE3/SSE4 to take advantage of more advanced instructions.

Testing

Add #undef __AVX2__ to simd_blitters_avx2.c to make sure you're testing SSE2 blitters

Performance test script

import pygame
import time

pygame.init()

width = 799

screen = pygame.Surface((width,100), depth=32)
screen.fill((60, 70, 80))

surf = pygame.Surface((width,100), pygame.SRCALPHA)
surf.fill((5,216,77,120))

print(screen, screen.get_shifts(), screen.get_alpha(), screen.get_colorkey())
print(surf, surf.get_shifts(), surf.get_alpha(), surf.get_colorkey())

# blit once before to get rid of any spurious warnings
screen.blit(surf, (0, 0))  

start = time.time()
for _ in range(100000):
    screen.blit(surf, (0, 0))    
print(time.time() - start)

Future things

It's very scary to me that this function is hardcoded to support ARGB and it seems like it could be called by any 4 byte alpha blit. This PR doesn't change the behavior at all, but I'd like this blitter to support varied alpha channel locations like the modern AVX2 ones.
This 4x strategy should be relevant to other things in simd_blitters_sse2, like the blend modes.
In the future this could be more macro-ed up, like our modern AVX2 blitters and the filler codes. It's not ready for that yet. This gets it closer. The other alpha blitters need significant work.

MyreMylar

LGTM 👍

My test results using the test program:

Before PR Test Results (multiple runs) - 799x100
--------------------------------------
9.54190182685852
9.535496711730957
9.523637771606445
9.511376142501831

After PR Test Results (multiple runs) - 799x100
--------------------------------------
4.210539102554321
4.189740419387817
4.187772512435913
4.1871418952941895



Before PR Test Results (multiple runs) - 800x600
--------------------------------------
5.045648813247681
5.021234035491943
5.04343843460083
5.013594388961792

After PR Test Results (multiple runs) - 800x600
--------------------------------------
4.173273324966431
4.166048526763916
4.1809751987457275
4.171604871749878

A very good speedup on the odd/large number of trailing pixels per row case.

novialriptide

Major performance boosts on the Apple M2 Max

Before

<799x100>
8.609078884124756
8.535655975341797
8.516473054885864

<800x100>
4.360736131668091
4.322415113449097
4.34481406211853

After

<799x100>
3.417734146118164
3.400517702102661
3.3988189697265625

<800x100>
3.435307025909424
3.3835179805755615
3.361617088317871

Starbuck5 · 2023-12-15T05:47:48Z

I have the reviews necessary to merge, but I plan on leaving this open a few more days in case @itzpr3d4t0r wants to take a look.

itzpr3d4t0r

Aside from a minor nitpick and a question, this looks good to me, good job!!

itzpr3d4t0r · 2023-12-17T13:24:26Z

src_c/simd_blitters_sse2.c

-                    sub_dst = _mm_and_si128(sub_dst, mm_rgb_mask);
-                    /* reset alpha to 0 */
-                    *dstp32 = _mm_cvtsi128_si32(sub_dst);
+            /*[00][00][00][00][0A][0A][0A][0A] -> mm_src_alpha*/


Just curious, could there be a way to reduce the amount of unpacks we do here? like with a shuffle? maybe like get the alpha in a register, shuffle that value in all other positions, and then right shift that? Could that save an instruction?

Yeah actually when experimenting with premul_alpha() i found out a shuffle is straight up better, replaces multiple instructions with one.

Of course a shuffle would be better, the problem is that SSE2 does not have a byte-level shuffle instruction. SSSE3 does, AVX or AVX2 does.

Also the alpha fetching here is not dynamic to the actual location of the alpha channel. (Same as it's always been in this function).

I'd be willing to trade performance for making this work correctly on different pixelformats.

The question is "what amount of ARM systems justify a 3rd optimization level," where we put in an SSE4.2 level of SIMD for translation of advanced features like byte-level shuffles to NEON.

src_c/simd_blitters_sse2.c

Benchmarks show a small decline on even surfaces, big (nearly 2x) improvement on odd surfaces.

LOOP_UNROLLED4, "Duff's Device", doesn't handle looping 0 times properly, it needs a pre check before entering that there are loops to be had. I forgot to add this while porting the no_surf_alpha_opaque_dst to use the modern stride switching conventions (see pygame-community#2601)

Starbuck5 requested a review from a team as a code owner December 13, 2023 07:59

Starbuck5 marked this pull request as draft December 13, 2023 07:59

Starbuck5 changed the title ~~Improve alphablit alpha sse2 argb no surf alpha opaque dst~~ Improve performance of SSE2 no_surf_alpha_opaque_dst blitter Dec 13, 2023

Starbuck5 added ARM ARM stuff Performance Related to the speed or resource usage of the project SIMD Surface pygame.Surface labels Dec 13, 2023

Starbuck5 marked this pull request as ready for review December 13, 2023 08:25

MyreMylar approved these changes Dec 14, 2023

View reviewed changes

novialriptide approved these changes Dec 15, 2023

View reviewed changes

Starbuck5 requested a review from itzpr3d4t0r December 15, 2023 05:47

itzpr3d4t0r approved these changes Dec 17, 2023

View reviewed changes

Starbuck5 added 4 commits January 4, 2024 00:43

Unify even/odd width blit path

34c0b6d

Benchmarks show a small decline on even surfaces, big (nearly 2x) improvement on odd surfaces.

Upgrade alpha blitter from 2px to 4px batches

c43140e

Move core alpha_opaque_dst op to macro

8ee4f9b

Comments pass for alpha_opaque_dst blitter

1f1ebf0

Starbuck5 force-pushed the improve-alphablit_alpha_sse2_argb_no_surf_alpha_opaque_dst branch from 97a6a7e to 1f1ebf0 Compare January 4, 2024 08:43

Starbuck5 merged commit 717d032 into pygame-community:main Jan 4, 2024
30 checks passed

Starbuck5 deleted the improve-alphablit_alpha_sse2_argb_no_surf_alpha_opaque_dst branch January 4, 2024 09:24

itzpr3d4t0r added this to the 2.5.0 milestone Jan 4, 2024

itzpr3d4t0r mentioned this pull request Mar 24, 2024

Optimized SSE2 alpha blit algorithm #2766

Open

Starbuck5 mentioned this pull request Jun 1, 2024

Add missing check in SSE2 alpha blitter #2896

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of SSE2 no_surf_alpha_opaque_dst blitter #2601

Improve performance of SSE2 no_surf_alpha_opaque_dst blitter #2601

Starbuck5 commented Dec 13, 2023 •

edited

Loading

MyreMylar left a comment

novialriptide left a comment •

edited

Loading

Starbuck5 commented Dec 15, 2023

itzpr3d4t0r left a comment •

edited

Loading

itzpr3d4t0r Dec 17, 2023 •

edited

Loading

itzpr3d4t0r Dec 23, 2023

Starbuck5 Jan 4, 2024

Starbuck5 Jan 4, 2024

Starbuck5 Jan 4, 2024

Improve performance of SSE2 no_surf_alpha_opaque_dst blitter #2601

Improve performance of SSE2 no_surf_alpha_opaque_dst blitter #2601

Conversation

Starbuck5 commented Dec 13, 2023 • edited Loading

MyreMylar left a comment

Choose a reason for hiding this comment

novialriptide left a comment • edited Loading

Choose a reason for hiding this comment

Before

After

Starbuck5 commented Dec 15, 2023

itzpr3d4t0r left a comment • edited Loading

Choose a reason for hiding this comment

itzpr3d4t0r Dec 17, 2023 • edited Loading

Choose a reason for hiding this comment

itzpr3d4t0r Dec 23, 2023

Choose a reason for hiding this comment

Starbuck5 Jan 4, 2024

Choose a reason for hiding this comment

Starbuck5 Jan 4, 2024

Choose a reason for hiding this comment

Starbuck5 Jan 4, 2024

Choose a reason for hiding this comment

Starbuck5 commented Dec 13, 2023 •

edited

Loading

novialriptide left a comment •

edited

Loading

itzpr3d4t0r left a comment •

edited

Loading

itzpr3d4t0r Dec 17, 2023 •

edited

Loading