Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a SSE2 fast path for AMD GPU #827

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

WickedShell
Copy link

This does a couple of things:

  • Rearranges the core structure to be a structure of array's, rather then an array of structures, which improves the cache hits when summarizing the results, and saves 400 bytes of stack space due to better alignment, and is a speedup by itself.
  • Moves to storing the result directly in the GPU structure and memcpying it. This saves us from some handling of fields that aren't actually exported, and is a bit less future maintenance.
  • Adds support for using SSE2 to summarize the results. There's a bit more that could be made faster, particularly if we raised the minimum target from SSE2, but on any 64bit build SSE2 was guaranteed which seemed like a reasonable minimum. I've done some loose benchmarking on my machine that shows this is faster, I need to formalize the results now that I've pushed an actual coherent branch rather then just the experiments.

Open questions with this work:

  1. Do we want any manually SIMD loops included in the build? It makes readability a bit worse, but since it was hidden in the macro it might not be too bad.
  2. Verify the timings to justify it. (Ideally on something with an APU such as a steam deck)
  3. I've never worked with Meson before. The SIMD detection appears to be working, but I think what I've currently presented doesn't actually allow you to disable SSE2 if the build machine supports it. Is there a better way I should be probing for SSE2 support?
  4. Remove the debug timing commits.

This saves memory because of the differnce in structure padding. As a
side effect storage of unused fields has been removed from this to
save more time and effort.
@misyltoad
Copy link
Contributor

misyltoad commented Aug 23, 2022

I can see moving from AOS to SOA making a difference, but.

Does the SSE2 stuff actually make any difference? I am guessing not? Compilers are good at vectorizing in $CURRENT_YEAR

(pls provide numbers + compile flags you used)

uint16_t soc_temp_c;
uint16_t gpu_temp_c;
uint16_t apu_cpu_temp_c;
#ifdef AMG_GPU_TEMP_MONITORING
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMG? Did you mean AMD?

Comment on lines +62 to +76
# Check for SSE2
if cc.compiles('''#include <emmintrin.h>
int main() {
__m128 v1 = _mm_set1_ps(-1.0f);
__m128 v2 = _mm_set1_ps(1.0f);
v1 = _mm_add_ps(v1, v2);
float sum[4];
_mm_store_ps(sum, v1);
return (int)sum[0];
}''',
name : 'SSE2 support',
args : '-msse2')
pre_args += '-DUSE_SSE2'
pre_args += '-msse2'
endif
Copy link
Contributor

@stephanlachnit stephanlachnit Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use the SIMD module?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants