Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proof of Concept] GPU kernel using block-local shared memory #73

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

efaulhaber
Copy link
Member

This is a proof-of-concept implementation of a more advanced kernel manually making use of block-shared memory.
GPU blocks are associated with NHS cells and then all threads in one block load all neighbor data from a neighboring cell into shared memory before working on the data. This allows for coalesced accesses.

Unfortunately, this kernel is almost 2x slower than the original simple implementation on an H100.

Thanks to @vchuravy for developing this kernel with me.

@efaulhaber
Copy link
Member Author

efaulhaber commented Nov 8, 2024

grafik On AMD GPUs, this kernel performs much better than the simple default. On Nvidia GPUs, it's much worse.

Copy link

codecov bot commented Nov 8, 2024

Codecov Report

Attention: Patch coverage is 0% with 127 lines in your changes missing coverage. Please review.

Project coverage is 70.24%. Comparing base (fde9a79) to head (80cf700).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/nhs_grid.jl 0.00% 127 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main      #73       +/-   ##
===========================================
- Coverage   88.15%   70.24%   -17.92%     
===========================================
  Files          15       15               
  Lines         498      625      +127     
===========================================
  Hits          439      439               
- Misses         59      186      +127     
Flag Coverage Δ
unit 70.24% <0.00%> (-17.92%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@efaulhaber efaulhaber changed the title [PoC] Add GPU kernel using shared memory [Proof of Concept] Add GPU kernel using shared memory Nov 12, 2024
@efaulhaber efaulhaber self-assigned this Nov 14, 2024
@efaulhaber efaulhaber changed the title [Proof of Concept] Add GPU kernel using shared memory [Proof of Concept] GPU kernel using block-local shared memory Nov 22, 2024
@vchuravy
Copy link
Member

vchuravy commented Dec 4, 2024

On AMD GPUs, this kernel performs much better than the simple default. On Nvidia GPUs, it's much worse.

Looking at the graph again, this seems only true for the 3090? The H100 also got faster?

@efaulhaber
Copy link
Member Author

efaulhaber commented Dec 4, 2024

No, the H100 is also ~2x slower with this kernel. The colors are not great in this plot.

@efaulhaber
Copy link
Member Author

With JuliaGPU/Metal.jl#480, JuliaGPU/Metal.jl#487 and JuliaGPU/Metal.jl#488, all kernels now work with Apple Silicon GPUs.
While the original kernel takes 9 seconds for the largest problem size, localmem takes 68s and double buffer takes 85s.
Why is this kernel so slow with Nvidia and Apple?

grafik

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants