[Proof of Concept] GPU kernel using block-local shared memory #73

efaulhaber · 2024-11-08T13:58:21Z

This is a proof-of-concept implementation of a more advanced kernel manually making use of block-shared memory.
GPU blocks are associated with NHS cells and then all threads in one block load all neighbor data from a neighboring cell into shared memory before working on the data. This allows for coalesced accesses.

Unfortunately, this kernel is almost 2x slower than the original simple implementation on an H100.

Thanks to @vchuravy for developing this kernel with me.

efaulhaber · 2024-11-08T16:45:55Z

On AMD GPUs, this kernel performs much better than the simple default. On Nvidia GPUs, it's much worse.

codecov · 2024-11-08T18:52:37Z

Codecov Report

Attention: Patch coverage is 0% with 127 lines in your changes missing coverage. Please review.

Project coverage is 70.24%. Comparing base (fde9a79) to head (80cf700).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/nhs_grid.jl	0.00%	127 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main      #73       +/-   ##
===========================================
- Coverage   88.15%   70.24%   -17.92%     
===========================================
  Files          15       15               
  Lines         498      625      +127     
===========================================
  Hits          439      439               
- Misses         59      186      +127

Flag	Coverage Δ
unit	`70.24% <0.00%> (-17.92%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Co-authored-by: Valentin Churavy <v.churavy@gmail.com>

src/nhs_grid.jl

vchuravy · 2024-12-04T12:45:27Z

On AMD GPUs, this kernel performs much better than the simple default. On Nvidia GPUs, it's much worse.

Looking at the graph again, this seems only true for the 3090? The H100 also got faster?

efaulhaber · 2024-12-04T13:49:24Z

No, the H100 is also ~2x slower with this kernel. The colors are not great in this plot.

efaulhaber · 2024-12-09T07:35:58Z

With JuliaGPU/Metal.jl#480, JuliaGPU/Metal.jl#487 and JuliaGPU/Metal.jl#488, all kernels now work with Apple Silicon GPUs.
While the original kernel takes 9 seconds for the largest problem size, localmem takes 68s and double buffer takes 85s.
Why is this kernel so slow with Nvidia and Apple?

efaulhaber added performance gpu labels Nov 8, 2024

efaulhaber force-pushed the ef/localmem-kernel branch from 9370790 to 35f2fe1 Compare November 8, 2024 17:12

efaulhaber changed the title ~~[PoC] Add GPU kernel using shared memory~~ [Proof of Concept] Add GPU kernel using shared memory Nov 12, 2024

efaulhaber self-assigned this Nov 14, 2024

efaulhaber and others added 5 commits November 18, 2024 14:24

Add GPU kernel using shared memory

a75ff59

Co-authored-by: Valentin Churavy <v.churavy@gmail.com>

Add missing imports

8371415

WIP: Implement double buffering

eee0a34

Fix double buffered kernel

843116c

Improve readability

02b22f0

efaulhaber force-pushed the ef/localmem-kernel branch from 8ce6aa1 to 02b22f0 Compare November 18, 2024 13:24

efaulhaber commented Nov 18, 2024

View reviewed changes

src/nhs_grid.jl Show resolved Hide resolved

efaulhaber changed the title ~~[Proof of Concept] Add GPU kernel using shared memory~~ [Proof of Concept] GPU kernel using block-local shared memory Nov 22, 2024

Merge branch 'main' into ef/localmem-kernel

80cf700

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proof of Concept] GPU kernel using block-local shared memory #73

[Proof of Concept] GPU kernel using block-local shared memory #73

efaulhaber commented Nov 8, 2024

efaulhaber commented Nov 8, 2024 •

edited

Loading

codecov bot commented Nov 8, 2024 •

edited

Loading

vchuravy commented Dec 4, 2024

efaulhaber commented Dec 4, 2024 •

edited

Loading

efaulhaber commented Dec 9, 2024

[Proof of Concept] GPU kernel using block-local shared memory #73

Are you sure you want to change the base?

[Proof of Concept] GPU kernel using block-local shared memory #73

Conversation

efaulhaber commented Nov 8, 2024

efaulhaber commented Nov 8, 2024 • edited Loading

codecov bot commented Nov 8, 2024 • edited Loading

Codecov Report

vchuravy commented Dec 4, 2024

efaulhaber commented Dec 4, 2024 • edited Loading

efaulhaber commented Dec 9, 2024

efaulhaber commented Nov 8, 2024 •

edited

Loading

codecov bot commented Nov 8, 2024 •

edited

Loading

efaulhaber commented Dec 4, 2024 •

edited

Loading