Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move strided batch pointer conversion to GPU #2601

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

THargreaves
Copy link

Relates to #2592

Modifications and comments welcome.

Copy link

codecov bot commented Dec 20, 2024

Codecov Report

Attention: Patch coverage is 46.15385% with 7 lines in your changes missing coverage. Please review.

Project coverage is 73.50%. Comparing base (830c49b) to head (953c796).
Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
lib/cublas/wrappers.jl 46.15% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2601      +/-   ##
==========================================
- Coverage   73.51%   73.50%   -0.02%     
==========================================
  Files         157      157              
  Lines       15207    15218      +11     
==========================================
+ Hits        11180    11186       +6     
- Misses       4027     4032       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Benchmark suite Current: 953c796 Previous: 62564aa Ratio
latency/precompile 45168542210 ns 45352888677 ns 1.00
latency/ttfp 6429231913.5 ns 6365236644.5 ns 1.01
latency/import 3045177395 ns 3028318014.5 ns 1.01
integration/volumerhs 9554999 ns 9566694 ns 1.00
integration/byval/slices=1 146470 ns 146977 ns 1.00
integration/byval/slices=3 425314 ns 425622 ns 1.00
integration/byval/reference 144902 ns 144997 ns 1.00
integration/byval/slices=2 286364 ns 286278 ns 1.00
integration/cudadevrt 103469 ns 103635 ns 1.00
kernel/indexing 14403 ns 14502 ns 0.99
kernel/indexing_checked 15257 ns 15491 ns 0.98
kernel/occupancy 693.4575163398692 ns 685.9934210526316 ns 1.01
kernel/launch 2086 ns 2211.5555555555557 ns 0.94
kernel/rand 16453 ns 15495 ns 1.06
array/reverse/1d 19368 ns 19288.5 ns 1.00
array/reverse/2d 23258 ns 25407 ns 0.92
array/reverse/1d_inplace 9968 ns 11091 ns 0.90
array/reverse/2d_inplace 11558 ns 12732 ns 0.91
array/copy 20492 ns 20971 ns 0.98
array/iteration/findall/int 156404 ns 159562 ns 0.98
array/iteration/findall/bool 137197 ns 139396 ns 0.98
array/iteration/findfirst/int 153397 ns 153501 ns 1.00
array/iteration/findfirst/bool 154184.5 ns 154850 ns 1.00
array/iteration/scalar 76406 ns 75791 ns 1.01
array/iteration/logical 209655.5 ns 216533 ns 0.97
array/iteration/findmin/1d 40407 ns 41336 ns 0.98
array/iteration/findmin/2d 93014 ns 94735 ns 0.98
array/reductions/reduce/1d 34611 ns 42301 ns 0.82
array/reductions/reduce/2d 46237.5 ns 45473.5 ns 1.02
array/reductions/mapreduce/1d 32218 ns 40507 ns 0.80
array/reductions/mapreduce/2d 43278 ns 47968 ns 0.90
array/broadcast 21168.5 ns 21824 ns 0.97
array/copyto!/gpu_to_gpu 13385 ns 13701 ns 0.98
array/copyto!/cpu_to_gpu 209733 ns 213728 ns 0.98
array/copyto!/gpu_to_cpu 245916 ns 245756 ns 1.00
array/accumulate/1d 108625 ns 108365 ns 1.00
array/accumulate/2d 79320.5 ns 80195 ns 0.99
array/construct 1204.95 ns 1204.7 ns 1.00
array/random/randn/Float32 42140 ns 43838 ns 0.96
array/random/randn!/Float32 26056 ns 26350 ns 0.99
array/random/rand!/Int64 27164 ns 27179 ns 1.00
array/random/rand!/Float32 8737.333333333334 ns 8916.333333333334 ns 0.98
array/random/rand/Int64 29715 ns 30059 ns 0.99
array/random/rand/Float32 12729 ns 12950 ns 0.98
array/permutedims/4d 66562 ns 67259 ns 0.99
array/permutedims/2d 55897 ns 56558.5 ns 0.99
array/permutedims/3d 58481 ns 59379 ns 0.98
array/sorting/1d 2931270.5 ns 2932110 ns 1.00
array/sorting/by 3497571 ns 3498470 ns 1.00
array/sorting/2d 1084182 ns 1085408 ns 1.00
cuda/synchronization/stream/auto 1029.7 ns 1025.5 ns 1.00
cuda/synchronization/stream/nonblocking 6521.6 ns 6545.6 ns 1.00
cuda/synchronization/stream/blocking 811.1630434782609 ns 796.0645161290323 ns 1.02
cuda/synchronization/context/auto 1177.7 ns 1190.9 ns 0.99
cuda/synchronization/context/nonblocking 6703.5 ns 6739.6 ns 0.99
cuda/synchronization/context/blocking 915.9777777777778 ns 903.25 ns 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Member

@maleadt maleadt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not pass the array in its entirety to the kernel, so you can keep the pointer(strided, (i-1)*stride + 1) intact (but now device-side) instead of having to roll your own address calculation? It doesn't matter much in practice, but I'm just trying to keep the code as generic as possible (in case we ever add strides to CuArray, as suggested in #1298).

@THargreaves
Copy link
Author

Wouldn't that put us back at the same point as we were before using the very slow pointer repeatedly, just now on the GPU?

My thought process was to perform the pointer call only once since that's the expensive bit, and then leave the remaining simple logic for in the kernel.

What if we restricted this fast method to be just for dense arrays and then used a fallback based on pointer or Base._memory_offset for any other cases.

Am correct in thinking that even with the strided arrays, we'd need the more general case now in case we were performed the strided batched operation on a non-contiguous view?

@maleadt
Copy link
Member

maleadt commented Dec 21, 2024

Wouldn't that put us back at the same point as we were before using the very slow pointer repeatedly, just now on the GPU?

No. You're dealing with different array types on the CPU and the GPU. On the CPU, pointer calls into an expensive operation performing lots of book keeping (which I linked to in the issue; configuring memory pools, verifying accessibility, etc). On the GPU, you're dealing with CuDeviceArrays for which pointer is a simple memory address calculation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants