Move strided batch pointer conversion to GPU #2601

THargreaves · 2024-12-20T11:07:01Z

Relates to #2592

Modifications and comments welcome.

codecov · 2024-12-20T13:30:09Z

Codecov Report

Attention: Patch coverage is 46.15385% with 7 lines in your changes missing coverage. Please review.

Project coverage is 73.50%. Comparing base (830c49b) to head (953c796).
Report is 3 commits behind head on master.

Files with missing lines	Patch %	Lines
lib/cublas/wrappers.jl	46.15%	7 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2601      +/-   ##
==========================================
- Coverage   73.51%   73.50%   -0.02%     
==========================================
  Files         157      157              
  Lines       15207    15218      +11     
==========================================
+ Hits        11180    11186       +6     
- Misses       4027     4032       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions

CUDA.jl Benchmarks

Benchmark suite	Current: `953c796`	Previous: `62564aa`	Ratio
`latency/precompile`	`45168542210` ns	`45352888677` ns	`1.00`
`latency/ttfp`	`6429231913.5` ns	`6365236644.5` ns	`1.01`
`latency/import`	`3045177395` ns	`3028318014.5` ns	`1.01`
`integration/volumerhs`	`9554999` ns	`9566694` ns	`1.00`
`integration/byval/slices=1`	`146470` ns	`146977` ns	`1.00`
`integration/byval/slices=3`	`425314` ns	`425622` ns	`1.00`
`integration/byval/reference`	`144902` ns	`144997` ns	`1.00`
`integration/byval/slices=2`	`286364` ns	`286278` ns	`1.00`
`integration/cudadevrt`	`103469` ns	`103635` ns	`1.00`
`kernel/indexing`	`14403` ns	`14502` ns	`0.99`
`kernel/indexing_checked`	`15257` ns	`15491` ns	`0.98`
`kernel/occupancy`	`693.4575163398692` ns	`685.9934210526316` ns	`1.01`
`kernel/launch`	`2086` ns	`2211.5555555555557` ns	`0.94`
`kernel/rand`	`16453` ns	`15495` ns	`1.06`
`array/reverse/1d`	`19368` ns	`19288.5` ns	`1.00`
`array/reverse/2d`	`23258` ns	`25407` ns	`0.92`
`array/reverse/1d_inplace`	`9968` ns	`11091` ns	`0.90`
`array/reverse/2d_inplace`	`11558` ns	`12732` ns	`0.91`
`array/copy`	`20492` ns	`20971` ns	`0.98`
`array/iteration/findall/int`	`156404` ns	`159562` ns	`0.98`
`array/iteration/findall/bool`	`137197` ns	`139396` ns	`0.98`
`array/iteration/findfirst/int`	`153397` ns	`153501` ns	`1.00`
`array/iteration/findfirst/bool`	`154184.5` ns	`154850` ns	`1.00`
`array/iteration/scalar`	`76406` ns	`75791` ns	`1.01`
`array/iteration/logical`	`209655.5` ns	`216533` ns	`0.97`
`array/iteration/findmin/1d`	`40407` ns	`41336` ns	`0.98`
`array/iteration/findmin/2d`	`93014` ns	`94735` ns	`0.98`
`array/reductions/reduce/1d`	`34611` ns	`42301` ns	`0.82`
`array/reductions/reduce/2d`	`46237.5` ns	`45473.5` ns	`1.02`
`array/reductions/mapreduce/1d`	`32218` ns	`40507` ns	`0.80`
`array/reductions/mapreduce/2d`	`43278` ns	`47968` ns	`0.90`
`array/broadcast`	`21168.5` ns	`21824` ns	`0.97`
`array/copyto!/gpu_to_gpu`	`13385` ns	`13701` ns	`0.98`
`array/copyto!/cpu_to_gpu`	`209733` ns	`213728` ns	`0.98`
`array/copyto!/gpu_to_cpu`	`245916` ns	`245756` ns	`1.00`
`array/accumulate/1d`	`108625` ns	`108365` ns	`1.00`
`array/accumulate/2d`	`79320.5` ns	`80195` ns	`0.99`
`array/construct`	`1204.95` ns	`1204.7` ns	`1.00`
`array/random/randn/Float32`	`42140` ns	`43838` ns	`0.96`
`array/random/randn!/Float32`	`26056` ns	`26350` ns	`0.99`
`array/random/rand!/Int64`	`27164` ns	`27179` ns	`1.00`
`array/random/rand!/Float32`	`8737.333333333334` ns	`8916.333333333334` ns	`0.98`
`array/random/rand/Int64`	`29715` ns	`30059` ns	`0.99`
`array/random/rand/Float32`	`12729` ns	`12950` ns	`0.98`
`array/permutedims/4d`	`66562` ns	`67259` ns	`0.99`
`array/permutedims/2d`	`55897` ns	`56558.5` ns	`0.99`
`array/permutedims/3d`	`58481` ns	`59379` ns	`0.98`
`array/sorting/1d`	`2931270.5` ns	`2932110` ns	`1.00`
`array/sorting/by`	`3497571` ns	`3498470` ns	`1.00`
`array/sorting/2d`	`1084182` ns	`1085408` ns	`1.00`
`cuda/synchronization/stream/auto`	`1029.7` ns	`1025.5` ns	`1.00`
`cuda/synchronization/stream/nonblocking`	`6521.6` ns	`6545.6` ns	`1.00`
`cuda/synchronization/stream/blocking`	`811.1630434782609` ns	`796.0645161290323` ns	`1.02`
`cuda/synchronization/context/auto`	`1177.7` ns	`1190.9` ns	`0.99`
`cuda/synchronization/context/nonblocking`	`6703.5` ns	`6739.6` ns	`0.99`
`cuda/synchronization/context/blocking`	`915.9777777777778` ns	`903.25` ns	`1.01`

This comment was automatically generated by workflow using github-action-benchmark.

maleadt

Why not pass the array in its entirety to the kernel, so you can keep the pointer(strided, (i-1)*stride + 1) intact (but now device-side) instead of having to roll your own address calculation? It doesn't matter much in practice, but I'm just trying to keep the code as generic as possible (in case we ever add strides to CuArray, as suggested in #1298).

THargreaves · 2024-12-20T20:04:57Z

Wouldn't that put us back at the same point as we were before using the very slow pointer repeatedly, just now on the GPU?

My thought process was to perform the pointer call only once since that's the expensive bit, and then leave the remaining simple logic for in the kernel.

What if we restricted this fast method to be just for dense arrays and then used a fallback based on pointer or Base._memory_offset for any other cases.

Am correct in thinking that even with the strided arrays, we'd need the more general case now in case we were performed the strided batched operation on a non-contiguous view?

maleadt · 2024-12-21T07:31:35Z

Wouldn't that put us back at the same point as we were before using the very slow pointer repeatedly, just now on the GPU?

No. You're dealing with different array types on the CPU and the GPU. On the CPU, pointer calls into an expensive operation performing lots of book keeping (which I linked to in the issue; configuring memory pools, verifying accessibility, etc). On the GPU, you're dealing with CuDeviceArrays for which pointer is a simple memory address calculation.

THargreaves added 2 commits December 20, 2024 10:59

Move strided batch pointer conversion to GPU

da0c6c7

Remove redundant/incorrect nblock check

953c796

github-actions bot reviewed Dec 20, 2024

View reviewed changes

maleadt reviewed Dec 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move strided batch pointer conversion to GPU #2601

Move strided batch pointer conversion to GPU #2601

THargreaves commented Dec 20, 2024

codecov bot commented Dec 20, 2024

github-actions bot left a comment

maleadt left a comment •

edited

Loading

THargreaves commented Dec 20, 2024

maleadt commented Dec 21, 2024

Move strided batch pointer conversion to GPU #2601

Are you sure you want to change the base?

Move strided batch pointer conversion to GPU #2601

Conversation

THargreaves commented Dec 20, 2024

codecov bot commented Dec 20, 2024

Codecov Report

github-actions bot left a comment

Choose a reason for hiding this comment

CUDA.jl Benchmarks

maleadt left a comment • edited Loading

Choose a reason for hiding this comment

THargreaves commented Dec 20, 2024

maleadt commented Dec 21, 2024

maleadt left a comment •

edited

Loading