-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move strided batch pointer conversion to GPU #2601
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #2601 +/- ##
==========================================
- Coverage 73.51% 73.50% -0.02%
==========================================
Files 157 157
Lines 15207 15218 +11
==========================================
+ Hits 11180 11186 +6
- Misses 4027 4032 +5 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDA.jl Benchmarks
Benchmark suite | Current: 953c796 | Previous: 62564aa | Ratio |
---|---|---|---|
latency/precompile |
45168542210 ns |
45352888677 ns |
1.00 |
latency/ttfp |
6429231913.5 ns |
6365236644.5 ns |
1.01 |
latency/import |
3045177395 ns |
3028318014.5 ns |
1.01 |
integration/volumerhs |
9554999 ns |
9566694 ns |
1.00 |
integration/byval/slices=1 |
146470 ns |
146977 ns |
1.00 |
integration/byval/slices=3 |
425314 ns |
425622 ns |
1.00 |
integration/byval/reference |
144902 ns |
144997 ns |
1.00 |
integration/byval/slices=2 |
286364 ns |
286278 ns |
1.00 |
integration/cudadevrt |
103469 ns |
103635 ns |
1.00 |
kernel/indexing |
14403 ns |
14502 ns |
0.99 |
kernel/indexing_checked |
15257 ns |
15491 ns |
0.98 |
kernel/occupancy |
693.4575163398692 ns |
685.9934210526316 ns |
1.01 |
kernel/launch |
2086 ns |
2211.5555555555557 ns |
0.94 |
kernel/rand |
16453 ns |
15495 ns |
1.06 |
array/reverse/1d |
19368 ns |
19288.5 ns |
1.00 |
array/reverse/2d |
23258 ns |
25407 ns |
0.92 |
array/reverse/1d_inplace |
9968 ns |
11091 ns |
0.90 |
array/reverse/2d_inplace |
11558 ns |
12732 ns |
0.91 |
array/copy |
20492 ns |
20971 ns |
0.98 |
array/iteration/findall/int |
156404 ns |
159562 ns |
0.98 |
array/iteration/findall/bool |
137197 ns |
139396 ns |
0.98 |
array/iteration/findfirst/int |
153397 ns |
153501 ns |
1.00 |
array/iteration/findfirst/bool |
154184.5 ns |
154850 ns |
1.00 |
array/iteration/scalar |
76406 ns |
75791 ns |
1.01 |
array/iteration/logical |
209655.5 ns |
216533 ns |
0.97 |
array/iteration/findmin/1d |
40407 ns |
41336 ns |
0.98 |
array/iteration/findmin/2d |
93014 ns |
94735 ns |
0.98 |
array/reductions/reduce/1d |
34611 ns |
42301 ns |
0.82 |
array/reductions/reduce/2d |
46237.5 ns |
45473.5 ns |
1.02 |
array/reductions/mapreduce/1d |
32218 ns |
40507 ns |
0.80 |
array/reductions/mapreduce/2d |
43278 ns |
47968 ns |
0.90 |
array/broadcast |
21168.5 ns |
21824 ns |
0.97 |
array/copyto!/gpu_to_gpu |
13385 ns |
13701 ns |
0.98 |
array/copyto!/cpu_to_gpu |
209733 ns |
213728 ns |
0.98 |
array/copyto!/gpu_to_cpu |
245916 ns |
245756 ns |
1.00 |
array/accumulate/1d |
108625 ns |
108365 ns |
1.00 |
array/accumulate/2d |
79320.5 ns |
80195 ns |
0.99 |
array/construct |
1204.95 ns |
1204.7 ns |
1.00 |
array/random/randn/Float32 |
42140 ns |
43838 ns |
0.96 |
array/random/randn!/Float32 |
26056 ns |
26350 ns |
0.99 |
array/random/rand!/Int64 |
27164 ns |
27179 ns |
1.00 |
array/random/rand!/Float32 |
8737.333333333334 ns |
8916.333333333334 ns |
0.98 |
array/random/rand/Int64 |
29715 ns |
30059 ns |
0.99 |
array/random/rand/Float32 |
12729 ns |
12950 ns |
0.98 |
array/permutedims/4d |
66562 ns |
67259 ns |
0.99 |
array/permutedims/2d |
55897 ns |
56558.5 ns |
0.99 |
array/permutedims/3d |
58481 ns |
59379 ns |
0.98 |
array/sorting/1d |
2931270.5 ns |
2932110 ns |
1.00 |
array/sorting/by |
3497571 ns |
3498470 ns |
1.00 |
array/sorting/2d |
1084182 ns |
1085408 ns |
1.00 |
cuda/synchronization/stream/auto |
1029.7 ns |
1025.5 ns |
1.00 |
cuda/synchronization/stream/nonblocking |
6521.6 ns |
6545.6 ns |
1.00 |
cuda/synchronization/stream/blocking |
811.1630434782609 ns |
796.0645161290323 ns |
1.02 |
cuda/synchronization/context/auto |
1177.7 ns |
1190.9 ns |
0.99 |
cuda/synchronization/context/nonblocking |
6703.5 ns |
6739.6 ns |
0.99 |
cuda/synchronization/context/blocking |
915.9777777777778 ns |
903.25 ns |
1.01 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not pass the array in its entirety to the kernel, so you can keep the pointer(strided, (i-1)*stride + 1)
intact (but now device-side) instead of having to roll your own address calculation? It doesn't matter much in practice, but I'm just trying to keep the code as generic as possible (in case we ever add strides to CuArray
, as suggested in #1298).
Wouldn't that put us back at the same point as we were before using the very slow My thought process was to perform the What if we restricted this fast method to be just for dense arrays and then used a fallback based on Am correct in thinking that even with the strided arrays, we'd need the more general case now in case we were performed the strided batched operation on a non-contiguous view? |
No. You're dealing with different array types on the CPU and the GPU. On the CPU, |
Relates to #2592
Modifications and comments welcome.