-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor distinct
using static_map
insert_or_apply
#16484
base: branch-24.10
Are you sure you want to change the base?
Refactor distinct
using static_map
insert_or_apply
#16484
Conversation
auto keys = rmm::device_uvector<size_type>(map_size, stream, mr); | ||
auto values = rmm::device_uvector<size_type>(map_size, stream, mr); | ||
|
||
map.retrieve_all(keys.begin(), values.begin(), stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a discard iterator for the keys? I don't think we need to materialize them since they're not returned.
auto plusop = plus_op{}; | ||
map.insert_or_apply(pairs, pairs + num_rows, plusop, stream.value()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto plusop = plus_op{}; | |
map.insert_or_apply(pairs, pairs + num_rows, plusop, stream.value()); | |
map.insert_or_apply(pairs, pairs + num_rows, plus_op{}, stream.value()); |
[values = values.begin(), | ||
keys = keys.begin(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels more natural to order these as keys, then values.
[values = values.begin(), | |
keys = keys.begin(), | |
[keys = keys.begin(), | |
values = values.begin(), |
} | ||
}); | ||
|
||
auto const map_end = thrust::copy_if( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The for_each
above is unnecessary afaict. I think we can use a single copy_if
with a counting transform iterator that checks the value for a given index, and copies the key for that index if so. That avoids materializing the intermediate output_indices
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This header is now empty. Can we delete it and remove it from #include
s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, this file is now empty and should be removed and deleted from CMakeLists.txt.
@bdice @PointKernel, I pushed all the cleanups and minor optimizations. The PR is ready for review. Also, I don't think we might need extra optimization on keep_none. As @bdice mentioned, transform counting iterator with |
reverting back this commit, as compilation time of distinct.cu is almost 6 mins. This reverts commit 5b66240.
@srinivasyadav18 can you please share the performance comparisons before and after this PR? |
Based on this gist, here are the most important performance signals: I'm comparing the 1B row case, removing
Overall, I believe this performance signature is a wash. Perhaps the shifting throughputs are due to cache locality differences. It's worth profiling the |
Thanks @GregoryKimball for providing the plot and important performance signals. I have done some profiling on 1 Billion keys and 1 Billion cardinality, and this what I found out. Overall distint algorithm has 2 major steps
Time taken for each step with base branch-24.10:
Below is the profiling of base branch-24.10 (selected region in green shade shows gather runtime) Time taken for each step with insert_or_apply:
Below is the profiling with insert_or_apply (selected region in green shade shows gather runtime) In summary, the gather algorithm is causing the main bottleneck. I tested the insert_or_apply implementtation to sort the array before returning from the distinct_algorithm, and performance improved signifcantly as now gather takes very less time. Below is the profiling with insert_or_apply with sorting the distinct indices: (selected region in green shade shows gather runtime) (red shows the sorting overhead) |
In summary, the branch-24.10 implementation is already efficient as it performs reduction and also maintains the ordering for reduction values (as they are row-indices).
Sorting helps the situation in large input cases, but causes regression in low input cases (see here in gist for numbers). |
Thank you @srinivasyadav18 for studying the 1B cardinality case. The gist you posted shows the benefit. It would be worth a profile and check of the 100K cardinality case, but I don't think the moderate cardinality performance is a blocker |
Instead of sorting, how about using scatter-gather approach?
I'm starting to review this PR. Sorry for jumping late. |
auto const map_end = thrust::copy_if( | ||
rmm::exec_policy_nosync(stream), | ||
output_iter, | ||
output_iter + num_distinct_keys, | ||
output_indices.begin(), | ||
cuda::proclaim_return_type<bool>([] __device__(auto const idx) { return idx != -1; })); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add back all the explanation comments (and new comments for new code). Although I wrote the original code, now I hardly understand what it is doing without the comments.
struct plus_op { | ||
template <cuda::thread_scope Scope> | ||
__device__ void operator()(cuda::atomic_ref<size_type, Scope> ref, size_type const val) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not put __device__
code in hpp
file.
@@ -23,11 +23,13 @@ | |||
#include <rmm/device_uvector.hpp> | |||
#include <rmm/resource_ref.hpp> | |||
|
|||
#include <cuco/static_map.cuh> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now with including the cuco
header, this file should be renamed into _helper.cuh
. Otherwise please move the these header and their relevant code into _helper.cuh
and keep this file containing host-only code.
This PR adopts some work from @srinivasyadav18 with additional modifications. This is meant to complement #16484. Authors: - Bradley Dice (https://github.com/bdice) - Srinivas Yadav (https://github.com/srinivasyadav18) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Srinivas Yadav (https://github.com/srinivasyadav18) URL: #16497
We hope that #15700 will improve the overall performance outlook of |
Update: waiting for #15700 to merge to determine if the current PR can improve performance after resolving the register issues. |
Description
This PR refactors
distinct
usingstatic_map::insert_or_apply
for keep_first, keep_last and keep_none options.Checklist