Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor distinct using static_map insert_or_apply #16484

Open
wants to merge 10 commits into
base: branch-24.10
Choose a base branch
from

Conversation

srinivasyadav18
Copy link
Contributor

Description

This PR refactors distinct using static_map::insert_or_apply for keep_first, keep_last and keep_none options.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Aug 2, 2024
auto keys = rmm::device_uvector<size_type>(map_size, stream, mr);
auto values = rmm::device_uvector<size_type>(map_size, stream, mr);

map.retrieve_all(keys.begin(), values.begin(), stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a discard iterator for the keys? I don't think we need to materialize them since they're not returned.

Comment on lines 136 to 137
auto plusop = plus_op{};
map.insert_or_apply(pairs, pairs + num_rows, plusop, stream.value());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto plusop = plus_op{};
map.insert_or_apply(pairs, pairs + num_rows, plusop, stream.value());
map.insert_or_apply(pairs, pairs + num_rows, plus_op{}, stream.value());

Comment on lines 151 to 152
[values = values.begin(),
keys = keys.begin(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels more natural to order these as keys, then values.

Suggested change
[values = values.begin(),
keys = keys.begin(),
[keys = keys.begin(),
values = values.begin(),

}
});

auto const map_end = thrust::copy_if(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The for_each above is unnecessary afaict. I think we can use a single copy_if with a counting transform iterator that checks the value for a given index, and copies the key for that index if so. That avoids materializing the intermediate output_indices.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This header is now empty. Can we delete it and remove it from #includes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, this file is now empty and should be removed and deleted from CMakeLists.txt.

@github-actions github-actions bot added the CMake CMake build issue label Aug 2, 2024
@srinivasyadav18 srinivasyadav18 added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Aug 2, 2024
@srinivasyadav18 srinivasyadav18 marked this pull request as ready for review August 2, 2024 23:24
@srinivasyadav18 srinivasyadav18 requested review from a team as code owners August 2, 2024 23:24
@srinivasyadav18
Copy link
Contributor Author

@bdice @PointKernel, I pushed all the cleanups and minor optimizations. The PR is ready for review.

Also, I don't think we might need extra optimization on keep_none. As @bdice mentioned, transform counting iterator with copy_if should be efficient enough.

reverting back this commit, as compilation time of distinct.cu
is almost 6 mins.

This reverts commit 5b66240.
@github-actions github-actions bot removed the CMake CMake build issue label Aug 4, 2024
@PointKernel
Copy link
Member

@srinivasyadav18 can you please share the performance comparisons before and after this PR?

@bdice bdice mentioned this pull request Aug 5, 2024
3 tasks
@GregoryKimball
Copy link
Contributor

Based on this gist, here are the most important performance signals:
image

I'm comparing the 1B row case, removing any because it's not impacted by this PR, and removing last because it is identical to first.

  • drop at 1B for keep=first
  • drop at medium-low cardinality (1-100K)
  • improvement at low cardinality (100) for I32
  • improvement at 1-100M

Overall, I believe this performance signature is a wash. Perhaps the shifting throughputs are due to cache locality differences. It's worth profiling the 1B rows, 1B cardinality, keep first case and the 1B rows, 100K cardinality, keep first case.

@srinivasyadav18
Copy link
Contributor Author

srinivasyadav18 commented Aug 6, 2024

Thanks @GregoryKimball for providing the plot and important performance signals.

I have done some profiling on 1 Billion keys and 1 Billion cardinality, and this what I found out.

Overall distint algorithm has 2 major steps

  • distinct_indices
    • initalize data strucutre
    • perform the reductions (reduce_by_row)
    • return the output_indices (all the distinct indices based on keep option)
  • gather

Time taken for each step with base branch-24.10:

  • distinct_indices
    • initalize data strucutre (cuco::static_set, reduction_results device_uvector) TIME = 5ms
    • perform the reductions (reduce_by_row) TIME = 65.589ms
    • return the output_indices (all the distinct indices based on keep option) TIME = 4ms
  • gather ##TIME = 15ms##

Below is the profiling of base branch-24.10 (selected region in green shade shows gather runtime)

base

Time taken for each step with insert_or_apply:

  • distinct_indices
    • initalize data strucutre TIME = 7.8ms
    • perform the reductions (insert_or_apply) (thrust::for_each with set_ref) TIME = 55.7ms
    • return the output_indices (retrieve_all) TIME = 11.8ms
  • gather ##TIME = 53.364##

Below is the profiling with insert_or_apply (selected region in green shade shows gather runtime)
insert_or_apply_gather

In summary, the gather algorithm is causing the main bottleneck.
It looks like the gather algorithm is efficient only when the row indices are already sorted (this makes sense because there is less data movement).

I tested the insert_or_apply implementtation to sort the array before returning from the distinct_algorithm, and performance improved signifcantly as now gather takes very less time.

Below is the profiling with insert_or_apply with sorting the distinct indices: (selected region in green shade shows gather runtime) (red shows the sorting overhead)

sort_insert_or_apply_gather

@srinivasyadav18
Copy link
Contributor Author

In summary, the branch-24.10 implementation is already efficient as it performs reduction and also maintains the ordering for reduction values (as they are row-indices).

insert-or-apply kernel improves performance more than 15%, but due to unordered indices as the output, the gather algorithm which does post-processing regresses the performance more than 50%, hence causing the overall performance regression in the pipeline.

Sorting helps the situation in large input cases, but causes regression in low input cases (see here in gist for numbers).

@GregoryKimball
Copy link
Contributor

Thank you @srinivasyadav18 for studying the 1B cardinality case. The gist you posted shows the benefit.
image

It would be worth a profile and check of the 100K cardinality case, but I don't think the moderate cardinality performance is a blocker

@ttnghia
Copy link
Contributor

ttnghia commented Aug 6, 2024

Instead of sorting, how about using scatter-gather approach?

  • Initialize a gather map with all 0
  • Scatter the output indices into the gather map, marking the position of these indices as 1
  • Gather using that gather map.

I'm starting to review this PR. Sorry for jumping late.

Comment on lines +78 to +83
auto const map_end = thrust::copy_if(
rmm::exec_policy_nosync(stream),
output_iter,
output_iter + num_distinct_keys,
output_indices.begin(),
cuda::proclaim_return_type<bool>([] __device__(auto const idx) { return idx != -1; }));
Copy link
Contributor

@ttnghia ttnghia Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add back all the explanation comments (and new comments for new code). Although I wrote the original code, now I hardly understand what it is doing without the comments.

Comment on lines +52 to +54
struct plus_op {
template <cuda::thread_scope Scope>
__device__ void operator()(cuda::atomic_ref<size_type, Scope> ref, size_type const val)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not put __device__ code in hpp file.

@@ -23,11 +23,13 @@
#include <rmm/device_uvector.hpp>
#include <rmm/resource_ref.hpp>

#include <cuco/static_map.cuh>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now with including the cuco header, this file should be renamed into _helper.cuh. Otherwise please move the these header and their relevant code into _helper.cuh and keep this file containing host-only code.

rapids-bot bot pushed a commit that referenced this pull request Aug 8, 2024
This PR adopts some work from @srinivasyadav18 with additional modifications. This is meant to complement #16484.

Authors:
  - Bradley Dice (https://github.com/bdice)
  - Srinivas Yadav (https://github.com/srinivasyadav18)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - Srinivas Yadav (https://github.com/srinivasyadav18)

URL: #16497
@GregoryKimball
Copy link
Contributor

GregoryKimball commented Aug 16, 2024

We hope that #15700 will improve the overall performance outlook of static_map in distinct

image

@PointKernel
Copy link
Member

Update: waiting for #15700 to merge to determine if the current PR can improve performance after resolving the register issues.

@PointKernel PointKernel added Performance Performance related issue 0 - Blocked Cannot progress due to external reasons labels Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Blocked Cannot progress due to external reasons improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

5 participants