Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] TDIGEST_MERGE group by aggregation scales very badly #16625

Closed
revans2 opened this issue Aug 21, 2024 · 1 comment · Fixed by #16780
Closed

[BUG] TDIGEST_MERGE group by aggregation scales very badly #16625

revans2 opened this issue Aug 21, 2024 · 1 comment · Fixed by #16780
Assignees
Labels
bug Something isn't working Performance Performance related issue Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Aug 21, 2024

Describe the bug
The current implementation of TDIGEST_MERGE when used in a group by context launches separate GPU operations (kernels/memory copies) on the order of the number of groups in the output aggregation.

template <typename HGroupOffsetIter, typename GroupOffsetIter, typename GroupLabelIter>
std::unique_ptr<column> merge_tdigests(tdigest_column_view const& tdv,
HGroupOffsetIter h_outer_offsets,
GroupOffsetIter group_offsets,
GroupLabelIter group_labels,
size_t num_group_labels,
size_type num_groups,
int max_centroids,
rmm::cuda_stream_view stream,
rmm::device_async_resource_ref mr)

Specifically the part that is per group is at

// generate the merged (but not yet compressed) tdigests for each group.
std::vector<std::unique_ptr<table>> tdigests;
tdigests.reserve(num_groups);
std::transform(h_outer_offsets,
h_outer_offsets + num_groups,
std::next(h_outer_offsets),
std::back_inserter(tdigests),
[&](auto tdigest_start, auto tdigest_end) {
// the range of tdigests in this group
auto const num_tdigests = tdigest_end - tdigest_start;
// slice each tdigest from the input
std::vector<table_view> unmerged_tdigests;
unmerged_tdigests.reserve(num_tdigests);
auto offset_iter = std::next(h_inner_offsets.begin(), tdigest_start);
std::transform(
offset_iter,
offset_iter + num_tdigests,
std::next(offset_iter),
std::back_inserter(unmerged_tdigests),
[&](size_type start, size_type end) {
return cudf::detail::slice(tdigests_unsliced, {start, end}, stream);
});
// merge
return cudf::detail::merge(unmerged_tdigests,
{0},
{order::ASCENDING},
{},
stream,
rmm::mr::get_current_device_resource());
});

If I run a spark query like.

spark.time(spark.range(0, 1000000L, 1, 2).selectExpr("id % 500000 as k", "id").groupBy("k").agg(percentile_approx(col("id"), lit(0.95), lit(100000))).write...)

I can see the merge operator taking a massive amount of time and launching 500,000 kernels to merge the compacted items in the digest. We could skip all of this if we just had a segmented merge, but we do have a segmented sort, which is probably good enough, with how long the current code takes to run.

std::unique_ptr<column> segmented_sorted_order(
table_view const& keys,
column_view const& segment_offsets,
std::vector<order> const& column_order = {},
std::vector<null_order> const& null_precedence = {},
rmm::cuda_stream_view stream = cudf::get_default_stream(),
rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource());

Steps/Code to reproduce bug

spark.time(spark.range(0, 1000000L, 1, 2).selectExpr("id % 500000 as k", "id").groupBy("k").agg(percentile_approx(col("id"), lit(0.95), lit(100000))).orderBy("k").show())
  • CPU Time (single thread, median of 3 runs): 2.751 seconds
  • GPU Time (a6000 GPU with a single thread, median of 3 runs): 31.868 seconds

Sorry that this is for Spark, but it can be replicated in C++. You just want to do a group by aggregation for TDIGEST_MERGE where there are a large number of output groups and a few items to actually mergre for each group. It is still bad if all of the values are unique, but then it does not launch any kernels to do the merge. It just does a few memcpy calls.

Expected behavior
The GPU should destroy the CPU like it does for a reduction.

spark.time(spark.range(0, 50000000L, 1, 2).select(percentile_approx(col("id"), lit(0.95), lit(100000))).show())
  • CPU Time (single thread, median of 3 runs): 33.999 seconds
  • GPU Time (a6000 GPU with a single thread, median of 3 runs): 0.915 seconds
  • CPU Time (16 threads, median of 3 runs): 7.394 seconds
  • CPU Time (32 threads, median of 3 runs): 7.143 seconds
@revans2 revans2 added bug Something isn't working Performance Performance related issue Spark Functionality that helps Spark RAPIDS labels Aug 21, 2024
@nvdbaranec
Copy link
Contributor

nvdbaranec commented Aug 21, 2024

I don't even think this needs to be a segmented sort(thrust doesn't have one in any case. The sort-by-key functionality is a bit of a misnomer). But, I think if we just glommed everything into one big array and did a regular sort, we'd get the same effect. Off the top of my head, I don't know if radix sort (what thrust uses internally) has any horrible performance problems when being handed already-almost-sorted (or in this case, already-sorted) inputs. Probably not.

@rapids-bot rapids-bot bot closed this as completed in 8e78424 Sep 25, 2024
rjzamora pushed a commit to rjzamora/cudf that referenced this issue Sep 25, 2024
Fixes rapidsai#16625

This PR fixes a slow implementation of the centroid merging step during the tdigest merge aggregation.  Previously it was doing a linear march over the individual tdigests per group and merging them one by one.  This led to terrible performance for large numbers of groups.  In principle though, all this really was doing was a segmented sort of centroid values. So that's what this PR changes it to.  Speedup for 1,000,000 input tidests with 1,000,000 individual groups is ~1000x,

```
Old
---------------------------------------------------------------------------------------------------------------
Benchmark                                                                     Time             CPU   Iterations
---------------------------------------------------------------------------------------------------------------
TDigest/many_tiny_groups/1000000/1/1/10000/iterations:8/manual_time        7473 ms         7472 ms            8
TDigest/many_tiny_groups2/1000000/1/1/1000/iterations:8/manual_time        7433 ms         7431 ms            8
```


```
New
---------------------------------------------------------------------------------------------------------------
Benchmark                                                                     Time             CPU   Iterations
---------------------------------------------------------------------------------------------------------------
TDigest/many_tiny_groups/1000000/1/1/10000/iterations:8/manual_time        6.72 ms         6.79 ms            8
TDigest/many_tiny_groups2/1000000/1/1/1000/iterations:8/manual_time        1.24 ms         1.32 ms            8
```

Authors:
  - https://github.com/nvdbaranec
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - Nghia Truong (https://github.com/ttnghia)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: rapidsai#16780
Matt711 pushed a commit to mroeschke/cudf that referenced this issue Sep 25, 2024
Fixes rapidsai#16625

This PR fixes a slow implementation of the centroid merging step during the tdigest merge aggregation.  Previously it was doing a linear march over the individual tdigests per group and merging them one by one.  This led to terrible performance for large numbers of groups.  In principle though, all this really was doing was a segmented sort of centroid values. So that's what this PR changes it to.  Speedup for 1,000,000 input tidests with 1,000,000 individual groups is ~1000x,

```
Old
---------------------------------------------------------------------------------------------------------------
Benchmark                                                                     Time             CPU   Iterations
---------------------------------------------------------------------------------------------------------------
TDigest/many_tiny_groups/1000000/1/1/10000/iterations:8/manual_time        7473 ms         7472 ms            8
TDigest/many_tiny_groups2/1000000/1/1/1000/iterations:8/manual_time        7433 ms         7431 ms            8
```


```
New
---------------------------------------------------------------------------------------------------------------
Benchmark                                                                     Time             CPU   Iterations
---------------------------------------------------------------------------------------------------------------
TDigest/many_tiny_groups/1000000/1/1/10000/iterations:8/manual_time        6.72 ms         6.79 ms            8
TDigest/many_tiny_groups2/1000000/1/1/1000/iterations:8/manual_time        1.24 ms         1.32 ms            8
```

Authors:
  - https://github.com/nvdbaranec
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - Nghia Truong (https://github.com/ttnghia)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: rapidsai#16780
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Performance Performance related issue Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants