Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization of tdigest merge aggregation. #16780

Merged
merged 23 commits into from
Sep 25, 2024

Conversation

nvdbaranec
Copy link
Contributor

@nvdbaranec nvdbaranec commented Sep 9, 2024

Fixes #16625

This PR fixes a slow implementation of the centroid merging step during the tdigest merge aggregation. Previously it was doing a linear march over the individual tdigests per group and merging them one by one. This led to terrible performance for large numbers of groups. In principle though, all this really was doing was a segmented sort of centroid values. So that's what this PR changes it to. Speedup for 1,000,000 input tidests with 1,000,000 individual groups is ~1000x,

Old
---------------------------------------------------------------------------------------------------------------
Benchmark                                                                     Time             CPU   Iterations
---------------------------------------------------------------------------------------------------------------
TDigest/many_tiny_groups/1000000/1/1/10000/iterations:8/manual_time        7473 ms         7472 ms            8
TDigest/many_tiny_groups2/1000000/1/1/1000/iterations:8/manual_time        7433 ms         7431 ms            8
New
---------------------------------------------------------------------------------------------------------------
Benchmark                                                                     Time             CPU   Iterations
---------------------------------------------------------------------------------------------------------------
TDigest/many_tiny_groups/1000000/1/1/10000/iterations:8/manual_time        6.72 ms         6.79 ms            8
TDigest/many_tiny_groups2/1000000/1/1/1000/iterations:8/manual_time        1.24 ms         1.32 ms            8

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@nvdbaranec nvdbaranec added libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Sep 9, 2024
@nvdbaranec nvdbaranec requested a review from a team as a code owner September 9, 2024 21:26
@nvdbaranec nvdbaranec marked this pull request as draft September 9, 2024 21:26
Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran a simple performance test against the CPU. An a6000 vs 16 CPU cores to do an approximate percentile on 1,000,000,000 rows with 1,000,000 unique keys in the group by. The GPU was 48x faster than the CPU. So it looks good. I still get errors related to #16675 not being in, and I am not sure if there are any merge conflicts that would show up between the two. I have not looked deeply at the C++ code so I am not approving this, but from the results I am +1 on merging it in.

@revans2
Copy link
Contributor

revans2 commented Sep 11, 2024

I did some more testing against the CPU and it looks really good. The improvements range from 5x faster to 32x faster than 16 CPU cores. A lot of the slowness on the CPU comes from spilling/shuffle when there are lots of groups, which we don't appear to suffer from as badly.

@github-actions github-actions bot added the CMake CMake build issue label Sep 11, 2024
@nvdbaranec nvdbaranec marked this pull request as ready for review September 17, 2024 19:37
@mhaseeb123
Copy link
Member

Looks like we also need a style fix for CI to run.

Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of nits related to recent changes in rmm/cudf. This is an amazing speed up.

cpp/src/quantiles/tdigest/tdigest_aggregation.cu Outdated Show resolved Hide resolved
cpp/src/quantiles/tdigest/tdigest_aggregation.cu Outdated Show resolved Hide resolved
cpp/benchmarks/quantiles/tdigest.cu Outdated Show resolved Hide resolved
cpp/benchmarks/quantiles/tdigest.cu Outdated Show resolved Hide resolved
Co-authored-by: David Wendt <45795991+davidwendt@users.noreply.github.com>
Copy link
Member

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style fix needed for CI. LGTM otherwise!

@ttnghia
Copy link
Contributor

ttnghia commented Sep 24, 2024

If there are just a few CI pipelines broken, don't rerun everything. Instead, click on the "Details" link then rerun only the failed jobs.

@nvdbaranec
Copy link
Contributor Author

If there are just a few CI pipelines broken, don't rerun everything. Instead, click on the "Details" link then rerun only the failed jobs.

That didn't work in this case.

Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks for making that benchmark change.

@ttnghia
Copy link
Contributor

ttnghia commented Sep 25, 2024

Yeah, it seems the CI workers are not available. Maybe you need to contact devops.

@nvdbaranec
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 8e78424 into rapidsai:branch-24.10 Sep 25, 2024
100 checks passed
rjzamora pushed a commit to rjzamora/cudf that referenced this pull request Sep 25, 2024
Fixes rapidsai#16625

This PR fixes a slow implementation of the centroid merging step during the tdigest merge aggregation.  Previously it was doing a linear march over the individual tdigests per group and merging them one by one.  This led to terrible performance for large numbers of groups.  In principle though, all this really was doing was a segmented sort of centroid values. So that's what this PR changes it to.  Speedup for 1,000,000 input tidests with 1,000,000 individual groups is ~1000x,

```
Old
---------------------------------------------------------------------------------------------------------------
Benchmark                                                                     Time             CPU   Iterations
---------------------------------------------------------------------------------------------------------------
TDigest/many_tiny_groups/1000000/1/1/10000/iterations:8/manual_time        7473 ms         7472 ms            8
TDigest/many_tiny_groups2/1000000/1/1/1000/iterations:8/manual_time        7433 ms         7431 ms            8
```


```
New
---------------------------------------------------------------------------------------------------------------
Benchmark                                                                     Time             CPU   Iterations
---------------------------------------------------------------------------------------------------------------
TDigest/many_tiny_groups/1000000/1/1/10000/iterations:8/manual_time        6.72 ms         6.79 ms            8
TDigest/many_tiny_groups2/1000000/1/1/1000/iterations:8/manual_time        1.24 ms         1.32 ms            8
```

Authors:
  - https://github.com/nvdbaranec
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - Nghia Truong (https://github.com/ttnghia)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: rapidsai#16780
Matt711 pushed a commit to mroeschke/cudf that referenced this pull request Sep 25, 2024
Fixes rapidsai#16625

This PR fixes a slow implementation of the centroid merging step during the tdigest merge aggregation.  Previously it was doing a linear march over the individual tdigests per group and merging them one by one.  This led to terrible performance for large numbers of groups.  In principle though, all this really was doing was a segmented sort of centroid values. So that's what this PR changes it to.  Speedup for 1,000,000 input tidests with 1,000,000 individual groups is ~1000x,

```
Old
---------------------------------------------------------------------------------------------------------------
Benchmark                                                                     Time             CPU   Iterations
---------------------------------------------------------------------------------------------------------------
TDigest/many_tiny_groups/1000000/1/1/10000/iterations:8/manual_time        7473 ms         7472 ms            8
TDigest/many_tiny_groups2/1000000/1/1/1000/iterations:8/manual_time        7433 ms         7431 ms            8
```


```
New
---------------------------------------------------------------------------------------------------------------
Benchmark                                                                     Time             CPU   Iterations
---------------------------------------------------------------------------------------------------------------
TDigest/many_tiny_groups/1000000/1/1/10000/iterations:8/manual_time        6.72 ms         6.79 ms            8
TDigest/many_tiny_groups2/1000000/1/1/1000/iterations:8/manual_time        1.24 ms         1.32 ms            8
```

Authors:
  - https://github.com/nvdbaranec
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - Nghia Truong (https://github.com/ttnghia)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: rapidsai#16780
rapids-bot bot pushed a commit that referenced this pull request Oct 24, 2024
Removes unused variable that contains host copy of the group_offsets data.
This host variable appears to have been made obsolete by a combination of #16897 and #16780
Found while working on #17149

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - Nghia Truong (https://github.com/ttnghia)

URL: #17151
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

[BUG] TDIGEST_MERGE group by aggregation scales very badly
7 participants