-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] TDIGEST_MERGE group by aggregation scales very badly #16625
Comments
I don't even think this needs to be a segmented sort(thrust doesn't have one in any case. The sort-by-key functionality is a bit of a misnomer). But, I think if we just glommed everything into one big array and did a regular sort, we'd get the same effect. Off the top of my head, I don't know if radix sort (what thrust uses internally) has any horrible performance problems when being handed already-almost-sorted (or in this case, already-sorted) inputs. Probably not. |
Fixes rapidsai#16625 This PR fixes a slow implementation of the centroid merging step during the tdigest merge aggregation. Previously it was doing a linear march over the individual tdigests per group and merging them one by one. This led to terrible performance for large numbers of groups. In principle though, all this really was doing was a segmented sort of centroid values. So that's what this PR changes it to. Speedup for 1,000,000 input tidests with 1,000,000 individual groups is ~1000x, ``` Old --------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------------------------------------------- TDigest/many_tiny_groups/1000000/1/1/10000/iterations:8/manual_time 7473 ms 7472 ms 8 TDigest/many_tiny_groups2/1000000/1/1/1000/iterations:8/manual_time 7433 ms 7431 ms 8 ``` ``` New --------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------------------------------------------- TDigest/many_tiny_groups/1000000/1/1/10000/iterations:8/manual_time 6.72 ms 6.79 ms 8 TDigest/many_tiny_groups2/1000000/1/1/1000/iterations:8/manual_time 1.24 ms 1.32 ms 8 ``` Authors: - https://github.com/nvdbaranec - Muhammad Haseeb (https://github.com/mhaseeb123) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Muhammad Haseeb (https://github.com/mhaseeb123) - Nghia Truong (https://github.com/ttnghia) - Mike Wilson (https://github.com/hyperbolic2346) URL: rapidsai#16780
Fixes rapidsai#16625 This PR fixes a slow implementation of the centroid merging step during the tdigest merge aggregation. Previously it was doing a linear march over the individual tdigests per group and merging them one by one. This led to terrible performance for large numbers of groups. In principle though, all this really was doing was a segmented sort of centroid values. So that's what this PR changes it to. Speedup for 1,000,000 input tidests with 1,000,000 individual groups is ~1000x, ``` Old --------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------------------------------------------- TDigest/many_tiny_groups/1000000/1/1/10000/iterations:8/manual_time 7473 ms 7472 ms 8 TDigest/many_tiny_groups2/1000000/1/1/1000/iterations:8/manual_time 7433 ms 7431 ms 8 ``` ``` New --------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------------------------------------------- TDigest/many_tiny_groups/1000000/1/1/10000/iterations:8/manual_time 6.72 ms 6.79 ms 8 TDigest/many_tiny_groups2/1000000/1/1/1000/iterations:8/manual_time 1.24 ms 1.32 ms 8 ``` Authors: - https://github.com/nvdbaranec - Muhammad Haseeb (https://github.com/mhaseeb123) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Muhammad Haseeb (https://github.com/mhaseeb123) - Nghia Truong (https://github.com/ttnghia) - Mike Wilson (https://github.com/hyperbolic2346) URL: rapidsai#16780
Describe the bug
The current implementation of TDIGEST_MERGE when used in a group by context launches separate GPU operations (kernels/memory copies) on the order of the number of groups in the output aggregation.
cudf/cpp/src/quantiles/tdigest/tdigest_aggregation.cu
Lines 1024 to 1033 in 58799d6
Specifically the part that is per group is at
cudf/cpp/src/quantiles/tdigest/tdigest_aggregation.cu
Lines 1055 to 1086 in 58799d6
If I run a spark query like.
I can see the merge operator taking a massive amount of time and launching 500,000 kernels to merge the compacted items in the digest. We could skip all of this if we just had a segmented merge, but we do have a segmented sort, which is probably good enough, with how long the current code takes to run.
cudf/cpp/include/cudf/sorting.hpp
Lines 258 to 264 in 58799d6
Steps/Code to reproduce bug
Sorry that this is for Spark, but it can be replicated in C++. You just want to do a group by aggregation for TDIGEST_MERGE where there are a large number of output groups and a few items to actually mergre for each group. It is still bad if all of the values are unique, but then it does not launch any kernels to do the merge. It just does a few memcpy calls.
Expected behavior
The GPU should destroy the CPU like it does for a reduction.
The text was updated successfully, but these errors were encountered: