[BUG] TDIGEST_MERGE group by aggregation scales very badly #16625

revans2 · 2024-08-21T13:19:27Z

Describe the bug
The current implementation of TDIGEST_MERGE when used in a group by context launches separate GPU operations (kernels/memory copies) on the order of the number of groups in the output aggregation.

cudf/cpp/src/quantiles/tdigest/tdigest_aggregation.cu

Lines 1024 to 1033 in 58799d6

    
           template <typename HGroupOffsetIter, typename GroupOffsetIter, typename GroupLabelIter> 
        
           std::unique_ptr<column> merge_tdigests(tdigest_column_view const& tdv, 
        
                                                  HGroupOffsetIter h_outer_offsets, 
        
                                                  GroupOffsetIter group_offsets, 
        
                                                  GroupLabelIter group_labels, 
        
                                                  size_t num_group_labels, 
        
                                                  size_type num_groups, 
        
                                                  int max_centroids, 
        
                                                  rmm::cuda_stream_view stream, 
        
                                                  rmm::device_async_resource_ref mr)

Specifically the part that is per group is at

cudf/cpp/src/quantiles/tdigest/tdigest_aggregation.cu

Lines 1055 to 1086 in 58799d6

    
           // generate the merged (but not yet compressed) tdigests for each group. 
        
           std::vector<std::unique_ptr<table>> tdigests; 
        
           tdigests.reserve(num_groups); 
        
           std::transform(h_outer_offsets, 
        
                          h_outer_offsets + num_groups, 
        
                          std::next(h_outer_offsets), 
        
                          std::back_inserter(tdigests), 
        
                          [&](auto tdigest_start, auto tdigest_end) { 
        
                            // the range of tdigests in this group 
        
                            auto const num_tdigests = tdigest_end - tdigest_start; 
        
                            // slice each tdigest from the input 
        
                            std::vector<table_view> unmerged_tdigests; 
        
                            unmerged_tdigests.reserve(num_tdigests); 
        
                            auto offset_iter = std::next(h_inner_offsets.begin(), tdigest_start); 
        
                            std::transform( 
        
                              offset_iter, 
        
                              offset_iter + num_tdigests, 
        
                              std::next(offset_iter), 
        
                              std::back_inserter(unmerged_tdigests), 
        
                              [&](size_type start, size_type end) { 
        
                                return cudf::detail::slice(tdigests_unsliced, {start, end}, stream); 
        
                              }); 
        
                            // merge 
        
                            return cudf::detail::merge(unmerged_tdigests, 
        
                                                       {0}, 
        
                                                       {order::ASCENDING}, 
        
                                                       {}, 
        
                                                       stream, 
        
                                                       rmm::mr::get_current_device_resource()); 
        
                          });

If I run a spark query like.

spark.time(spark.range(0, 1000000L, 1, 2).selectExpr("id % 500000 as k", "id").groupBy("k").agg(percentile_approx(col("id"), lit(0.95), lit(100000))).write...)

I can see the merge operator taking a massive amount of time and launching 500,000 kernels to merge the compacted items in the digest. We could skip all of this if we just had a segmented merge, but we do have a segmented sort, which is probably good enough, with how long the current code takes to run.

cudf/cpp/include/cudf/sorting.hpp

Lines 258 to 264 in 58799d6

    
           std::unique_ptr<column> segmented_sorted_order( 
        
             table_view const& keys, 
        
             column_view const& segment_offsets, 
        
             std::vector<order> const& column_order         = {}, 
        
             std::vector<null_order> const& null_precedence = {}, 
        
             rmm::cuda_stream_view stream                   = cudf::get_default_stream(), 
        
             rmm::device_async_resource_ref mr              = rmm::mr::get_current_device_resource());

Steps/Code to reproduce bug

spark.time(spark.range(0, 1000000L, 1, 2).selectExpr("id % 500000 as k", "id").groupBy("k").agg(percentile_approx(col("id"), lit(0.95), lit(100000))).orderBy("k").show())

CPU Time (single thread, median of 3 runs): 2.751 seconds
GPU Time (a6000 GPU with a single thread, median of 3 runs): 31.868 seconds

Sorry that this is for Spark, but it can be replicated in C++. You just want to do a group by aggregation for TDIGEST_MERGE where there are a large number of output groups and a few items to actually mergre for each group. It is still bad if all of the values are unique, but then it does not launch any kernels to do the merge. It just does a few memcpy calls.

Expected behavior
The GPU should destroy the CPU like it does for a reduction.

spark.time(spark.range(0, 50000000L, 1, 2).select(percentile_approx(col("id"), lit(0.95), lit(100000))).show())

CPU Time (single thread, median of 3 runs): 33.999 seconds
GPU Time (a6000 GPU with a single thread, median of 3 runs): 0.915 seconds
CPU Time (16 threads, median of 3 runs): 7.394 seconds
CPU Time (32 threads, median of 3 runs): 7.143 seconds

The text was updated successfully, but these errors were encountered:

nvdbaranec · 2024-08-21T14:32:32Z

I don't even think this needs to be a segmented sort(thrust doesn't have one in any case. The sort-by-key functionality is a bit of a misnomer). But, I think if we just glommed everything into one big array and did a regular sort, we'd get the same effect. Off the top of my head, I don't know if radix sort (what thrust uses internally) has any horrible performance problems when being handed already-almost-sorted (or in this case, already-sorted) inputs. Probably not.

Fixes rapidsai#16625 This PR fixes a slow implementation of the centroid merging step during the tdigest merge aggregation. Previously it was doing a linear march over the individual tdigests per group and merging them one by one. This led to terrible performance for large numbers of groups. In principle though, all this really was doing was a segmented sort of centroid values. So that's what this PR changes it to. Speedup for 1,000,000 input tidests with 1,000,000 individual groups is ~1000x, ``` Old --------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------------------------------------------- TDigest/many_tiny_groups/1000000/1/1/10000/iterations:8/manual_time 7473 ms 7472 ms 8 TDigest/many_tiny_groups2/1000000/1/1/1000/iterations:8/manual_time 7433 ms 7431 ms 8 ``` ``` New --------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------------------------------------------- TDigest/many_tiny_groups/1000000/1/1/10000/iterations:8/manual_time 6.72 ms 6.79 ms 8 TDigest/many_tiny_groups2/1000000/1/1/1000/iterations:8/manual_time 1.24 ms 1.32 ms 8 ``` Authors: - https://github.com/nvdbaranec - Muhammad Haseeb (https://github.com/mhaseeb123) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Muhammad Haseeb (https://github.com/mhaseeb123) - Nghia Truong (https://github.com/ttnghia) - Mike Wilson (https://github.com/hyperbolic2346) URL: rapidsai#16780

revans2 added bug Something isn't working Performance Performance related issue Spark Functionality that helps Spark RAPIDS labels Aug 21, 2024

mattahrens assigned nvdbaranec Aug 22, 2024

nvdbaranec mentioned this issue Sep 9, 2024

Optimization of tdigest merge aggregation. #16780

Merged

3 tasks

rapids-bot bot closed this as completed in #16780 Sep 25, 2024

rapids-bot bot closed this as completed in 8e78424 Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] TDIGEST_MERGE group by aggregation scales very badly #16625

[BUG] TDIGEST_MERGE group by aggregation scales very badly #16625

revans2 commented Aug 21, 2024

nvdbaranec commented Aug 21, 2024 •

edited

Loading

[BUG] TDIGEST_MERGE group by aggregation scales very badly #16625

[BUG] TDIGEST_MERGE group by aggregation scales very badly #16625

Comments

revans2 commented Aug 21, 2024

nvdbaranec commented Aug 21, 2024 • edited Loading

nvdbaranec commented Aug 21, 2024 •

edited

Loading