Switch fully unbounded window functions to use aggregations #13727

mythrocks · 2023-07-20T04:17:55Z

Description

A fully unbounded window function (i.e. [unbounded_preceding, unbounded_following]) need not go through the window function machinery for execution. E.g. Consider the following:

auto grps = { 0, 0, 0, 0, 1, 1, 1, 1, 2, 2 };
auto vals = { 3, 1, 4, 2, 6, 7, 8, 5, 9, 0 };

Running the MIN window function on the groups, over an [UNBOUNDED, UNBOUNDED] window should produce:

auto res = { 1, 1, 1, 1, 5, 5, 5, 5, 0, 0 };

This result could more easily be achieved using a grouped MIN aggregation, and replicating each group's result for every entry in the group.

This commit adds logic to detect fully unbounded windows, and use groupby::aggregate() (when one or more grouping keys are specified), or reduce() (when there are no grouping keys).

Tangentially, this change also adds the following:

A new overload of cudf::groupby::groupby::aggregate() that takes a stream parameter.
A detail header to declare the (pre-existing) cudf::reduction::detail::reduce() function.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

A fully unbounded window function (i.e. [unbounded_preceding, unbounded_following]) need not go through the window function machinery for execution. E.g. Consider the following: ```c++ auto grps = { 0, 0, 0, 0, 1, 1, 1, 1, 2, 2 }; auto vals = { 3, 1, 4, 2, 6, 7, 8, 5, 9, 0 }; ``` Running the `MIN` window function on the groups, over an `[UNBOUNDED, UNBOUNDED]` window should produce: ```c++ auto res = { 1, 1, 1, 1, 5, 5, 5, 5, 0, 0 }; ``` This result could more easily be achieved using a grouped `MIN` aggregation, and replicating each group's result for every entry in the group. This commit adds logic to detect fully unbounded windows, and use `groupby::aggregate()` (when one or more grouping keys are specified), or `reduce()` (when there are no grouping keys).

…y-unbounded-windows

mythrocks · 2023-07-20T22:04:11Z

A couple of tangential changes were necessitated for this one:

The "no grouping keys" case requires access to cudf::reduction::detail::reduce() for the stream version of the function. A declaration was added in a detail header.
The grouped case requires cudf::groupby::groupby::aggregate() to take a stream. An overload was added, per @vyasr's advice. (This would imply that we might've been syncing on the default stream for grouped aggregations before?)

cpp/src/rolling/grouped_rolling.cu

mythrocks · 2023-07-21T09:42:24Z

=================================== FAILURES ===================================
______________________ test_jaccard_index_random_strings _______________________
...
>       assert_eq(expected, actual)
...
AssertionError: Series are different

These failures aren't pertinent to this change. I'll work on the JNI piece of this, and come back to the CI failure.

mythrocks · 2023-07-21T11:08:22Z

It turns out that the JNI side of unbounded row-based window functions wasn't wired properly: Unbounded windows were treated as "very large, finite windows".

This was corrected, and an appropriate test was added.

mythrocks · 2023-07-21T11:34:02Z

Running correctness tests in Spark right now, but it looks like there's reasonable speedup.

A simulation of the user query (MIN aggregation on unbounded windows for 10K groups of about 150K rows each) went from 730 seconds to 53 seconds on an RTX 6000.

revans2

Java code looks good.

bdice

This is a nice optimization. I have a few comments on its implementation, some of which might require a bit of separate work to prepare the pieces used by this PR.

cpp/include/cudf/groupby.hpp

cpp/src/rolling/detail/optimized_unbounded_window.cpp

cpp/src/rolling/grouped_rolling.cu

cpp/tests/groupby/count_tests.cpp

bdice · 2023-07-21T20:47:31Z

cpp/tests/groupby/groupby_test_util.hpp

@@ -33,7 +34,8 @@ void test_single_agg(cudf::column_view const& keys,
                     cudf::sorted keys_are_sorted                 = cudf::sorted::NO,
                     std::vector<cudf::order> const& column_order = {},
                     std::vector<cudf::null_order> const& null_precedence = {},
-                     cudf::sorted reference_keys_are_sorted               = cudf::sorted::NO);
+                     cudf::sorted reference_keys_are_sorted               = cudf::sorted::NO,
+                     rmm::cuda_stream_view test_stream = cudf::get_default_stream());


Note that the other tests in https://github.com/rapidsai/cudf/tree/branch-23.08/cpp/tests/streams are using cudf::test::get_default_stream().

Line#166 does use cudf::test::get_default_stream().

That said, I have moved the streams test to a separate file under tests/streams. This test uses cudf::test::get_default_stream().

Essentially the requirement for streams testing is that in every code path it must be possible to pass cudf::test::get_default_stream() through. #13506 contains a number of changes to existing cudf test utilities where there is no input stream so the code had to be changed to always use cudf::test::get_default_stream. Since this util is now supporting passing a stream it's fine for this to be cudf::get_default_stream().

That said, it would probably be easier to remove the stream parameter here and just always use cudf::test::get_default_stream() internally in this util. cudf::test::get_default_stream() is just an alias for cudf::get_default_stream() under normal circumstances. It exists to provide a symbol that can be overridden to use a different stream when STREAM_MODE testing is set in the CMake configuration for a particular test. Using cudf::test::get_default_stream() will therefore "just work" in all cases.

1. New header for optimized_unbounded_window. 2. Removed static_cast from CUDF_FAIL. 3. [[fallthrough]] is switch case.

1. Moved utility functions to separate header. 2. Changed window_bounds::is_unbounded, and value to member functions.

Moved groupby stream tests into its own translation unit.

vyasr

Looks like a good optimization here. Would love to see more perf numbers if you have them (I see one comment indicating that one particular case went from 730 to 53 seconds?). A few minor suggestions for improvement, but in general the stream piece has been handled correctly now (looks like some changes were made after our last discussion). Regarding the detail reduction header I don't think that's critical, but it's a nice-to-have now that it's there.

cpp/src/rolling/detail/optimized_unbounded_window.cpp

cpp/src/rolling/detail/optimized_unbounded_window.hpp

cpp/src/rolling/detail/optimized_unbounded_window.cpp

cpp/include/cudf/reduction/detail/reduction.hpp

cpp/tests/CMakeLists.txt

vyasr · 2023-07-24T23:52:20Z

cpp/tests/groupby/groupby_test_util.hpp

@@ -33,7 +34,8 @@ void test_single_agg(cudf::column_view const& keys,
                     cudf::sorted keys_are_sorted                 = cudf::sorted::NO,
                     std::vector<cudf::order> const& column_order = {},
                     std::vector<cudf::null_order> const& null_precedence = {},
-                     cudf::sorted reference_keys_are_sorted               = cudf::sorted::NO);
+                     cudf::sorted reference_keys_are_sorted               = cudf::sorted::NO,
+                     rmm::cuda_stream_view test_stream = cudf::get_default_stream());


Essentially the requirement for streams testing is that in every code path it must be possible to pass cudf::test::get_default_stream() through. #13506 contains a number of changes to existing cudf test utilities where there is no input stream so the code had to be changed to always use cudf::test::get_default_stream. Since this util is now supporting passing a stream it's fine for this to be cudf::get_default_stream().

That said, it would probably be easier to remove the stream parameter here and just always use cudf::test::get_default_stream() internally in this util. cudf::test::get_default_stream() is just an alias for cudf::get_default_stream() under normal circumstances. It exists to provide a symbol that can be overridden to use a different stream when STREAM_MODE testing is set in the CMake configuration for a particular test. Using cudf::test::get_default_stream() will therefore "just work" in all cases.

cpp/tests/rolling/grouped_rolling_range_test.cpp

cpp/tests/streams/groupby_test.cpp

1. Removed unnecessary headers. 2. Adjusted the use of cudf::test::get_default_stream() in groupby_test_util, and streams/groupby_test. 3. Documented aggregation_based_rolling_window() and reduction_based_rolling_window().

…y-unbounded-windows

mythrocks · 2023-07-25T21:23:59Z

Thanks for the review, @vyasr. I just ran an updated test with spark-rapids, to confirm that things are in shape.

@bdice, I should now have addressed the concerns you'd raised in your review. I'd very much appreciate it if you'd have another look.

bdice

Thanks @mythrocks. Two comments for forward-looking changes, otherwise approving.

cpp/include/cudf/rolling.hpp

cpp/src/rolling/detail/optimized_unbounded_window.cpp

Follow-up to rapidsai/cudf#13727. This change addresses the slowness in window aggregations for windows defined as `[UNBOUNDED PRECEDING, UNBOUNDED FOLLOWING]`. Before this change, unbounded row window bounds were interpreted as finite values, e.g. `[MAX_INT, MAX_INT]`. While this might be technically indistinguishable from a fully unbounded window, it causes the optimization in rapidsai/cudf/pull/13727 not to be triggered, because the window bounds are still finite. The change in this PR allows the plugin to detect unbounded windows, and mark them as such for `libcudf`. The `libcudf` window function primitives can then detect fully unbounded windows, and use a faster/optimized path for execution. Preliminary test results indicate that `[UNBOUNDED PRECEDING, UNBOUNDED FOLLOWING]` window function computations over 1B rows and thousands of groups are sped up by a factor of 10-14x over the previous/naive GPU implementation. Signed-off-by: MithunR <mythrocks@gmail.com>

mythrocks · 2023-07-26T08:32:52Z

/merge

mythrocks · 2023-07-26T08:33:13Z

This change has been merged. Thank you all for your reviews and advice.

Follow-up to rapidsai/cudf#13727. This change addresses the slowness in window aggregations for windows defined as `[UNBOUNDED PRECEDING, UNBOUNDED FOLLOWING]`. Before this change, unbounded row window bounds were interpreted as finite values, e.g. `[MAX_INT, MAX_INT]`. While this might be technically indistinguishable from a fully unbounded window, it causes the optimization in rapidsai/cudf/pull/13727 not to be triggered, because the window bounds are still finite. The change in this PR allows the plugin to detect unbounded windows, and mark them as such for `libcudf`. The `libcudf` window function primitives can then detect fully unbounded windows, and use a faster/optimized path for execution. Preliminary test results indicate that `[UNBOUNDED PRECEDING, UNBOUNDED FOLLOWING]` window function computations over 1B rows and thousands of groups are sped up by a factor of 10-14x over the previous/naive GPU implementation. Signed-off-by: MithunR <mythrocks@gmail.com>

mythrocks requested review from a team as code owners July 20, 2023 04:17

mythrocks requested review from harrism and PointKernel July 20, 2023 04:17

mythrocks marked this pull request as draft July 20, 2023 04:18

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Jul 20, 2023

mythrocks self-assigned this Jul 20, 2023

mythrocks added 2 - In Progress Currently a work in progress Spark Functionality that helps Spark RAPIDS improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jul 20, 2023

mythrocks added 4 commits July 19, 2023 21:19

Removed debug logging.

3a10e48

Merge remote-tracking branch 'origin/branch-23.08' into optimize-full…

0621b5e

…y-unbounded-windows

Integrated groupby with custom stream.

70ba471

Switch to using cudf::reduction::detail::reduce().

c99e79e

github-actions bot added the conda label Jul 20, 2023

Merge remote-tracking branch 'origin/branch-23.08' into optimize-full…

098d2d6

…y-unbounded-windows

mythrocks marked this pull request as ready for review July 20, 2023 21:53

mythrocks requested a review from a team as a code owner July 20, 2023 21:53

mythrocks commented Jul 21, 2023

View reviewed changes

cpp/src/rolling/grouped_rolling.cu Outdated Show resolved Hide resolved

Removed debug prints.

19672ed

JNI changes.

21eeb92

mythrocks requested a review from a team as a code owner July 21, 2023 11:04

github-actions bot added the Java Affects Java cuDF API. label Jul 21, 2023

revans2 approved these changes Jul 21, 2023

View reviewed changes

bdice requested changes Jul 21, 2023

View reviewed changes

bdice mentioned this pull request Jul 24, 2023

Require streams in public API for groupby aggregate. #13737

Closed

Review changes:

4bad802

1. New header for optimized_unbounded_window. 2. Removed static_cast from CUDF_FAIL. 3. [[fallthrough]] is switch case.

GregoryKimball requested a review from vyasr July 24, 2023 20:07

raydouglass approved these changes Jul 24, 2023

View reviewed changes

mythrocks added 2 commits July 24, 2023 13:55

Review fixes:

c786f3e

1. Moved utility functions to separate header. 2. Changed window_bounds::is_unbounded, and value to member functions.

Review fixes:

e68f46f

Moved groupby stream tests into its own translation unit.

vyasr requested changes Jul 24, 2023

View reviewed changes

mythrocks added 2 commits July 25, 2023 09:54

Formatting.

ea11930

Review fixes.

a746cc8

1. Removed unnecessary headers. 2. Adjusted the use of cudf::test::get_default_stream() in groupby_test_util, and streams/groupby_test. 3. Documented aggregation_based_rolling_window() and reduction_based_rolling_window().

mythrocks requested review from bdice and vyasr July 25, 2023 18:30

Merge remote-tracking branch 'origin/branch-23.08' into optimize-full…

c60f761

…y-unbounded-windows

vyasr approved these changes Jul 25, 2023

View reviewed changes

bdice approved these changes Jul 25, 2023

View reviewed changes

cpp/include/cudf/rolling.hpp Outdated Show resolved Hide resolved

cpp/src/rolling/detail/optimized_unbounded_window.cpp Show resolved Hide resolved

Remove expired TODO. Obviated by range_window_bounds.

48aef1e

This was referenced Jul 25, 2023

Treat unbounded windows as truly non-finite. NVIDIA/spark-rapids#8802

Merged

[FEA] Support COUNT_ALL and COUNT_VALID as reduce aggregations #13756

Open

rapids-bot bot merged commit ecb20a4 into rapidsai:branch-23.08 Jul 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch fully unbounded window functions to use aggregations #13727

Switch fully unbounded window functions to use aggregations #13727

mythrocks commented Jul 20, 2023 •

edited

Loading

mythrocks commented Jul 20, 2023 •

edited

Loading

mythrocks commented Jul 21, 2023 •

edited

Loading

mythrocks commented Jul 21, 2023

mythrocks commented Jul 21, 2023

revans2 left a comment

bdice left a comment

bdice Jul 21, 2023

mythrocks Jul 24, 2023

vyasr Jul 24, 2023

vyasr left a comment

vyasr Jul 24, 2023

mythrocks commented Jul 25, 2023

bdice left a comment

mythrocks commented Jul 26, 2023

mythrocks commented Jul 26, 2023

Switch fully unbounded window functions to use aggregations #13727

Switch fully unbounded window functions to use aggregations #13727

Conversation

mythrocks commented Jul 20, 2023 • edited Loading

Description

Checklist

mythrocks commented Jul 20, 2023 • edited Loading

mythrocks commented Jul 21, 2023 • edited Loading

mythrocks commented Jul 21, 2023

mythrocks commented Jul 21, 2023

revans2 left a comment

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

bdice Jul 21, 2023

Choose a reason for hiding this comment

mythrocks Jul 24, 2023

Choose a reason for hiding this comment

vyasr Jul 24, 2023

Choose a reason for hiding this comment

vyasr left a comment

Choose a reason for hiding this comment

vyasr Jul 24, 2023

Choose a reason for hiding this comment

mythrocks commented Jul 25, 2023

bdice left a comment

Choose a reason for hiding this comment

mythrocks commented Jul 26, 2023

mythrocks commented Jul 26, 2023

mythrocks commented Jul 20, 2023 •

edited

Loading

mythrocks commented Jul 20, 2023 •

edited

Loading

mythrocks commented Jul 21, 2023 •

edited

Loading