Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch fully unbounded window functions to use aggregations #13727

Merged

Conversation

mythrocks
Copy link
Contributor

@mythrocks mythrocks commented Jul 20, 2023

Description

A fully unbounded window function (i.e. [unbounded_preceding, unbounded_following]) need not go through the window function machinery for execution. E.g. Consider the following:

auto grps = { 0, 0, 0, 0, 1, 1, 1, 1, 2, 2 };
auto vals = { 3, 1, 4, 2, 6, 7, 8, 5, 9, 0 };

Running the MIN window function on the groups, over an [UNBOUNDED, UNBOUNDED] window should produce:

auto res = { 1, 1, 1, 1, 5, 5, 5, 5, 0, 0 };

This result could more easily be achieved using a grouped MIN aggregation, and replicating each group's result for every entry in the group.

This commit adds logic to detect fully unbounded windows, and use groupby::aggregate() (when one or more grouping keys are specified), or reduce() (when there are no grouping keys).

Tangentially, this change also adds the following:

  1. A new overload of cudf::groupby::groupby::aggregate() that takes a stream parameter.
  2. A detail header to declare the (pre-existing) cudf::reduction::detail::reduce() function.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

A fully unbounded window function (i.e. [unbounded_preceding, unbounded_following])
need not go through the window function machinery for execution.
E.g. Consider the following:

```c++
auto grps = { 0, 0, 0, 0, 1, 1, 1, 1, 2, 2 };
auto vals = { 3, 1, 4, 2, 6, 7, 8, 5, 9, 0 };
```

Running the `MIN` window function on the groups, over an `[UNBOUNDED, UNBOUNDED]`
window should produce:

```c++
auto res = { 1, 1, 1, 1, 5, 5, 5, 5, 0, 0 };
```

This result could more easily be achieved using a grouped `MIN` aggregation,
and replicating each group's result for every entry in the group.

This commit adds logic to detect fully unbounded windows, and use
`groupby::aggregate()` (when one or more grouping keys are specified),
or `reduce()` (when there are no grouping keys).
@mythrocks mythrocks requested review from a team as code owners July 20, 2023 04:17
@mythrocks mythrocks requested review from harrism and PointKernel July 20, 2023 04:17
@mythrocks mythrocks marked this pull request as draft July 20, 2023 04:18
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Jul 20, 2023
@mythrocks mythrocks self-assigned this Jul 20, 2023
@mythrocks mythrocks added 2 - In Progress Currently a work in progress Spark Functionality that helps Spark RAPIDS improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jul 20, 2023
@github-actions github-actions bot added the conda label Jul 20, 2023
@mythrocks mythrocks marked this pull request as ready for review July 20, 2023 21:53
@mythrocks mythrocks requested a review from a team as a code owner July 20, 2023 21:53
@mythrocks
Copy link
Contributor Author

mythrocks commented Jul 20, 2023

A couple of tangential changes were necessitated for this one:

  1. The "no grouping keys" case requires access to cudf::reduction::detail::reduce() for the stream version of the function. A declaration was added in a detail header.
  2. The grouped case requires cudf::groupby::groupby::aggregate() to take a stream. An overload was added, per @vyasr's advice. (This would imply that we might've been syncing on the default stream for grouped aggregations before?)

@mythrocks
Copy link
Contributor Author

mythrocks commented Jul 21, 2023

=================================== FAILURES ===================================
______________________ test_jaccard_index_random_strings _______________________
...
>       assert_eq(expected, actual)
...
AssertionError: Series are different

These failures aren't pertinent to this change. I'll work on the JNI piece of this, and come back to the CI failure.

@mythrocks mythrocks requested a review from a team as a code owner July 21, 2023 11:04
@github-actions github-actions bot added the Java Affects Java cuDF API. label Jul 21, 2023
@mythrocks
Copy link
Contributor Author

It turns out that the JNI side of unbounded row-based window functions wasn't wired properly: Unbounded windows were treated as "very large, finite windows".

This was corrected, and an appropriate test was added.

@mythrocks
Copy link
Contributor Author

Running correctness tests in Spark right now, but it looks like there's reasonable speedup.

A simulation of the user query (MIN aggregation on unbounded windows for 10K groups of about 150K rows each) went from 730 seconds to 53 seconds on an RTX 6000.

Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java code looks good.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice optimization. I have a few comments on its implementation, some of which might require a bit of separate work to prepare the pieces used by this PR.

cpp/include/cudf/groupby.hpp Show resolved Hide resolved
cpp/src/rolling/detail/optimized_unbounded_window.cpp Outdated Show resolved Hide resolved
cpp/src/rolling/detail/optimized_unbounded_window.cpp Outdated Show resolved Hide resolved
cpp/src/rolling/grouped_rolling.cu Outdated Show resolved Hide resolved
cpp/src/rolling/grouped_rolling.cu Outdated Show resolved Hide resolved
cpp/tests/groupby/count_tests.cpp Outdated Show resolved Hide resolved
@@ -33,7 +34,8 @@ void test_single_agg(cudf::column_view const& keys,
cudf::sorted keys_are_sorted = cudf::sorted::NO,
std::vector<cudf::order> const& column_order = {},
std::vector<cudf::null_order> const& null_precedence = {},
cudf::sorted reference_keys_are_sorted = cudf::sorted::NO);
cudf::sorted reference_keys_are_sorted = cudf::sorted::NO,
rmm::cuda_stream_view test_stream = cudf::get_default_stream());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the other tests in https://github.com/rapidsai/cudf/tree/branch-23.08/cpp/tests/streams are using cudf::test::get_default_stream().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line#166 does use cudf::test::get_default_stream().

That said, I have moved the streams test to a separate file under tests/streams. This test uses cudf::test::get_default_stream().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially the requirement for streams testing is that in every code path it must be possible to pass cudf::test::get_default_stream() through. #13506 contains a number of changes to existing cudf test utilities where there is no input stream so the code had to be changed to always use cudf::test::get_default_stream. Since this util is now supporting passing a stream it's fine for this to be cudf::get_default_stream().

That said, it would probably be easier to remove the stream parameter here and just always use cudf::test::get_default_stream() internally in this util. cudf::test::get_default_stream() is just an alias for cudf::get_default_stream() under normal circumstances. It exists to provide a symbol that can be overridden to use a different stream when STREAM_MODE testing is set in the CMake configuration for a particular test. Using cudf::test::get_default_stream() will therefore "just work" in all cases.

1. New header for optimized_unbounded_window.
2. Removed static_cast from CUDF_FAIL.
3. [[fallthrough]] is switch case.
@GregoryKimball GregoryKimball requested a review from vyasr July 24, 2023 20:07
1. Moved utility functions to separate header.
2. Changed window_bounds::is_unbounded, and value to member functions.
Moved groupby stream tests into its own translation unit.
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good optimization here. Would love to see more perf numbers if you have them (I see one comment indicating that one particular case went from 730 to 53 seconds?). A few minor suggestions for improvement, but in general the stream piece has been handled correctly now (looks like some changes were made after our last discussion). Regarding the detail reduction header I don't think that's critical, but it's a nice-to-have now that it's there.

cpp/src/rolling/detail/optimized_unbounded_window.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/reduction/detail/reduction.hpp Outdated Show resolved Hide resolved
cpp/tests/CMakeLists.txt Show resolved Hide resolved
@@ -33,7 +34,8 @@ void test_single_agg(cudf::column_view const& keys,
cudf::sorted keys_are_sorted = cudf::sorted::NO,
std::vector<cudf::order> const& column_order = {},
std::vector<cudf::null_order> const& null_precedence = {},
cudf::sorted reference_keys_are_sorted = cudf::sorted::NO);
cudf::sorted reference_keys_are_sorted = cudf::sorted::NO,
rmm::cuda_stream_view test_stream = cudf::get_default_stream());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially the requirement for streams testing is that in every code path it must be possible to pass cudf::test::get_default_stream() through. #13506 contains a number of changes to existing cudf test utilities where there is no input stream so the code had to be changed to always use cudf::test::get_default_stream. Since this util is now supporting passing a stream it's fine for this to be cudf::get_default_stream().

That said, it would probably be easier to remove the stream parameter here and just always use cudf::test::get_default_stream() internally in this util. cudf::test::get_default_stream() is just an alias for cudf::get_default_stream() under normal circumstances. It exists to provide a symbol that can be overridden to use a different stream when STREAM_MODE testing is set in the CMake configuration for a particular test. Using cudf::test::get_default_stream() will therefore "just work" in all cases.

cpp/tests/streams/groupby_test.cpp Outdated Show resolved Hide resolved
cpp/tests/streams/groupby_test.cpp Outdated Show resolved Hide resolved
1. Removed unnecessary headers.
2. Adjusted the use of cudf::test::get_default_stream() in groupby_test_util, and streams/groupby_test.
3. Documented aggregation_based_rolling_window() and reduction_based_rolling_window().
@mythrocks mythrocks requested review from bdice and vyasr July 25, 2023 18:30
@mythrocks
Copy link
Contributor Author

Thanks for the review, @vyasr. I just ran an updated test with spark-rapids, to confirm that things are in shape.

@bdice, I should now have addressed the concerns you'd raised in your review. I'd very much appreciate it if you'd have another look.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mythrocks. Two comments for forward-looking changes, otherwise approving.

cpp/include/cudf/rolling.hpp Outdated Show resolved Hide resolved
mythrocks added a commit to mythrocks/spark-rapids that referenced this pull request Jul 25, 2023
Follow-up to rapidsai/cudf#13727.

This change addresses the slowness in window aggregations for windows defined as
`[UNBOUNDED PRECEDING, UNBOUNDED FOLLOWING]`. Before this change, unbounded row
window bounds were interpreted as finite values, e.g. `[MAX_INT, MAX_INT]`.
While this might be technically indistinguishable from a fully unbounded window,
it causes the optimization in rapidsai/cudf/pull/13727 not to be triggered,
because the window bounds are still finite.

The change in this PR allows the plugin to detect unbounded windows, and mark
them as such for `libcudf`. The `libcudf` window function primitives can then
detect fully unbounded windows, and use a faster/optimized path for execution.

Preliminary test results indicate that `[UNBOUNDED PRECEDING, UNBOUNDED FOLLOWING]`
window function computations over 1B rows and thousands of groups are sped up
by a factor of 10-14x over the previous/naive GPU implementation.

Signed-off-by: MithunR <mythrocks@gmail.com>
@mythrocks
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit ecb20a4 into rapidsai:branch-23.08 Jul 26, 2023
@mythrocks
Copy link
Contributor Author

This change has been merged. Thank you all for your reviews and advice.

mythrocks added a commit to NVIDIA/spark-rapids that referenced this pull request Jul 27, 2023
Follow-up to rapidsai/cudf#13727.

This change addresses the slowness in window aggregations for windows defined as
`[UNBOUNDED PRECEDING, UNBOUNDED FOLLOWING]`. Before this change, unbounded row
window bounds were interpreted as finite values, e.g. `[MAX_INT, MAX_INT]`.
While this might be technically indistinguishable from a fully unbounded window,
it causes the optimization in rapidsai/cudf/pull/13727 not to be triggered,
because the window bounds are still finite.

The change in this PR allows the plugin to detect unbounded windows, and mark
them as such for `libcudf`. The `libcudf` window function primitives can then
detect fully unbounded windows, and use a faster/optimized path for execution.

Preliminary test results indicate that `[UNBOUNDED PRECEDING, UNBOUNDED FOLLOWING]`
window function computations over 1B rows and thousands of groups are sped up
by a factor of 10-14x over the previous/naive GPU implementation.

Signed-off-by: MithunR <mythrocks@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress CMake CMake build issue improvement Improvement / enhancement to an existing function Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants