-
Notifications
You must be signed in to change notification settings - Fork 928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Support set like operators intersect, union, and difference on lists #10409
Comments
Given that this is only ever relevant on lists, I'd advocate for separate APIs. |
This issue has been labeled |
We had another request come in for Spark's array_overlap, which is very much like an |
When it comes to implementing this, this could be an interesting use case for per-warp, shared memory hash maps from cuco. |
I would also like to mention that |
@ttnghia I have an idea for a different algorithm that would hopefully be more load balanced than 1 list per warp method. But I don't know if it will be performant vs the latter for lists that have balanced sizes. |
This PR adds a small (detail) API that generates group labels from a given offset array `offsets`. The output will be an array containing consecutive groups of identical labels, the number of elements in each group `i` is defined by `offsets[i+1] - offsets[i]`. Examples: ``` offsets = [ 0, 4, 6, 10 ] output = [ 0, 0, 0, 0, 1, 1, 2, 2, 2, 2 ] offsets = [ 5, 10, 12 ] output = [ 0, 0, 0, 0, 0, 1, 1 ] ``` Note that the label values always start from `0`. We can in fact add a parameter to allow specifying any starting value but we don't need it in now. Several places in cudf have been updated to adopt the new API immediately. These places have been tested extensively thus no unit tests for the new API is needed. In addition, I ran a benchmark for groupby aggregations and found no performance difference after adopting this. Closes #10905 and unblocks #10409. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Devavret Makkar (https://github.com/devavret) URL: #10945
@revans2 Do we have any application with map for these set operations? Previously, for |
@ttnghia Generally no we don't need something similar here.
|
This PR adds the following APIs for set operations: * `lists::have_overlap` * `lists::intersect_distinct` * `lists::union_distinct` * `lists::difference_distinct` ### Name Convention Except for the first API (`lists::have_overlap`) that returns a boolean column, the suffix `_distinct` of the rest APIs denotes that their results will be lists columns in which all list rows have been post-processed to remove duplicates. As such, their results are actually "set" columns in which each row is a "set" of distinct elements. --- Depends on: * #10945 * #11017 * NVIDIA/cuCollections#175 * #11052 * #11118 * #11100 * #11149 Closes #10409. Authors: - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) Approvers: - Michael Wang (https://github.com/isVoid) - AJ Schmidt (https://github.com/ajschmidt8) - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) URL: #11043
Is your feature request related to a problem? Please describe.
We have had a request to support
array_intersect
in Spark specifically on a list of strings. Because it is similar to set union and set difference operations that Spark also supports it would be great to support all of those at once if it is simple.Describe the solution you'd like
Three new binary ops that would take list columns/scalars as input and do these set like operations.
I want binary ops just because they appear to match with what we want, but separate APIs for each works too. The order of the output list does not matter.
For Intersect we want a list of elements in the intersection of lhs and rhs without duplicates.
For Union we want a list of elements in the union of lhs and rhs without duplicates.
For difference, which Spark calls except for whatever reason we want a list of the elements in lhs but not in rhs, without duplicates.
For all of these nulls count as equal to other nulls, and oddly NaN counts as equal to other NaNs. If you cannot in good continuance support NaNs as equal, I understand, and we can probably deal with it because we already have a config related to NaNs in Spark for similar reasons. Nulls however is much harder to not support.
A null in either the lhs or rhs should result in a null output.
Describe alternatives you've considered
None really. I mean we might be able to play some games with for union by doing a sequence followed by a group by with a count, and then sorting the results by the sequence again and along with a reduction to produce the offsets, but it gets to be really complicated really fast. Intersection I am not sure on, we probably would have to do something with a left-anti join instead, but none of those would work with nulls properly.
Additional context
Intersect is the highest priority, specifically intersect of a list of strings.
The text was updated successfully, but these errors were encountered: