Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support set like operators intersect, union, and difference on lists #10409

Closed
revans2 opened this issue Mar 10, 2022 · 8 comments · Fixed by #11043
Closed

[FEA] Support set like operators intersect, union, and difference on lists #10409

revans2 opened this issue Mar 10, 2022 · 8 comments · Fixed by #11043
Assignees
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Mar 10, 2022

Is your feature request related to a problem? Please describe.
We have had a request to support array_intersect in Spark specifically on a list of strings. Because it is similar to set union and set difference operations that Spark also supports it would be great to support all of those at once if it is simple.

Describe the solution you'd like
Three new binary ops that would take list columns/scalars as input and do these set like operations.

I want binary ops just because they appear to match with what we want, but separate APIs for each works too. The order of the output list does not matter.

For Intersect we want a list of elements in the intersection of lhs and rhs without duplicates.
For Union we want a list of elements in the union of lhs and rhs without duplicates.
For difference, which Spark calls except for whatever reason we want a list of the elements in lhs but not in rhs, without duplicates.

For all of these nulls count as equal to other nulls, and oddly NaN counts as equal to other NaNs. If you cannot in good continuance support NaNs as equal, I understand, and we can probably deal with it because we already have a config related to NaNs in Spark for similar reasons. Nulls however is much harder to not support.

A null in either the lhs or rhs should result in a null output.

lhs rhs array_intersect(lhs, rhs) array_union(lhs, rhs) array_except(lhs, rhs)
[NaN, 5.0, 0.0, 0.0, 0.0, 0.0, null, 0.0] [1.0, 0.5, null, 0.0, 0.0, null, NaN] [NaN, 0.0, null] [NaN, 5.0, 0.0, null, 1.0, 0.5] [5.0]
[NaN, 5.0, 0.0, 0.0, 0.0, 0.0, null, 1.0] [2.0, 1.0, null, 0.0, 0.0, null] [0.0, null, 1.0] [NaN, 5.0, 0.0, null, 1.0, 2.0] [NaN, 5.0]
null [2.0, 1.0, null, 0.0, 0.0, null] null null null
[NaN, 5.0, 0.0, 0.0, 0.0, 0.0, null, 1.0] null null null null

Describe alternatives you've considered
None really. I mean we might be able to play some games with for union by doing a sequence followed by a group by with a count, and then sorting the results by the sequence again and along with a reduction to produce the offsets, but it gets to be really complicated really fast. Intersection I am not sure on, we probably would have to do something with a left-anti join instead, but none of those would work with nulls properly.

Additional context
Intersect is the highest priority, specifically intersect of a list of strings.

@jrhemstad
Copy link
Contributor

I want binary ops just because they appear to match with what we want, but separate APIs for each works too

Given that this is only ever relevant on lists, I'd advocate for separate APIs.

@github-actions
Copy link

github-actions bot commented Apr 9, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@revans2
Copy link
Contributor Author

revans2 commented Apr 12, 2022

We had another request come in for Spark's array_overlap, which is very much like an len(array_intersect) > 0. Eventually it would be nice to have a dedicated operator for this, but for now I am happy to implement it this way. The one thing we would need is an option to say if nulls are considered equal or not, because array_overlap checks for a non-null value. We could also filter out the nulls using a segmented gather if null equality is a problem.

@jrhemstad jrhemstad added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify inactive-30d labels Apr 12, 2022
@jrhemstad
Copy link
Contributor

When it comes to implementing this, this could be an interesting use case for per-warp, shared memory hash maps from cuco.

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Apr 12, 2022

I would also like to mention that array_intersect, array_union and array_except are also Presto functions. We should do a careful comparison as part of implementing this.

@devavret
Copy link
Contributor

devavret commented May 9, 2022

@ttnghia I have an idea for a different algorithm that would hopefully be more load balanced than 1 list per warp method. But I don't know if it will be performant vs the latter for lists that have balanced sizes.

rapids-bot bot pushed a commit that referenced this issue Jun 1, 2022
This PR adds a small (detail) API that generates group labels from a given offset array `offsets`. The output will be an array containing consecutive groups of identical labels, the number of elements in each group `i` is defined by `offsets[i+1] - offsets[i]`.

Examples:
```
offsets = [ 0, 4, 6, 10 ]
output  = [ 0, 0, 0, 0, 1, 1, 2, 2, 2, 2 ]

offsets = [ 5, 10, 12 ]
output  = [ 0, 0, 0, 0, 0, 1, 1 ]
```

Note that the label values always start from `0`. We can in fact add a parameter to allow specifying any starting value but we don't need it in now.

Several places in cudf have been updated to adopt the new API immediately. These places have been tested extensively thus no unit tests for the new API is needed. In addition, I ran a benchmark for groupby aggregations and found no performance difference after adopting this.

Closes #10905 and unblocks #10409.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Jake Hemstad (https://github.com/jrhemstad)
  - Devavret Makkar (https://github.com/devavret)

URL: #10945
@ttnghia
Copy link
Contributor

ttnghia commented Jun 2, 2022

@revans2 Do we have any application with map for these set operations?

Previously, for drop_list_duplicates with keys/values, we need a keep option for the duplicate keys (keep first/keep last). Do we still need it again here?

@revans2
Copy link
Contributor Author

revans2 commented Jun 13, 2022

@ttnghia Generally no we don't need something similar here. drop_list_duplicates for Spark for maps is a general operations needed for anything that can modify or add keys to a map.

intersect, union, and difference here are used as set like operators. Spark has no official set type so it just uses a list/array for the type and provides operators to work on them. There are no map equivalents to this.

map_zip_with might come close to it, but it is different enough that keep first vs keep last would not be what we want. I think we would do a union for the keys and the look up the values for those keys in each map before calling the higher order function.

rapids-bot bot pushed a commit that referenced this issue Jul 26, 2022
This PR adds the following APIs for set operations:
 * `lists::have_overlap`
 * `lists::intersect_distinct`
 * `lists::union_distinct`
 * `lists::difference_distinct`

### Name Convention
Except for the first API (`lists::have_overlap`) that returns a boolean column, the suffix `_distinct` of the rest APIs denotes that their results will be lists columns in which all list rows have been post-processed to remove duplicates. As such, their results are actually "set" columns in which each row is a "set" of distinct elements.

---

Depends on:
 * #10945
 * #11017
 * NVIDIA/cuCollections#175
 * #11052
 * #11118
 * #11100
 * #11149

Closes #10409.

Authors:
  - Nghia Truong (https://github.com/ttnghia)
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Michael Wang (https://github.com/isVoid)
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Bradley Dice (https://github.com/bdice)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #11043
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants