[FEA] Support set like operators intersect, union, and difference on lists #10409

revans2 · 2022-03-10T20:12:45Z

Is your feature request related to a problem? Please describe.
We have had a request to support array_intersect in Spark specifically on a list of strings. Because it is similar to set union and set difference operations that Spark also supports it would be great to support all of those at once if it is simple.

Describe the solution you'd like
Three new binary ops that would take list columns/scalars as input and do these set like operations.

I want binary ops just because they appear to match with what we want, but separate APIs for each works too. The order of the output list does not matter.

For Intersect we want a list of elements in the intersection of lhs and rhs without duplicates.
For Union we want a list of elements in the union of lhs and rhs without duplicates.
For difference, which Spark calls except for whatever reason we want a list of the elements in lhs but not in rhs, without duplicates.

For all of these nulls count as equal to other nulls, and oddly NaN counts as equal to other NaNs. If you cannot in good continuance support NaNs as equal, I understand, and we can probably deal with it because we already have a config related to NaNs in Spark for similar reasons. Nulls however is much harder to not support.

A null in either the lhs or rhs should result in a null output.

lhs	rhs	array_intersect(lhs, rhs)	array_union(lhs, rhs)	array_except(lhs, rhs)
[NaN, 5.0, 0.0, 0.0, 0.0, 0.0, null, 0.0]	[1.0, 0.5, null, 0.0, 0.0, null, NaN]	[NaN, 0.0, null]	[NaN, 5.0, 0.0, null, 1.0, 0.5]	[5.0]
[NaN, 5.0, 0.0, 0.0, 0.0, 0.0, null, 1.0]	[2.0, 1.0, null, 0.0, 0.0, null]	[0.0, null, 1.0]	[NaN, 5.0, 0.0, null, 1.0, 2.0]	[NaN, 5.0]
null	[2.0, 1.0, null, 0.0, 0.0, null]	null	null	null
[NaN, 5.0, 0.0, 0.0, 0.0, 0.0, null, 1.0]	null	null	null	null

Describe alternatives you've considered
None really. I mean we might be able to play some games with for union by doing a sequence followed by a group by with a count, and then sorting the results by the sequence again and along with a reduction to produce the offsets, but it gets to be really complicated really fast. Intersection I am not sure on, we probably would have to do something with a left-anti join instead, but none of those would work with nulls properly.

Additional context
Intersect is the highest priority, specifically intersect of a list of strings.

The text was updated successfully, but these errors were encountered:

jrhemstad · 2022-03-10T21:27:43Z

I want binary ops just because they appear to match with what we want, but separate APIs for each works too

Given that this is only ever relevant on lists, I'd advocate for separate APIs.

github-actions · 2022-04-09T22:02:53Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

revans2 · 2022-04-12T20:47:57Z

We had another request come in for Spark's array_overlap, which is very much like an len(array_intersect) > 0. Eventually it would be nice to have a dedicated operator for this, but for now I am happy to implement it this way. The one thing we would need is an option to say if nulls are considered equal or not, because array_overlap checks for a non-null value. We could also filter out the nulls using a segmented gather if null equality is a problem.

jrhemstad · 2022-04-12T20:50:29Z

When it comes to implementing this, this could be an interesting use case for per-warp, shared memory hash maps from cuco.

GregoryKimball · 2022-04-12T22:49:14Z

I would also like to mention that array_intersect, array_union and array_except are also Presto functions. We should do a careful comparison as part of implementing this.

devavret · 2022-05-09T07:32:48Z

@ttnghia I have an idea for a different algorithm that would hopefully be more load balanced than 1 list per warp method. But I don't know if it will be performant vs the latter for lists that have balanced sizes.

This PR adds a small (detail) API that generates group labels from a given offset array `offsets`. The output will be an array containing consecutive groups of identical labels, the number of elements in each group `i` is defined by `offsets[i+1] - offsets[i]`. Examples: ``` offsets = [ 0, 4, 6, 10 ] output = [ 0, 0, 0, 0, 1, 1, 2, 2, 2, 2 ] offsets = [ 5, 10, 12 ] output = [ 0, 0, 0, 0, 0, 1, 1 ] ``` Note that the label values always start from `0`. We can in fact add a parameter to allow specifying any starting value but we don't need it in now. Several places in cudf have been updated to adopt the new API immediately. These places have been tested extensively thus no unit tests for the new API is needed. In addition, I ran a benchmark for groupby aggregations and found no performance difference after adopting this. Closes #10905 and unblocks #10409. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Devavret Makkar (https://github.com/devavret) URL: #10945

ttnghia · 2022-06-02T18:43:24Z

@revans2 Do we have any application with map for these set operations?

Previously, for drop_list_duplicates with keys/values, we need a keep option for the duplicate keys (keep first/keep last). Do we still need it again here?

revans2 · 2022-06-13T20:35:33Z

@ttnghia Generally no we don't need something similar here. drop_list_duplicates for Spark for maps is a general operations needed for anything that can modify or add keys to a map.

intersect, union, and difference here are used as set like operators. Spark has no official set type so it just uses a list/array for the type and provides operators to work on them. There are no map equivalents to this.

map_zip_with might come close to it, but it is different enough that keep first vs keep last would not be what we want. I think we would do a union for the keys and the look up the values for those keys in each map before calling the higher order function.

This PR adds the following APIs for set operations: * `lists::have_overlap` * `lists::intersect_distinct` * `lists::union_distinct` * `lists::difference_distinct` ### Name Convention Except for the first API (`lists::have_overlap`) that returns a boolean column, the suffix `_distinct` of the rest APIs denotes that their results will be lists columns in which all list rows have been post-processed to remove duplicates. As such, their results are actually "set" columns in which each row is a "set" of distinct elements. --- Depends on: * #10945 * #11017 * NVIDIA/cuCollections#175 * #11052 * #11118 * #11100 * #11149 Closes #10409. Authors: - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) Approvers: - Michael Wang (https://github.com/isVoid) - AJ Schmidt (https://github.com/ajschmidt8) - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) URL: #11043

revans2 added feature request New feature or request Needs Triage Need team to review and classify labels Mar 10, 2022

revans2 mentioned this issue Mar 10, 2022

[FEA] Support ArrayIntersect on at least Arrays of String NVIDIA/spark-rapids#4932

Closed

github-actions bot added the inactive-30d label Apr 9, 2022

sameerz mentioned this issue Apr 12, 2022

[FEA] Support arrays_overlap NVIDIA/spark-rapids#5188

Closed

jrhemstad added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify inactive-30d labels Apr 12, 2022

This was referenced Apr 13, 2022

[FEA] Support array_union NVIDIA/spark-rapids#5228

Closed

[FEA] Support function array_except NVIDIA/spark-rapids#5222

Closed

revans2 mentioned this issue Apr 26, 2022

[FEA] cudf::lists::contains to support NaN == NaN #10741

Closed

revans2 added the Spark Functionality that helps Spark RAPIDS label Apr 26, 2022

ttnghia self-assigned this May 3, 2022

This was referenced May 19, 2022

[FEA] Generate element labels from offsets #10905

Closed

Generate group labels from offsets #10945

Merged

[FEA] Generate offsets from element labels #10955

Closed

vyasr mentioned this issue Jun 7, 2022

[FEA] Share code between left_semi_anti_join, cudf::contains, and set operations #11037

Closed

ttnghia mentioned this issue Jun 15, 2022

Support set operations #11043

Merged

rapids-bot bot closed this as completed in #11043 Jul 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support set like operators intersect, union, and difference on lists #10409

[FEA] Support set like operators intersect, union, and difference on lists #10409

revans2 commented Mar 10, 2022

jrhemstad commented Mar 10, 2022

github-actions bot commented Apr 9, 2022

revans2 commented Apr 12, 2022

jrhemstad commented Apr 12, 2022

GregoryKimball commented Apr 12, 2022 •

edited

Loading

devavret commented May 9, 2022 •

edited

Loading

ttnghia commented Jun 2, 2022

revans2 commented Jun 13, 2022

[FEA] Support set like operators intersect, union, and difference on lists #10409

[FEA] Support set like operators intersect, union, and difference on lists #10409

Comments

revans2 commented Mar 10, 2022

jrhemstad commented Mar 10, 2022

github-actions bot commented Apr 9, 2022

revans2 commented Apr 12, 2022

jrhemstad commented Apr 12, 2022

GregoryKimball commented Apr 12, 2022 • edited Loading

devavret commented May 9, 2022 • edited Loading

ttnghia commented Jun 2, 2022

revans2 commented Jun 13, 2022

GregoryKimball commented Apr 12, 2022 •

edited

Loading

devavret commented May 9, 2022 •

edited

Loading