-
Notifications
You must be signed in to change notification settings - Fork 924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Share code between left_semi_anti_join
, cudf::contains
, and set operations
#11037
Comments
I haven't looked closely, but my suspicion is the So I'd be more inclined to use that as the basis upon which |
|
I don't think it would need to start from scratch. I think it would just be a matter of updating the equality/hashing operators used. @vyasr wrote the code that is there today and should be able to help. |
Updating the equality/hashing operators to use the solution you suggested (#10656 (comment)) is almost starting from scratch 😃. In summary, to have *-joins and
We can do these steps in parallel. Edit: Just realize that *_joins can just work around null equality check during building the gather map. So I'm going to push a PR refactoring |
Is it? This looks awfully close to what I already described: cudf/cpp/src/join/semi_join.cu Lines 92 to 152 in fe9a4f8
I would like the join implementation to be the source of truth. The implementation I described is very similar to how |
No, that implementation uses the old row comparator, which doesn't support nested types. If we want nested types, that implementation needs to be modified heavily to something like we just talked in the
The problem is that, *-joins firstly check for existence (using the same result generated from If you want to keep the |
So, in summary, we have 2 options:
What do you think? I'm fine with both although I prefer the first one. |
@ttnghia I see your point. The I think there are two conflicting considerations here.
Given these two considerations I think the second option that you outlined makes more sense. The function that you call
If repeated lookups are valuable to optimize, we could extend the |
Okay then I'll implement Now I even think of this as an extension of the set-like operations. Semi-join is something like |
To me that sounds like the last piece that I suggested: we may eventually want to implement something like a |
cudf::contains
in left_semi_anti_join
cudf::contains
and left_semi_anti_join
cudf::contains
and left_semi_anti_join
left_semi_anti_join
, cudf::contains
, and set operations
A (left) semi-join between the left and right tables returns a set of rows in the left table that has matching rows (i.e., compared equally) in the right table. As such, for each row in the left table, it needs to check if that row has a match in the right table. Such check is very generic and has applications in many other places, not just in semi-join. This PR exposes that check functionality as a new `cudf::detail::contains(table_view, table_view)` for internal usage. Closes #11037. Depends on: * NVIDIA/cuCollections#175 Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Yunsong Wang (https://github.com/PointKernel) URL: #11100
This extends the `cudf::contains` API to support nested types (lists + structs) with arbitrarily nested levels. As such, `cudf::contains` will work with literally any type of input data. In addition, this fixes null handling of `cudf::contains` with structs column + struct scalar input when the structs column contains null rows at the top level while the scalar key is valid but all nulls at children levels. Closes: #8965 Depends on: * #10730 * #10883 * #10802 * #10997 * NVIDIA/cuCollections#172 * NVIDIA/cuCollections#173 * #11037 * #11356 Authors: - Nghia Truong (https://github.com/ttnghia) - Devavret Makkar (https://github.com/devavret) - Bradley Dice (https://github.com/bdice) - Karthikeyan (https://github.com/karthikeyann) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) URL: #10656
This extends the `lists::contains` API to support nested types (lists + structs) with arbitrarily nested levels. As such, `lists::contains` will work with literally any type of input data. In addition, the related implementation has been significantly refactored to facilitate adding new implementation. Closes #8958. Depends on: * #10730 * #10883 * #10999 * #11019 * #11037 Authors: - Nghia Truong (https://github.com/ttnghia) - Bradley Dice (https://github.com/bdice) Approvers: - MithunR (https://github.com/mythrocks) - Bradley Dice (https://github.com/bdice) URL: #10548
The semi- and anti- joins operations are operations that rely on building a gather map by checking if rows in one table exist in the other table.
Such checking is also being used in
cudf::contains
and set operations.We should share code between them to avoid duplicate implementation.
Depends on:
pair_contains
instatic_map
andstatic_multimap
NVIDIA/cuCollections#173pair_contains_if
instatic_map
andstatic_multimap
NVIDIA/cuCollections#176The text was updated successfully, but these errors were encountered: