Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Share code between left_semi_anti_join, cudf::contains, and set operations #11037

Closed
ttnghia opened this issue Jun 3, 2022 · 10 comments · Fixed by #11100
Closed

[FEA] Share code between left_semi_anti_join, cudf::contains, and set operations #11037

ttnghia opened this issue Jun 3, 2022 · 10 comments · Fixed by #11100
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@ttnghia
Copy link
Contributor

ttnghia commented Jun 3, 2022

The semi- and anti- joins operations are operations that rely on building a gather map by checking if rows in one table exist in the other table.

Such checking is also being used in cudf::contains and set operations.

We should share code between them to avoid duplicate implementation.


Depends on:

@ttnghia ttnghia added feature request New feature or request Needs Triage Need team to review and classify labels Jun 3, 2022
@jrhemstad
Copy link
Contributor

I haven't looked closely, but my suspicion is the semi_join implementation is already closer to the desired implementation I described here: #10656 (comment)

So I'd be more inclined to use that as the basis upon which contains is implemented, but I don't have strong feelings here.

@ttnghia
Copy link
Contributor Author

ttnghia commented Jun 3, 2022

semi-join currently uses static_map which in turn uses the old row comparator that doesn't support both nested data types and strong index types. So basically we can't reuse the code in *-join implementation but have to start from scratch with the solution as you just suggested.

@jrhemstad
Copy link
Contributor

So basically we can't reuse the code in *-join implementation but have to start from scratch with the solution as you just suggested

I don't think it would need to start from scratch. I think it would just be a matter of updating the equality/hashing operators used.

@vyasr wrote the code that is there today and should be able to help.

@ttnghia
Copy link
Contributor Author

ttnghia commented Jun 3, 2022

Updating the equality/hashing operators to use the solution you suggested (#10656 (comment)) is almost starting from scratch 😃. In summary, to have *-joins and cudf::contains share their implementation, we need:

  • Support nested types in cudf::contains (this is what cudf::contains wants), and
  • Add null_equality parameter in cudf::contains (this is what *-joins need)

We can do these steps in parallel.

Edit: Just realize that *_joins can just work around null equality check during building the gather map. So I'm going to push a PR refactoring semi_anti_join to call cudf::contains. I'll also run benchmarks comparing the performance before and after the transition.

@jrhemstad
Copy link
Contributor

Updating the equality/hashing operators to use the solution you suggested (#10656 (comment)) is almost starting from scratch

Is it? This looks awfully close to what I already described:

// Create hash table.
semi_map_type hash_table{compute_hash_table_size(right_num_rows),
std::numeric_limits<hash_value_type>::max(),
cudf::detail::JoinNoneValue,
hash_table_allocator_type{default_allocator<char>{}, stream},
stream.value()};
// Create hash table containing all keys found in right table
auto right_rows_d = table_device_view::create(right_flattened_keys, stream);
auto const right_nulls = cudf::nullate::DYNAMIC{cudf::has_nulls(right_flattened_keys)};
row_hash const hash_build{right_nulls, *right_rows_d};
row_equality equality_build{right_nulls, *right_rows_d, *right_rows_d, compare_nulls};
make_pair_fn pair_func_build{};
auto iter = cudf::detail::make_counting_transform_iterator(0, pair_func_build);
// skip rows that are null here.
if ((compare_nulls == null_equality::EQUAL) or (not nullable(right_keys))) {
hash_table.insert(iter, iter + right_num_rows, hash_build, equality_build, stream.value());
} else {
thrust::counting_iterator<size_type> stencil(0);
auto const [row_bitmask, _] = cudf::detail::bitmask_and(right_flattened_keys, stream);
row_is_valid pred{static_cast<bitmask_type const*>(row_bitmask.data())};
// insert valid rows
hash_table.insert_if(
iter, iter + right_num_rows, stencil, pred, hash_build, equality_build, stream.value());
}
// Now we have a hash table, we need to iterate over the rows of the left table
// and check to see if they are contained in the hash table
auto left_rows_d = table_device_view::create(left_flattened_keys, stream);
auto const left_nulls = cudf::nullate::DYNAMIC{cudf::has_nulls(left_flattened_keys)};
row_hash hash_probe{left_nulls, *left_rows_d};
// Note: This equality comparator violates symmetry of equality and is
// therefore relying on the implementation detail of the order in which its
// operator is invoked. If cuco makes no promises about the order of
// invocation this seems a bit unsafe.
row_equality equality_probe{left_nulls, *right_rows_d, *left_rows_d, compare_nulls};
// For semi join we want contains to be true, for anti join we want contains to be false
bool const join_type_boolean = (kind == join_kind::LEFT_SEMI_JOIN);
auto hash_table_view = hash_table.get_device_view();
auto gather_map =
std::make_unique<rmm::device_uvector<cudf::size_type>>(left_num_rows, stream, mr);
rmm::device_uvector<bool> flagged(left_num_rows, stream, mr);
auto flagged_d = flagged.data();
auto counting_iter = thrust::counting_iterator<size_type>(0);
thrust::for_each(
rmm::exec_policy(stream),
counting_iter,
counting_iter + left_num_rows,
[flagged_d, hash_table_view, join_type_boolean, hash_probe, equality_probe] __device__(
const size_type idx) {
flagged_d[idx] =
hash_table_view.contains(idx, hash_probe, equality_probe) == join_type_boolean;
});

I would like the join implementation to be the source of truth.

The implementation I described is very similar to how inner_join/left_join are also implemented. It would be incongruous to have the implementation for semi_join elsewhere when in it is so similar to the implementations for the other joins.

@ttnghia
Copy link
Contributor Author

ttnghia commented Jun 3, 2022

Is it? This looks awfully close to what I already described:

No, that implementation uses the old row comparator, which doesn't support nested types. If we want nested types, that implementation needs to be modified heavily to something like we just talked in the cudf::contains PR (#10656 (comment)).

It would be incongruous to have the implementation for semi_join elsewhere when in it is so similar to the implementations for the other joins.

The problem is that, *-joins firstly check for existence (using the same result generated from cudf::contains) then build a gather_map from that result. So we can use cudf::contains as a component to implement *-join.

If you want to keep the contains implementation inside *-join, then we will have cudf::contains internally call an external function like detail::semi_anti_join_contains which seems a bit awkward to me.

@ttnghia
Copy link
Contributor Author

ttnghia commented Jun 3, 2022

So, in summary, we have 2 options:

  1. Having *-joins call cudf::contains:
auto left_semi_anti_join(cudf::table_view const& left_keys, cudf::table_view const& right_keys,...) {

  auto contains_check =  cudf::contains(left_keys, right_keys,....);
  // build gather map from contains_check
  return gather_map;
}
  1. Having cudf::contains calls detail::semi_anti_join_contains (which is extracted from semi_anti_join):
std::unique_ptr<column> contains(column_view const& haystack,
                                 column_view const& needles,
                                 rmm::cuda_stream_view stream,
                                 rmm::mr::device_memory_resource* mr)
{
  return cudf::detail::semi_anti_join_contains(haystack, needles, default_stream, mr);
}

What do you think? I'm fine with both although I prefer the first one.

@vyasr
Copy link
Contributor

vyasr commented Jun 7, 2022

@ttnghia I see your point. The semi_join implementation is essentially 1) build a hash table on the right table, 2) probe the hash table with the left table entries and generate a boolean array indicating whether each row in left matches anything in right, and 3) perform a stream compaction using the boolean array to generate a gather map, which is returned.

I think there are two conflicting considerations here.

  1. Steps 1+2 from above are essentially contains. That suggests that semi_join should depend on contains.
  2. contains is a single-column search, i.e. the needles are scalars (perhaps of nested types) and the haystack is a column, whereas semi_join needs to support multiple columns. That suggests that contains should depend on semi_join.

Given these two considerations I think the second option that you outlined makes more sense. The function that you call detail::semi_anti_join_contains would need to have the signature

std::unique_ptr<column> contains(table_view const& haystack,
                                 table_view const& needles,
                                 rmm::cuda_stream_view stream,
                                 rmm::mr::device_memory_resource* mr)

If repeated lookups are valuable to optimize, we could extend the hash_join object to support holding either a static_multimap or a static_map as a way to also incorporate the semi_/anti_join logic into the same framework as the other joins. That would allow us to expose the containment check as a public API of hash_join object. That's probably premature at this stage however; implementing the above contains(table_view, table_view) helper function should be sufficient for now.

@ttnghia
Copy link
Contributor Author

ttnghia commented Jun 7, 2022

Okay then I'll implement contains inside semi_anti_join.

Now I even think of this as an extension of the set-like operations. Semi-join is something like set_intersect and anti-join is set_difference (but all with duplicates?). We may end up in the future to refactor it again, moving the implementation to reduction set-like operations 😃 .

@vyasr
Copy link
Contributor

vyasr commented Jun 7, 2022

To me that sounds like the last piece that I suggested: we may eventually want to implement something like a static_map version of the current hash_join object (which uses static_multimap for the other types of joins) in order to expose the hash lookup component for querying just the contains check. Using it for a single column would be suitable for contains or the various set-like operations (IIRC those are also all single columns of lists) while using it for multiple columns would be for semi/anti joins. I don't remember enough of @devavret's idea for #10409 to know if that would be consistent with what he had in mind, though.

@ttnghia ttnghia changed the title [FEA] Use cudf::contains in left_semi_anti_join [FEA] Share code between cudf::contains and left_semi_anti_join Jun 10, 2022
@ttnghia ttnghia changed the title [FEA] Share code between cudf::contains and left_semi_anti_join [FEA] Share code between left_semi_anti_join, cudf::contains, and set operations Jun 11, 2022
@ttnghia ttnghia self-assigned this Jun 17, 2022
@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jun 29, 2022
rapids-bot bot pushed a commit that referenced this issue Jul 14, 2022
A (left) semi-join between the left and right tables returns a set of rows in the left table that has matching rows (i.e., compared equally) in the right table. As such, for each row in the left table, it needs to check if that row has a match in the right table. 

Such check is very generic and has applications in many other places, not just in semi-join. This PR exposes that check functionality as a new `cudf::detail::contains(table_view, table_view)` for internal usage.

Closes #11037.

Depends on:
 * NVIDIA/cuCollections#175

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #11100
rapids-bot bot pushed a commit that referenced this issue Aug 17, 2022
This extends the `cudf::contains` API to support nested types (lists + structs) with arbitrarily nested levels. As such, `cudf::contains` will work with literally any type of input data.

In addition, this fixes null handling of `cudf::contains` with structs column + struct scalar input when the structs column contains null rows at the top level while the scalar key is valid but all nulls at children levels.

Closes: #8965
Depends on:
 * #10730
 * #10883
 * #10802
 * #10997
 * NVIDIA/cuCollections#172
 * NVIDIA/cuCollections#173
 * #11037
 * #11356

Authors:
  - Nghia Truong (https://github.com/ttnghia)
  - Devavret Makkar (https://github.com/devavret)
  - Bradley Dice (https://github.com/bdice)
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Bradley Dice (https://github.com/bdice)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #10656
rapids-bot bot pushed a commit that referenced this issue Aug 22, 2022
This extends the `lists::contains` API to support nested types (lists + structs) with arbitrarily nested levels. As such, `lists::contains` will work with literally any type of input data.

In addition, the related implementation has been significantly refactored to facilitate adding new implementation.

Closes #8958.
Depends on:
 * #10730
 * #10883
 * #10999
 * #11019
 * #11037

Authors:
  - Nghia Truong (https://github.com/ttnghia)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - MithunR (https://github.com/mythrocks)
  - Bradley Dice (https://github.com/bdice)

URL: #10548
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants