[FEA] Share code between `left_semi_anti_join`, `cudf::contains`, and set operations #11037

ttnghia · 2022-06-03T05:01:59Z

The semi- and anti- joins operations are operations that rely on building a gather map by checking if rows in one table exist in the other table.

Such checking is also being used in cudf::contains and set operations.

We should share code between them to avoid duplicate implementation.

Depends on:

The text was updated successfully, but these errors were encountered:

jrhemstad · 2022-06-03T14:05:05Z

I haven't looked closely, but my suspicion is the semi_join implementation is already closer to the desired implementation I described here: #10656 (comment)

So I'd be more inclined to use that as the basis upon which contains is implemented, but I don't have strong feelings here.

ttnghia · 2022-06-03T14:29:16Z

semi-join currently uses static_map which in turn uses the old row comparator that doesn't support both nested data types and strong index types. So basically we can't reuse the code in *-join implementation but have to start from scratch with the solution as you just suggested.

jrhemstad · 2022-06-03T15:32:59Z

So basically we can't reuse the code in *-join implementation but have to start from scratch with the solution as you just suggested

I don't think it would need to start from scratch. I think it would just be a matter of updating the equality/hashing operators used.

@vyasr wrote the code that is there today and should be able to help.

ttnghia · 2022-06-03T15:40:55Z

Updating the equality/hashing operators to use the solution you suggested (#10656 (comment)) is almost starting from scratch 😃. In summary, to have *-joins and cudf::contains share their implementation, we need:

Support nested types in cudf::contains (this is what cudf::contains wants), and
~~Add null_equality parameter in cudf::contains (this is what *-joins need)~~

We can do these steps in parallel.

Edit: Just realize that *_joins can just work around null equality check during building the gather map. So I'm going to push a PR refactoring semi_anti_join to call cudf::contains. I'll also run benchmarks comparing the performance before and after the transition.

jrhemstad · 2022-06-03T16:39:38Z

Updating the equality/hashing operators to use the solution you suggested (#10656 (comment)) is almost starting from scratch

Is it? This looks awfully close to what I already described:

cudf/cpp/src/join/semi_join.cu

Lines 92 to 152 in fe9a4f8

    
           // Create hash table. 
        
           semi_map_type hash_table{compute_hash_table_size(right_num_rows), 
        
                                    std::numeric_limits<hash_value_type>::max(), 
        
                                    cudf::detail::JoinNoneValue, 
        
                                    hash_table_allocator_type{default_allocator<char>{}, stream}, 
        
                                    stream.value()}; 
        
           // Create hash table containing all keys found in right table 
        
           auto right_rows_d      = table_device_view::create(right_flattened_keys, stream); 
        
           auto const right_nulls = cudf::nullate::DYNAMIC{cudf::has_nulls(right_flattened_keys)}; 
        
           row_hash const hash_build{right_nulls, *right_rows_d}; 
        
           row_equality equality_build{right_nulls, *right_rows_d, *right_rows_d, compare_nulls}; 
        
           make_pair_fn pair_func_build{}; 
        
           auto iter = cudf::detail::make_counting_transform_iterator(0, pair_func_build); 
        
           // skip rows that are null here. 
        
           if ((compare_nulls == null_equality::EQUAL) or (not nullable(right_keys))) { 
        
             hash_table.insert(iter, iter + right_num_rows, hash_build, equality_build, stream.value()); 
        
           } else { 
        
             thrust::counting_iterator<size_type> stencil(0); 
        
             auto const [row_bitmask, _] = cudf::detail::bitmask_and(right_flattened_keys, stream); 
        
             row_is_valid pred{static_cast<bitmask_type const*>(row_bitmask.data())}; 
        
             // insert valid rows 
        
             hash_table.insert_if( 
        
               iter, iter + right_num_rows, stencil, pred, hash_build, equality_build, stream.value()); 
        
           } 
        
           // Now we have a hash table, we need to iterate over the rows of the left table 
        
           // and check to see if they are contained in the hash table 
        
           auto left_rows_d      = table_device_view::create(left_flattened_keys, stream); 
        
           auto const left_nulls = cudf::nullate::DYNAMIC{cudf::has_nulls(left_flattened_keys)}; 
        
           row_hash hash_probe{left_nulls, *left_rows_d}; 
        
           // Note: This equality comparator violates symmetry of equality and is 
        
           // therefore relying on the implementation detail of the order in which its 
        
           // operator is invoked. If cuco makes no promises about the order of 
        
           // invocation this seems a bit unsafe. 
        
           row_equality equality_probe{left_nulls, *right_rows_d, *left_rows_d, compare_nulls}; 
        
           // For semi join we want contains to be true, for anti join we want contains to be false 
        
           bool const join_type_boolean = (kind == join_kind::LEFT_SEMI_JOIN); 
        
           auto hash_table_view = hash_table.get_device_view(); 
        
           auto gather_map = 
        
             std::make_unique<rmm::device_uvector<cudf::size_type>>(left_num_rows, stream, mr); 
        
           rmm::device_uvector<bool> flagged(left_num_rows, stream, mr); 
        
           auto flagged_d = flagged.data(); 
        
           auto counting_iter = thrust::counting_iterator<size_type>(0); 
        
           thrust::for_each( 
        
             rmm::exec_policy(stream), 
        
             counting_iter, 
        
             counting_iter + left_num_rows, 
        
             [flagged_d, hash_table_view, join_type_boolean, hash_probe, equality_probe] __device__( 
        
               const size_type idx) { 
        
               flagged_d[idx] = 
        
                 hash_table_view.contains(idx, hash_probe, equality_probe) == join_type_boolean; 
        
             });

I would like the join implementation to be the source of truth.

The implementation I described is very similar to how inner_join/left_join are also implemented. It would be incongruous to have the implementation for semi_join elsewhere when in it is so similar to the implementations for the other joins.

ttnghia · 2022-06-03T16:47:25Z

Is it? This looks awfully close to what I already described:

No, that implementation uses the old row comparator, which doesn't support nested types. If we want nested types, that implementation needs to be modified heavily to something like we just talked in the cudf::contains PR (#10656 (comment)).

It would be incongruous to have the implementation for semi_join elsewhere when in it is so similar to the implementations for the other joins.

The problem is that, *-joins firstly check for existence (using the same result generated from cudf::contains) then build a gather_map from that result. So we can use cudf::contains as a component to implement *-join.

If you want to keep the contains implementation inside *-join, then we will have cudf::contains internally call an external function like detail::semi_anti_join_contains which seems a bit awkward to me.

ttnghia · 2022-06-03T16:51:13Z

So, in summary, we have 2 options:

Having *-joins call cudf::contains:

auto left_semi_anti_join(cudf::table_view const& left_keys, cudf::table_view const& right_keys,...) {

  auto contains_check =  cudf::contains(left_keys, right_keys,....);
  // build gather map from contains_check
  return gather_map;
}

Having cudf::contains calls detail::semi_anti_join_contains (which is extracted from semi_anti_join):

std::unique_ptr<column> contains(column_view const& haystack,
                                 column_view const& needles,
                                 rmm::cuda_stream_view stream,
                                 rmm::mr::device_memory_resource* mr)
{
  return cudf::detail::semi_anti_join_contains(haystack, needles, default_stream, mr);
}

What do you think? I'm fine with both although I prefer the first one.

vyasr · 2022-06-07T17:58:28Z

@ttnghia I see your point. The semi_join implementation is essentially 1) build a hash table on the right table, 2) probe the hash table with the left table entries and generate a boolean array indicating whether each row in left matches anything in right, and 3) perform a stream compaction using the boolean array to generate a gather map, which is returned.

I think there are two conflicting considerations here.

Steps 1+2 from above are essentially contains. That suggests that semi_join should depend on contains.
contains is a single-column search, i.e. the needles are scalars (perhaps of nested types) and the haystack is a column, whereas semi_join needs to support multiple columns. That suggests that contains should depend on semi_join.

Given these two considerations I think the second option that you outlined makes more sense. The function that you call detail::semi_anti_join_contains would need to have the signature

std::unique_ptr<column> contains(table_view const& haystack,
                                 table_view const& needles,
                                 rmm::cuda_stream_view stream,
                                 rmm::mr::device_memory_resource* mr)

If repeated lookups are valuable to optimize, we could extend the hash_join object to support holding either a static_multimap or a static_map as a way to also incorporate the semi_/anti_join logic into the same framework as the other joins. That would allow us to expose the containment check as a public API of hash_join object. That's probably premature at this stage however; implementing the above contains(table_view, table_view) helper function should be sufficient for now.

ttnghia · 2022-06-07T18:08:16Z

Okay then I'll implement contains inside semi_anti_join.

Now I even think of this as an extension of the set-like operations. Semi-join is something like set_intersect and anti-join is set_difference (but all with duplicates?). We may end up in the future to refactor it again, moving the implementation to reduction set-like operations 😃 .

vyasr · 2022-06-07T18:43:58Z

To me that sounds like the last piece that I suggested: we may eventually want to implement something like a static_map version of the current hash_join object (which uses static_multimap for the other types of joins) in order to expose the hash lookup component for querying just the contains check. Using it for a single column would be suitable for contains or the various set-like operations (IIRC those are also all single columns of lists) while using it for multiple columns would be for semi/anti joins. I don't remember enough of @devavret's idea for #10409 to know if that would be consistent with what he had in mind, though.

A (left) semi-join between the left and right tables returns a set of rows in the left table that has matching rows (i.e., compared equally) in the right table. As such, for each row in the left table, it needs to check if that row has a match in the right table. Such check is very generic and has applications in many other places, not just in semi-join. This PR exposes that check functionality as a new `cudf::detail::contains(table_view, table_view)` for internal usage. Closes #11037. Depends on: * NVIDIA/cuCollections#175 Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Yunsong Wang (https://github.com/PointKernel) URL: #11100

This extends the `cudf::contains` API to support nested types (lists + structs) with arbitrarily nested levels. As such, `cudf::contains` will work with literally any type of input data. In addition, this fixes null handling of `cudf::contains` with structs column + struct scalar input when the structs column contains null rows at the top level while the scalar key is valid but all nulls at children levels. Closes: #8965 Depends on: * #10730 * #10883 * #10802 * #10997 * NVIDIA/cuCollections#172 * NVIDIA/cuCollections#173 * #11037 * #11356 Authors: - Nghia Truong (https://github.com/ttnghia) - Devavret Makkar (https://github.com/devavret) - Bradley Dice (https://github.com/bdice) - Karthikeyan (https://github.com/karthikeyann) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) URL: #10656

This extends the `lists::contains` API to support nested types (lists + structs) with arbitrarily nested levels. As such, `lists::contains` will work with literally any type of input data. In addition, the related implementation has been significantly refactored to facilitate adding new implementation. Closes #8958. Depends on: * #10730 * #10883 * #10999 * #11019 * #11037 Authors: - Nghia Truong (https://github.com/ttnghia) - Bradley Dice (https://github.com/bdice) Approvers: - MithunR (https://github.com/mythrocks) - Bradley Dice (https://github.com/bdice) URL: #10548

ttnghia added feature request New feature or request Needs Triage Need team to review and classify labels Jun 3, 2022

ttnghia mentioned this issue Jun 3, 2022

Fully support nested types in cudf::contains #10656

Merged

ttnghia mentioned this issue Jun 3, 2022

[FEA] Support null_equality parameter in cudf::contains #11041

Closed

ttnghia changed the title ~~[FEA] Use cudf::contains in left_semi_anti_join~~ [FEA] Share code between cudf::contains and left_semi_anti_join Jun 10, 2022

ttnghia changed the title ~~[FEA] Share code between cudf::contains and left_semi_anti_join~~ [FEA] Share code between left_semi_anti_join, cudf::contains, and set operations Jun 11, 2022

This was referenced Jun 11, 2022

Support set operations #11043

Merged

Support nested types in lists::contains #10548

Merged

Refactor semi_anti_join #11100

Merged

ttnghia self-assigned this Jun 17, 2022

GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jun 29, 2022

rapids-bot bot closed this as completed in #11100 Jul 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Share code between `left_semi_anti_join`, `cudf::contains`, and set operations #11037

[FEA] Share code between `left_semi_anti_join`, `cudf::contains`, and set operations #11037

ttnghia commented Jun 3, 2022 •

edited

Loading

jrhemstad commented Jun 3, 2022

ttnghia commented Jun 3, 2022 •

edited

Loading

jrhemstad commented Jun 3, 2022

ttnghia commented Jun 3, 2022 •

edited

Loading

jrhemstad commented Jun 3, 2022

ttnghia commented Jun 3, 2022 •

edited

Loading

ttnghia commented Jun 3, 2022 •

edited

Loading

vyasr commented Jun 7, 2022 •

edited

Loading

ttnghia commented Jun 7, 2022 •

edited

Loading

vyasr commented Jun 7, 2022

[FEA] Share code between left_semi_anti_join, cudf::contains, and set operations #11037

[FEA] Share code between left_semi_anti_join, cudf::contains, and set operations #11037

Comments

ttnghia commented Jun 3, 2022 • edited Loading

jrhemstad commented Jun 3, 2022

ttnghia commented Jun 3, 2022 • edited Loading

jrhemstad commented Jun 3, 2022

ttnghia commented Jun 3, 2022 • edited Loading

jrhemstad commented Jun 3, 2022

ttnghia commented Jun 3, 2022 • edited Loading

ttnghia commented Jun 3, 2022 • edited Loading

vyasr commented Jun 7, 2022 • edited Loading

ttnghia commented Jun 7, 2022 • edited Loading

vyasr commented Jun 7, 2022

[FEA] Share code between `left_semi_anti_join`, `cudf::contains`, and set operations #11037

[FEA] Share code between `left_semi_anti_join`, `cudf::contains`, and set operations #11037

ttnghia commented Jun 3, 2022 •

edited

Loading

ttnghia commented Jun 3, 2022 •

edited

Loading

ttnghia commented Jun 3, 2022 •

edited

Loading

ttnghia commented Jun 3, 2022 •

edited

Loading

ttnghia commented Jun 3, 2022 •

edited

Loading

vyasr commented Jun 7, 2022 •

edited

Loading

ttnghia commented Jun 7, 2022 •

edited

Loading