Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle arbitrarily different data in null list column rows when checking for equivalency. #8666

Merged
merged 9 commits into from
Jul 20, 2021

Conversation

nvdbaranec
Copy link
Contributor

@nvdbaranec nvdbaranec commented Jul 6, 2021

The column equivalency checking code was not handling a particular corner case properly. Fundamentally, there is no requirement that the offsets or child data for null rows in two list columns to be the same. An example:

List<int32_t>:
Length : 7
Offsets : 0, 3, 6, 8, 11, 14, 16, 19
Null count: 7
0010100
   1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 7, 7, 7

List<int32_t>:
Length : 7
Offsets : 0, 0, 0, 2, 2, 5, 5, 5
Null count: 7
0010100
   3, 3, 5, 5, 5

At first glance, these columns do not seem equivalent. However, the only two non-null rows (2 and 4) are identical:
[[3, 3], [5, 5, 5]]

The comparison code was expecting row positions to always be the same inside of child rows, but that does not have to be the case. For example, in the first column, the child row indices that we care about are [6, 7, 11, 12, 13], whereas in the second column they are [0, 1, 2, 3, 4]

The fix for this is to fundamentally change how the comparison code works so that instead of simply iterating from 0 to size for each column, we instead provide an explicit list of column indices that should be compared. The various compare functors now take additional lhs_row_indices and rhs_row_indices columns to reflect this.

For flat hierarchies, this input is always just [0, 1, 2, 3... size]. However, every time we encounter a list column in the hierarchy, the rows that need to be considered for both columns can be completely and arbitrarily changed.

I'm leaving this as a draft as there is a discussion point in the column property comparisons that I think is worth having. Similar to the data values, one of the things the column property comparison wanted to do was simply compare lhs.size() to rhs.size(). But as we can see for the leaf columns in the above case, they are totally different. However, when we are only checking for equivalency what matters is that the number of rows we are going to be comparing is the same. Similarly, the null counts cannot be compared directly. Just the null count of the rows we are explicitly comparing. As far as I can tell, this is the only way to do it, but I'm not sure it's 100% semantically in the spirit of what the column properties are, since we are really checking the properties of a subset of the overall column.

I left a couple of comments in the property comparator code labelled
// DISCUSSION

Note: I haven't added tests yet.

…itrarily different data (offsets, values) in null rows.
@nvdbaranec nvdbaranec added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Jul 6, 2021
@nvdbaranec nvdbaranec requested a review from a team as a code owner July 6, 2021 23:06
@nvdbaranec nvdbaranec marked this pull request as draft July 6, 2021 23:08
@gerashegalov
Copy link
Contributor

gerashegalov commented Jul 7, 2021

Thanks @nvdbaranec. I cherry-picked your commits into my PR #8588 to test.

The previously failing ScalarListBothInvalid test cases now succeed.

The ScalarStructBothValid test cases still fail:

C++ exception with description "cuDF failure at: ../include/cudf/detail/scatter.cuh:271: Scatter source and target are not of the same type." thrown in the test body

UPDATE: it turned out to be a bug in the test in #8588

@karthikeyann
Copy link
Contributor

rerun tests

@codecov
Copy link

codecov bot commented Jul 8, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@2a8d202). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.08    #8666   +/-   ##
===============================================
  Coverage                ?   10.53%           
===============================================
  Files                   ?      116           
  Lines                   ?    18916           
  Branches                ?        0           
===============================================
  Hits                    ?     1993           
  Misses                  ?    16923           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2a8d202...dbf338e. Read the comment docs.

…o have the various expect_columns_* functions throw instead of print upon failure,

allowing for use of EXPECT_THROW(...).  Add tests.  Couple of small fixes.
@nvdbaranec nvdbaranec marked this pull request as ready for review July 8, 2021 19:52
@jrhemstad
Copy link
Contributor

Does the Arrow spec allow for null lists to have non-zero lengths?

@nvdbaranec
Copy link
Contributor Author

Does the Arrow spec allow for null lists to have non-zero lengths?

I believe so.

https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout
Similar to the layout of variable-size binary, a null value may correspond to a non-empty segment in the child array. When this is true, the content of the corresponding segment can be arbitrary.

cpp/include/cudf_test/column_utilities.hpp Outdated Show resolved Hide resolved
cpp/tests/utilities/column_utilities.cu Outdated Show resolved Hide resolved

// if the row is valid, check that the length of the list is the same. do this
// for both the equivalency and exact equality checks.
if (lhs_valids[lhs_index] && ((lhs_offsets[lhs_index + 1] - lhs_offsets[lhs_index]) !=
Copy link
Contributor

@ttnghia ttnghia Jul 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can do something simpler, like pulling all contents of the original columns (using gather with gather map is the simple sequence [0, 1, 2, 3,..., size - 1]). Then, we just use the existing code to compare the resulted columns (which have zero size for the nulls after gathering).

Of course, we only use gather if the column has nulls. I think the total overhead here is very little. And since we are running this in tests, not in production, we don't have to worry much about such performance penalty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit of a large change, and one think I don't particularly like about it is that it uses a function (gather()) which will in turn be using this function to verify itself.

namespace test {

namespace {

// expand all non-null rows in a list column into a column of child row indices.
std::unique_ptr<column> generate_child_row_indices(lists_column_view const& c,
column_view const& row_indices)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need an assertion for row_indices not having nulls? Overkill for cudf::test code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.S. This was an informative read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's overkill here. row_indices is never something handed to us by the external user. It's purely internal.

auto const rhs_index = rhs_indices[index];

// check for validity match
if (lhs_valids[lhs_index] != rhs_valids[rhs_index]) { return true; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have lost the plot here: Should we not consider the offset of lhs and rhs at this point, because they might be sliced columns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lhs_valids and rhs_valids are iterators that take care of the offset inline.

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a while to review. Generally +1. A couple of minor nitpicks. One spot where I wonder if we've handled sliced columns.

@firestarman
Copy link
Contributor

rerun tests

… void. Change the print_all_differences parameter to be an

enum with 3 values :  FIRST_ERROR, ALL_ERRORS and QUIET.
@nvdbaranec
Copy link
Contributor Author

I've changed things so that the comparison functions now return a true/false value and the print_all_differences parameter is now an enum value. This makes it possible to now have tests that expect a comparison to fail without causing any debug spew:

EXPECT_EQ(cudf::test::expect_columns_equal(s_col0, s_col1, cudf::test::debug_output_level::QUIET), false);

@nvdbaranec
Copy link
Contributor Author

rerun tests

cpp/include/cudf_test/column_utilities.hpp Outdated Show resolved Hide resolved
cpp/include/cudf_test/column_utilities.hpp Outdated Show resolved Hide resolved
cpp/include/cudf_test/column_utilities.hpp Show resolved Hide resolved
@nvdbaranec nvdbaranec requested a review from ttnghia July 19, 2021 17:04
@nvdbaranec
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit f1fa694 into rapidsai:branch-21.08 Jul 20, 2021
rapids-bot bot pushed a commit that referenced this pull request Jul 20, 2021
Uses scalar-vector-based scatter API to provide support for copy_if_else involving scalar columns. 

Other changes:
- removes some dead code
- refactoring into overloaded functions

Closes #8361, depends on #8630, #8666

Authors:
  - Gera Shegalov (https://github.com/gerashegalov)

Approvers:
  - https://github.com/nvdbaranec
  - MithunR (https://github.com/mythrocks)

URL: #8588
rapids-bot bot pushed a commit to rapidsai/cuspatial that referenced this pull request Jul 21, 2021
rapidsai/cudf#8666 modified `cudf::test` APIs to accept a verbosity enum as a parameter to control output, which is backwards incompatible with the previously boolean parameter.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - Paul Taylor (https://github.com/trxcllnt)

URL: #433
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants