Handle arbitrarily different data in null list column rows when checking for equivalency. #8666

nvdbaranec · 2021-07-06T23:06:26Z

The column equivalency checking code was not handling a particular corner case properly. Fundamentally, there is no requirement that the offsets or child data for null rows in two list columns to be the same. An example:

List<int32_t>:
Length : 7
Offsets : 0, 3, 6, 8, 11, 14, 16, 19
Null count: 7
0010100
   1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 7, 7, 7

List<int32_t>:
Length : 7
Offsets : 0, 0, 0, 2, 2, 5, 5, 5
Null count: 7
0010100
   3, 3, 5, 5, 5

At first glance, these columns do not seem equivalent. However, the only two non-null rows (2 and 4) are identical:
[[3, 3], [5, 5, 5]]

The comparison code was expecting row positions to always be the same inside of child rows, but that does not have to be the case. For example, in the first column, the child row indices that we care about are [6, 7, 11, 12, 13], whereas in the second column they are [0, 1, 2, 3, 4]

The fix for this is to fundamentally change how the comparison code works so that instead of simply iterating from 0 to size for each column, we instead provide an explicit list of column indices that should be compared. The various compare functors now take additional lhs_row_indices and rhs_row_indices columns to reflect this.

For flat hierarchies, this input is always just [0, 1, 2, 3... size]. However, every time we encounter a list column in the hierarchy, the rows that need to be considered for both columns can be completely and arbitrarily changed.

I'm leaving this as a draft as there is a discussion point in the column property comparisons that I think is worth having. Similar to the data values, one of the things the column property comparison wanted to do was simply compare lhs.size() to rhs.size(). But as we can see for the leaf columns in the above case, they are totally different. However, when we are only checking for equivalency what matters is that the number of rows we are going to be comparing is the same. Similarly, the null counts cannot be compared directly. Just the null count of the rows we are explicitly comparing. As far as I can tell, this is the only way to do it, but I'm not sure it's 100% semantically in the spirit of what the column properties are, since we are really checking the properties of a subset of the overall column.

I left a couple of comments in the property comparator code labelled
// DISCUSSION

Note: I haven't added tests yet.

…itrarily different data (offsets, values) in null rows.

gerashegalov · 2021-07-07T01:58:23Z

Thanks @nvdbaranec. I cherry-picked your commits into my PR #8588 to test.

The previously failing ScalarListBothInvalid test cases now succeed.

~~The ScalarStructBothValid test cases still fail:~~

~~C++ exception with description "cuDF failure at: ../include/cudf/detail/scatter.cuh:271: Scatter source and target are not of the same type." thrown in the test body~~

UPDATE: it turned out to be a bug in the test in #8588

karthikeyann · 2021-07-08T06:16:50Z

rerun tests

codecov · 2021-07-08T08:13:39Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@2a8d202). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-21.08    #8666   +/-   ##
===============================================
  Coverage                ?   10.53%           
===============================================
  Files                   ?      116           
  Lines                   ?    18916           
  Branches                ?        0           
===============================================
  Hits                    ?     1993           
  Misses                  ?    16923           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2a8d202...dbf338e. Read the comment docs.

…o have the various expect_columns_* functions throw instead of print upon failure, allowing for use of EXPECT_THROW(...). Add tests. Couple of small fixes.