-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded #34408
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The skeleton looks good to me
@jorisvandenbossche @westonpace @zeroshade this now has the implementation of all the new functions, changes in behavior, and tests. Ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor nits but this looks pretty well thought out
@westonpace I'm a bit lost on how these labels change and what they actully mean, but the PR is ready for review again. |
@ursabot please benchmark |
Benchmark runs are scheduled for baseline = fc95019 and contender = ceacb2e. Results will be available as each benchmark for each run completes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more questions. I also (hopefully) triggered conbench just to double check impact (though I don't know if we have any null-specific benchmarks).
ASSERT_TRUE(arr_default_null_count->data()->MayHaveNulls()); | ||
ASSERT_TRUE(arr_default_null_count->data()->MayHaveLogicalNulls()); | ||
|
||
RunEndEncodedBuilder ree_builder(pool_, std::make_shared<Int32Builder>(pool_), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have existing tests like this for sparse and dense union? (given we now have distinct paths for those)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned you in PR comments pointing to code that tests IsNull/Valid
for unions.
// ---------------------------------------------------------------------- | ||
// Null handling for types without a validity bitmap | ||
|
||
ARROW_EXPORT bool IsNullSparseUnion(const ArrayData& data, int64_t i); | ||
ARROW_EXPORT bool IsNullDenseUnion(const ArrayData& data, int64_t i); | ||
ARROW_EXPORT bool IsNullRunEndEncoded(const ArrayData& data, int64_t i); | ||
|
||
ARROW_EXPORT bool UnionMayHaveLogicalNulls(const ArrayData& data); | ||
ARROW_EXPORT bool RunEndEncodedMayHaveLogicalNulls(const ArrayData& data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What prevents these functions from being .cc
only function in an anonymous namespace? It seems like the only need is because we have functions later in data.h
that could be in a .cc
file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are called from IsNull
and IsValid
which are inline and having the switch on the type ID inlined allows the compiler to inline the highly predictable branches into the loop bodies from which these functions are called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could introduce an IsValidFallback
function that handles the case when the validity buffer is NULLPTR
, then all these could become .cc
only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's ok how it is. I just don't ever have a good sense for when it's justified for leaving something in the header file for performance reasons. Typically I think it would be nice for there to be some benchmark to point to. Otherwise "it seems like this is important for performance" becomes too subjective of a criteria to apply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing benchmarks indicated some regression (maybe due to union arrays being part of IsNull
benchmarks) so I will investigate more carefully.
- Add REE to TestNullCount tests - Test new null handling functions in TestMakeArrayOfNull
Benchmark runs are scheduled for baseline = 84e5430 and contender = fde31ed. fde31ed is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
…rvers Now that all the benchmark builds are running again. :fingers-crossed: This reverts commit fde31ed.
const int8_t child_id = sparse_union_type->child_ids()[types[span.offset + i]]; | ||
|
||
null_count += span.child_data[child_id].IsNull(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you test this with a sliced array? (don't directly see it in the tests)
Because in #35036, I based my implementation on this function, and I am finding that I need to remove/add the offset like so:
const int8_t child_id = sparse_union_type->child_ids()[types[span.offset + i]]; | |
null_count += span.child_data[child_id].IsNull(i); | |
const int8_t child_id = sparse_union_type->child_ids()[types[i]]; | |
null_count += span.child_data[child_id].IsNull(span.offset + i); |
So types
, which is the result of Span.GetValues()
already takes into account the span's offset, but IsNull()
is called on the child_data, and accessing child_data gives the original array data without offset already taken into account.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a high chance the existing tests didn't cover sliced union arrays and an even higher chance I messed up here because I had just read the Arrow spec on Unions to fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not so much about the Arrow spec on Unions, but rather about the small details of our APIs (Span::GetValues returning sliced values or not, Span::child_data returning sliced data or not?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetValues
applies slices. Direct access doesn't. I think that makes sense, but it's hard to keep in mind at all times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
child_ids()
is unclear to me because it's a cache if I recall correctly.
…rvers Now that all the benchmark builds are running again. :fingers-crossed: This reverts commit fde31ed.
…hout bitmaps like Unions and Run-End Encoded (apache#34408) Bonus: add `ArrayData::IsValid()` to make it consistent with `Array` and `ArraySpan`. ### Rationale for this change This is the proposed fix to apache#34361 plus the addition of more APIs dealing with validity/nullity. ### What changes are included in this PR? This PR changes the behavior of `IsNull` and `IsValid` in `Array`, `ArrayData`, and `ArraySpan`. It preserves the behavior of `MayHaveNulls`, `GetNullCount` and introduces new member functions to `Array`, `ArrayData`, and `ArraySpan`: - `bool HasValidityBitmap() const` - `bool MayHaveLogicalNulls() const` - `int64_t ComputeLogicalNullCount() const` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. See above. Breakage with these changes can only happen if users rely on `IsNull(i)` always returning `true` for union types, but we have users reporting that the current behavior or broken apache#34315. This is why the behavior of `IsNull` and `IsValid` is changing., **This PR contains a "Critical Fix".** * Closes: apache#34361 Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
…hout bitmaps like Unions and Run-End Encoded (apache#34408) Bonus: add `ArrayData::IsValid()` to make it consistent with `Array` and `ArraySpan`. ### Rationale for this change This is the proposed fix to apache#34361 plus the addition of more APIs dealing with validity/nullity. ### What changes are included in this PR? This PR changes the behavior of `IsNull` and `IsValid` in `Array`, `ArrayData`, and `ArraySpan`. It preserves the behavior of `MayHaveNulls`, `GetNullCount` and introduces new member functions to `Array`, `ArrayData`, and `ArraySpan`: - `bool HasValidityBitmap() const` - `bool MayHaveLogicalNulls() const` - `int64_t ComputeLogicalNullCount() const` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. See above. Breakage with these changes can only happen if users rely on `IsNull(i)` always returning `true` for union types, but we have users reporting that the current behavior or broken apache#34315. This is why the behavior of `IsNull` and `IsValid` is changing., **This PR contains a "Critical Fix".** * Closes: apache#34361 Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
…hout bitmaps like Unions and Run-End Encoded (apache#34408) Bonus: add `ArrayData::IsValid()` to make it consistent with `Array` and `ArraySpan`. ### Rationale for this change This is the proposed fix to apache#34361 plus the addition of more APIs dealing with validity/nullity. ### What changes are included in this PR? This PR changes the behavior of `IsNull` and `IsValid` in `Array`, `ArrayData`, and `ArraySpan`. It preserves the behavior of `MayHaveNulls`, `GetNullCount` and introduces new member functions to `Array`, `ArrayData`, and `ArraySpan`: - `bool HasValidityBitmap() const` - `bool MayHaveLogicalNulls() const` - `int64_t ComputeLogicalNullCount() const` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. See above. Breakage with these changes can only happen if users rely on `IsNull(i)` always returning `true` for union types, but we have users reporting that the current behavior or broken apache#34315. This is why the behavior of `IsNull` and `IsValid` is changing., **This PR contains a "Critical Fix".** * Closes: apache#34361 Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
Bonus: add
ArrayData::IsValid()
to make it consistent withArray
andArraySpan
.Rationale for this change
This is the proposed fix to #34361 plus the addition of more APIs dealing with validity/nullity.
What changes are included in this PR?
This PR changes the behavior of
IsNull
andIsValid
inArray
,ArrayData
, andArraySpan
.It preserves the behavior of
MayHaveNulls
,GetNullCount
and introduces new member functions toArray
,ArrayData
, andArraySpan
:bool HasValidityBitmap() const
bool MayHaveLogicalNulls() const
int64_t ComputeLogicalNullCount() const
Are these changes tested?
Yes.
Are there any user-facing changes?
Yes. See above.
Breakage with these changes can only happen if users rely on
IsNull(i)
always returningtrue
for union types, but we have users reporting that the current behavior or broken #34315. This is why the behavior ofIsNull
andIsValid
is changing.,This PR contains a "Critical Fix".