Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded #34408

Merged
merged 11 commits into from
Apr 6, 2023

Conversation

felipecrv
Copy link
Contributor

@felipecrv felipecrv commented Mar 1, 2023

Bonus: add ArrayData::IsValid() to make it consistent with Array and ArraySpan.

Rationale for this change

This is the proposed fix to #34361 plus the addition of more APIs dealing with validity/nullity.

What changes are included in this PR?

This PR changes the behavior of IsNull and IsValid in Array, ArrayData, and ArraySpan.

It preserves the behavior of MayHaveNulls, GetNullCount and introduces new member functions to Array, ArrayData, and ArraySpan:

  • bool HasValidityBitmap() const
  • bool MayHaveLogicalNulls() const
  • int64_t ComputeLogicalNullCount() const

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes. See above.

Breakage with these changes can only happen if users rely on IsNull(i) always returning true for union types, but we have users reporting that the current behavior or broken #34315. This is why the behavior of IsNull and IsValid is changing.,

This PR contains a "Critical Fix".

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The skeleton looks good to me

cpp/src/arrow/array/data.h Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Mar 2, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 2, 2023
@felipecrv
Copy link
Contributor Author

@jorisvandenbossche @westonpace @zeroshade this now has the implementation of all the new functions, changes in behavior, and tests. Ready for review.

@felipecrv felipecrv changed the title GH-34361: [C++] work-in-progress: Add skeleton of the new APIs for handling null checks correctly for all types GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded Mar 3, 2023
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor nits but this looks pretty well thought out

cpp/src/arrow/array/array_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/data.h Outdated Show resolved Hide resolved
cpp/src/arrow/util/union_util.cc Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Mar 3, 2023
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Mar 7, 2023
@westonpace
Copy link
Member

Is this ready for another review? Please check "Re-request review" if so. Although I notice it has the awaiting change review label. I wonder if that is automatic anytime a commit is pushed.

image

@felipecrv felipecrv requested review from westonpace and jorisvandenbossche and removed request for westonpace and jorisvandenbossche March 7, 2023 23:15
@felipecrv
Copy link
Contributor Author

@westonpace I'm a bit lost on how these labels change and what they actully mean, but the PR is ready for review again.

@westonpace
Copy link
Member

@ursabot please benchmark

@ursabot
Copy link

ursabot commented Mar 14, 2023

Benchmark runs are scheduled for baseline = fc95019 and contender = ceacb2e. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️2.64% ⬆️0.64%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️2.55% ⬆️0.64%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] ceacb2ef ec2-t3-xlarge-us-east-2
[Finished] ceacb2ef test-mac-arm
[Finished] ceacb2ef ursa-i9-9960x
[Finished] ceacb2ef ursa-thinkcentre-m75q
[Finished] fc950192 ec2-t3-xlarge-us-east-2
[Failed] fc950192 test-mac-arm
[Finished] fc950192 ursa-i9-9960x
[Finished] fc950192 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more questions. I also (hopefully) triggered conbench just to double check impact (though I don't know if we have any null-specific benchmarks).

cpp/src/arrow/array/array_test.cc Outdated Show resolved Hide resolved
ASSERT_TRUE(arr_default_null_count->data()->MayHaveNulls());
ASSERT_TRUE(arr_default_null_count->data()->MayHaveLogicalNulls());

RunEndEncodedBuilder ree_builder(pool_, std::make_shared<Int32Builder>(pool_),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have existing tests like this for sparse and dense union? (given we now have distinct paths for those)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned you in PR comments pointing to code that tests IsNull/Valid for unions.

Comment on lines +39 to +47
// ----------------------------------------------------------------------
// Null handling for types without a validity bitmap

ARROW_EXPORT bool IsNullSparseUnion(const ArrayData& data, int64_t i);
ARROW_EXPORT bool IsNullDenseUnion(const ArrayData& data, int64_t i);
ARROW_EXPORT bool IsNullRunEndEncoded(const ArrayData& data, int64_t i);

ARROW_EXPORT bool UnionMayHaveLogicalNulls(const ArrayData& data);
ARROW_EXPORT bool RunEndEncodedMayHaveLogicalNulls(const ArrayData& data);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What prevents these functions from being .cc only function in an anonymous namespace? It seems like the only need is because we have functions later in data.h that could be in a .cc file.

Copy link
Contributor Author

@felipecrv felipecrv Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are called from IsNull and IsValid which are inline and having the switch on the type ID inlined allows the compiler to inline the highly predictable branches into the loop bodies from which these functions are called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could introduce an IsValidFallback function that handles the case when the validity buffer is NULLPTR, then all these could become .cc only.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok how it is. I just don't ever have a good sense for when it's justified for leaving something in the header file for performance reasons. Typically I think it would be nice for there to be some benchmark to point to. Otherwise "it seems like this is important for performance" becomes too subjective of a criteria to apply.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing benchmarks indicated some regression (maybe due to union arrays being part of IsNull benchmarks) so I will investigate more carefully.

@felipecrv felipecrv marked this pull request as ready for review April 6, 2023 14:27
@zeroshade zeroshade merged commit fde31ed into apache:main Apr 6, 2023
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Apr 6, 2023
@felipecrv felipecrv deleted the null_fix branch April 6, 2023 15:37
@ursabot
Copy link

ursabot commented Apr 7, 2023

Benchmark runs are scheduled for baseline = 84e5430 and contender = fde31ed. fde31ed is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.56% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.26% ⬆️0.0%] ursa-i9-9960x
[Failed ⬇️0.0% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] fde31ed1 ec2-t3-xlarge-us-east-2
[Failed] fde31ed1 test-mac-arm
[Finished] fde31ed1 ursa-i9-9960x
[Failed] fde31ed1 ursa-thinkcentre-m75q
[Finished] 84e54308 ec2-t3-xlarge-us-east-2
[Failed] 84e54308 test-mac-arm
[Finished] 84e54308 ursa-i9-9960x
[Failed] 84e54308 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

felipecrv added a commit to felipecrv/arrow that referenced this pull request Apr 11, 2023
…rvers

Now that all the benchmark builds are running again. :fingers-crossed:

This reverts commit fde31ed.
Comment on lines +36 to +38
const int8_t child_id = sparse_union_type->child_ids()[types[span.offset + i]];

null_count += span.child_data[child_id].IsNull(i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test this with a sliced array? (don't directly see it in the tests)
Because in #35036, I based my implementation on this function, and I am finding that I need to remove/add the offset like so:

Suggested change
const int8_t child_id = sparse_union_type->child_ids()[types[span.offset + i]];
null_count += span.child_data[child_id].IsNull(i);
const int8_t child_id = sparse_union_type->child_ids()[types[i]];
null_count += span.child_data[child_id].IsNull(span.offset + i);

So types, which is the result of Span.GetValues() already takes into account the span's offset, but IsNull() is called on the child_data, and accessing child_data gives the original array data without offset already taken into account.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a high chance the existing tests didn't cover sliced union arrays and an even higher chance I messed up here because I had just read the Arrow spec on Unions to fix this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not so much about the Arrow spec on Unions, but rather about the small details of our APIs (Span::GetValues returning sliced values or not, Span::child_data returning sliced data or not?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetValues applies slices. Direct access doesn't. I think that makes sense, but it's hard to keep in mind at all times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

child_ids() is unclear to me because it's a cache if I recall correctly.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Apr 12, 2023
felipecrv added a commit to felipecrv/arrow that referenced this pull request May 2, 2023
…rvers

Now that all the benchmark builds are running again. :fingers-crossed:

This reverts commit fde31ed.
liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this pull request May 11, 2023
…hout bitmaps like Unions and Run-End Encoded (apache#34408)

Bonus: add `ArrayData::IsValid()` to make it consistent with `Array` and `ArraySpan`.

### Rationale for this change

This is the proposed fix to apache#34361 plus the addition of more APIs dealing with validity/nullity.

### What changes are included in this PR?

This PR changes the behavior of `IsNull` and `IsValid` in `Array`, `ArrayData`, and `ArraySpan`.

It preserves the behavior of `MayHaveNulls`, `GetNullCount` and introduces new member functions to `Array`, `ArrayData`, and `ArraySpan`:

 - `bool HasValidityBitmap() const`
 - `bool MayHaveLogicalNulls() const`
 - `int64_t ComputeLogicalNullCount() const`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes. See above.

Breakage with these changes can only happen if users rely on `IsNull(i)` always returning `true` for union types, but we have users reporting that the current behavior or broken apache#34315. This is why the behavior of `IsNull` and `IsValid` is changing.,

**This PR contains a "Critical Fix".**
* Closes: apache#34361

Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this pull request May 15, 2023
…hout bitmaps like Unions and Run-End Encoded (apache#34408)

Bonus: add `ArrayData::IsValid()` to make it consistent with `Array` and `ArraySpan`.

### Rationale for this change

This is the proposed fix to apache#34361 plus the addition of more APIs dealing with validity/nullity.

### What changes are included in this PR?

This PR changes the behavior of `IsNull` and `IsValid` in `Array`, `ArrayData`, and `ArraySpan`.

It preserves the behavior of `MayHaveNulls`, `GetNullCount` and introduces new member functions to `Array`, `ArrayData`, and `ArraySpan`:

 - `bool HasValidityBitmap() const`
 - `bool MayHaveLogicalNulls() const`
 - `int64_t ComputeLogicalNullCount() const`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes. See above.

Breakage with these changes can only happen if users rely on `IsNull(i)` always returning `true` for union types, but we have users reporting that the current behavior or broken apache#34315. This is why the behavior of `IsNull` and `IsValid` is changing.,

**This PR contains a "Critical Fix".**
* Closes: apache#34361

Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this pull request May 16, 2023
…hout bitmaps like Unions and Run-End Encoded (apache#34408)

Bonus: add `ArrayData::IsValid()` to make it consistent with `Array` and `ArraySpan`.

### Rationale for this change

This is the proposed fix to apache#34361 plus the addition of more APIs dealing with validity/nullity.

### What changes are included in this PR?

This PR changes the behavior of `IsNull` and `IsValid` in `Array`, `ArrayData`, and `ArraySpan`.

It preserves the behavior of `MayHaveNulls`, `GetNullCount` and introduces new member functions to `Array`, `ArrayData`, and `ArraySpan`:

 - `bool HasValidityBitmap() const`
 - `bool MayHaveLogicalNulls() const`
 - `int64_t ComputeLogicalNullCount() const`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes. See above.

Breakage with these changes can only happen if users rely on `IsNull(i)` always returning `true` for union types, but we have users reporting that the current behavior or broken apache#34315. This is why the behavior of `IsNull` and `IsValid` is changing.,

**This PR contains a "Critical Fix".**
* Closes: apache#34361

Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Handling "logical" nulls: add GetLogicalNullCount / update IsNull() to check for logical null
6 participants