Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check the length of null_bit_buffer in ArrayData::try_new() #1714

Merged
merged 1 commit into from
May 21, 2022

Conversation

HaoYang670
Copy link
Contributor

Signed-off-by: remzi 13716567376yh@gmail.com

Which issue does this PR close?

Closes #1707.

Rationale for this change

The ArrayData::try_new will panic if null_bit_buffer is too short.

What changes are included in this PR?

Check the length of null_bit_buffer before we call new_unchecked in ArrayData::try_new()

Are there any user-facing changes?

No.

Signed-off-by: remzi <13716567376yh@gmail.com>
@github-actions github-actions bot added the arrow Changes to the arrow crate label May 19, 2022
@HaoYang670
Copy link
Contributor Author

HaoYang670 commented May 19, 2022

I am not sure if this can be removed: https://github.com/apache/arrow-rs/blob/master/arrow/src/array/data.rs#L683-L693
as validate() is a public function.

But if we don't, we will check the length of null_bit_buffer twice.
Have no good idea on it.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you 👍

@HaoYang670
Copy link
Contributor Author

But if we don't, we will check the length of null_bit_buffer twice.

Maybe in the long term, we should try to remove the null_count field from ArrayData. I don't find why we need this field except making fn null_count() as a const function (Who will call null_count() multiple times on same array?).

@tustvold
Copy link
Contributor

we should try to remove the null_count field from ArrayData

Often faster kernel implementations are possible if one can ignore nulls, as such null_count helps to inform this selection. It is also fairly common for a given array to be fed to more than one kernel, e.g. cmp then filter, and as such caching the null count on ArrayData avoids computing it multiple times

@HaoYang670
Copy link
Contributor Author

Thank you for your explanation @tustvold. I will do some experiments to test how much performance null_count could contribute.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks @HaoYang670

@viirya viirya merged commit 8855d22 into apache:master May 21, 2022
@HaoYang670 HaoYang670 deleted the try_new_bug branch May 21, 2022 23:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ArrayData::try_new cannot always return expected error.
3 participants