Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add RunEndEncodedArray #3553

Merged
merged 21 commits into from
Jan 25, 2023
Merged

feat: Add RunEndEncodedArray #3553

merged 21 commits into from
Jan 25, 2023

Conversation

askoa
Copy link
Contributor

@askoa askoa commented Jan 18, 2023

Which issue does this PR close?

Part of #3520.

Included changes from unmerged PR #3534

Rationale for this change

See issue description.

What changes are included in this PR?

  • Added new array RunEndEncodedArray.
  • Add validations for the array
  • Added builders for GenericByte and ArrowPrimitiveType

Are there any user-facing changes?

Users will get brand new encoder in Arrow format!
No breaking changes.

@github-actions github-actions bot added arrow Changes to the arrow crate parquet Changes to the parquet crate labels Jan 18, 2023
@askoa askoa marked this pull request as draft January 18, 2023 13:49
@askoa askoa marked this pull request as ready for review January 18, 2023 16:35
Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, so many style issues on code comment. Please leave a space after //. I notes some above, but there are more.

@askoa
Copy link
Contributor Author

askoa commented Jan 19, 2023

@viirya Can you please approve this PR if you don't have further comments?

@tustvold @alamb Can you please take a look when you have some time?

@viirya
Copy link
Member

viirya commented Jan 19, 2023

@askoa I've not finished reviewing this.

@tustvold
Copy link
Contributor

tustvold commented Jan 19, 2023

I intend to review this tomorrow.

This is also a breaking change as it adds a new DataType variant

Edit: ran out of time today, will try to find time over the weekend

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this, and sorry it has taken so long to review, I've left a few comments and hope to review this again prior to review - whilst painful it is important we get this right 😄

Some broad comments

  • The change to GenericByteBuilder is unsound and should not be merged

  • The handling of null counts in the specification is a little unclear to me, I've asked for clarification on the mailing list - https://lists.apache.org/thread/4x14b0h3fcfwzk68jpoq3n5xvr241qz5

  • I wonder if we should use RunArray / RunBuilder instead of either RunEndEncoded or REE . The latter is not a standard initialism, and the former is a little odd imo, not to mention verbose, given it isn't DictionaryEncodedArray.

Overall I really like where this is headed, nice work 👍

@askoa
Copy link
Contributor Author

askoa commented Jan 22, 2023

  • I wonder if we should use RunArray / RunBuilder instead of either RunEndEncoded or REE . The latter is not a standard initialism, and the former is a little odd imo, not to mention verbose, given it isn't DictionaryEncodedArray.

I'll change it to RunEndsArray. It is more specific as we use Run End Encoding and not Run Length Encoding.

@tustvold
Copy link
Contributor

tustvold commented Jan 22, 2023

To me that makes me think it contains just run ends and not values 😅

Perhaps wait for some others to weigh in before changing anything, thoughts @viirya @alamb ?

Edit: At least imo the fact it stores run ends and not run lengths is immaterial from a user-perspective, what matters is it stores runs of data. It isn't DictionaryEncodedArray or DictionaryOffsetArray or DictionaryOffsetEncodedArray, it is just DictionaryArray. Similarly it isn't ListOffsetArray but ListArray. I personally think RunArray is sufficient, but would definitely welcome other feedback. I'm mainly trying to avoid having types like GenericByteRunEndBuilder which is kind of sad 😆

@askoa
Copy link
Contributor Author

askoa commented Jan 22, 2023

  • I wonder if we should use RunArray / RunBuilder instead of either RunEndEncoded or REE . The latter is not a standard initialism, and the former is a little odd imo, not to mention verbose, given it isn't DictionaryEncodedArray.

I'll change it to RunEndsArray. It is more specific as we use Run End Encoding and not Run Length Encoding.

Just to give some additional context. The name RunEndEncoded is defined as a Type in fbs file. Whereas Dictionary is not defined as a Type.

So when defining and array for the Type, RunEndEncodedArray is consistent with other Types like LargeBinary whose array is LargeBinaryArray. Using RunArray is inconsistent on how we name array for a type.

Does the suggestion include changing the name of data type to Run instead of RunEndEncoded?

https://github.com/apache/arrow/blob/cb0f0178dacd922f355255a16c85820fb4e390c8/format/Schema.fbs#L426-L430

@viirya
Copy link
Member

viirya commented Jan 22, 2023

I have no strong option on the naming. To me I think RunArray(or RunsArray?) sounds good to me. Although as a type name RunEndEncoded, to name it as RunEndEncodedArray also reasonable, but it is a bit verbose actually.

@askoa
Copy link
Contributor Author

askoa commented Jan 23, 2023

(or RunsArray`?)

There is no plural names so far. The lists are not named Lists array. So I'll change the name to RunArray

@askoa askoa marked this pull request as draft January 23, 2023 11:30
@askoa askoa marked this pull request as ready for review January 23, 2023 14:30
@askoa
Copy link
Contributor Author

askoa commented Jan 23, 2023

@tustvold I resolved all your comments. Please take a look when you have some time.

@askoa askoa marked this pull request as ready for review January 23, 2023 19:55
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, thank you. Just a minor nit.

I intend to leave this open for a bit longer to give others a chance to review, and will then get it merged likely tomorrow

As an aside, for future it helps reviewers if you avoid rebasing once reviews have started, it makes it hard to see what has changed, especially for large PRs like this one. All commits get squashed on merge, so the branch history can be as messy as you like 😅

@askoa
Copy link
Contributor Author

askoa commented Jan 24, 2023

_As an aside, for future it helps reviewers if you avoid rebasing once reviews have started, _

Will do. Thanks for your reviews.

@tustvold
Copy link
Contributor

In the interests of unblocking follow on work I'm going to merge this, if there is further feedback it can be handled in follow up PRs.

Thank you @askoa for all your work on this 💪

@tustvold tustvold merged commit f0be9da into apache:master Jan 25, 2023
@ursabot
Copy link

ursabot commented Jan 25, 2023

Benchmark runs are scheduled for baseline = 98d35d3 and contender = f0be9da. f0be9da is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants