-
Notifications
You must be signed in to change notification settings - Fork 881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add RunEndEncodedArray
#3553
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Firstly, so many style issues on code comment. Please leave a space after //
. I notes some above, but there are more.
@askoa I've not finished reviewing this. |
I intend to review this tomorrow. This is also a breaking change as it adds a new DataType variant Edit: ran out of time today, will try to find time over the weekend |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this, and sorry it has taken so long to review, I've left a few comments and hope to review this again prior to review - whilst painful it is important we get this right 😄
Some broad comments
-
The change to GenericByteBuilder is unsound and should not be merged
-
The handling of null counts in the specification is a little unclear to me, I've asked for clarification on the mailing list - https://lists.apache.org/thread/4x14b0h3fcfwzk68jpoq3n5xvr241qz5
-
I wonder if we should use
RunArray / RunBuilder
instead of eitherRunEndEncoded
orREE
. The latter is not a standard initialism, and the former is a little odd imo, not to mention verbose, given it isn'tDictionaryEncodedArray
.
Overall I really like where this is headed, nice work 👍
I'll change it to |
To me that makes me think it contains just run ends and not values 😅 Perhaps wait for some others to weigh in before changing anything, thoughts @viirya @alamb ? Edit: At least imo the fact it stores run ends and not run lengths is immaterial from a user-perspective, what matters is it stores runs of data. It isn't |
Just to give some additional context. The name So when defining and array for the Type, Does the suggestion include changing the name of data type to |
I have no strong option on the naming. To me I think |
There is no plural names so far. The lists are not named |
@tustvold I resolved all your comments. Please take a look when you have some time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, thank you. Just a minor nit.
I intend to leave this open for a bit longer to give others a chance to review, and will then get it merged likely tomorrow
As an aside, for future it helps reviewers if you avoid rebasing once reviews have started, it makes it hard to see what has changed, especially for large PRs like this one. All commits get squashed on merge, so the branch history can be as messy as you like 😅
Will do. Thanks for your reviews. |
In the interests of unblocking follow on work I'm going to merge this, if there is further feedback it can be handled in follow up PRs. Thank you @askoa for all your work on this 💪 |
Benchmark runs are scheduled for baseline = 98d35d3 and contender = f0be9da. f0be9da is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Part of #3520.
Included changes from unmerged PR #3534
Rationale for this change
See issue description.
What changes are included in this PR?
RunEndEncodedArray
.GenericByte
andArrowPrimitiveType
Are there any user-facing changes?
Users will get brand new encoder in Arrow format!
No breaking changes.