-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-6111: [Java] Support LargeVarChar and LargeBinary types #6425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@siddharthteotia do you have time to review? |
|
@BryanCutler if you have time to review it would be appreciated otherwise I can take a look soon. |
|
I will review this tomorrow. Sorry for the delay. |
|
@siddharthteotia do you still intend to review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be long?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad -- It's the offset that has to be long, not this index since this is for the number of values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@siddharthteotia Thanks a lot for your attention. Currently, we are using int32 for all vector indices, and int64 for all ArrowBuf indices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense for this to extend BaseVariableWidthVector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might be able to reuse quite some functionality. However, generally in Arrow we have been okay with some code duplication to avoid the performance overhead of having a deep inheritance hierarchy. So may be the current approach is fine. Just wondering if we evaluated this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question.
We choose to have a separate BaseLargeVariableWidthVector class after careful thoughts and evaluations.
It appears that BaseVariableWidthVector and BaseLargeVariableWidthVector have many similarities. However, they are different in details, and the differences cannot be overcome by method overriding/overloading.
For example, in BaseVariableWidthVector, we have
protected final int getStartOffset(int index)
while in BaseLargeVariableWidthVector, we have
protected final long getStartOffset(int index)
These two methods differ only in return type, so we cannot use method overloading/overriding.
Another concern is performance. For example, handleSafe is a performance critical operation, because it may be called in each setSafe method call. So it is declared as a final method. If we share the same base class (BaseVariableWidthVector), we must provide two implementations for handleSafe, and it will be virtual instead of being final. This may lead to performance degradation.
|
@liyafan82 what's the status of this? Are you intending to get this into 0.17? |
@nealrichardson Thanks for your attention. |
941281d to
820b0a9
Compare
|
@siddharthteotia Do you have any more comments to this PR? |
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a quick pass and mostly looks good, I'll try to look at more detail soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couldn't you combine this with the "VarChar" block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised. Thanks for the good suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this meant to be implemented later? maybe add some comment and a message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Added some comments here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments added. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So do we need #6323 to properly test out the 64bit offset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I will add integration test later, when this or that PR finishes.
|
@BryanCutler @liyafan82 can this be merged now? |
@emkornfield Since ARROW-7610 is finished, we need to add an integration test for LargeVarChar/LargeBinary in this PR (or in another PR)? |
855eae1 to
819c306
Compare
|
I have added integration test for the large varchar vector. |
|
If you're doing integration tests as part of this patch, please remove this skip: https://github.com/apache/arrow/blob/master/dev/archery/archery/integration/datagen.py#L1446 |
|
Yes, please enable integration tests and lets make sure it passes before merging this. |
abf9ac4 to
2f72713
Compare
|
@nealrichardson and @BryanCutler Thanks for your good suggestion. I have revised the comment in |
|
@BryanCutler @siddharthteotia I think I'm OK merging if this is you are happy with the code and following up on integation tests as part of ARROW-6110? What do you two think? |
|
I think you should be removing the skip Java here https://github.com/apache/arrow/blob/2f72713446b04f8979b04f907e7185985028b0a8/dev/archery/archery/integration/datagen.py#L1480 to enable integration testing for this. I'll try to take another review pass early next week. |
|
@BryanCutler I believe the integration tests couple LargeList with LargeVarChar/LargeBinary, so both need to be implemented to enable the integration tests. |
|
The def generate_primitive_large_offsets_case(batch_sizes):
types = ['largebinary', 'largeutf8']
fields = []
for type_ in types:
fields.append(get_field(type_ + "_nullable", type_, nullable=True))
fields.append(get_field(type_ + "_nonnullable", type_, nullable=False))
return _generate_file('primitive_large_offsets', fields, batch_sizes |
|
@BryanCutler you are correct, for some reason I thought LargeList was coupled with LargeVarChar/LargeBinary. |
ada09de to
f795c56
Compare
|
@BryanCutler @emkornfield Sorry for my late response. |
|
Thanks @liyafan82 , looks like they didn't pass on this first try. Any idea what was causing the error? |
Thank you @BryanCutler . I am investigating. |
25c403f to
f429bd1
Compare
|
@BryanCutler The integration tests pass now. Please take another look when you have time. Thank you. |
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job getting integration tests passing @liyafan82 ! I took a quick look and LGTM, will do a more thorough pass hopefully tomorrow before merging.
@BryanCutler Please take your time. Thanks a lot for your effort. |
siddharthteotia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just did a final pass. LGTM. Thanks @BryanCutler for the review.
Thanks @liyafan82 for seeing this through. Good job
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for taking a look also @siddharthteotia
|
|
||
| @Override | ||
| public Void visit(BaseLargeVariableWidthVector left, Void value) { | ||
| return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this another TODO? Would you mind making JIRAs for this and the other TODOs here so they can be tracked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I will fix this problem in ARROW-8402, and will open other JIRAs to track other TODOs. Thanks for your reminder.
|
merged to master, thanks @liyafan82 ! |
|
@BryanCutler @siddharthteotia @emkornfield I am aware that this PR is large, and consumes lots of effort. So thanks a lot for your effort and good comments! |
These types of vectors are also variable width vectors. However, they have a offset width of 8, so the underlying data buffer can be over 2GB in size. Closes apache#6425 from liyafan82/fly_0210_large Authored-by: liyafan82 <fan_li_ya@foxmail.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
These types of vectors are also variable width vectors. However, they have a offset width of 8, so the underlying data buffer can be over 2GB in size.