Skip to content

Conversation

@liyafan82
Copy link
Contributor

These types of vectors are also variable width vectors. However, they have a offset width of 8, so the underlying data buffer can be over 2GB in size.

@github-actions
Copy link

@emkornfield
Copy link
Contributor

@siddharthteotia do you have time to review?

@emkornfield
Copy link
Contributor

@BryanCutler if you have time to review it would be appreciated otherwise I can take a look soon.

@siddharthteotia
Copy link
Contributor

I will review this tomorrow. Sorry for the delay.

@emkornfield
Copy link
Contributor

@siddharthteotia do you still intend to review?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be long?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad -- It's the offset that has to be long, not this index since this is for the number of values.

Copy link
Contributor Author

@liyafan82 liyafan82 Mar 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siddharthteotia Thanks a lot for your attention. Currently, we are using int32 for all vector indices, and int64 for all ArrowBuf indices.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense for this to extend BaseVariableWidthVector?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might be able to reuse quite some functionality. However, generally in Arrow we have been okay with some code duplication to avoid the performance overhead of having a deep inheritance hierarchy. So may be the current approach is fine. Just wondering if we evaluated this.

Copy link
Contributor Author

@liyafan82 liyafan82 Mar 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question.
We choose to have a separate BaseLargeVariableWidthVector class after careful thoughts and evaluations.

It appears that BaseVariableWidthVector and BaseLargeVariableWidthVector have many similarities. However, they are different in details, and the differences cannot be overcome by method overriding/overloading.

For example, in BaseVariableWidthVector, we have

protected final int getStartOffset(int index)

while in BaseLargeVariableWidthVector, we have

protected final long getStartOffset(int index)

These two methods differ only in return type, so we cannot use method overloading/overriding.

Another concern is performance. For example, handleSafe is a performance critical operation, because it may be called in each setSafe method call. So it is declared as a final method. If we share the same base class (BaseVariableWidthVector), we must provide two implementations for handleSafe, and it will be virtual instead of being final. This may lead to performance degradation.

@nealrichardson
Copy link
Member

@liyafan82 what's the status of this? Are you intending to get this into 0.17?

@liyafan82
Copy link
Contributor Author

@liyafan82 what's the status of this? Are you intending to get this into 0.17?

@nealrichardson Thanks for your attention.
This PR is being reviewed. I am not sure if the reviewing process can finish in time.

@liyafan82
Copy link
Contributor Author

liyafan82 commented Apr 10, 2020

@siddharthteotia Do you have any more comments to this PR?

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a quick pass and mostly looks good, I'll try to look at more detail soon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't you combine this with the "VarChar" block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised. Thanks for the good suggestion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised. Thank you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this meant to be implemented later? maybe add some comment and a message

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Added some comments here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments added. Thank you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do we need #6323 to properly test out the 64bit offset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I will add integration test later, when this or that PR finishes.

@emkornfield
Copy link
Contributor

@BryanCutler @liyafan82 can this be merged now?

@liyafan82
Copy link
Contributor Author

@BryanCutler @liyafan82 can this be merged now?

@emkornfield Since ARROW-7610 is finished, we need to add an integration test for LargeVarChar/LargeBinary in this PR (or in another PR)?

@liyafan82
Copy link
Contributor Author

I have added integration test for the large varchar vector.

@nealrichardson
Copy link
Member

If you're doing integration tests as part of this patch, please remove this skip: https://github.com/apache/arrow/blob/master/dev/archery/archery/integration/datagen.py#L1446

@BryanCutler
Copy link
Member

Yes, please enable integration tests and lets make sure it passes before merging this.

@liyafan82 liyafan82 force-pushed the fly_0210_large branch 2 times, most recently from abf9ac4 to 2f72713 Compare May 6, 2020 03:43
@liyafan82
Copy link
Contributor Author

liyafan82 commented May 6, 2020

@nealrichardson and @BryanCutler Thanks for your good suggestion.
I have removed the skip in datagen.py, but the tests failed. The reason is that we do not support LargeList type yet:

2020-05-06T03:03:50.3032543Z com.fasterxml.jackson.databind.exc.InvalidTypeIdException: Could not resolve type id 'largelist' as a subtype of [simple type, class org.apache.arrow.vector.types.pojo.ArrowType]: known type ids = [binary, bool, date, decimal, duration, fixedsizebinary, fixedsizelist, floatingpoint, int, interval, largebinary, largeutf8, list, map, null, struct, time, timestamp, union, utf8] (for POJO property 'type')
2020-05-06T03:03:50.3033308Z  at [Source: (File); line: 7, column: 19] (through reference chain: org.apache.arrow.vector.types.pojo.Schema["fields"]->java.util.ArrayList[0]->org.apache.arrow.vector.types.pojo.Field["type"])

I have revised the comment in datagen.py accordingly.

@emkornfield
Copy link
Contributor

@BryanCutler @siddharthteotia I think I'm OK merging if this is you are happy with the code and following up on integation tests as part of ARROW-6110? What do you two think?

@BryanCutler
Copy link
Member

I think you should be removing the skip Java here https://github.com/apache/arrow/blob/2f72713446b04f8979b04f907e7185985028b0a8/dev/archery/archery/integration/datagen.py#L1480 to enable integration testing for this. I'll try to take another review pass early next week.

@emkornfield
Copy link
Contributor

@BryanCutler I believe the integration tests couple LargeList with LargeVarChar/LargeBinary, so both need to be implemented to enable the integration tests.

@BryanCutler
Copy link
Member

The generate_primitive_large_offsets_case looks like it is just testing 'largebinary', 'largeutf8'. Are large lists somehow part of that?

def generate_primitive_large_offsets_case(batch_sizes):
    types = ['largebinary', 'largeutf8']

    fields = []

    for type_ in types:
        fields.append(get_field(type_ + "_nullable", type_, nullable=True))
        fields.append(get_field(type_ + "_nonnullable", type_, nullable=False))

    return _generate_file('primitive_large_offsets', fields, batch_sizes

@emkornfield
Copy link
Contributor

@BryanCutler you are correct, for some reason I thought LargeList was coupled with LargeVarChar/LargeBinary.

@liyafan82 liyafan82 force-pushed the fly_0210_large branch 2 times, most recently from ada09de to f795c56 Compare May 22, 2020 09:47
@liyafan82
Copy link
Contributor Author

@BryanCutler @emkornfield Sorry for my late response.
I have removed the skip. Let's see if the integration tests can pass this time.

@BryanCutler
Copy link
Member

BryanCutler commented May 22, 2020

Thanks @liyafan82 , looks like they didn't pass on this first try. Any idea what was causing the error?

Error accessing files
Current token (VALUE_STRING) not numeric, can not use numeric value accessors
 at [Source: (File); line: 65, column: 14]
12:46:37.776 [main] ERROR org.apache.arrow.tools.Integration - Error accessing files
com.fasterxml.jackson.core.JsonParseException: Current token (VALUE_STRING) not numeric, can not use numeric value accessors
 at [Source: (File); line: 65, column: 14]
	at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1804)
	at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:698)
	at com.fasterxml.jackson.core.base.ParserBase._parseNumericValue(ParserBase.java:781)
	at com.fasterxml.jackson.core.base.ParserBase._parseIntValue(ParserBase.java:799)
	at com.fasterxml.jackson.core.base.ParserBase.getIntValue(ParserBase.java:645)
	at org.apache.arrow.vector.ipc.JsonFileReader$BufferHelper$5.read(JsonFileReader.java:306)

@liyafan82
Copy link
Contributor Author

Thanks @liyafan82 , looks like they didn't pass on this first try. Any idea what was causing the error?

Error accessing files
Current token (VALUE_STRING) not numeric, can not use numeric value accessors
 at [Source: (File); line: 65, column: 14]
12:46:37.776 [main] ERROR org.apache.arrow.tools.Integration - Error accessing files
com.fasterxml.jackson.core.JsonParseException: Current token (VALUE_STRING) not numeric, can not use numeric value accessors
 at [Source: (File); line: 65, column: 14]
	at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1804)
	at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:698)
	at com.fasterxml.jackson.core.base.ParserBase._parseNumericValue(ParserBase.java:781)
	at com.fasterxml.jackson.core.base.ParserBase._parseIntValue(ParserBase.java:799)
	at com.fasterxml.jackson.core.base.ParserBase.getIntValue(ParserBase.java:645)
	at org.apache.arrow.vector.ipc.JsonFileReader$BufferHelper$5.read(JsonFileReader.java:306)

Thank you @BryanCutler . I am investigating.

@liyafan82 liyafan82 force-pushed the fly_0210_large branch 3 times, most recently from 25c403f to f429bd1 Compare May 26, 2020 07:08
@liyafan82 liyafan82 closed this May 26, 2020
@liyafan82 liyafan82 reopened this May 26, 2020
@liyafan82
Copy link
Contributor Author

@BryanCutler The integration tests pass now. Please take another look when you have time. Thank you.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job getting integration tests passing @liyafan82 ! I took a quick look and LGTM, will do a more thorough pass hopefully tomorrow before merging.

@liyafan82
Copy link
Contributor Author

Great job getting integration tests passing @liyafan82 ! I took a quick look and LGTM, will do a more thorough pass hopefully tomorrow before merging.

@BryanCutler Please take your time. Thanks a lot for your effort.

Copy link
Contributor

@siddharthteotia siddharthteotia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did a final pass. LGTM. Thanks @BryanCutler for the review.
Thanks @liyafan82 for seeing this through. Good job

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for taking a look also @siddharthteotia


@Override
public Void visit(BaseLargeVariableWidthVector left, Void value) {
return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this another TODO? Would you mind making JIRAs for this and the other TODOs here so they can be tracked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I will fix this problem in ARROW-8402, and will open other JIRAs to track other TODOs. Thanks for your reminder.

@BryanCutler
Copy link
Member

merged to master, thanks @liyafan82 !

@liyafan82
Copy link
Contributor Author

@BryanCutler @siddharthteotia @emkornfield I am aware that this PR is large, and consumes lots of effort. So thanks a lot for your effort and good comments!

pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
These types of vectors are also variable width vectors. However, they have a offset width of 8, so the underlying data buffer can be over 2GB in size.

Closes apache#6425 from liyafan82/fly_0210_large

Authored-by: liyafan82 <fan_li_ya@foxmail.com>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants