Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41569: [Java] ListViewVector Implementation for UnionListViewReader #43077

Merged
merged 37 commits into from
Aug 1, 2024

Conversation

vibhatha
Copy link
Collaborator

@vibhatha vibhatha commented Jun 28, 2024

Rationale for this change

This PR contains the multiple components which are mainly required to add the C Data interface for ListViewVector. This PR solves the following major issues associated with this exercise.

What changes are included in this PR?

Apart from that, the following features have also been added

  • JSON Writer/Reader
  • Complex Writer functionality

Are these changes tested?

Yes

Are there any user-facing changes?

Yes, we are introducing the usage of listview instead of list, startListView instead of startList and endListView instead of endList for ListView related APIs in building the ListViewVector.

@vibhatha
Copy link
Collaborator Author

vibhatha commented Jul 22, 2024

@github-actions crossbow submit -g java-jars

Copy link

Invalid group(s) {'java-jars'}. Must be one of {'verify-rc-source', 'go', 'nightly-packaging', 'example', 'linux-amd64', 'nightly-tests', 'conan', 'ruby', 'verify-rc-binaries', 'example-python', 'verify-rc-source-macos', 'integration', 'r', 'homebrew', 'java', 'nightly-release', 'example-cpp', 'linux', 'linux-arm64', 'verify-rc', 'c-glib', 'verify-rc-source-linux', 'packaging', 'test', 'verify-rc-jars', 'verify-rc-wheels', 'python', 'cpp', 'wheel', 'nightly', 'fuzz', 'vcpkg', 'conda'}
The Archery job run can be found at: https://github.com/apache/arrow/actions/runs/10038941988

@vibhatha
Copy link
Collaborator Author

@github-actions crossbow submit -g java

Copy link

Revision: 955e9b9

Submitted crossbow builds: ursacomputing/crossbow @ actions-9c744c4dc4

Task Status
java-jars GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
verify-rc-source-java-linux-almalinux-8-amd64 GitHub Actions
verify-rc-source-java-linux-conda-latest-amd64 GitHub Actions
verify-rc-source-java-linux-ubuntu-20.04-amd64 GitHub Actions
verify-rc-source-java-linux-ubuntu-22.04-amd64 GitHub Actions
verify-rc-source-java-macos-amd64 GitHub Actions

@vibhatha vibhatha marked this pull request as ready for review July 23, 2024 06:42
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we split up the C Data changes from the writer/reader changes?

java/c/src/test/java/org/apache/arrow/c/RoundtripTest.java Outdated Show resolved Hide resolved
java/c/src/test/python/integration_tests.py Outdated Show resolved Hide resolved
@@ -257,70 +268,6 @@ public void write(${name}Holder holder) {
public void writeNull() {
}

@Override
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did these have to be moved out?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct I have moved some methods unnecessarily. I moved them back. Now the split take cares of ListView related things in the view class and the rest as usual in the non view class.

this.writer = writer;
}

public PromotableViewWriter promote() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming this promote is very confusing because it is really just converting to a ViewWriter instead of a PromotableWriter. Also, it's missing a docstring.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, toViewWriter() would be better?

Comment on lines 220 to 226
if (writer instanceof PromotableViewWriter) {
// ensure writers are initialized
((PromotableViewWriter) writer).getWriter(MinorType.LISTVIEW);
} else {
writer = ((PromotableWriter) writer).promote();
((PromotableViewWriter) writer).getWriter(MinorType.LISTVIEW);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would the 'else' case come up in the first place?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it comes to this call, there is already a PromotableWriter which get created when we first start writing using the StructWriter (it could be an intWriter, a floatWriter, etc). So the else is the one get called when there is any vector type other than ListViewVector is being written first. So in order to make sure we get the correct writer (UnionListViewWriter in our case), we need to make sure this cast happens.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the beginning it was confusing to me how the original logic works, it contains only a single instance of PromotableWriter and using that we do all the writing (even though there are multiple writer types being used via the StructWriter) and since we needed a differentiation in the first place to make sure we can accomodate the ListView types to be included into the UnionVector maintained underneath this writers.

Comment on lines 284 to 291
} else if (bufferType.equals(SIZE)
&& vector.getValueCount() == 0
&& vector.getMinorType() == MinorType.LISTVIEW) {
// Empty vectors may not have allocated a sizes buffer
try (ArrowBuf vectorBufferTmp = vector.getAllocator().buffer(4)) {
vectorBufferTmp.setInt(0, 0);
writeValueToGenerator(bufferType, vectorBufferTmp, null, vector, i);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be OK to have an empty sizes buffer? Why are we setting a single value?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to test this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gathering my thoughts, I think the reason for doing this was basically to accommodate what was done to offset. There the current logic sets the a buffer with a single value. I think I naively followed the same approach. You have a point here, let me further test this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should not apply for list view. We shouldn't blindly apply the same hacks since that may cover up bugs.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Jul 24, 2024
@vibhatha
Copy link
Collaborator Author

Can we split up the C Data changes from the writer/reader changes?

Mmm we could, although the PR title is different I started towards the C data interface. And the required components have significant overlap to tests them properly. And I thought to add the C Data interface here. That code is very small compared to the rest. Would you prefer another PR for C data component?

@lidavidm
Copy link
Member

It's already hard to review and unless the C Data and JSON and etc changes are strictly required I'd prefer to separate them

@vibhatha
Copy link
Collaborator Author

I will move the C Data components from this PR today itself. Sorry about that.

@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jul 24, 2024
@assignUser assignUser removed their request for review July 24, 2024 15:34
@vibhatha vibhatha requested a review from lidavidm July 24, 2024 23:34
@vibhatha
Copy link
Collaborator Author

@lidavidm I moved the C Data component and the Visitor components for a follow up PR.

@github-actions github-actions bot removed the awaiting change review Awaiting change review label Jul 26, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 1, 2024
@vibhatha
Copy link
Collaborator Author

vibhatha commented Aug 1, 2024

@lidavidm updated the PR.

<@pp.dropOutputFile />
<@pp.changeOutputFile name="/org/apache/arrow/vector/complex/impl/UnionViewWriter.java" />

package org.apache.arrow.vector.complex.impl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem fixed?

Also overall, can we get the imports grouped together at least?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidavidm sorry about this confusion. I took a look at the previous templates, they all have these issues, apologies I have just adopted that without putting much thought.

Please check if it is resolved now?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Aug 1, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 1, 2024
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Aug 1, 2024
@lidavidm lidavidm merged commit d4d92e4 into apache:main Aug 1, 2024
16 checks passed
@lidavidm lidavidm removed the awaiting merge Awaiting merge label Aug 1, 2024
@vibhatha vibhatha deleted the gh-41569 branch August 1, 2024 11:25
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit d4d92e4.

There were 2 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 606 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants