Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build: Use specified branch of arrow-rs with workaround to invalid offset buffers from Java Arrow #239

Merged
merged 6 commits into from
Apr 8, 2024

Conversation

viirya
Copy link
Member

@viirya viirya commented Mar 31, 2024

Which issue does this PR close?

Closes #236.

Rationale for this change

What changes are included in this PR?

How are these changes tested?

@viirya viirya changed the title feat: Change default value of columnar shuffle config feat: Use specified branch of arrow-rs with workaround to invalid offset buffers from Java Arrow Mar 31, 2024
@viirya viirya force-pushed the columnar_shuffle_default2 branch 4 times, most recently from e925ace to 9e6f9b2 Compare March 31, 2024 04:40
@viirya viirya force-pushed the columnar_shuffle_default2 branch from 9e6f9b2 to 36c2f12 Compare March 31, 2024 05:55
@codecov-commenter
Copy link

codecov-commenter commented Mar 31, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 33.48%. Comparing base (aa6ddc5) to head (6161601).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##               main     #239   +/-   ##
=========================================
  Coverage     33.48%   33.48%           
  Complexity      776      776           
=========================================
  Files           108      108           
  Lines         37178    37178           
  Branches       8146     8146           
=========================================
  Hits          12448    12448           
  Misses        22107    22107           
  Partials       2623     2623           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@viirya
Copy link
Member Author

viirya commented Apr 1, 2024

With the specified arrow-rs and DataFusion forked repos which include a hacky workaround, I don't see

General execution error with reason org.apache.comet.CometNativeException: Fail to process Arrow array with reason C Data interface error: The external buffer at position 1 is null...

the error showing anymore from TPCDSQuerySuite.

But there are a few query failures unrelated to that.

parquet = { version = "~50.0.0", default-features = false, features = ["experimental"] }
arrow = { git = "https://github.com/viirya/arrow-rs.git", rev = "3f1ae0c", features = ["prettyprint", "ffi", "chrono-tz"] }
arrow-array = { git = "https://github.com/viirya/arrow-rs.git", rev = "3f1ae0c" }
arrow-data = { git = "https://github.com/viirya/arrow-rs.git", rev = "3f1ae0c" }
arrow-schema = { git = "https://github.com/viirya/arrow-rs.git", rev = "3f1ae0c" }
arrow-string = { git = "https://github.com/viirya/arrow-rs.git", rev = "3f1ae0c" }
parquet = { git = "https://github.com/viirya/arrow-rs.git", rev = "3f1ae0c", default-features = false, features = ["experimental"] }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch basically switches to use the specified branch in my forked repo. It adds a hacky fix to the Java Arrow bug. Once the Java Arrow bug fix is merged and released, we can restore this back.

Comment on lines +69 to +72
datafusion-common = { git = "https://github.com/viirya/arrow-datafusion.git", rev = "111a940" }
datafusion = { default-features = false, git = "https://github.com/viirya/arrow-datafusion.git", rev = "111a940", features = ["unicode_expressions"] }
datafusion-functions = { git = "https://github.com/viirya/arrow-datafusion.git", rev = "111a940" }
datafusion-physical-expr = { git = "https://github.com/viirya/arrow-datafusion.git", rev = "111a940", default-features = false, features = ["unicode_expressions"] }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to use the specified version of arrow-rs in DataFusion, otherwise there will be conflicts.

Copy link
Member Author

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except for arrow-rs and DataFusion crates changes, other changes are for API changes in DataFusion, including new built-in scalar resolution, new ExecutionPlan API, etc.

@viirya
Copy link
Member Author

viirya commented Apr 1, 2024

cc @sunchao @andygrove

@viirya viirya changed the title feat: Use specified branch of arrow-rs with workaround to invalid offset buffers from Java Arrow build: Use specified branch of arrow-rs with workaround to invalid offset buffers from Java Arrow Apr 1, 2024
@advancedxy
Copy link
Contributor

I tried to follow all the discussions of the related issues/PRs, and want to make sure I understand the issues/situations correctly. Per my understanding, please correct me if I'm wrong:

  1. Java Arrow's C data interface is exporting empty(/null) arraybuf for empty offset buffer, which will be fixed in GH-40038: [Java] Export non empty offset buffer for variable-size layout through C Data Interface arrow#40043.
  2. arrow-rs's implementation would allow empty offset buffer per discussion and its follow-up PRs: Ensure there is a single zero in the offsets buffer for an empty ListArray. arrow-rs#1620 and https://github.com/apache/arrow-rs/pull/2836/files

If the upstream arrow-rs is going to pick-up the fix, is it possible for us to wait a bit and use arrow-rs upstream directly? I'm a bit of worried about the stability/quality of using the main branch of arrow-rs directly.

@advancedxy
Copy link
Contributor

Also, do you think it's a viable option to special handle offset buf in Comet's Java side @viirya ?

I just did a quick browsing of Arrow's export code, I think it should be doable to define a special ArrayExporter of Java Arrow c's exporter and write single value buffer for empty offset buffer in that exporter?

@viirya
Copy link
Member Author

viirya commented Apr 1, 2024

I tried to follow all the discussions of the related issues/PRs, and want to make sure I understand the issues/situations correctly. Per my understanding, please correct me if I'm wrong:

  1. Java Arrow's C data interface is exporting empty(/null) arraybuf for empty offset buffer, which will be fixed in GH-40038: [Java] Export non empty offset buffer for variable-size layout through C Data Interface arrow#40043.

Correct.

  1. arrow-rs's implementation would allow empty offset buffer per discussion and its follow-up PRs: Ensure there is a single zero in the offsets buffer for an empty ListArray. arrow-rs#1620 and https://github.com/apache/arrow-rs/pull/2836/files

No. As empty offset buffer is invalid, we are not going to merge the fix into arrow-rs. It is a hacky fix temporarily only before we have the fix at Java Arrow.

If the upstream arrow-rs is going to pick-up the fix, is it possible for us to wait a bit and use arrow-rs upstream directly? I'm a bit of worried about the stability/quality of using the main branch of arrow-rs directly.

We don't use the main branch of arrow-rs directly but a specified branch in forked repo with a temporary fix. The branch is frozen If we don't update it.

@viirya
Copy link
Member Author

viirya commented Apr 1, 2024

Also, do you think it's a viable option to special handle offset buf in Comet's Java side @viirya ?

I just did a quick browsing of Arrow's export code, I think it should be doable to define a special ArrayExporter of Java Arrow c's exporter and write single value buffer for empty offset buffer in that exporter?

Looks like it is feasible to have a custom ArrayExporter. But it is also a hacky fix. We need to manually handle these offset-layouted arrays.

@advancedxy
Copy link
Contributor

No. As empty offset buffer is invalid, we are not going to merge the fix into arrow-rs. It is a hacky fix temporarily only before we have the fix at Java Arrow.

I see. Thanks for the clarification. I didn't see that conclusion and thought arrow-rs will allow empty buffer for compatible reason.

We don't use the main branch of arrow-rs directly but a specified branch in forked repo with a temporary fix. The branch is frozen If we don't update it.

I know the specified branch is frozen if we don't update it. But it might contain untested/unstable code in the specified branch as it was kind of cut directly from master at some point with a temporary fix. If we are going to go with that approach, do you think it's better to checkout the specified branch from a released tag, such as 50.0.0? In that way, the specified branch was already tested for release.

But it is also a hacky fix. We need to manually handle these offset-layouted arrays.

Of course, it's hacky too. The long term fix should be fixing Java Arrow's C data interface's exporting. Compared to a specified branch of arrow-rs though, this fix is self-contained in the Comet's repo, which might has lower maintenance cost and doesn't depend on the arrow-rs/datafusion, which might be iterated fast to introduce new features/fixes.

Anyway, I think current approach is also a good way to iterate fast as long as the Java's Arrow C data API will be fixed soon.

@viirya
Copy link
Member Author

viirya commented Apr 5, 2024

@sunchao @andygrove @parthchandra

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Do we have an issue to track switching back to a versioned release of DataFusion/Arrow once Arrow Java 16 is released?

@viirya
Copy link
Member Author

viirya commented Apr 8, 2024

LGTM. Do we have an issue to track switching back to a versioned release of DataFusion/Arrow once Arrow Java 16 is released?

Created #248 to track it.

@viirya viirya merged commit 59f535c into apache:main Apr 8, 2024
29 checks passed
@viirya
Copy link
Member Author

viirya commented Apr 8, 2024

Merged. Thanks.

himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
…fset buffers from Java Arrow (apache#239)

* feat: Use specified branch of arrow-rs with workaround to invalid offset buffers from Java Arrow

* Use FunctionRegistry

* Fix

* Update

* Restore config

* Restore plan stability
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use specified branch of arrow-rs with workaround to invalid offset buffers from Java Arrow
4 participants