ARROW-17883: [Java] implement immutable table #14316

lwhite1 · 2022-10-04T18:48:22Z

Table is a new immutable tabular data structure based on FieldVectors.

This PR is described in detail in the included README.md file. The original design discussion can be found here, if you're interested.

Note to reviewers:

This is a fairly large change set. Most of the code is in "getters" in the Row class. These methods are fairly well covered by tests, but it would be good to have someone look especially at the complex vector types.
The only changes to existing classes were three new export methods added to the Data class. These use the logic for exporting VectorSchemaRoots.

…lwhite1/arrow into 17883-implement-immutable-table

…and clean-up

…testing

…lwhite1/arrow into 17883-implement-immutable-table

github-actions · 2022-10-04T18:49:14Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

lwhite1 · 2022-10-05T12:25:32Z

If @lidavidm and @davisusanibar could do a review it would be greatly appreciated.

lidavidm

Generally looks reasonable, I left some small comments.

Some other notes:

The core code uses FreeMarker templating to help cut down on code duplication (particularly for Row.java), though it has its own issues
I wonder if parameterized tests might help cut down some of the test code? I'm not sure if the Java library has enough metaprogramming capabilities for that to be helpful (the C++ libraries do)
It might be good to port the README to Sphinx (at some point) so it shows up in the actual documentation

java/vector/src/main/java/org/apache/arrow/vector/table/README.md

lidavidm · 2022-10-05T14:38:11Z

java/vector/src/main/java/org/apache/arrow/vector/table/BaseRow.java

+ * <p>
+ * This API is EXPERIMENTAL.
+ */
+public abstract class BaseRow {


For the class hierarchy: is a hypothetical MutableRow expected to inherit from BaseRow (in which case there's not a great way to abstract over immutable/mutable rows for read-only access) or from Row (in which case BaseRow seems a little redundant)?

It inherits from Row. You are right about it being a bit redundant. I was thinking there was some benefit to having the Row hierarchy mirror the Table hierarchy, but I was not completely sold on the idea either. If you feel strongly I will refactor it to remove BaseRow.

Since things are experimental, I think we can keep it for now and see how the mutable implementation actually looks

lidavidm · 2022-10-05T14:38:44Z

java/vector/src/main/java/org/apache/arrow/vector/table/BaseRow.java

+  /**
+   * Returns the standard character set to use for decoding strings. The Arrow format only supports UTF-8.
+   */
+  private final Charset defaultCharacterSet = StandardCharsets.UTF_8;


It seems this should be a static, since UTF-8 is part of the standard. (I could see there being a non-final attribute for the case of getting strings out of a binary vector, though.)

Yes. I could change that. It was created as a field before I learned that only UTF-8 was supported.

java/vector/src/main/java/org/apache/arrow/vector/table/BaseTable.java

java/vector/src/main/java/org/apache/arrow/vector/table/README.md

lwhite1 · 2022-10-05T16:25:48Z

@lidavidm Thank you for the review. I think all the comments are reasonable and will start on making the changes. LMK if you think hierarchy refactoring is necessary (or if I can get the other stuff done quickly, I may just do it).

On your other comments:

The core code uses FreeMarker templating to help cut down on code duplication (particularly for Row.java), though it has its own issues

I looked briefly at that. It seemed like the template examples I looked at were much simpler, and TBH, I didn't have time to learn how the templating worked and apply it, even if it were easier.

I wonder if parameterized tests might help cut down some of the test code? I'm not sure if the Java library has enough metaprogramming capabilities for that to be helpful (the C++ libraries do)

I'm not sure. Some refactoring could be done to reduce the amount of code I'm sure. I don't think I would have time in this PR if it's going to make 10.0.0

It might be good to port the README to Sphinx (at some point) so it shows up in the actual documentation

I will port the README as a separate PR

lidavidm · 2022-10-05T16:34:59Z

Ok, sounds good. Those comments aren't requirements, and if you already looked at FreeMarker then that's fine (it's hard to follow how it works, so I'm not a big fan of it unless it would reduce code size a lot). The main thing IMO is just being consistent about long vs int

lwhite1 · 2022-10-05T17:05:14Z

The main thing IMO is just being consistent about long vs int

I would like to walk back committing to that change now that I've looked at the code and thought about it. There are two issues, the first being relatively trivial, but the second being important I think:

There are many (over 100) places where an int is required in the current code. Most can be fixed with search-and-replace, but still...
If you allow longs now in the API, the user will get a runtime exception (array index out of bounds) if they go outside the int range due to the int value limit in ValueVector.

Because of the second issue, I think it would be better to expand the range when longs are really supported than to expand it now and create a hazard in the code. It's a trivial API change later to go from int to long and no one will be adversely affected either then or now.

lidavidm · 2022-10-05T17:06:26Z

Ok. It'll affect binary compatibility but I don't think that's been a concern for us so far, so no worries (and this is experimental code anyways)

lwhite1 · 2022-10-05T18:16:06Z

I removed the BaseRow class and converted the Charset variable to static.

lidavidm · 2022-10-05T18:47:21Z

@davisusanibar any comments?

lidavidm · 2022-10-05T18:50:35Z

Looking at CI failures: the manylinux failure seems to be a SIGABRT - possibly there's a bug in the C++ side of the C Data Interface module. And the Windows failure appears to be a Windows code for "heap corruption" - not sure what we can do about that

lwhite1 · 2022-10-05T20:21:41Z

Looking at CI failures: the manylinux failure seems to be a SIGABRT - possibly there's a bug in the C++ side of the C Data Interface module. And the Windows failure appears to be a Windows code for "heap corruption" - not sure what we can do about that

Thanks for looking into that. It didn't seem like it could be related to the changes in the PR, but I wasn't sure.

lidavidm · 2022-10-05T20:31:08Z

If you want to copy paste them into Jiras - or I'll do it later (the Windows one seems iffy but I feel like I've seen it several times now) - the helpful thing would actually be to update CI to collect those core dump files when they happen so we can debug more easily

lidavidm · 2022-10-05T20:31:18Z

Regardless, not a blocker here

davisusanibar · 2022-10-06T04:34:27Z

If you want to copy paste them into Jiras - or I'll do it later (the Windows one seems iffy but I feel like I've seen it several times now) - the helpful thing would actually be to update CI to collect those core dump files when they happen so we can debug more easily

For Java JNI Gandiva errors, just created tis jira ticket: https://issues.apache.org/jira/browse/ARROW-17946
Not able to replicate Windows issue locally, just created the issue with message on this PR: https://issues.apache.org/jira/browse/ARROW-17947

lwhite1 · 2022-10-06T12:44:00Z

@davisusanibar Do you approve the PR?

davisusanibar

LGTM, thank you

davisusanibar · 2022-10-06T14:30:09Z

I only have this question: At current status for 10.0.0 release: It is ok to only push to use of Table instead of VectorSchemaRoot for single-batch processing?

lwhite1 · 2022-10-06T14:43:32Z

I only have this question: At current status for 10.0.0 release: It is ok to only push to use of Table instead of VectorSchemaRoot for single-batch processing?

Since the implementation is experimental I wouldn't 'push' people to use Table. I will update the docs which should let people know they have Table as an option and what the pros and cons are.

ursabot · 2022-10-06T17:22:36Z

Benchmark runs are scheduled for baseline = eb5f0f7 and contender = 418f115. 418f115 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.0% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.27% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.18% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 418f1155 ec2-t3-xlarge-us-east-2
[Failed] 418f1155 test-mac-arm
[Failed] 418f1155 ursa-i9-9960x
[Finished] 418f1155 ursa-thinkcentre-m75q
[Finished] eb5f0f78 ec2-t3-xlarge-us-east-2
[Failed] eb5f0f78 test-mac-arm
[Failed] eb5f0f78 ursa-i9-9960x
[Finished] eb5f0f78 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Table is a new immutable tabular data structure based on FieldVectors. This PR is described in detail in the included README.md file. The original design discussion can be found [here](https://docs.google.com/document/d/1J77irZFWNnSID7vK71z26Nw_Pi99I9Hb9iryno8B03c/edit#heading=h.a1lebwljypq5), if you're interested. Note to reviewers: - This is a fairly large change set. Most of the code is in "getters" in the Row class. These methods are fairly well covered by tests, but it would be good to have someone look especially at the complex vector types. - The only changes to existing classes were three new export methods added to the Data class. These use the logic for exporting VectorSchemaRoots. Lead-authored-by: Larry White <ljw1001@gmail.com> Co-authored-by: Larry White <lwhite1@users.noreply.github.com> Signed-off-by: David Li <li.davidm96@gmail.com>

This PR bumps Apache Arrow version from 9.0.0 to 10.0.0. Main changes related to PyAmber: ## Java/Scala side: - JDBC Driver for Arrow Flight SQL ([13800](apache/arrow#13800)) - Initial implementation of immutable Table API ([14316](apache/arrow#14316)) - Substrait, transaction, cancellation for Flight SQL ([13492](apache/arrow#13492)) - Read Arrow IPC, CSV, and ORC files by NativeDatasetFactory ([13811](apache/arrow#13811), [13973](apache/arrow#13973), [14182](apache/arrow#14182)) - Add utility to bind Arrow data to JDBC parameters ([13589](apache/arrow#13589)) ## Python side: - The batch_readahead and fragment_readahead arguments for scanning Datasets are exposed in Python ([ARROW-17299](https://issues.apache.org/jira/browse/ARROW-17299)). - ExtensionArrays can now be created from a storage array through the pa.array(..) constructor ([ARROW-17834](https://issues.apache.org/jira/browse/ARROW-17834)). - Converting ListArrays containing ExtensionArray values to numpy or pandas works by falling back to the storage array ([ARROW-17813](https://issues.apache.org/jira/browse/ARROW-17813)). - Casting Tables to a new schema now honors the nullability flag in the target schema ([ARROW-16651](https://issues.apache.org/jira/browse/ARROW-16651)).

lwhite1 and others added 30 commits September 29, 2022 11:16

initial commit of Table support

f819459

Initial commit of Table support

e64843c

Update readme with description of exports

abdc234

Added "experimental API" text to class javadoc

0062d41

removed charset options, since arrow doesn't support anything but UTF-8

8b943e1

Merge branch 'apache:master' into 17883-implement-immutable-table

89c5299

Update README.md

928d672

Merge branch '17883-implement-immutable-table' of https://github.com/…

06af9fc

…lwhite1/arrow into 17883-implement-immutable-table

Added a variation to Data.exportTable

af63c9f

More tests; better documentation

8437401

cleanup and additional test for table

3ac5eff

doc and minor tweak to row.

2a99072

Merge branch 'apache:master' into 17883-implement-immutable-table

5764cdd

Extended tests to do initial coverage of UInt vectors

1fb9751

Initial set of getter tests using value holders

016f657

added tests for holders for time and timestamp vector types

879a3f6

Added tests for varbinary and varchar vectors. Some new test methods …

bdc3647

…and clean-up

more Row tests

c5fefa5

added tests for duration vectors

98ce8f2

Minor improvements in testing timestamp vectors

a34dc6f

Added tests for bitvector

8631e68

added tests for time milli vector

c225f5c

additional time vector tests

77e5a64

Merge branch 'apache:master' into 17883-implement-immutable-table

484bad5

add support for timestamp with TZ tests and some additional duration …

4439648

…testing

Merge branch '17883-implement-immutable-table' of https://github.com/…

f95413a

…lwhite1/arrow into 17883-implement-immutable-table

Added tests for interval day vectors

33b70db

tests for interval month and year vectors

bdc2111

added missing tests for complex types

07088bb

Added tests for extension type, and fixed bug in extensiontype code

fdaf4a8

Add apache license text to README.md

fc00850

lidavidm reviewed Oct 5, 2022

View reviewed changes

lwhite1 added 2 commits October 5, 2022 12:37

fix readme issues

983f311

Restore line erroneously deleted

c844047

Removes BaseRow class and converts CharSet variable to static

3887d8c

lidavidm approved these changes Oct 5, 2022

View reviewed changes

davisusanibar approved these changes Oct 6, 2022

View reviewed changes

lidavidm merged commit 418f115 into apache:master Oct 6, 2022

Yicong-Huang mentioned this pull request Dec 8, 2022

Bump Apache Arrow to 10.0.0 Texera/texera#1764

Merged

GeorgeAp mentioned this pull request Apr 27, 2023

[FEDE-6183] Upgrade arrow to 11.0.0 sirensolutions/arrow#46

Merged

asfimport mentioned this pull request Jan 11, 2023

[Java][Gandiva] Crashed tests: Error occurred in starting fork apache/arrow-java#239

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-17883: [Java] implement immutable table #14316

ARROW-17883: [Java] implement immutable table #14316

lwhite1 commented Oct 4, 2022 •

edited

Loading

github-actions bot commented Oct 4, 2022

lwhite1 commented Oct 5, 2022

lidavidm left a comment

lidavidm Oct 5, 2022

lwhite1 Oct 5, 2022

lidavidm Oct 5, 2022

lidavidm Oct 5, 2022

lwhite1 Oct 5, 2022

lwhite1 commented Oct 5, 2022

lidavidm commented Oct 5, 2022

lwhite1 commented Oct 5, 2022

lidavidm commented Oct 5, 2022

lwhite1 commented Oct 5, 2022

lidavidm commented Oct 5, 2022

lidavidm commented Oct 5, 2022

lwhite1 commented Oct 5, 2022

lidavidm commented Oct 5, 2022

lidavidm commented Oct 5, 2022

davisusanibar commented Oct 6, 2022

lwhite1 commented Oct 6, 2022

davisusanibar left a comment

davisusanibar commented Oct 6, 2022

lwhite1 commented Oct 6, 2022

ursabot commented Oct 6, 2022

ARROW-17883: [Java] implement immutable table #14316

ARROW-17883: [Java] implement immutable table #14316

Conversation

lwhite1 commented Oct 4, 2022 • edited Loading

github-actions bot commented Oct 4, 2022

lwhite1 commented Oct 5, 2022

lidavidm left a comment

Choose a reason for hiding this comment

lidavidm Oct 5, 2022

Choose a reason for hiding this comment

lwhite1 Oct 5, 2022

Choose a reason for hiding this comment

lidavidm Oct 5, 2022

Choose a reason for hiding this comment

lidavidm Oct 5, 2022

Choose a reason for hiding this comment

lwhite1 Oct 5, 2022

Choose a reason for hiding this comment

lwhite1 commented Oct 5, 2022

lidavidm commented Oct 5, 2022

lwhite1 commented Oct 5, 2022

lidavidm commented Oct 5, 2022

lwhite1 commented Oct 5, 2022

lidavidm commented Oct 5, 2022

lidavidm commented Oct 5, 2022

lwhite1 commented Oct 5, 2022

lidavidm commented Oct 5, 2022

lidavidm commented Oct 5, 2022

davisusanibar commented Oct 6, 2022

lwhite1 commented Oct 6, 2022

davisusanibar left a comment

Choose a reason for hiding this comment

davisusanibar commented Oct 6, 2022

lwhite1 commented Oct 6, 2022

ursabot commented Oct 6, 2022

lwhite1 commented Oct 4, 2022 •

edited

Loading