Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17883: [Java] implement immutable table #14316

Merged
merged 46 commits into from
Oct 6, 2022

Conversation

lwhite1
Copy link
Contributor

@lwhite1 lwhite1 commented Oct 4, 2022

Table is a new immutable tabular data structure based on FieldVectors.

This PR is described in detail in the included README.md file. The original design discussion can be found here, if you're interested.

Note to reviewers:

  • This is a fairly large change set. Most of the code is in "getters" in the Row class. These methods are fairly well covered by tests, but it would be good to have someone look especially at the complex vector types.
  • The only changes to existing classes were three new export methods added to the Data class. These use the logic for exporting VectorSchemaRoots.

lwhite1 and others added 30 commits September 29, 2022 11:16
@github-actions
Copy link

github-actions bot commented Oct 4, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@lwhite1
Copy link
Contributor Author

lwhite1 commented Oct 5, 2022

If @lidavidm and @davisusanibar could do a review it would be greatly appreciated.

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks reasonable, I left some small comments.

Some other notes:

  • The core code uses FreeMarker templating to help cut down on code duplication (particularly for Row.java), though it has its own issues
  • I wonder if parameterized tests might help cut down some of the test code? I'm not sure if the Java library has enough metaprogramming capabilities for that to be helpful (the C++ libraries do)
  • It might be good to port the README to Sphinx (at some point) so it shows up in the actual documentation

* <p>
* This API is EXPERIMENTAL.
*/
public abstract class BaseRow {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the class hierarchy: is a hypothetical MutableRow expected to inherit from BaseRow (in which case there's not a great way to abstract over immutable/mutable rows for read-only access) or from Row (in which case BaseRow seems a little redundant)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It inherits from Row. You are right about it being a bit redundant. I was thinking there was some benefit to having the Row hierarchy mirror the Table hierarchy, but I was not completely sold on the idea either. If you feel strongly I will refactor it to remove BaseRow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since things are experimental, I think we can keep it for now and see how the mutable implementation actually looks

/**
* Returns the standard character set to use for decoding strings. The Arrow format only supports UTF-8.
*/
private final Charset defaultCharacterSet = StandardCharsets.UTF_8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this should be a static, since UTF-8 is part of the standard. (I could see there being a non-final attribute for the case of getting strings out of a binary vector, though.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I could change that. It was created as a field before I learned that only UTF-8 was supported.

@lwhite1
Copy link
Contributor Author

lwhite1 commented Oct 5, 2022

@lidavidm Thank you for the review. I think all the comments are reasonable and will start on making the changes. LMK if you think hierarchy refactoring is necessary (or if I can get the other stuff done quickly, I may just do it).

On your other comments:

The core code uses FreeMarker templating to help cut down on code duplication (particularly for Row.java), though it has its own issues

I looked briefly at that. It seemed like the template examples I looked at were much simpler, and TBH, I didn't have time to learn how the templating worked and apply it, even if it were easier.

I wonder if parameterized tests might help cut down some of the test code? I'm not sure if the Java library has enough metaprogramming capabilities for that to be helpful (the C++ libraries do)

I'm not sure. Some refactoring could be done to reduce the amount of code I'm sure. I don't think I would have time in this PR if it's going to make 10.0.0

It might be good to port the README to Sphinx (at some point) so it shows up in the actual documentation

I will port the README as a separate PR

@lidavidm
Copy link
Member

lidavidm commented Oct 5, 2022

Ok, sounds good. Those comments aren't requirements, and if you already looked at FreeMarker then that's fine (it's hard to follow how it works, so I'm not a big fan of it unless it would reduce code size a lot). The main thing IMO is just being consistent about long vs int

@lwhite1
Copy link
Contributor Author

lwhite1 commented Oct 5, 2022

The main thing IMO is just being consistent about long vs int

I would like to walk back committing to that change now that I've looked at the code and thought about it. There are two issues, the first being relatively trivial, but the second being important I think:

  1. There are many (over 100) places where an int is required in the current code. Most can be fixed with search-and-replace, but still...
  2. If you allow longs now in the API, the user will get a runtime exception (array index out of bounds) if they go outside the int range due to the int value limit in ValueVector.

Because of the second issue, I think it would be better to expand the range when longs are really supported than to expand it now and create a hazard in the code. It's a trivial API change later to go from int to long and no one will be adversely affected either then or now.

@lidavidm
Copy link
Member

lidavidm commented Oct 5, 2022

Ok. It'll affect binary compatibility but I don't think that's been a concern for us so far, so no worries (and this is experimental code anyways)

@lwhite1
Copy link
Contributor Author

lwhite1 commented Oct 5, 2022

I removed the BaseRow class and converted the Charset variable to static.

@lidavidm
Copy link
Member

lidavidm commented Oct 5, 2022

@davisusanibar any comments?

@lidavidm
Copy link
Member

lidavidm commented Oct 5, 2022

Looking at CI failures: the manylinux failure seems to be a SIGABRT - possibly there's a bug in the C++ side of the C Data Interface module. And the Windows failure appears to be a Windows code for "heap corruption" - not sure what we can do about that

@lwhite1
Copy link
Contributor Author

lwhite1 commented Oct 5, 2022

Looking at CI failures: the manylinux failure seems to be a SIGABRT - possibly there's a bug in the C++ side of the C Data Interface module. And the Windows failure appears to be a Windows code for "heap corruption" - not sure what we can do about that

Thanks for looking into that. It didn't seem like it could be related to the changes in the PR, but I wasn't sure.

@lidavidm
Copy link
Member

lidavidm commented Oct 5, 2022

If you want to copy paste them into Jiras - or I'll do it later (the Windows one seems iffy but I feel like I've seen it several times now) - the helpful thing would actually be to update CI to collect those core dump files when they happen so we can debug more easily

@lidavidm
Copy link
Member

lidavidm commented Oct 5, 2022

Regardless, not a blocker here

@davisusanibar
Copy link
Contributor

If you want to copy paste them into Jiras - or I'll do it later (the Windows one seems iffy but I feel like I've seen it several times now) - the helpful thing would actually be to update CI to collect those core dump files when they happen so we can debug more easily

@lwhite1
Copy link
Contributor Author

lwhite1 commented Oct 6, 2022

@davisusanibar Do you approve the PR?

Copy link
Contributor

@davisusanibar davisusanibar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you

@davisusanibar
Copy link
Contributor

I only have this question: At current status for 10.0.0 release: It is ok to only push to use of Table instead of VectorSchemaRoot for single-batch processing?

@lwhite1
Copy link
Contributor Author

lwhite1 commented Oct 6, 2022

I only have this question: At current status for 10.0.0 release: It is ok to only push to use of Table instead of VectorSchemaRoot for single-batch processing?

Since the implementation is experimental I wouldn't 'push' people to use Table. I will update the docs which should let people know they have Table as an option and what the pros and cons are.

@lidavidm lidavidm merged commit 418f115 into apache:master Oct 6, 2022
@ursabot
Copy link

ursabot commented Oct 6, 2022

Benchmark runs are scheduled for baseline = eb5f0f7 and contender = 418f115. 418f115 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.0% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.27% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.18% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 418f1155 ec2-t3-xlarge-us-east-2
[Failed] 418f1155 test-mac-arm
[Failed] 418f1155 ursa-i9-9960x
[Finished] 418f1155 ursa-thinkcentre-m75q
[Finished] eb5f0f78 ec2-t3-xlarge-us-east-2
[Failed] eb5f0f78 test-mac-arm
[Failed] eb5f0f78 ursa-i9-9960x
[Finished] eb5f0f78 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

fatemehp pushed a commit to fatemehp/arrow that referenced this pull request Oct 17, 2022
Table is a new immutable tabular data structure based on FieldVectors.

This PR is described in detail in the included README.md file. The original design discussion can be found [here](https://docs.google.com/document/d/1J77irZFWNnSID7vK71z26Nw_Pi99I9Hb9iryno8B03c/edit#heading=h.a1lebwljypq5), if you're interested.

Note to reviewers:
- This is a fairly large change set. Most of the code is in "getters" in the Row class. These methods are fairly well covered by tests, but it would be good to have someone look especially at the complex vector types. 
- The only changes to existing classes were three new export methods added to the Data class. These use the logic for exporting VectorSchemaRoots. 

Lead-authored-by: Larry White <ljw1001@gmail.com>
Co-authored-by: Larry White <lwhite1@users.noreply.github.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Yicong-Huang added a commit to Texera/texera that referenced this pull request Dec 13, 2022
This PR bumps Apache Arrow version from 9.0.0 to 10.0.0.

Main changes related to PyAmber:

## Java/Scala side:

- JDBC Driver for Arrow Flight SQL
([13800](apache/arrow#13800))
- Initial implementation of immutable Table API
([14316](apache/arrow#14316))
- Substrait, transaction, cancellation for Flight SQL
([13492](apache/arrow#13492))
- Read Arrow IPC, CSV, and ORC files by NativeDatasetFactory
([13811](apache/arrow#13811),
[13973](apache/arrow#13973),
[14182](apache/arrow#14182))
- Add utility to bind Arrow data to JDBC parameters
([13589](apache/arrow#13589))

## Python side:

- The batch_readahead and fragment_readahead arguments for scanning
Datasets are exposed in Python
([ARROW-17299](https://issues.apache.org/jira/browse/ARROW-17299)).
- ExtensionArrays can now be created from a storage array through the
pa.array(..) constructor
([ARROW-17834](https://issues.apache.org/jira/browse/ARROW-17834)).
- Converting ListArrays containing ExtensionArray values to numpy or
pandas works by falling back to the storage array
([ARROW-17813](https://issues.apache.org/jira/browse/ARROW-17813)).
- Casting Tables to a new schema now honors the nullability flag in the
target schema
([ARROW-16651](https://issues.apache.org/jira/browse/ARROW-16651)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants