-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-17883: [Java] implement immutable table #14316
ARROW-17883: [Java] implement immutable table #14316
Conversation
…lwhite1/arrow into 17883-implement-immutable-table
…lwhite1/arrow into 17883-implement-immutable-table
|
If @lidavidm and @davisusanibar could do a review it would be greatly appreciated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks reasonable, I left some small comments.
Some other notes:
- The core code uses FreeMarker templating to help cut down on code duplication (particularly for Row.java), though it has its own issues
- I wonder if parameterized tests might help cut down some of the test code? I'm not sure if the Java library has enough metaprogramming capabilities for that to be helpful (the C++ libraries do)
- It might be good to port the README to Sphinx (at some point) so it shows up in the actual documentation
java/vector/src/main/java/org/apache/arrow/vector/table/README.md
Outdated
Show resolved
Hide resolved
* <p> | ||
* This API is EXPERIMENTAL. | ||
*/ | ||
public abstract class BaseRow { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the class hierarchy: is a hypothetical MutableRow expected to inherit from BaseRow (in which case there's not a great way to abstract over immutable/mutable rows for read-only access) or from Row (in which case BaseRow seems a little redundant)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It inherits from Row. You are right about it being a bit redundant. I was thinking there was some benefit to having the Row hierarchy mirror the Table hierarchy, but I was not completely sold on the idea either. If you feel strongly I will refactor it to remove BaseRow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since things are experimental, I think we can keep it for now and see how the mutable implementation actually looks
/** | ||
* Returns the standard character set to use for decoding strings. The Arrow format only supports UTF-8. | ||
*/ | ||
private final Charset defaultCharacterSet = StandardCharsets.UTF_8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this should be a static, since UTF-8 is part of the standard. (I could see there being a non-final attribute for the case of getting strings out of a binary vector, though.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I could change that. It was created as a field before I learned that only UTF-8 was supported.
java/vector/src/main/java/org/apache/arrow/vector/table/README.md
Outdated
Show resolved
Hide resolved
@lidavidm Thank you for the review. I think all the comments are reasonable and will start on making the changes. LMK if you think hierarchy refactoring is necessary (or if I can get the other stuff done quickly, I may just do it). On your other comments:
I looked briefly at that. It seemed like the template examples I looked at were much simpler, and TBH, I didn't have time to learn how the templating worked and apply it, even if it were easier.
I'm not sure. Some refactoring could be done to reduce the amount of code I'm sure. I don't think I would have time in this PR if it's going to make 10.0.0
I will port the README as a separate PR |
Ok, sounds good. Those comments aren't requirements, and if you already looked at FreeMarker then that's fine (it's hard to follow how it works, so I'm not a big fan of it unless it would reduce code size a lot). The main thing IMO is just being consistent about |
I would like to walk back committing to that change now that I've looked at the code and thought about it. There are two issues, the first being relatively trivial, but the second being important I think:
Because of the second issue, I think it would be better to expand the range when longs are really supported than to expand it now and create a hazard in the code. It's a trivial API change later to go from int to long and no one will be adversely affected either then or now. |
Ok. It'll affect binary compatibility but I don't think that's been a concern for us so far, so no worries (and this is experimental code anyways) |
I removed the BaseRow class and converted the Charset variable to static. |
@davisusanibar any comments? |
Looking at CI failures: the manylinux failure seems to be a SIGABRT - possibly there's a bug in the C++ side of the C Data Interface module. And the Windows failure appears to be a Windows code for "heap corruption" - not sure what we can do about that |
Thanks for looking into that. It didn't seem like it could be related to the changes in the PR, but I wasn't sure. |
If you want to copy paste them into Jiras - or I'll do it later (the Windows one seems iffy but I feel like I've seen it several times now) - the helpful thing would actually be to update CI to collect those core dump files when they happen so we can debug more easily |
Regardless, not a blocker here |
|
@davisusanibar Do you approve the PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you
I only have this question: At current status for 10.0.0 release: It is ok to only push to use of Table instead of VectorSchemaRoot for single-batch processing? |
Since the implementation is experimental I wouldn't 'push' people to use Table. I will update the docs which should let people know they have Table as an option and what the pros and cons are. |
Benchmark runs are scheduled for baseline = eb5f0f7 and contender = 418f115. 418f115 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Table is a new immutable tabular data structure based on FieldVectors. This PR is described in detail in the included README.md file. The original design discussion can be found [here](https://docs.google.com/document/d/1J77irZFWNnSID7vK71z26Nw_Pi99I9Hb9iryno8B03c/edit#heading=h.a1lebwljypq5), if you're interested. Note to reviewers: - This is a fairly large change set. Most of the code is in "getters" in the Row class. These methods are fairly well covered by tests, but it would be good to have someone look especially at the complex vector types. - The only changes to existing classes were three new export methods added to the Data class. These use the logic for exporting VectorSchemaRoots. Lead-authored-by: Larry White <ljw1001@gmail.com> Co-authored-by: Larry White <lwhite1@users.noreply.github.com> Signed-off-by: David Li <li.davidm96@gmail.com>
This PR bumps Apache Arrow version from 9.0.0 to 10.0.0. Main changes related to PyAmber: ## Java/Scala side: - JDBC Driver for Arrow Flight SQL ([13800](apache/arrow#13800)) - Initial implementation of immutable Table API ([14316](apache/arrow#14316)) - Substrait, transaction, cancellation for Flight SQL ([13492](apache/arrow#13492)) - Read Arrow IPC, CSV, and ORC files by NativeDatasetFactory ([13811](apache/arrow#13811), [13973](apache/arrow#13973), [14182](apache/arrow#14182)) - Add utility to bind Arrow data to JDBC parameters ([13589](apache/arrow#13589)) ## Python side: - The batch_readahead and fragment_readahead arguments for scanning Datasets are exposed in Python ([ARROW-17299](https://issues.apache.org/jira/browse/ARROW-17299)). - ExtensionArrays can now be created from a storage array through the pa.array(..) constructor ([ARROW-17834](https://issues.apache.org/jira/browse/ARROW-17834)). - Converting ListArrays containing ExtensionArray values to numpy or pandas works by falling back to the storage array ([ARROW-17813](https://issues.apache.org/jira/browse/ARROW-17813)). - Casting Tables to a new schema now honors the nullability flag in the target schema ([ARROW-16651](https://issues.apache.org/jira/browse/ARROW-16651)).
Table is a new immutable tabular data structure based on FieldVectors.
This PR is described in detail in the included README.md file. The original design discussion can be found here, if you're interested.
Note to reviewers: