-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-17525: [Java] Read ORC files using NativeDatasetFactory #13973
Conversation
….NativeDatasetFactory
|
@davisusanibar or @lwhite1, would either of you like to take a look here? |
Hi @davisusanibar or @lwhite1 |
Hi @igor-suhorukov let me review this today, sorry for the long time |
String dataName = "test-orc"; | ||
String basePath = TMP.getRoot().getAbsolutePath(); | ||
|
||
TypeDescription orcSchema = TypeDescription.fromString("struct<ints:int>"); | ||
Writer writer = OrcFile.createWriter(new Path(basePath, dataName), | ||
OrcFile.writerOptions(new Configuration()).setSchema(orcSchema)); | ||
VectorizedRowBatch batch = orcSchema.createRowBatch(); | ||
LongColumnVector longColumnVector = (LongColumnVector) batch.cols[0]; | ||
longColumnVector.vector[0] = Integer.MIN_VALUE; | ||
longColumnVector.vector[1] = Integer.MAX_VALUE; | ||
batch.size = 2; | ||
writer.addRowBatch(batch); | ||
writer.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be possible to externalize this as a common method? something like this OrcWriteSupport.writeTempFile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @davisusanibar, sure! Extract it into OrcWriteSupport.writeTempFile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you
writer.close(); | ||
|
||
String orcDatasetUri = new File(basePath, dataName).toURI().toString(); | ||
FileSystemDatasetFactory factory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, only have this comment:
- There is a Jira ticket in case to use NativeMemoryPool.createListenable for big size data. Do you know if there are some limitation/restriction for big ORC files also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know any things related to NativeMemoryPool.createListenable. It looks like common & separate Java Dataset API issue not introduced by this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That ticket is entirely unrelated, I agree. (I added some comments on Jira.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification
….NativeDatasetFactory
@@ -109,6 +109,38 @@ | |||
<artifactId>jackson-databind</artifactId> | |||
<scope>test</scope> | |||
</dependency> | |||
<dependency> | |||
<groupId>org.apache.arrow.orc</groupId> | |||
<artifactId>arrow-orc</artifactId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this dependency actually used? I see the ORC library but not the Arrow ORC adapter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lidavidm sure! If comment this dependency: java.lang.NoClassDefFoundError: org/apache/hadoop/hdfs/client/HdfsDataOutputStream$SyncFlag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like that's just a transitive dependency that needs to get pulled in…but since this is a test-only dependency it's ok then.
Integration test failure should be unrelated, see #14069 |
Thank you @lidavidm and @davisusanibar |
Benchmark runs are scheduled for baseline = c586b9f and contender = 21491ec. 21491ec is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
…#13973) Support ORC file format in java Dataset API Authored-by: igor.suhorukov <igor.suhorukov@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
…#13973) Support ORC file format in java Dataset API Authored-by: igor.suhorukov <igor.suhorukov@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
…#13973) Support ORC file format in java Dataset API Authored-by: igor.suhorukov <igor.suhorukov@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
This PR bumps Apache Arrow version from 9.0.0 to 10.0.0. Main changes related to PyAmber: ## Java/Scala side: - JDBC Driver for Arrow Flight SQL ([13800](apache/arrow#13800)) - Initial implementation of immutable Table API ([14316](apache/arrow#14316)) - Substrait, transaction, cancellation for Flight SQL ([13492](apache/arrow#13492)) - Read Arrow IPC, CSV, and ORC files by NativeDatasetFactory ([13811](apache/arrow#13811), [13973](apache/arrow#13973), [14182](apache/arrow#14182)) - Add utility to bind Arrow data to JDBC parameters ([13589](apache/arrow#13589)) ## Python side: - The batch_readahead and fragment_readahead arguments for scanning Datasets are exposed in Python ([ARROW-17299](https://issues.apache.org/jira/browse/ARROW-17299)). - ExtensionArrays can now be created from a storage array through the pa.array(..) constructor ([ARROW-17834](https://issues.apache.org/jira/browse/ARROW-17834)). - Converting ListArrays containing ExtensionArray values to numpy or pandas works by falling back to the storage array ([ARROW-17813](https://issues.apache.org/jira/browse/ARROW-17813)). - Casting Tables to a new schema now honors the nullability flag in the target schema ([ARROW-16651](https://issues.apache.org/jira/browse/ARROW-16651)).
Support ORC file format in java Dataset API