ARROW-17525: [Java] Read ORC files using NativeDatasetFactory #13973

igor-suhorukov · 2022-08-25T15:04:30Z

Support ORC file format in java Dataset API

….NativeDatasetFactory

github-actions · 2022-08-25T19:33:02Z

https://issues.apache.org/jira/browse/ARROW-17525

github-actions · 2022-08-25T19:33:03Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

lidavidm · 2022-08-26T12:21:25Z

@davisusanibar or @lwhite1, would either of you like to take a look here?

igor-suhorukov · 2022-09-06T14:12:44Z

Hi @davisusanibar or @lwhite1
Any comments?
I am ready to work on next PRs related to #14039 after merge of current PR into master branch

davisusanibar · 2022-09-06T14:39:19Z

Hi @davisusanibar or @lwhite1 Any comments? I am ready to work on next PRs related to #14039 after merge of current PR into master branch

Hi @igor-suhorukov let me review this today, sorry for the long time

davisusanibar · 2022-09-07T13:16:36Z

java/dataset/src/test/java/org/apache/arrow/dataset/file/TestFileSystemDataset.java

+    String dataName = "test-orc";
+    String basePath = TMP.getRoot().getAbsolutePath();
+
+    TypeDescription orcSchema = TypeDescription.fromString("struct<ints:int>");
+    Writer writer = OrcFile.createWriter(new Path(basePath, dataName),
+            OrcFile.writerOptions(new Configuration()).setSchema(orcSchema));
+    VectorizedRowBatch batch = orcSchema.createRowBatch();
+    LongColumnVector longColumnVector = (LongColumnVector) batch.cols[0];
+    longColumnVector.vector[0] = Integer.MIN_VALUE;
+    longColumnVector.vector[1] = Integer.MAX_VALUE;
+    batch.size = 2;
+    writer.addRowBatch(batch);
+    writer.close();


Could be possible to externalize this as a common method? something like this OrcWriteSupport.writeTempFile

Thanks @davisusanibar, sure! Extract it into OrcWriteSupport.writeTempFile

davisusanibar · 2022-09-07T13:17:37Z

java/dataset/src/test/java/org/apache/arrow/dataset/file/TestFileSystemDataset.java

+    writer.close();
+
+    String orcDatasetUri = new File(basePath, dataName).toURI().toString();
+    FileSystemDatasetFactory factory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),


LGTM, only have this comment:

There is a Jira ticket in case to use NativeMemoryPool.createListenable for big size data. Do you know if there are some limitation/restriction for big ORC files also?

I don't know any things related to NativeMemoryPool.createListenable. It looks like common & separate Java Dataset API issue not introduced by this PR

That ticket is entirely unrelated, I agree. (I added some comments on Jira.)

Thanks for the clarification

….NativeDatasetFactory

lidavidm · 2022-09-07T19:54:43Z

java/dataset/pom.xml

@@ -109,6 +109,38 @@
            <artifactId>jackson-databind</artifactId>
            <scope>test</scope>
        </dependency>
+        <dependency>
+            <groupId>org.apache.arrow.orc</groupId>
+            <artifactId>arrow-orc</artifactId>


Is this dependency actually used? I see the ORC library but not the Arrow ORC adapter

@lidavidm sure! If comment this dependency: java.lang.NoClassDefFoundError: org/apache/hadoop/hdfs/client/HdfsDataOutputStream$SyncFlag

It seems like that's just a transitive dependency that needs to get pulled in…but since this is a test-only dependency it's ok then.

lidavidm · 2022-09-07T20:37:05Z

Integration test failure should be unrelated, see #14069

igor-suhorukov · 2022-09-07T20:58:01Z

Thank you @lidavidm and @davisusanibar
Yes, looks like by CI log details

ursabot · 2022-09-07T22:52:17Z

Benchmark runs are scheduled for baseline = c586b9f and contender = 21491ec. 21491ec is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.17% ⬆️0.03%] test-mac-arm
[Failed ⬇️1.1% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.21% ⬆️0.25%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 21491ec0 ec2-t3-xlarge-us-east-2
[Finished] 21491ec0 test-mac-arm
[Failed] 21491ec0 ursa-i9-9960x
[Finished] 21491ec0 ursa-thinkcentre-m75q
[Finished] c586b9fe ec2-t3-xlarge-us-east-2
[Finished] c586b9fe test-mac-arm
[Failed] c586b9fe ursa-i9-9960x
[Finished] c586b9fe ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

…#13973) Support ORC file format in java Dataset API Authored-by: igor.suhorukov <igor.suhorukov@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>

This PR bumps Apache Arrow version from 9.0.0 to 10.0.0. Main changes related to PyAmber: ## Java/Scala side: - JDBC Driver for Arrow Flight SQL ([13800](apache/arrow#13800)) - Initial implementation of immutable Table API ([14316](apache/arrow#14316)) - Substrait, transaction, cancellation for Flight SQL ([13492](apache/arrow#13492)) - Read Arrow IPC, CSV, and ORC files by NativeDatasetFactory ([13811](apache/arrow#13811), [13973](apache/arrow#13973), [14182](apache/arrow#14182)) - Add utility to bind Arrow data to JDBC parameters ([13589](apache/arrow#13589)) ## Python side: - The batch_readahead and fragment_readahead arguments for scanning Datasets are exposed in Python ([ARROW-17299](https://issues.apache.org/jira/browse/ARROW-17299)). - ExtensionArrays can now be created from a storage array through the pa.array(..) constructor ([ARROW-17834](https://issues.apache.org/jira/browse/ARROW-17834)). - Converting ListArrays containing ExtensionArray values to numpy or pandas works by falling back to the storage array ([ARROW-17813](https://issues.apache.org/jira/browse/ARROW-17813)). - Casting Tables to a new schema now honors the nullability flag in the target schema ([ARROW-16651](https://issues.apache.org/jira/browse/ARROW-16651)).

ARROW-17525: [Java] Read ORC files using org.apache.arrow.dataset.jni…

3658292

….NativeDatasetFactory

github-actions bot added the Component: Java label Aug 25, 2022

davisusanibar reviewed Sep 7, 2022

View reviewed changes

ARROW-17525: [Java] Read ORC files using org.apache.arrow.dataset.jni…

25ac823

….NativeDatasetFactory

davisusanibar approved these changes Sep 7, 2022

View reviewed changes

lidavidm reviewed Sep 7, 2022

View reviewed changes

lidavidm merged commit 21491ec into apache:master Sep 7, 2022

Yicong-Huang mentioned this pull request Dec 8, 2022

Bump Apache Arrow to 10.0.0 Texera/texera#1764

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-17525: [Java] Read ORC files using NativeDatasetFactory #13973

ARROW-17525: [Java] Read ORC files using NativeDatasetFactory #13973

igor-suhorukov commented Aug 25, 2022

github-actions bot commented Aug 25, 2022

github-actions bot commented Aug 25, 2022

lidavidm commented Aug 26, 2022

igor-suhorukov commented Sep 6, 2022

davisusanibar commented Sep 6, 2022

davisusanibar Sep 7, 2022

igor-suhorukov Sep 7, 2022

davisusanibar Sep 7, 2022

davisusanibar Sep 7, 2022

igor-suhorukov Sep 7, 2022

lidavidm Sep 7, 2022

davisusanibar Sep 7, 2022

lidavidm Sep 7, 2022

igor-suhorukov Sep 7, 2022

lidavidm Sep 7, 2022

lidavidm commented Sep 7, 2022

igor-suhorukov commented Sep 7, 2022

ursabot commented Sep 7, 2022

ARROW-17525: [Java] Read ORC files using NativeDatasetFactory #13973

ARROW-17525: [Java] Read ORC files using NativeDatasetFactory #13973

Conversation

igor-suhorukov commented Aug 25, 2022

github-actions bot commented Aug 25, 2022

github-actions bot commented Aug 25, 2022

lidavidm commented Aug 26, 2022

igor-suhorukov commented Sep 6, 2022

davisusanibar commented Sep 6, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lidavidm commented Sep 7, 2022

igor-suhorukov commented Sep 7, 2022

ursabot commented Sep 7, 2022