Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17525: [Java] Read ORC files using NativeDatasetFactory #13973

Merged
merged 2 commits into from
Sep 7, 2022
Merged

ARROW-17525: [Java] Read ORC files using NativeDatasetFactory #13973

merged 2 commits into from
Sep 7, 2022

Conversation

igor-suhorukov
Copy link
Contributor

Support ORC file format in java Dataset API

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@lidavidm
Copy link
Member

@davisusanibar or @lwhite1, would either of you like to take a look here?

@igor-suhorukov
Copy link
Contributor Author

Hi @davisusanibar or @lwhite1
Any comments?
I am ready to work on next PRs related to #14039 after merge of current PR into master branch

@davisusanibar
Copy link
Contributor

Hi @davisusanibar or @lwhite1 Any comments? I am ready to work on next PRs related to #14039 after merge of current PR into master branch

Hi @igor-suhorukov let me review this today, sorry for the long time

Comment on lines 369 to 381
String dataName = "test-orc";
String basePath = TMP.getRoot().getAbsolutePath();

TypeDescription orcSchema = TypeDescription.fromString("struct<ints:int>");
Writer writer = OrcFile.createWriter(new Path(basePath, dataName),
OrcFile.writerOptions(new Configuration()).setSchema(orcSchema));
VectorizedRowBatch batch = orcSchema.createRowBatch();
LongColumnVector longColumnVector = (LongColumnVector) batch.cols[0];
longColumnVector.vector[0] = Integer.MIN_VALUE;
longColumnVector.vector[1] = Integer.MAX_VALUE;
batch.size = 2;
writer.addRowBatch(batch);
writer.close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be possible to externalize this as a common method? something like this OrcWriteSupport.writeTempFile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @davisusanibar, sure! Extract it into OrcWriteSupport.writeTempFile

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

writer.close();

String orcDatasetUri = new File(basePath, dataName).toURI().toString();
FileSystemDatasetFactory factory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only have this comment:

  1. There is a Jira ticket in case to use NativeMemoryPool.createListenable for big size data. Do you know if there are some limitation/restriction for big ORC files also?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know any things related to NativeMemoryPool.createListenable. It looks like common & separate Java Dataset API issue not introduced by this PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That ticket is entirely unrelated, I agree. (I added some comments on Jira.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification

@@ -109,6 +109,38 @@
<artifactId>jackson-databind</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.arrow.orc</groupId>
<artifactId>arrow-orc</artifactId>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this dependency actually used? I see the ORC library but not the Arrow ORC adapter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidavidm sure! If comment this dependency: java.lang.NoClassDefFoundError: org/apache/hadoop/hdfs/client/HdfsDataOutputStream$SyncFlag
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like that's just a transitive dependency that needs to get pulled in…but since this is a test-only dependency it's ok then.

@lidavidm
Copy link
Member

lidavidm commented Sep 7, 2022

Integration test failure should be unrelated, see #14069

@lidavidm lidavidm merged commit 21491ec into apache:master Sep 7, 2022
@igor-suhorukov
Copy link
Contributor Author

Thank you @lidavidm and @davisusanibar
Yes, looks like by CI log details

@ursabot
Copy link

ursabot commented Sep 7, 2022

Benchmark runs are scheduled for baseline = c586b9f and contender = 21491ec. 21491ec is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.17% ⬆️0.03%] test-mac-arm
[Failed ⬇️1.1% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.21% ⬆️0.25%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 21491ec0 ec2-t3-xlarge-us-east-2
[Finished] 21491ec0 test-mac-arm
[Failed] 21491ec0 ursa-i9-9960x
[Finished] 21491ec0 ursa-thinkcentre-m75q
[Finished] c586b9fe ec2-t3-xlarge-us-east-2
[Finished] c586b9fe test-mac-arm
[Failed] c586b9fe ursa-i9-9960x
[Finished] c586b9fe ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

drin pushed a commit to drin/arrow that referenced this pull request Sep 7, 2022
…#13973)

Support ORC file format in java Dataset API

Authored-by: igor.suhorukov <igor.suhorukov@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
zagto pushed a commit to zagto/arrow that referenced this pull request Oct 7, 2022
…#13973)

Support ORC file format in java Dataset API

Authored-by: igor.suhorukov <igor.suhorukov@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
fatemehp pushed a commit to fatemehp/arrow that referenced this pull request Oct 17, 2022
…#13973)

Support ORC file format in java Dataset API

Authored-by: igor.suhorukov <igor.suhorukov@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Yicong-Huang added a commit to Texera/texera that referenced this pull request Dec 13, 2022
This PR bumps Apache Arrow version from 9.0.0 to 10.0.0.

Main changes related to PyAmber:

## Java/Scala side:

- JDBC Driver for Arrow Flight SQL
([13800](apache/arrow#13800))
- Initial implementation of immutable Table API
([14316](apache/arrow#14316))
- Substrait, transaction, cancellation for Flight SQL
([13492](apache/arrow#13492))
- Read Arrow IPC, CSV, and ORC files by NativeDatasetFactory
([13811](apache/arrow#13811),
[13973](apache/arrow#13973),
[14182](apache/arrow#14182))
- Add utility to bind Arrow data to JDBC parameters
([13589](apache/arrow#13589))

## Python side:

- The batch_readahead and fragment_readahead arguments for scanning
Datasets are exposed in Python
([ARROW-17299](https://issues.apache.org/jira/browse/ARROW-17299)).
- ExtensionArrays can now be created from a storage array through the
pa.array(..) constructor
([ARROW-17834](https://issues.apache.org/jira/browse/ARROW-17834)).
- Converting ListArrays containing ExtensionArray values to numpy or
pandas works by falling back to the storage array
([ARROW-17813](https://issues.apache.org/jira/browse/ARROW-17813)).
- Casting Tables to a new schema now honors the nullability flag in the
target schema
([ARROW-16651](https://issues.apache.org/jira/browse/ARROW-16651)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants