Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17786: [Java] Read CSV files using org.apache.arrow.dataset.jni.NativeDatasetFactory #14182

Merged
merged 7 commits into from
Oct 11, 2022

Conversation

davisusanibar
Copy link
Contributor

Support CSV file format in java Dataset API

@github-actions
Copy link


public CsvWriteSupport(File outputFolder) {
path = outputFolder.getPath() + File.separator + "generated-" + random.nextLong() + ".csv";
uri = "file://" + path;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, changed

@pitrou
Copy link
Member

pitrou commented Sep 22, 2022

@lwhite1 Would you like to review this?

@pitrou
Copy link
Member

pitrou commented Sep 22, 2022

@davisusanibar Why is only write support tested? Shouldn't you test support for dataset reads as well?

@davisusanibar
Copy link
Contributor Author

@davisusanibar Why is only write support tested? Shouldn't you test support for dataset reads as well?

Hi, write support is doing by Java native libraries, then Dataset module is reading that CSV file with (FileFormat.CSV): new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(), FileFormat.CSV, writeSupport.getOutputURI());

Currently we are only offering Read support for Parquet. ORC and now CSV, ... write support it not implemented

@pitrou
Copy link
Member

pitrou commented Sep 22, 2022

@davisusanibar Why is only write support tested? Shouldn't you test support for dataset reads as well?

Hi, write support is doing by Java native libraries, then Dataset module is reading that CSV file with (FileFormat.CSV): new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(), FileFormat.CSV, writeSupport.getOutputURI());

Currently we are only offering Read support for Parquet. ORC and now CSV, ... write support it not implemented

Oops, sorry, I had misread the tests.

@davisusanibar
Copy link
Contributor Author

Related to update Java Dataset module documentation, this Jira will cover that: https://issues.apache.org/jira/browse/ARROW-17789

@davisusanibar
Copy link
Contributor Author

This line

if (ARROW_PREDICT_TRUE(SetFieldBuilder(string_view(key, len), &duplicate_keys))) {
was not changed long time ago, I don't know what is the error:

[1/1] cd /tmp/arrow-lint-5g90mrbe/cpp-build && /usr/local/bin/python /arrow/cpp/build-support/run_cpplint.py --cpplint_binary /arrow/cpp/build-support/cpplint.py --exclude_globs /arrow/cpp/build-support/lint_exclusions.txt --source_dir /arrow/cpp/src --source_dir /arrow/cpp/examples --source_dir /arrow/cpp/tools --quiet
FAILED: CMakeFiles/lint 
cd /tmp/arrow-lint-5g90mrbe/cpp-build && /usr/local/bin/python /arrow/cpp/build-support/run_cpplint.py --cpplint_binary /arrow/cpp/build-support/cpplint.py --exclude_globs /arrow/cpp/build-support/lint_exclusions.txt --source_dir /arrow/cpp/src --source_dir /arrow/cpp/examples --source_dir /arrow/cpp/tools --quiet
/arrow/cpp/src/arrow/json/parser.cc:1030:  Lines should be <= 90 characters long  [whitespace/line_length] [2]
Done processing /arrow/cpp/src/arrow/json/parser.cc
Total errors found: 1

@davisusanibar
Copy link
Contributor Author

This line

if (ARROW_PREDICT_TRUE(SetFieldBuilder(string_view(key, len), &duplicate_keys))) {

was not changed long time ago, I don't know what is the error:

[1/1] cd /tmp/arrow-lint-5g90mrbe/cpp-build && /usr/local/bin/python /arrow/cpp/build-support/run_cpplint.py --cpplint_binary /arrow/cpp/build-support/cpplint.py --exclude_globs /arrow/cpp/build-support/lint_exclusions.txt --source_dir /arrow/cpp/src --source_dir /arrow/cpp/examples --source_dir /arrow/cpp/tools --quiet
FAILED: CMakeFiles/lint 
cd /tmp/arrow-lint-5g90mrbe/cpp-build && /usr/local/bin/python /arrow/cpp/build-support/run_cpplint.py --cpplint_binary /arrow/cpp/build-support/cpplint.py --exclude_globs /arrow/cpp/build-support/lint_exclusions.txt --source_dir /arrow/cpp/src --source_dir /arrow/cpp/examples --source_dir /arrow/cpp/tools --quiet
/arrow/cpp/src/arrow/json/parser.cc:1030:  Lines should be <= 90 characters long  [whitespace/line_length] [2]
Done processing /arrow/cpp/src/arrow/json/parser.cc
Total errors found: 1

Just rebase and working fine now

@davisusanibar
Copy link
Contributor Author

Please, when you have some time if you could review this, thank you

@davisusanibar
Copy link
Contributor Author

Please if you could help me with a review @lwhite1 @lidavidm

java/dataset/src/main/cpp/jni_wrapper.cc Show resolved Hide resolved

checkParquetReadResult(schema, expectedJsonUnordered, datum);

AutoCloseables.close(datum);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use try-with-resources?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added try-with for FileSystemDatasetFactory but don't know how to add that forList<ArrowRecordBatch> that is an input for another function.

@lwhite1
Copy link
Contributor

lwhite1 commented Oct 6, 2022

Sorry for not getting to this sooner. Is this for release 10?

@davisusanibar
Copy link
Contributor Author

Sorry for not getting to this sooner. Is this for release 10?

Not problem for that. Yes, trying to merge for 10 to then could create cookbooks for this CSV Reader case

@lidavidm
Copy link
Member

lidavidm commented Oct 7, 2022

@github-actions crossbow submit java

@github-actions
Copy link

github-actions bot commented Oct 7, 2022

Revision: 98c6d1d

Submitted crossbow builds: ursacomputing/crossbow @ actions-e54f92cb9c

Task Status
java-jars Github Actions
verify-rc-source-java-linux-almalinux-8-amd64 Github Actions
verify-rc-source-java-linux-conda-latest-amd64 Github Actions
verify-rc-source-java-linux-ubuntu-18.04-amd64 Github Actions
verify-rc-source-java-linux-ubuntu-20.04-amd64 Github Actions
verify-rc-source-java-linux-ubuntu-22.04-amd64 Github Actions
verify-rc-source-java-macos-amd64 Github Actions

@lidavidm
Copy link
Member

lidavidm commented Oct 7, 2022

CI shows a flake in testTable:

Error:  org.apache.arrow.c.RoundtripTest.testTable  Time elapsed: 0.031 s  <<< ERROR!
java.lang.IllegalStateException: Cannot import released ArrowSchema
	at org.apache.arrow.util.Preconditions.checkState(Preconditions.java:458)
	at org.apache.arrow.c.SchemaImporter.importField(SchemaImporter.java:63)
	at org.apache.arrow.c.SchemaImporter.importField(SchemaImporter.java:56)
	at org.apache.arrow.c.Data.importField(Data.java:246)
	at org.apache.arrow.c.Data.importSchema(Data.java:266)
	at org.apache.arrow.c.Data.importVectorSchemaRoot(Data.java:377)
	at org.apache.arrow.c.RoundtripTest.testTable(RoundtripTest.java:683)

@lidavidm
Copy link
Member

lidavidm commented Oct 7, 2022

CI shows a flake in testTable:

Error:  org.apache.arrow.c.RoundtripTest.testTable  Time elapsed: 0.031 s  <<< ERROR!
java.lang.IllegalStateException: Cannot import released ArrowSchema
	at org.apache.arrow.util.Preconditions.checkState(Preconditions.java:458)
	at org.apache.arrow.c.SchemaImporter.importField(SchemaImporter.java:63)
	at org.apache.arrow.c.SchemaImporter.importField(SchemaImporter.java:56)
	at org.apache.arrow.c.Data.importField(Data.java:246)
	at org.apache.arrow.c.Data.importSchema(Data.java:266)
	at org.apache.arrow.c.Data.importVectorSchemaRoot(Data.java:377)
	at org.apache.arrow.c.RoundtripTest.testTable(RoundtripTest.java:683)

@lwhite1 have you seen this? Might be worth some investigation

log link: https://github.com/apache/arrow/actions/runs/3200943136/jobs/5229379128 (I'm going to restart the job now)

@lidavidm
Copy link
Member

lidavidm commented Oct 7, 2022

@davisusanibar it may be worth investigating if we can not build/run tests in java-jars? It seems rather redundant (or maybe that's on purpose)

@davisusanibar
Copy link
Contributor Author

@davisusanibar it may be worth investigating if we can not build/run tests in java-jars? It seems rather redundant (or maybe that's on purpose)

Yes, sound redundant at first review, but, let me check what part could be improved

@lwhite1
Copy link
Contributor

lwhite1 commented Oct 7, 2022

CI shows a flake in testTable:

Error:  org.apache.arrow.c.RoundtripTest.testTable  Time elapsed: 0.031 s  <<< ERROR!
java.lang.IllegalStateException: Cannot import released ArrowSchema
	at org.apache.arrow.util.Preconditions.checkState(Preconditions.java:458)
	at org.apache.arrow.c.SchemaImporter.importField(SchemaImporter.java:63)
	at org.apache.arrow.c.SchemaImporter.importField(SchemaImporter.java:56)
	at org.apache.arrow.c.Data.importField(Data.java:246)
	at org.apache.arrow.c.Data.importSchema(Data.java:266)
	at org.apache.arrow.c.Data.importVectorSchemaRoot(Data.java:377)
	at org.apache.arrow.c.RoundtripTest.testTable(RoundtripTest.java:683)

@lwhite1 have you seen this? Might be worth some investigation

I haven't noticed it before but there are so many failures with any given commit I probably don't pay as much attention as I should to something that doesn't look related to the changes. I will take a look now.

@lwhite1
Copy link
Contributor

lwhite1 commented Oct 7, 2022

CI shows a flake in testTable:

Error:  org.apache.arrow.c.RoundtripTest.testTable  Time elapsed: 0.031 s  <<< ERROR!
java.lang.IllegalStateException: Cannot import released ArrowSchema
	at org.apache.arrow.util.Preconditions.checkState(Preconditions.java:458)
	at org.apache.arrow.c.SchemaImporter.importField(SchemaImporter.java:63)
	at org.apache.arrow.c.SchemaImporter.importField(SchemaImporter.java:56)
	at org.apache.arrow.c.Data.importField(Data.java:246)
	at org.apache.arrow.c.Data.importSchema(Data.java:266)
	at org.apache.arrow.c.Data.importVectorSchemaRoot(Data.java:377)
	at org.apache.arrow.c.RoundtripTest.testTable(RoundtripTest.java:683)

@lwhite1 have you seen this? Might be worth some investigation

I haven't noticed it before but there are so many failures with any given commit I probably don't pay as much attention as I should to something that doesn't look related to the changes. I will take a look now.

It looks like a bug in the test. I will push a fix. IDK why it passes locally.

@lidavidm
Copy link
Member

lidavidm commented Oct 7, 2022

It looks like the test crashes on another attempt.

@lidavidm
Copy link
Member

@davisusanibar can you rebase here to see if Larry's fix makes CI pass?

@lidavidm
Copy link
Member

Ok, looks all good. Thanks!

@lidavidm lidavidm merged commit a39f219 into apache:master Oct 11, 2022
@ursabot
Copy link

ursabot commented Oct 12, 2022

Benchmark runs are scheduled for baseline = fa3cf78 and contender = a39f219. a39f219 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️1.67% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.27% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.18% ⬆️0.04%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] a39f2197 ec2-t3-xlarge-us-east-2
[Failed] a39f2197 test-mac-arm
[Failed] a39f2197 ursa-i9-9960x
[Finished] a39f2197 ursa-thinkcentre-m75q
[Finished] fa3cf78e ec2-t3-xlarge-us-east-2
[Failed] fa3cf78e test-mac-arm
[Failed] fa3cf78e ursa-i9-9960x
[Finished] fa3cf78e ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Oct 12, 2022

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

Yicong-Huang added a commit to Texera/texera that referenced this pull request Dec 13, 2022
This PR bumps Apache Arrow version from 9.0.0 to 10.0.0.

Main changes related to PyAmber:

## Java/Scala side:

- JDBC Driver for Arrow Flight SQL
([13800](apache/arrow#13800))
- Initial implementation of immutable Table API
([14316](apache/arrow#14316))
- Substrait, transaction, cancellation for Flight SQL
([13492](apache/arrow#13492))
- Read Arrow IPC, CSV, and ORC files by NativeDatasetFactory
([13811](apache/arrow#13811),
[13973](apache/arrow#13973),
[14182](apache/arrow#14182))
- Add utility to bind Arrow data to JDBC parameters
([13589](apache/arrow#13589))

## Python side:

- The batch_readahead and fragment_readahead arguments for scanning
Datasets are exposed in Python
([ARROW-17299](https://issues.apache.org/jira/browse/ARROW-17299)).
- ExtensionArrays can now be created from a storage array through the
pa.array(..) constructor
([ARROW-17834](https://issues.apache.org/jira/browse/ARROW-17834)).
- Converting ListArrays containing ExtensionArray values to numpy or
pandas works by falling back to the storage array
([ARROW-17813](https://issues.apache.org/jira/browse/ARROW-17813)).
- Casting Tables to a new schema now honors the nullability flag in the
target schema
([ARROW-16651](https://issues.apache.org/jira/browse/ARROW-16651)).
lidavidm pushed a commit that referenced this pull request Mar 2, 2023
…upport (as per ARROW-17786) (#34390)

Update status documentation: CSV read in Java from #14182

Authored-by: Igor Suhorukov <igor.suhorukov@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants