ARROW-17786: [Java] Read CSV files using org.apache.arrow.dataset.jni.NativeDatasetFactory #14182

davisusanibar · 2022-09-20T19:44:07Z

Support CSV file format in java Dataset API

github-actions · 2022-09-20T19:44:26Z

https://issues.apache.org/jira/browse/ARROW-17786

pitrou · 2022-09-22T14:17:05Z

java/dataset/src/test/java/org/apache/arrow/dataset/CsvWriteSupport.java

+
+  public CsvWriteSupport(File outputFolder) {
+    path = outputFolder.getPath() + File.separator + "generated-" + random.nextLong() + ".csv";
+    uri = "file://" + path;


Shouldn't you use a URI constructor instead?
https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#URI(java.lang.String,%20java.lang.String,%20java.lang.String)

Thanks, changed

pitrou · 2022-09-22T14:18:42Z

@lwhite1 Would you like to review this?

pitrou · 2022-09-22T14:19:08Z

@davisusanibar Why is only write support tested? Shouldn't you test support for dataset reads as well?

davisusanibar · 2022-09-22T14:33:06Z

@davisusanibar Why is only write support tested? Shouldn't you test support for dataset reads as well?

Hi, write support is doing by Java native libraries, then Dataset module is reading that CSV file with (FileFormat.CSV): new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(), FileFormat.CSV, writeSupport.getOutputURI());

Currently we are only offering Read support for Parquet. ORC and now CSV, ... write support it not implemented

pitrou · 2022-09-22T14:33:44Z

@davisusanibar Why is only write support tested? Shouldn't you test support for dataset reads as well?

Hi, write support is doing by Java native libraries, then Dataset module is reading that CSV file with (FileFormat.CSV): new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(), FileFormat.CSV, writeSupport.getOutputURI());

Currently we are only offering Read support for Parquet. ORC and now CSV, ... write support it not implemented

Oops, sorry, I had misread the tests.

davisusanibar · 2022-09-22T18:29:21Z

Related to update Java Dataset module documentation, this Jira will cover that: https://issues.apache.org/jira/browse/ARROW-17789

davisusanibar · 2022-09-22T22:34:35Z

This line

arrow/cpp/src/arrow/json/parser.cc

Line 1030 in 91e3ac5

    
           if (ARROW_PREDICT_TRUE(SetFieldBuilder(string_view(key, len), &duplicate_keys))) {

was not changed long time ago, I don't know what is the error:

[1/1] cd /tmp/arrow-lint-5g90mrbe/cpp-build && /usr/local/bin/python /arrow/cpp/build-support/run_cpplint.py --cpplint_binary /arrow/cpp/build-support/cpplint.py --exclude_globs /arrow/cpp/build-support/lint_exclusions.txt --source_dir /arrow/cpp/src --source_dir /arrow/cpp/examples --source_dir /arrow/cpp/tools --quiet
FAILED: CMakeFiles/lint 
cd /tmp/arrow-lint-5g90mrbe/cpp-build && /usr/local/bin/python /arrow/cpp/build-support/run_cpplint.py --cpplint_binary /arrow/cpp/build-support/cpplint.py --exclude_globs /arrow/cpp/build-support/lint_exclusions.txt --source_dir /arrow/cpp/src --source_dir /arrow/cpp/examples --source_dir /arrow/cpp/tools --quiet
/arrow/cpp/src/arrow/json/parser.cc:1030:  Lines should be <= 90 characters long  [whitespace/line_length] [2]
Done processing /arrow/cpp/src/arrow/json/parser.cc
Total errors found: 1

davisusanibar · 2022-09-23T17:32:54Z

This line

arrow/cpp/src/arrow/json/parser.cc

Line 1030 in 91e3ac5

    
           if (ARROW_PREDICT_TRUE(SetFieldBuilder(string_view(key, len), &duplicate_keys))) {

was not changed long time ago, I don't know what is the error:

[1/1] cd /tmp/arrow-lint-5g90mrbe/cpp-build && /usr/local/bin/python /arrow/cpp/build-support/run_cpplint.py --cpplint_binary /arrow/cpp/build-support/cpplint.py --exclude_globs /arrow/cpp/build-support/lint_exclusions.txt --source_dir /arrow/cpp/src --source_dir /arrow/cpp/examples --source_dir /arrow/cpp/tools --quiet
FAILED: CMakeFiles/lint 
cd /tmp/arrow-lint-5g90mrbe/cpp-build && /usr/local/bin/python /arrow/cpp/build-support/run_cpplint.py --cpplint_binary /arrow/cpp/build-support/cpplint.py --exclude_globs /arrow/cpp/build-support/lint_exclusions.txt --source_dir /arrow/cpp/src --source_dir /arrow/cpp/examples --source_dir /arrow/cpp/tools --quiet
/arrow/cpp/src/arrow/json/parser.cc:1030:  Lines should be <= 90 characters long  [whitespace/line_length] [2]
Done processing /arrow/cpp/src/arrow/json/parser.cc
Total errors found: 1

Just rebase and working fine now

davisusanibar · 2022-10-04T17:29:25Z

Please, when you have some time if you could review this, thank you

davisusanibar · 2022-10-06T15:05:17Z

Please if you could help me with a review @lwhite1 @lidavidm

java/dataset/src/main/cpp/jni_wrapper.cc

lidavidm · 2022-10-06T15:28:33Z

java/dataset/src/test/java/org/apache/arrow/dataset/file/TestFileSystemDataset.java

+
+    checkParquetReadResult(schema, expectedJsonUnordered, datum);
+
+    AutoCloseables.close(datum);


Use try-with-resources?

Added try-with for FileSystemDatasetFactory but don't know how to add that forList<ArrowRecordBatch> that is an input for another function.

lwhite1 · 2022-10-06T18:48:34Z

Sorry for not getting to this sooner. Is this for release 10?

…uild by maven

davisusanibar · 2022-10-06T22:44:04Z

Sorry for not getting to this sooner. Is this for release 10?

Not problem for that. Yes, trying to merge for 10 to then could create cookbooks for this CSV Reader case

lidavidm · 2022-10-07T12:51:22Z

@github-actions crossbow submit java

github-actions · 2022-10-07T12:52:50Z

Revision: 98c6d1d

Submitted crossbow builds: ursacomputing/crossbow @ actions-e54f92cb9c

Task	Status
java-jars
verify-rc-source-java-linux-almalinux-8-amd64
verify-rc-source-java-linux-conda-latest-amd64
verify-rc-source-java-linux-ubuntu-18.04-amd64
verify-rc-source-java-linux-ubuntu-20.04-amd64
verify-rc-source-java-linux-ubuntu-22.04-amd64
verify-rc-source-java-macos-amd64

lidavidm · 2022-10-07T12:54:07Z

CI shows a flake in testTable:

Error:  org.apache.arrow.c.RoundtripTest.testTable  Time elapsed: 0.031 s  <<< ERROR!
java.lang.IllegalStateException: Cannot import released ArrowSchema
	at org.apache.arrow.util.Preconditions.checkState(Preconditions.java:458)
	at org.apache.arrow.c.SchemaImporter.importField(SchemaImporter.java:63)
	at org.apache.arrow.c.SchemaImporter.importField(SchemaImporter.java:56)
	at org.apache.arrow.c.Data.importField(Data.java:246)
	at org.apache.arrow.c.Data.importSchema(Data.java:266)
	at org.apache.arrow.c.Data.importVectorSchemaRoot(Data.java:377)
	at org.apache.arrow.c.RoundtripTest.testTable(RoundtripTest.java:683)

lidavidm · 2022-10-07T15:21:55Z

CI shows a flake in testTable:

Error:  org.apache.arrow.c.RoundtripTest.testTable  Time elapsed: 0.031 s  <<< ERROR!
java.lang.IllegalStateException: Cannot import released ArrowSchema
	at org.apache.arrow.util.Preconditions.checkState(Preconditions.java:458)
	at org.apache.arrow.c.SchemaImporter.importField(SchemaImporter.java:63)
	at org.apache.arrow.c.SchemaImporter.importField(SchemaImporter.java:56)
	at org.apache.arrow.c.Data.importField(Data.java:246)
	at org.apache.arrow.c.Data.importSchema(Data.java:266)
	at org.apache.arrow.c.Data.importVectorSchemaRoot(Data.java:377)
	at org.apache.arrow.c.RoundtripTest.testTable(RoundtripTest.java:683)

@lwhite1 have you seen this? Might be worth some investigation

log link: https://github.com/apache/arrow/actions/runs/3200943136/jobs/5229379128 (I'm going to restart the job now)

lidavidm · 2022-10-07T15:23:34Z

@davisusanibar it may be worth investigating if we can not build/run tests in java-jars? It seems rather redundant (or maybe that's on purpose)

davisusanibar · 2022-10-07T16:15:46Z

@davisusanibar it may be worth investigating if we can not build/run tests in java-jars? It seems rather redundant (or maybe that's on purpose)

Yes, sound redundant at first review, but, let me check what part could be improved

lwhite1 · 2022-10-07T16:23:10Z

CI shows a flake in testTable:

Error:  org.apache.arrow.c.RoundtripTest.testTable  Time elapsed: 0.031 s  <<< ERROR!
java.lang.IllegalStateException: Cannot import released ArrowSchema
	at org.apache.arrow.util.Preconditions.checkState(Preconditions.java:458)
	at org.apache.arrow.c.SchemaImporter.importField(SchemaImporter.java:63)
	at org.apache.arrow.c.SchemaImporter.importField(SchemaImporter.java:56)
	at org.apache.arrow.c.Data.importField(Data.java:246)
	at org.apache.arrow.c.Data.importSchema(Data.java:266)
	at org.apache.arrow.c.Data.importVectorSchemaRoot(Data.java:377)
	at org.apache.arrow.c.RoundtripTest.testTable(RoundtripTest.java:683)

@lwhite1 have you seen this? Might be worth some investigation

I haven't noticed it before but there are so many failures with any given commit I probably don't pay as much attention as I should to something that doesn't look related to the changes. I will take a look now.

lwhite1 · 2022-10-07T16:29:16Z

CI shows a flake in testTable:
Error:  org.apache.arrow.c.RoundtripTest.testTable  Time elapsed: 0.031 s  <<< ERROR!
java.lang.IllegalStateException: Cannot import released ArrowSchema
	at org.apache.arrow.util.Preconditions.checkState(Preconditions.java:458)
	at org.apache.arrow.c.SchemaImporter.importField(SchemaImporter.java:63)
	at org.apache.arrow.c.SchemaImporter.importField(SchemaImporter.java:56)
	at org.apache.arrow.c.Data.importField(Data.java:246)
	at org.apache.arrow.c.Data.importSchema(Data.java:266)
	at org.apache.arrow.c.Data.importVectorSchemaRoot(Data.java:377)
	at org.apache.arrow.c.RoundtripTest.testTable(RoundtripTest.java:683)
@lwhite1 have you seen this? Might be worth some investigation
I haven't noticed it before but there are so many failures with any given commit I probably don't pay as much attention as I should to something that doesn't look related to the changes. I will take a look now.

It looks like a bug in the test. I will push a fix. IDK why it passes locally.

lidavidm · 2022-10-07T16:30:47Z

It looks like the test crashes on another attempt.

lidavidm · 2022-10-11T11:41:00Z

@davisusanibar can you rebase here to see if Larry's fix makes CI pass?

lidavidm · 2022-10-11T19:34:04Z

Ok, looks all good. Thanks!

ursabot · 2022-10-12T11:32:18Z

Benchmark runs are scheduled for baseline = fa3cf78 and contender = a39f219. a39f219 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️1.67% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.27% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.18% ⬆️0.04%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] a39f2197 ec2-t3-xlarge-us-east-2
[Failed] a39f2197 test-mac-arm
[Failed] a39f2197 ursa-i9-9960x
[Finished] a39f2197 ursa-thinkcentre-m75q
[Finished] fa3cf78e ec2-t3-xlarge-us-east-2
[Failed] fa3cf78e test-mac-arm
[Failed] fa3cf78e ursa-i9-9960x
[Finished] fa3cf78e ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2022-10-12T11:32:36Z

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

This PR bumps Apache Arrow version from 9.0.0 to 10.0.0. Main changes related to PyAmber: ## Java/Scala side: - JDBC Driver for Arrow Flight SQL ([13800](apache/arrow#13800)) - Initial implementation of immutable Table API ([14316](apache/arrow#14316)) - Substrait, transaction, cancellation for Flight SQL ([13492](apache/arrow#13492)) - Read Arrow IPC, CSV, and ORC files by NativeDatasetFactory ([13811](apache/arrow#13811), [13973](apache/arrow#13973), [14182](apache/arrow#14182)) - Add utility to bind Arrow data to JDBC parameters ([13589](apache/arrow#13589)) ## Python side: - The batch_readahead and fragment_readahead arguments for scanning Datasets are exposed in Python ([ARROW-17299](https://issues.apache.org/jira/browse/ARROW-17299)). - ExtensionArrays can now be created from a storage array through the pa.array(..) constructor ([ARROW-17834](https://issues.apache.org/jira/browse/ARROW-17834)). - Converting ListArrays containing ExtensionArray values to numpy or pandas works by falling back to the storage array ([ARROW-17813](https://issues.apache.org/jira/browse/ARROW-17813)). - Casting Tables to a new schema now honors the nullability flag in the target schema ([ARROW-16651](https://issues.apache.org/jira/browse/ARROW-16651)).

…upport (as per ARROW-17786) (#34390) Update status documentation: CSV read in Java from #14182 Authored-by: Igor Suhorukov <igor.suhorukov@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>

Read CSV files using org.apache.arrow.dataset.jni.NativeDatasetFactory

eb2fecc

github-actions bot added the Component: Java label Sep 20, 2022

pitrou reviewed Sep 22, 2022

View reviewed changes

Create URI with scheme and ssp

19ca9e5

Merge branch 'master' into ARROW-17786

ed83a40

davisusanibar added 2 commits October 4, 2022 09:09

Merge branch

efbd83a

Solving error for JNI compilation failure

8b57022

lidavidm reviewed Oct 6, 2022

View reviewed changes

Close resources try-with + Disable CSV when needed + Update Dataset b…

98c6d1d

…uild by maven

lidavidm approved these changes Oct 7, 2022

View reviewed changes

Merge branch 'master' into ARROW-17786

059b6f3

lidavidm merged commit a39f219 into apache:master Oct 11, 2022

Yicong-Huang mentioned this pull request Dec 8, 2022

Bump Apache Arrow to 10.0.0 Texera/texera#1764

Merged

igor-suhorukov mentioned this pull request Mar 1, 2023

MINOR: [Docs][Java] Update status documentation for Java's CSV read support (as per ARROW-17786) #34390

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-17786: [Java] Read CSV files using org.apache.arrow.dataset.jni.NativeDatasetFactory #14182

ARROW-17786: [Java] Read CSV files using org.apache.arrow.dataset.jni.NativeDatasetFactory #14182

davisusanibar commented Sep 20, 2022

github-actions bot commented Sep 20, 2022

pitrou Sep 22, 2022

davisusanibar Sep 22, 2022

pitrou commented Sep 22, 2022

pitrou commented Sep 22, 2022

davisusanibar commented Sep 22, 2022

pitrou commented Sep 22, 2022

davisusanibar commented Sep 22, 2022

davisusanibar commented Sep 22, 2022

davisusanibar commented Sep 23, 2022

davisusanibar commented Oct 4, 2022

davisusanibar commented Oct 6, 2022

lidavidm Oct 6, 2022

davisusanibar Oct 6, 2022

lwhite1 commented Oct 6, 2022 •

edited

Loading

davisusanibar commented Oct 6, 2022

lidavidm commented Oct 7, 2022

github-actions bot commented Oct 7, 2022

lidavidm commented Oct 7, 2022

lidavidm commented Oct 7, 2022 •

edited

Loading

lidavidm commented Oct 7, 2022

davisusanibar commented Oct 7, 2022

lwhite1 commented Oct 7, 2022

lwhite1 commented Oct 7, 2022

lidavidm commented Oct 7, 2022

lidavidm commented Oct 11, 2022

lidavidm commented Oct 11, 2022

ursabot commented Oct 12, 2022

ursabot commented Oct 12, 2022


		checkParquetReadResult(schema, expectedJsonUnordered, datum);

		AutoCloseables.close(datum);

ARROW-17786: [Java] Read CSV files using org.apache.arrow.dataset.jni.NativeDatasetFactory #14182

ARROW-17786: [Java] Read CSV files using org.apache.arrow.dataset.jni.NativeDatasetFactory #14182

Conversation

davisusanibar commented Sep 20, 2022

github-actions bot commented Sep 20, 2022

pitrou Sep 22, 2022

Choose a reason for hiding this comment

davisusanibar Sep 22, 2022

Choose a reason for hiding this comment

pitrou commented Sep 22, 2022

pitrou commented Sep 22, 2022

davisusanibar commented Sep 22, 2022

pitrou commented Sep 22, 2022

davisusanibar commented Sep 22, 2022

davisusanibar commented Sep 22, 2022

davisusanibar commented Sep 23, 2022

davisusanibar commented Oct 4, 2022

davisusanibar commented Oct 6, 2022

lidavidm Oct 6, 2022

Choose a reason for hiding this comment

davisusanibar Oct 6, 2022

Choose a reason for hiding this comment

lwhite1 commented Oct 6, 2022 • edited Loading

davisusanibar commented Oct 6, 2022

lidavidm commented Oct 7, 2022

github-actions bot commented Oct 7, 2022

lidavidm commented Oct 7, 2022

lidavidm commented Oct 7, 2022 • edited Loading

lidavidm commented Oct 7, 2022

davisusanibar commented Oct 7, 2022

lwhite1 commented Oct 7, 2022

lwhite1 commented Oct 7, 2022

lidavidm commented Oct 7, 2022

lidavidm commented Oct 11, 2022

lidavidm commented Oct 11, 2022

ursabot commented Oct 12, 2022

ursabot commented Oct 12, 2022

lwhite1 commented Oct 6, 2022 •

edited

Loading

lidavidm commented Oct 7, 2022 •

edited

Loading