-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-17786: [Java] Read CSV files using org.apache.arrow.dataset.jni.NativeDatasetFactory #14182
Conversation
|
||
public CsvWriteSupport(File outputFolder) { | ||
path = outputFolder.getPath() + File.separator + "generated-" + random.nextLong() + ".csv"; | ||
uri = "file://" + path; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't you use a URI constructor instead?
https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#URI(java.lang.String,%20java.lang.String,%20java.lang.String)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, changed
@lwhite1 Would you like to review this? |
@davisusanibar Why is only write support tested? Shouldn't you test support for dataset reads as well? |
Hi, write support is doing by Java native libraries, then Dataset module is reading that CSV file with (FileFormat.CSV): Currently we are only offering Read support for Parquet. ORC and now CSV, ... write support it not implemented |
Oops, sorry, I had misread the tests. |
Related to update Java Dataset module documentation, this Jira will cover that: https://issues.apache.org/jira/browse/ARROW-17789 |
This line arrow/cpp/src/arrow/json/parser.cc Line 1030 in 91e3ac5
|
Just rebase and working fine now |
Please, when you have some time if you could review this, thank you |
|
||
checkParquetReadResult(schema, expectedJsonUnordered, datum); | ||
|
||
AutoCloseables.close(datum); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use try-with-resources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added try-with for FileSystemDatasetFactory
but don't know how to add that forList<ArrowRecordBatch>
that is an input for another function.
Sorry for not getting to this sooner. Is this for release 10? |
Not problem for that. Yes, trying to merge for 10 to then could create cookbooks for this CSV Reader case |
@github-actions crossbow submit java |
Revision: 98c6d1d Submitted crossbow builds: ursacomputing/crossbow @ actions-e54f92cb9c |
CI shows a flake in testTable:
|
@lwhite1 have you seen this? Might be worth some investigation log link: https://github.com/apache/arrow/actions/runs/3200943136/jobs/5229379128 (I'm going to restart the job now) |
@davisusanibar it may be worth investigating if we can not build/run tests in java-jars? It seems rather redundant (or maybe that's on purpose) |
Yes, sound redundant at first review, but, let me check what part could be improved |
I haven't noticed it before but there are so many failures with any given commit I probably don't pay as much attention as I should to something that doesn't look related to the changes. I will take a look now. |
It looks like a bug in the test. I will push a fix. IDK why it passes locally. |
It looks like the test crashes on another attempt. |
@davisusanibar can you rebase here to see if Larry's fix makes CI pass? |
Ok, looks all good. Thanks! |
Benchmark runs are scheduled for baseline = fa3cf78 and contender = a39f219. a39f219 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
['Python', 'R'] benchmarks have high level of regressions. |
This PR bumps Apache Arrow version from 9.0.0 to 10.0.0. Main changes related to PyAmber: ## Java/Scala side: - JDBC Driver for Arrow Flight SQL ([13800](apache/arrow#13800)) - Initial implementation of immutable Table API ([14316](apache/arrow#14316)) - Substrait, transaction, cancellation for Flight SQL ([13492](apache/arrow#13492)) - Read Arrow IPC, CSV, and ORC files by NativeDatasetFactory ([13811](apache/arrow#13811), [13973](apache/arrow#13973), [14182](apache/arrow#14182)) - Add utility to bind Arrow data to JDBC parameters ([13589](apache/arrow#13589)) ## Python side: - The batch_readahead and fragment_readahead arguments for scanning Datasets are exposed in Python ([ARROW-17299](https://issues.apache.org/jira/browse/ARROW-17299)). - ExtensionArrays can now be created from a storage array through the pa.array(..) constructor ([ARROW-17834](https://issues.apache.org/jira/browse/ARROW-17834)). - Converting ListArrays containing ExtensionArray values to numpy or pandas works by falling back to the storage array ([ARROW-17813](https://issues.apache.org/jira/browse/ARROW-17813)). - Casting Tables to a new schema now honors the nullability flag in the target schema ([ARROW-16651](https://issues.apache.org/jira/browse/ARROW-16651)).
…upport (as per ARROW-17786) (#34390) Update status documentation: CSV read in Java from #14182 Authored-by: Igor Suhorukov <igor.suhorukov@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
Support CSV file format in java Dataset API