ARROW-11776: [C++][Java] Support parquet write from ArrowReader to file#14151
ARROW-11776: [C++][Java] Support parquet write from ArrowReader to file#14151lidavidm merged 14 commits intoapache:masterfrom
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename pull request title in the following format? or See also: |
|
@JkSelf Can we just use the existing ticket ARROW-11776? |
There was a problem hiding this comment.
Is it necessary to conduct this kind of encoding?
There was a problem hiding this comment.
Let's add some unit tests.
You can refer to the previous PR https://github.com/apache/arrow/pull/10201/files
There was a problem hiding this comment.
Is it needed to create this class?
IteratorImpl seems to already handle everything.
There was a problem hiding this comment.
May be we can rename this as CRecordBatchIterator or something to make things more clearer.
There was a problem hiding this comment.
There might be some better solution to distinguish on the 2 variables than just using iter and itr?
There was a problem hiding this comment.
arrowArray may need to be eventually closed or memory may leak
|
Why isn't this just using the C Data Interface instead of doing things like serializing schemas and manually adapting iterators? |
|
In particular you can export a stream now and that will arrive in C++ as a RecordBatchReader |
|
@zhztheplayer @lidavidm |
There was a problem hiding this comment.
Same question - why are we manually bridging data to C++, why do we have CRecordBatchIterator at all, when we could use ArrowArrayStream instead? This should just take an ArrowArrayStream as the data source
There was a problem hiding this comment.
We have already used the ArrowArrayStream sharing data to native. Here the CRecordBatchIterator is to iterate the ArrowArrayStream object. I have changed the CRecordBatchIterator name to CArrowArrayStreamIterator. Sorry for the wrong naming.
There was a problem hiding this comment.
No, what I mean is that you do not need the iterator at all. Implement the ArrowReader API for Scanner or whatever the Java-side data source is, then export that once and import it once on the C++ side. The JNI side should not need to know about any new Java-side classes or methods.
There was a problem hiding this comment.
Thanks for your detailed explanation. Already removed the iterator and use the ArrowArrayStream as the data source .
|
|
java/dataset/src/main/java/org/apache/arrow/dataset/file/JniWrapper.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
I don't think this is right. We should implement ArrowReader over Scanner, then write the entire dataset in one go. And then this method can be used to write any ArrowReader, not just Scanners.
There was a problem hiding this comment.
Added ArrowScannerReader over Scanner.
java/dataset/src/main/java/org/apache/arrow/dataset/file/DatasetFileWriter.java
Outdated
Show resolved
Hide resolved
java/dataset/src/main/java/org/apache/arrow/dataset/file/DatasetFileWriter.java
Outdated
Show resolved
Hide resolved
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java
Outdated
Show resolved
Hide resolved
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java
Outdated
Show resolved
Hide resolved
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java
Outdated
Show resolved
Hide resolved
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java
Outdated
Show resolved
Hide resolved
java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java
Outdated
Show resolved
Hide resolved
java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java
Outdated
Show resolved
Hide resolved
java/dataset/src/main/java/org/apache/arrow/dataset/file/DatasetFileWriter.java
Outdated
Show resolved
Hide resolved
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java
Outdated
Show resolved
Hide resolved
java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java
Outdated
Show resolved
Hide resolved
java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java
Outdated
Show resolved
Hide resolved
java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java
Outdated
Show resolved
Hide resolved
85ffbd5 to
7578bca
Compare
There was a problem hiding this comment.
Is there an issue if we potentially double-close these?
There was a problem hiding this comment.
If already closed, the close method will directly return and no further operator.
java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java
Outdated
Show resolved
Hide resolved
|
There are lint errors https://github.com/apache/arrow/actions/runs/3290191393/jobs/5426029066#step:6:7990 |
|
@lidavidm |
|
Can you rebase to see if the CI issue is fixed? |
7988012 to
5f86c71
Compare
Rebased. |
|
There's still a lint error here |
|
Benchmark runs are scheduled for baseline = a2881a1 and contender = dddf38f. dddf38f is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
|
['Python', 'R'] benchmarks have high level of regressions. |
…le (apache#14151) This PR is aim to support parquet write from ArrowReader to file. Authored-by: Jia Ke <ke.a.jia@intel.com> Signed-off-by: David Li <li.davidm96@gmail.com>
This PR is aim to support parquet write from ArrowReader to file.