New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

FileStream: Open next file in parallel while decoding #5161

Merged

alamb merged 4 commits into master from file-stream-pipeline

Feb 7, 2023

Contributor

thinkharderdev commented Feb 2, 2023

Which issue does this PR close?

Closes #5129

Rationale for this change

File opening is mostly IO (and may involve a bunch of sequential IO) so it can probably be parallelized well with decoding. So we should open the next file in parallel while decoding the current file in FileStream

What changes are included in this PR?

Are these changes tested?

I think this should be covered by existing tests

Are there any user-facing changes?

FileStreamMetrics.time_opening is a slightly different metric now as it won't capture time spent opening but rather time spent opening while also not concurrently decoding.

thinkharderdev requested a review from tustvold

February 2, 2023 14:24

github-actions bot added the core label

alamb approved these changes

View reviewed changes

Contributor

alamb left a comment

Looks good to me @thinkharderdev -- thank you. It would be great to figure out some way to test this PR (mostly to ensure we don't break this behavior in the future). However, i don't have any clever ideas on how to do so.

I went through the logic in detail.

I left some suggestions for comments to clarify the intent, which I think would be valuable but are not necessary.

datafusion/core/src/physical_plan/file_format/file_stream.rs Outdated Show resolved Hide resolved

datafusion/core/src/physical_plan/file_format/file_stream.rs Outdated Show resolved Hide resolved

datafusion/core/src/physical_plan/file_format/file_stream.rs

+                                                  partition_values,
+                                              }
+                                          }
+                                          None => return Poll::Ready(None),

Contributor

alamb Feb 5, 2023

Suggested change

      
                                        None => return Poll::Ready(None),
          
                                        // No more input files
          
                                        None => return Poll::Ready(None),

datafusion/core/src/physical_plan/file_format/file_stream.rs Outdated Show resolved Hide resolved

datafusion/core/src/physical_plan/file_format/file_stream.rs Outdated

@@ @@ -237,13 +249,34 @@ impl<F: FileOpener> FileStream<F> { @@
                                   partition_values,
                               } => match ready!(future.poll_unpin(cx)) {
                                   Ok(reader) => {
+                                      let partition_values = mem::take(partition_values);
+                                      let next = self.next_file().transpose();

Contributor

alamb Feb 5, 2023

Suggested change

      
                                    let next = self.next_file().transpose();
          
                                    // begin opening next file
          
                                    let next = self.next_file().transpose();

Dandandan reviewed

View reviewed changes

datafusion/core/src/physical_plan/file_format/file_stream.rs

@@ @@ -98,6 +99,8 @@ enum FileStreamState { @@
                       partition_values: Vec<ScalarValue>,
                       /// The reader instance
                       reader: BoxStream<'static, Result<RecordBatch, ArrowError>>,
+                      /// A [`FileOpenFuture`] for the next file to be processed
+                      next: Option<(FileOpenFuture, Vec<ScalarValue>)>,

Contributor

Dandandan Feb 6, 2023

I wonder if we could make it future-proof by potentially prefetching n files instead of 1? I guess in cases where file opening is slower than scanning / processing, this could make a difference (e.g. small files).

Contributor

tustvold Feb 6, 2023

Perhaps a follow on PR could turn this into a stream and use StreamExt::buffered or something

Contributor Author

thinkharderdev Feb 6, 2023

yeah, that seems like a good idea

thinkharderdev and others added 4 commits

February 6, 2023 09:34


          FileStream: Open next file in parallel while decoding

0c3a4a3


          Apply suggestions from code review

423c60f

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>


          more descriptive method name

4de7590


          formatting

fa60853

thinkharderdev force-pushed the file-stream-pipeline branch from f8a339d to fa60853 Compare

February 6, 2023 14:37

Contributor

alamb commented Feb 7, 2023

Let's file a ticket for the "buffer N items at a time" idea and work on it as a follow on PR

alamb merged commit 816a0f8 into master

Contributor

alamb commented Feb 7, 2023

Thanks again @thinkharderdev

alamb deleted the file-stream-pipeline branch

February 7, 2023 10:54

thinkharderdev mentioned this pull request

FileStream: Buffer more than one FileOpenFuture #5209

Open

ursabot commented Feb 7, 2023

Benchmark runs are scheduled for baseline = 48732b4 and contender = 816a0f8. 816a0f8 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Contributor Author

thinkharderdev commented Feb 7, 2023

Added #5209

nenorbot mentioned this pull request

FileStream does not poll next file open future #5799

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core