-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(rust): Add parquet source node to new streaming engine #18152
Conversation
7d22137
to
2756b43
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #18152 +/- ##
==========================================
- Coverage 80.42% 79.83% -0.60%
==========================================
Files 1492 1496 +4
Lines 198675 200238 +1563
Branches 2841 2841
==========================================
+ Hits 159785 159854 +69
- Misses 38365 39859 +1494
Partials 525 525 ☔ View full report in Codecov by Sentry. |
if let Some(predicate) = self.physical_predicate.as_deref() { | ||
let mask = predicate.evaluate_io(&df)?; | ||
let mask = mask.bool().unwrap(); | ||
|
||
par_filter_df(&mut df, mask, cpu_runtime.as_ref()).await?; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I very briefly discussed this with Ritchie, but we should maybe plan a quick call for this. I think for streaming, it always makes sense to evaluate to decode the column that are needed for the predicate first. Then, depending on the selectiveness of the data (which we can estimate with (mask ^ (mask >> 1)).popcount()
), we can decide to use the par_filter_df
or the direct filter=Bitmask(mask)` in the parquet reader. This would lead to very large speedups, and I definitely think is worth it.
This can maybe wait until we have a POC, though.
6104f7f
to
b5b3153
Compare
b5b3153
to
7df7be5
Compare
7e8ac05
to
ef3a71b
Compare
Ai.. Can you rebase. :/ |
3c12bab
to
8bf43ed
Compare
ea5c9b3
to
4f3e788
Compare
4f3e788
to
cc70ae9
Compare
Err(e) => Err(e), | ||
} | ||
}) | ||
.buffer_unordered(num_pipelines); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@orlp This changes the decode back to spawning tasks onto the executor to support splitting a single row group into multiple morsels. This lets me distribute the morsels more evenly across the pipelines - with the previous approach, I think if I added the splitting in-place, then even after I split the row group into several morsels they would still end up being sent serially across the same pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if I added the splitting in-place, then even after I split the row group into several morsels they would still end up being sent serially across the same pipeline.
I don't understand why, as long as the splitting happens before going into the work distributor, everything should be fine. Is the splitting itself also computationally intensive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's because we can't split the raw byte-data of the row group, so instead I'm splitting the row group after it's been decoded into a DataFrame. In the previous version the row group decoding took place after the work distributor was used to distribute the raw row group byte data - at that point I was no longer able to re-distribute the individual splits within a row group - they would have to be serially sent across the pipeline they were in.
I think, maybe we can have a compute node that specifically sits in front of the parquet source to ensure that we have good morsel sizes? It would split morsels that are too big, and combine morsels that are too small? Then it could also sit in front of other operators that have unpredictable morsel sizes (e.g. other source nodes, or the filter node).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. I thought about it again and I don't like what I'm currently doing either.. it's not receiving backpressure properly from the pipeline.
I think it's better to leave it as 1 morsel per row-group for now?
|
||
if self.use_par_decode && decode_fut_iter.len() > 1 { | ||
for handle in decode_fut_iter.map(|fut| { | ||
async_executor::AbortOnDropHandle::new(async_executor::spawn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@orlp one more place where I spawn - this is for decoding the columns within a row group in parallel. I think in theory this makes sense for very wide tables, but from testing the performance was identical for 1M row groups with 50 columns. I've currently tuned it send a minimum of const VALUES_PER_THREAD: usize = 8_388_608
per thread
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you maybe add a comment explaining how/why that value was derived?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added 👍
*edit: doubled the value to 16M
// Early shutdown - our port state was set to `Done` by the downstream nodes. This | ||
// also means we can discard any potential errors from the `shutdown()`. Note that | ||
// `shutdown()` internally unwraps any critical errors. | ||
let _ = polars_io::pl_async::get_runtime().block_on(self.shutdown()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shutdown
s docs state it panics if called more than once. update_state
can get called more than once. I think you should check is_finished
before calling this.
Furthermore, do we actually need to block on this? Can't we just spawn it and it'll clean itself up in the background?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should check is_finished before calling this.
I believe it should be checked by the initial if self.is_finished.load(Ordering::Relaxed) {
😁
Furthermore, do we actually need to block on this? Can't we just spawn it and it'll clean itself up in the background?
I tried but the borrow checker wasn't happy - as shutdown()
takes &mut self
. I think the shutdown should be fairly quick, so it should be fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update - I added shutdown_in_background()
and made it work by putting the task data behind an Arc<Mutex<>>
Err(e) => Err(e), | ||
} | ||
}) | ||
.buffer_unordered(num_pipelines); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if I added the splitting in-place, then even after I split the row group into several morsels they would still end up being sent serially across the same pipeline.
I don't understand why, as long as the splitting happens before going into the work distributor, everything should be fine. Is the splitting itself also computationally intensive?
c56fa22
to
c47eb71
Compare
let mut dfs = vec![].into_iter(); | ||
|
||
'main: loop { | ||
let Some(mut indexed_wait_group) = wait_groups.next().await else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've used a new approach for applying backpressure using the wait groups here - it makes us spawn much less tasks than before.
if cols_per_task <= df.width() { | ||
df._filter_seq(mask)? | ||
} else { | ||
let mask = mask.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added horizontal parallelism back to predicate filtering here in the row group decoder, but only when we are past VALUES_PER_THREAD
. I want to do this here so that the predicate is applied before we potentially split the row group into multiple morsels - if we instead do the predicate after then we could end up with very small morsels.
loop { | ||
use crate::async_primitives::connector::SendError; | ||
|
||
let port_index = indexed_wait_group.index; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give this a different name? A port
refers to an input or output of a node. One port
can consist of a serial sender/receiver, or a series of parallel senders/receivers, but they all belong to the same port.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've renamed it to channel_index
let mut row_group_data_fetcher = Box::pin(self); | ||
let current_future = Box::pin( | ||
unsafe { | ||
std::mem::transmute::<&mut RowGroupDataFetcher, &'static mut RowGroupDataFetcher>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of solving this with lifetime transmutes which make me rather uncomfortable, can you change it to use Arc
+ Mutex
es instead? You'll likely need to move the mutable state inside a Mutex
and change next()
to take &self
instead of &mut self
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a way to get rid of the unsafe transmute without introducing any locks 😁
d325c57
to
cf34379
Compare
Nice work :) There are still some changes I'd like to make to clean everything up but we (or just me) can do that later in future PR's. |
967eda8
to
76caa3f
Compare
76caa3f
to
b751463
Compare
Great effort to get this in. Thanks both! We can iron out in future PR's. First get this huge PR in. :) |
Enables scanning parquet files in the new streaming engine. This is done via a new parquet source
node that has been built to run natively on the new async executor for maximum performance.
Benchmarks
Setup
Dataset generation
Feature parity
The source node in this PR should fully support all existing functionality of the in-memory engine
(including slices with negative offsets, which isn't supported by the existing streaming engine).
Metadata fetching optimization
The new source node uses a metadata size estimate for async reads that can allows us to potentially
save network requests. Small parquet files are also fully downloaded in one network request:
Slice pushdown
Slice pushdown (negative offset)
Predicate pushdown
Byte source trait
This PR also introduces a new byte source trait that provides a unified interface to efficiently fetch byte ranges from both local and cloud files.