-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FilePartition and PartitionedFile for scanning flexibility #932
Conversation
cc @houqp @alamb @andygrove for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left couple minor comments, the rest looks good to me!
ballista/rust/scheduler/src/lib.rs
Outdated
.collect(), | ||
schema: Some(parquet_desc.schema().as_ref().into()), | ||
partitions: vec![FilePartitionMetadata { | ||
filename: vec![path], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember we discussed this in the original PR. After taking a second look at the code, I am still not fully following the change here. The old behavior has FilePartitionMetadata.filename
set to a vector of file paths returned from a directory list, while the new behavior here has the filename always set to a vector of single entry with value set to the root path of the table.
Shouldn't we use parquet_desc.descriptor.descriptor
to build the filename vector here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to a vector of all the files.
However, after searching for a while in the project, I find this method may not be actually used, it's hard to understand this RPC's intention as well. Perhaps it's deprecated and we should remove it later?
rpc GetFileMetadata (GetFileMetadataParams) returns (GetFileMetadataResult) {}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the same question when I was going through the code base yesterday, I noticed it's only mentioned in ballista/docs/architecture.md
. @andygrove do you know if this RPC method is still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great refactor @yjshen !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great @yjshen -- thank you for persevering. I think this PR looks great other than the addition of filter
to the LogicalPlanBuilder::scan
(see comments on that).
I didn't review the ballista changes, but I assume they are mostly mechanical
Again, thank you so much and sorry for the long review cycle
pub file_path: String, | ||
/// Statistics of the file | ||
pub statistics: Statistics, | ||
// Values of partition columns to be appended to each row |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in order to take full advantage of partition values (which might span multiple columns, for example), more information about the partitioning scheme will be needed (e.g. what expression is used to generated partitioning values). Adding partitioning support to DataFusion's planning / execution is probably worth its own discussion
(that is to say I agree with postponing adding anything partition specific)
@@ -27,14 +27,14 @@ use crate::{ | |||
logical_plan::{Column, Expr}, | |||
physical_optimizer::pruning::{PruningPredicate, PruningStatistics}, | |||
physical_plan::{ | |||
common, DisplayFormatType, ExecutionPlan, Partitioning, SendableRecordBatchStream, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like how the statistics and schema related code has been moved out of physical_plan
and into datasource
} | ||
|
||
/// Convert a table provider into a builder with a TableScan | ||
pub fn scan( | ||
table_name: impl Into<String>, | ||
provider: Arc<dyn TableProvider>, | ||
projection: Option<Vec<usize>>, | ||
filters: Option<Vec<Expr>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this argument is likely going to be confusing to users and it should be removed.
For example as a user of LogicalPlanBuilder
I would probably assume that the following plan would return only rows where with a<5
// Build a plan that looks like it would filter out all rows with `a < 5`
let plan = builder.scan("table", provider, None, vec![col("a").lt(lit(5)));
However, I am pretty sure it could (and often would) return rows with a >= 5). This is because filters
added to a TableScan
node are optional (in the sense that the provider might not filter rows that do not pass the predicate, but is not required to). Indeed, even for the parquet provider, the filters are only used for row group pruning which may or may not be able to filter rows.
I think we could solve this with:
- Leave
scan
signature alone and rely on the predicate pushdown optimization to push filters appropriately down to the scan (my preference as it is simpler for the users) - Rename this argument to something like 'optional_filters_for_performance' and document what it does more carefully. I think it would be challenging to explain as it might/might not do anything depending on how the data was laid out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, and keep the filters not deserialized for ballista as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @yjshen !
Thank you @yjshen for being patient and driving through this big change step by step :) |
Which issue does this PR close?
Closes #946 .
Rationale for this change
ParquetExec:: try_from_path
to get planning-related metadata.What changes are included in this PR?
PartitionedFile -> Single file (for the moment) or part of a file (later, part of the row groups or rows), and we may even extend this to include partition value and partition schema to support partitioned tables:
/path/to/table/root/p_date=20210813/p_hour=1200/xxxxx.parquet
FilePartition -> The basic unit for parallel processing, each task is responsible for processing one FilePartition which is composed of several PartitionFiles.
Update ballista protocol as well as the serdes to use the new abstraction.
Telling apart the planning related code from
ParquetExec
Are there any user-facing changes?
No.