-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prune scanned files on column stats #724
Conversation
@wjones127 - the python 3.7 builds seem to have started failing here and in other PRs. It seems it tries to build pyarrow from source again and fails to find Arrow C++. While we could install it, my understand is this should not be the case -also everything worked until very recently, any ideas? |
rust/src/delta_datafusion.rs
Outdated
@@ -310,6 +448,7 @@ fn to_scalar_value(stat_val: &serde_json::Value) -> Option<datafusion::scalar::S | |||
} | |||
} | |||
serde_json::Value::String(s) => Some(ScalarValue::from(s.as_str())), | |||
// TODO is it permissible to encode arrays / objects as partition values? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the py37 build error, it's because pyarrow 9 stopped releasing manylinux2010 wheels. Compare https://pypi.org/project/pyarrow/9.0.0/#files with https://pypi.org/project/pyarrow/8.0.0/#files. We might need to bump our manylinux support to 2014 too :( @wjones127 is the manylinux2010 support removal in arrow 9 release expected? |
rust/src/delta_datafusion.rs
Outdated
.zip(files_to_prune.into_iter()) | ||
.filter_map(|(action, prune_file)| { | ||
if prune_file { | ||
return None; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be worth specializing this so we are not paying the penalty of iterating and checking prune file array when there is no file to prune. For example, the code below can be abstracted into a function, then we have two types of iterator loops that calls it. One iterator loop zips with files_to_prune, the other one simply just iterates through get_state().files().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense! Updated it and to the opportunity to batch files by partition values. Thats not going to be optimal for many cases, but by default hopefully better than each file in its own (datafusion-)partition.
impl ExecutionPlanVisitor for ExecutionMetricsCollector { | ||
type Error = DataFusionError; | ||
|
||
fn pre_visit( | ||
&mut self, | ||
plan: &dyn ExecutionPlan, | ||
) -> std::result::Result<bool, Self::Error> { | ||
if let Some(exec) = plan.as_any().downcast_ref::<ParquetExec>() { | ||
let files = get_scanned_files(exec); | ||
self.scanned_files.extend(files); | ||
} | ||
Ok(true) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just did this for testing right now, but I would like to use something like this to collect the statistics we need for proper conflict resolution.
// Statistics::default() | ||
// } | ||
// } | ||
impl PruningStatistics for delta::DeltaTable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementing this made me think of how to best store stats, which is an ongoing topic (#454)... Maybe PruningStatistics
s view on the work helps?
Somewhere along the lines of
pub struct Stats {
files: HashMap<Path, (usize, PartitionedFile)>,
max_values: RecordBatch,
min_values: RecordBatch,
...
}
... and then implement some convenience accessors to get data per column / per file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, switching to columnar format will help in many places :)
rust/src/delta_datafusion.rs
Outdated
let pruning_predicate = PruningPredicate::try_new(predicate, schema.clone())?; | ||
let files_to_prune = pruning_predicate.prune(self)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was a bit concerned about interpreting the filters properly, but luckily the hard part was done 😆.
rust/src/delta_datafusion.rs
Outdated
"Failed to evaluate table pruning predicates.".to_string(), | ||
) | ||
})?? | ||
.for_each(|f| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This iterates through the vector again no? I think we should be able to perform the hashmap insertion within filter_map callback above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
absolutely! we are avoid that extra iteration now.
Thanks @roeap , this is a great demonstration of datafusion's extensibility :) The rest looks good to me, left a very minor comment. |
Yes. Sorry I saw that in the release notes and didn't connect the dots. Odd though, I'm unsure why it's being installed in the job that's supposed to only install PyArrow 4.X For some reason |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Description
This PR deepens the integration with datafusion by leveraging the column statistics from the delta log to prune the files that need to be scanned when additional constraints are supplied. Luckily datafusion provides some excellent utilities to implement this. Specifically we implement
PruningStatistics
forDeltaTable
, use that withPruningPredicate
, and the rest just kind of happens 😆.I still need to do cleanup and more testing as well as have a look how much effort and dependency growth it would be to adopt datafusion expressions for our partition / stats handling. However if there is feedback as to if this is the right way to go, I'd be happy to hear it.
cc @houqp @wjones127 - maybe even @tustvold has some feedback? :)
Related Issue(s)
Especially the statistics recorded during scans should get us a lot closer to finishing #632
Documentation