Prune scanned files on column stats #724

roeap · 2022-08-06T18:10:21Z

Description

This PR deepens the integration with datafusion by leveraging the column statistics from the delta log to prune the files that need to be scanned when additional constraints are supplied. Luckily datafusion provides some excellent utilities to implement this. Specifically we implement PruningStatistics for DeltaTable, use that with PruningPredicate, and the rest just kind of happens 😆.

I still need to do cleanup and more testing as well as have a look how much effort and dependency growth it would be to adopt datafusion expressions for our partition / stats handling. However if there is feedback as to if this is the right way to go, I'd be happy to hear it.

cc @houqp @wjones127 - maybe even @tustvold has some feedback? :)

Related Issue(s)

Especially the statistics recorded during scans should get us a lot closer to finishing #632

Documentation

roeap · 2022-08-07T05:56:26Z

@wjones127 - the python 3.7 builds seem to have started failing here and in other PRs. It seems it tries to build pyarrow from source again and fails to find Arrow C++. While we could install it, my understand is this should not be the case -also everything worked until very recently, any ideas?

houqp · 2022-08-07T07:18:55Z

rust/src/delta_datafusion.rs

@@ -310,6 +448,7 @@ fn to_scalar_value(stat_val: &serde_json::Value) -> Option<datafusion::scalar::S
            }
        }
        serde_json::Value::String(s) => Some(ScalarValue::from(s.as_str())),
+        // TODO is it permissible to encode arrays / objects as partition values?


not for deltalake: https://github.com/delta-io/delta/blob/master/PROTOCOL.md#partition-value-serialization

houqp · 2022-08-07T07:37:06Z

For the py37 build error, it's because pyarrow 9 stopped releasing manylinux2010 wheels. Compare https://pypi.org/project/pyarrow/9.0.0/#files with https://pypi.org/project/pyarrow/8.0.0/#files. We might need to bump our manylinux support to 2014 too :(

@wjones127 is the manylinux2010 support removal in arrow 9 release expected?

houqp · 2022-08-07T07:39:56Z

rust/src/delta_datafusion.rs

+            .zip(files_to_prune.into_iter())
+            .filter_map(|(action, prune_file)| {
+                if prune_file {
+                    return None;
+                }


might be worth specializing this so we are not paying the penalty of iterating and checking prune file array when there is no file to prune. For example, the code below can be abstracted into a function, then we have two types of iterator loops that calls it. One iterator loop zips with files_to_prune, the other one simply just iterates through get_state().files().

Makes sense! Updated it and to the opportunity to batch files by partition values. Thats not going to be optimal for many cases, but by default hopefully better than each file in its own (datafusion-)partition.

roeap · 2022-08-07T12:17:46Z

rust/tests/datafusion_test.rs

+    impl ExecutionPlanVisitor for ExecutionMetricsCollector {
+        type Error = DataFusionError;
+
+        fn pre_visit(
+            &mut self,
+            plan: &dyn ExecutionPlan,
+        ) -> std::result::Result<bool, Self::Error> {
+            if let Some(exec) = plan.as_any().downcast_ref::<ParquetExec>() {
+                let files = get_scanned_files(exec);
+                self.scanned_files.extend(files);
+            }
+            Ok(true)
+        }
+    }


Just did this for testing right now, but I would like to use something like this to collect the statistics we need for proper conflict resolution.

roeap · 2022-08-07T12:28:32Z

rust/src/delta_datafusion.rs

-//         Statistics::default()
-//     }
-// }
+impl PruningStatistics for delta::DeltaTable {


Implementing this made me think of how to best store stats, which is an ongoing topic (#454)... Maybe PruningStatisticss view on the work helps?

Somewhere along the lines of

pub struct Stats { files: HashMap<Path, (usize, PartitionedFile)>, max_values: RecordBatch, min_values: RecordBatch, ... }

... and then implement some convenience accessors to get data per column / per file.

yep, switching to columnar format will help in many places :)

roeap · 2022-08-07T12:31:06Z

rust/src/delta_datafusion.rs

+                    let pruning_predicate = PruningPredicate::try_new(predicate, schema.clone())?;
+                    let files_to_prune = pruning_predicate.prune(self)?;


I was a bit concerned about interpreting the filters properly, but luckily the hard part was done 😆.

houqp · 2022-08-08T04:04:12Z

rust/src/delta_datafusion.rs

+                        "Failed to evaluate table pruning predicates.".to_string(),
+                    )
+                })??
+                .for_each(|f| {


This iterates through the vector again no? I think we should be able to perform the hashmap insertion within filter_map callback above?

absolutely! we are avoid that extra iteration now.

houqp · 2022-08-08T04:06:01Z

Thanks @roeap , this is a great demonstration of datafusion's extensibility :) The rest looks good to me, left a very minor comment.

wjones127 · 2022-08-08T04:30:03Z

@wjones127 is the manylinux2010 support removal in arrow 9 release expected?

Yes. Sorry I saw that in the release notes and didn't connect the dots.

Odd though, I'm unsure why it's being installed in the job that's supposed to only install PyArrow 4.X For some reason maturin develop is doing some pip --force-reinstall and that's causing an upgrade. I need to look more into this.

houqp

LGTM!

roeap added 4 commits August 3, 2022 01:26

table into commands

a0d5eed

feat: handle partition values in table scan

3ea433e

test: add partitioned table query test

2a2e390

feat: prune files to scan based on expressions

3bc1fcb

roeap marked this pull request as draft August 6, 2022 18:10

roeap added 2 commits August 6, 2022 22:01

chore: clippy

6999483

ci: try upgrading pip in 3.7 build

272225f

houqp reviewed Aug 7, 2022

View reviewed changes

houqp mentioned this pull request Aug 7, 2022

Fix parsing null counts for struct type columns in the struct stats #714

Merged

roeap added 3 commits August 7, 2022 12:23

cleanup file partitioning

a662194

clippy

9ae1045

test: cleanup tests

91821f2

roeap marked this pull request as ready for review August 7, 2022 12:15

roeap commented Aug 7, 2022

View reviewed changes

roeap requested a review from houqp August 7, 2022 12:36

houqp reviewed Aug 8, 2022

View reviewed changes

houqp requested review from fvaleye, mosyp, wjones127 and xianwill August 8, 2022 04:04

roeap added 2 commits August 8, 2022 08:32

perf: advoid unncessarily iterating files

ee2aef4

fix: allow for predicates not relevant for file pruning

8a20c92

houqp approved these changes Aug 8, 2022

View reviewed changes

roeap merged commit f9816b0 into delta-io:main Aug 8, 2022

roeap deleted the commands branch August 8, 2022 16:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prune scanned files on column stats #724

Prune scanned files on column stats #724

roeap commented Aug 6, 2022

roeap commented Aug 7, 2022

houqp Aug 7, 2022

houqp commented Aug 7, 2022

houqp Aug 7, 2022

roeap Aug 7, 2022

roeap Aug 7, 2022

roeap Aug 7, 2022 •

edited

Loading

houqp Aug 8, 2022

roeap Aug 7, 2022

houqp Aug 8, 2022

roeap Aug 8, 2022

houqp commented Aug 8, 2022

wjones127 commented Aug 8, 2022

houqp left a comment

		let pruning_predicate = PruningPredicate::try_new(predicate, schema.clone())?;
		let files_to_prune = pruning_predicate.prune(self)?;

Prune scanned files on column stats #724

Prune scanned files on column stats #724

Conversation

roeap commented Aug 6, 2022

Description

Related Issue(s)

Documentation

roeap commented Aug 7, 2022

Choose a reason for hiding this comment

houqp commented Aug 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roeap Aug 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

houqp commented Aug 8, 2022

wjones127 commented Aug 8, 2022

houqp left a comment

Choose a reason for hiding this comment

roeap Aug 7, 2022 •

edited

Loading