feat: more economic data skipping with datafusion #2772

roeap · 2024-08-14T19:23:07Z

Description

This PR adresses some inefficiencies when creating stats for file pruning using the PruningPredicate. Rather then generating stats from file actions which we generate ad-hoc from teh underlying arrow data, we now read the stats directly from the raw file data - i.e. avoiding expensive roundtrips through file actions.

To make this work, we needed to include parsing partition values (adding the partitionValues_parsed field to action data) during log replay. As a follow-up we should make more use of the parsed partiton values #2771

Related Issue(s)

Documentation

ion-elgreco · 2024-08-14T19:31:21Z

crates/core/src/kernel/snapshot/log_data.rs

+                    .ok()?;
+                results.push(result.record_batch().clone());
+            }
+            let batch = concat_batches(results[0].schema_ref(), &results).ok()?;


Isn't it cheaper to concat first and then run it through the Evaluator?

Probably, at the very least that would give more opportunity for parallelism. in fact we should already concatenate the batches on EagerSnapshot. However the schemas of these batches are not yet normalized, so we cannot concatenate them yet.

For this we need to do either some internal casting/filtering in the log replay, or even better do column selection when reading the checkpoints ...

rtyler

manually tested, some rustdocs can come later 😄

rtyler

🙃 i'm sure it's fine 🙃

roeap requested review from wjones127, rtyler, hntd187 and ion-elgreco as code owners August 14, 2024 19:23

github-actions bot added the binding/rust Issues for the Rust crate label Aug 14, 2024

roeap force-pushed the feature/data-skipping branch from e1557b5 to cdbd392 Compare August 14, 2024 19:25

ion-elgreco reviewed Aug 14, 2024

View reviewed changes

roeap enabled auto-merge August 14, 2024 20:09

roeap added 6 commits August 14, 2024 22:21

:test: add stats parsing test

d14cc0a

test: geneate test add actions with partition values

c094f04

feat: parse partition values during log replay

bbd59e9

feat: read PruningStatistics from files batch

4e3d4fe

feat: use kernel expression evaluator

fd4ff7f

fix: allow missing file stats in log replay

e688ad4

rtyler previously approved these changes Aug 14, 2024

View reviewed changes

roeap dismissed rtyler’s stale review via e688ad4 August 14, 2024 20:23

roeap force-pushed the feature/data-skipping branch from 928b10c to e688ad4 Compare August 14, 2024 20:23

rtyler approved these changes Aug 14, 2024

View reviewed changes

rtyler disabled auto-merge August 14, 2024 20:26

rtyler enabled auto-merge August 14, 2024 20:26

rtyler added this pull request to the merge queue Aug 14, 2024

Merged via the queue into delta-io:main with commit d3a7967 Aug 14, 2024
18 checks passed

roeap deleted the feature/data-skipping branch August 14, 2024 20:54

Tom-Newton mentioned this pull request Aug 14, 2024

Transaction log parsing performance regression #2760

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: more economic data skipping with datafusion #2772

feat: more economic data skipping with datafusion #2772

roeap commented Aug 14, 2024

ion-elgreco Aug 14, 2024

roeap Aug 14, 2024 •

edited

Loading

rtyler left a comment

rtyler left a comment

feat: more economic data skipping with datafusion #2772

feat: more economic data skipping with datafusion #2772

Conversation

roeap commented Aug 14, 2024

Description

Related Issue(s)

Documentation

ion-elgreco Aug 14, 2024

Choose a reason for hiding this comment

roeap Aug 14, 2024 • edited Loading

Choose a reason for hiding this comment

rtyler left a comment

Choose a reason for hiding this comment

rtyler left a comment

Choose a reason for hiding this comment

roeap Aug 14, 2024 •

edited

Loading