-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: more economic data skipping with datafusion #2772
Conversation
e1557b5
to
cdbd392
Compare
.ok()?; | ||
results.push(result.record_batch().clone()); | ||
} | ||
let batch = concat_batches(results[0].schema_ref(), &results).ok()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it cheaper to concat first and then run it through the Evaluator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, at the very least that would give more opportunity for parallelism. in fact we should already concatenate the batches on EagerSnapshot
. However the schemas of these batches are not yet normalized, so we cannot concatenate them yet.
For this we need to do either some internal casting/filtering in the log replay, or even better do column selection when reading the checkpoints ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
manually tested, some rustdocs can come later 😄
928b10c
to
e688ad4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙃 i'm sure it's fine 🙃
Description
This PR adresses some inefficiencies when creating stats for file pruning using the
PruningPredicate
. Rather then generating stats from file actions which we generate ad-hoc from teh underlying arrow data, we now read the stats directly from the raw file data - i.e. avoiding expensive roundtrips through file actions.To make this work, we needed to include parsing partition values (adding the
partitionValues_parsed
field to action data) during log replay. As a follow-up we should make more use of the parsed partiton values #2771Related Issue(s)
Documentation