-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Add new stats pruning helpers to allow combining partition values in file level stats #16139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@xudong963 any chance you can review this since you've already approved the same code (with less tests!) in the original PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropped some suggestion and comments.
datafusion/common/src/pruning.rs
Outdated
) -> Self { | ||
let num_containers = partition_values.len(); | ||
let partition_schema = Arc::new(Schema::new(partition_fields)); | ||
let mut partition_valeus_by_column = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let mut partition_valeus_by_column = | |
let mut partition_values_by_column = |
/// The outer vector represents the containers while the inner | ||
/// vector represents the partition values for each column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
constructor accepts partition_values as a Vec, documented as “outer vector represents the containers while the inner vector represents the partition values for each column.” In code however, each inner Vec is treated as the values for one container, then transpose that into column-major storage.
The phrasing “inner vector represents the partition values for each column” can be read as “one column’s values across containers.”
datafusion/common/src/pruning.rs
Outdated
fn min_values(&self, column: &Column) -> Option<ArrayRef> { | ||
let index = self.schema.index_of(column.name()).ok()?; | ||
if self.statistics.iter().any(|s| { | ||
s.column_statistics | ||
.get(index) | ||
.is_some_and(|stat| stat.min_value.is_exact().unwrap_or(false)) | ||
}) { | ||
match ScalarValue::iter_to_array(self.statistics.iter().map(|s| { | ||
s.column_statistics | ||
.get(index) | ||
.and_then(|stat| { | ||
if let Precision::Exact(min) = &stat.min_value { | ||
Some(min.clone()) | ||
} else { | ||
None | ||
} | ||
}) | ||
.unwrap_or(ScalarValue::Null) | ||
})) { | ||
Ok(array) => Some(array), | ||
Err(_) => { | ||
log::warn!( | ||
"Failed to convert min values to array for column {}", | ||
column.name() | ||
); | ||
None | ||
} | ||
} | ||
} else { | ||
None | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both PrunableStatistics::min_values
and max_values
walk the same steps:
- Find the column index in the schema.
- Check whether any
Statistics
entry has an “exact” value for that column. - Iterate over all
Statistics
, pulling out the exact values or substitutingScalarValue::Null
. - Call
ScalarValue::iter_to_array(...)
and log or returnNone
on error.
By lifting steps (2)–(4) into a helper, we:
- Eliminate duplicate code in each method
- Centralize error handling and logging
- Make future changes (e.g. using a different logging framework) in one place
datafusion/common/src/pruning.rs
Outdated
let mut contained = Vec::with_capacity(self.partition_values.len()); | ||
for partition_value in partition_values { | ||
let contained_value = if values.contains(partition_value) { | ||
Some(true) | ||
} else { | ||
Some(false) | ||
}; | ||
contained.push(contained_value); | ||
} | ||
let array = BooleanArray::from(contained); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of explicit loops; would simplifying to .map(...) chains followed by collect() be better?
let array = BooleanArray::from(
partition_values
.iter()
.map(|pv| Some(values.contains(pv)))
.collect::<Vec<_>>()
);
Benefits:
- Eliminates manual push logic
- More concise: transforms each pv into a boolean directly
- Clearly shows “map input → output” intent
fe0b8f1
to
8b7089f
Compare
Thank you @kosiew that was great feedback 😄 |
8b7089f
to
27cdc22
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datafusion/common/src/pruning.rs
Outdated
fn min_values(&self, column: &Column) -> Option<ArrayRef> { | ||
let index = self.partition_schema.index_of(column.name()).ok()?; | ||
let partition_values = self.partition_values.get(index)?; | ||
match ScalarValue::iter_to_array(partition_values.iter().cloned()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is always sad to me that this API requires a clone of ScalarValue 😢 (nothing you did in this PR)
…ngStatistics for partition + file level stats pruning
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
This reverts commit 8b6f1a2.
I think you will need to do the gitbox thing with your apache account (when it is activated) and then you will be an official committer |
5094e27
to
5776221
Compare
} | ||
|
||
/// Prune a set of containers represented by their statistics. | ||
/// Each [`Statistics`] represents a container (e.g. a file or a partition of files). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does a partition of files
mean? A FileGroup
or just a collection of some files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's any collection of files. Basically it's up to the caller to define what the container is: it could be a single file or a FileGroup
. I do think it'd be helpful to link to possible sources of statistics here (e.g. link to [FileGroup
]).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will make a follow on PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow on PR: #16213
/// If multiple statistics have information for the same column, | ||
/// the first one is returned without any regard for completeness or accuracy. | ||
/// That is: if the first statistics has information for a column, even if it is incomplete, | ||
/// that is returned even if a later statistics has more complete information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about why not prune based on all the statistics that have information, one by one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could do that, but I'm not really sure where we'd encounter that.
The immediate use case here is having partition values for partition columns and file level statistics for the rest, so there is no overlap. But I had to choose some behavior for this implementation so I went with "first". I guess "both via AND" is reasonable, it just seems a bit harder to explain / get right. But I can give it a shot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we can also just add a todo and maintain the current status in the PR
Thanks @adriangb and @xudong963 -- this looks good to me 🚀 |
A step towards #16014