Skip to content

Conversation

adriangb
Copy link
Contributor

A step towards #16014

@github-actions github-actions bot added the common Related to common crate label May 21, 2025
@adriangb
Copy link
Contributor Author

@xudong963 any chance you can review this since you've already approved the same code (with less tests!) in the original PR?

Copy link
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped some suggestion and comments.

) -> Self {
let num_containers = partition_values.len();
let partition_schema = Arc::new(Schema::new(partition_fields));
let mut partition_valeus_by_column =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let mut partition_valeus_by_column =
let mut partition_values_by_column =

Comment on lines +160 to +161
/// The outer vector represents the containers while the inner
/// vector represents the partition values for each column.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

constructor accepts partition_values as a Vec, documented as “outer vector represents the containers while the inner vector represents the partition values for each column.” In code however, each inner Vec is treated as the values for one container, then transpose that into column-major storage.

The phrasing “inner vector represents the partition values for each column” can be read as “one column’s values across containers.”

Comment on lines 264 to 288
fn min_values(&self, column: &Column) -> Option<ArrayRef> {
let index = self.schema.index_of(column.name()).ok()?;
if self.statistics.iter().any(|s| {
s.column_statistics
.get(index)
.is_some_and(|stat| stat.min_value.is_exact().unwrap_or(false))
}) {
match ScalarValue::iter_to_array(self.statistics.iter().map(|s| {
s.column_statistics
.get(index)
.and_then(|stat| {
if let Precision::Exact(min) = &stat.min_value {
Some(min.clone())
} else {
None
}
})
.unwrap_or(ScalarValue::Null)
})) {
Ok(array) => Some(array),
Err(_) => {
log::warn!(
"Failed to convert min values to array for column {}",
column.name()
);
None
}
}
} else {
None
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both PrunableStatistics::min_values and max_values walk the same steps:

  1. Find the column index in the schema.
  2. Check whether any Statistics entry has an “exact” value for that column.
  3. Iterate over all Statistics, pulling out the exact values or substituting ScalarValue::Null.
  4. Call ScalarValue::iter_to_array(...) and log or return None on error.

By lifting steps (2)–(4) into a helper, we:

  • Eliminate duplicate code in each method
  • Centralize error handling and logging
  • Make future changes (e.g. using a different logging framework) in one place

Comment on lines 228 to 237
let mut contained = Vec::with_capacity(self.partition_values.len());
for partition_value in partition_values {
let contained_value = if values.contains(partition_value) {
Some(true)
} else {
Some(false)
};
contained.push(contained_value);
}
let array = BooleanArray::from(contained);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of explicit loops; would simplifying to .map(...) chains followed by collect() be better?

let array = BooleanArray::from(
    partition_values
        .iter()
        .map(|pv| Some(values.contains(pv)))
        .collect::<Vec<_>>()
);

Benefits:

  • Eliminates manual push logic
  • More concise: transforms each pv into a boolean directly
  • Clearly shows “map input → output” intent

@adriangb adriangb force-pushed the add-pruning-structs branch from fe0b8f1 to 8b7089f Compare May 23, 2025 13:57
@adriangb
Copy link
Contributor Author

Thank you @kosiew that was great feedback 😄

@adriangb adriangb force-pushed the add-pruning-structs branch from 8b7089f to 27cdc22 Compare May 26, 2025 16:35
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @adriangb and @kosiew -- this looks like great work to me

fn min_values(&self, column: &Column) -> Option<ArrayRef> {
let index = self.partition_schema.index_of(column.name()).ok()?;
let partition_values = self.partition_values.get(index)?;
match ScalarValue::iter_to_array(partition_values.iter().cloned()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is always sad to me that this API requires a clone of ScalarValue 😢 (nothing you did in this PR)

@xudong963
Copy link
Member

xudong963 commented May 28, 2025

Sorry for late, I'll check tomorrow (feel free to directly invite me to review by the button, then I'll notice more)

image

@xudong963 xudong963 self-requested a review May 28, 2025 16:20
@adriangb
Copy link
Contributor Author

Sorry for late, I'll check tomorrow (feel free to directly invite me to review by the button, then I'll notice more)

I'm not able to request reviews. I think only commiters can do that and I'm not a commiter (yet).

image

adriangb and others added 6 commits May 28, 2025 12:16
@alamb
Copy link
Contributor

alamb commented May 28, 2025

I'm not able to request reviews. I think only commiters can do that and I'm not a commiter (yet).

I think you will need to do the gitbox thing with your apache account (when it is activated) and then you will be an official committer

@adriangb adriangb force-pushed the add-pruning-structs branch from 5094e27 to 5776221 Compare May 28, 2025 17:24
}

/// Prune a set of containers represented by their statistics.
/// Each [`Statistics`] represents a container (e.g. a file or a partition of files).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does a partition of files mean? A FileGroup or just a collection of some files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's any collection of files. Basically it's up to the caller to define what the container is: it could be a single file or a FileGroup. I do think it'd be helpful to link to possible sources of statistics here (e.g. link to [FileGroup]).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make a follow on PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow on PR: #16213

Comment on lines +396 to +399
/// If multiple statistics have information for the same column,
/// the first one is returned without any regard for completeness or accuracy.
/// That is: if the first statistics has information for a column, even if it is incomplete,
/// that is returned even if a later statistics has more complete information.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about why not prune based on all the statistics that have information, one by one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could do that, but I'm not really sure where we'd encounter that.
The immediate use case here is having partition values for partition columns and file level statistics for the rest, so there is no overlap. But I had to choose some behavior for this implementation so I went with "first". I guess "both via AND" is reasonable, it just seems a bit harder to explain / get right. But I can give it a shot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can also just add a todo and maintain the current status in the PR

@alamb
Copy link
Contributor

alamb commented May 30, 2025

Thanks @adriangb and @xudong963 -- this looks good to me 🚀

@alamb alamb merged commit c6e5c91 into apache:main May 30, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants