Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Aggregate::try new with validation checks #3286

Merged
merged 4 commits into from
Aug 31, 2022

Conversation

andygrove
Copy link
Member

Which issue does this PR close?

Related to #3285

Rationale for this change

I found a bug in ProjectionPushDown in another branch I am working on. Adding validation checks when we create aggregate plans exposes this and seems like a good idea in general.

What changes are included in this PR?

Add Aggregate::try new and update existing code to call that rather than creating the struct directly.

Are there any user-facing changes?

No

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules labels Aug 29, 2022
Copy link
Contributor

@avantgardnerio avantgardnerio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. It follows the boy scout rule.

/// Count the number of distinct exprs in a list of group by expressions. If the
/// first element is a `GroupingSet` expression then it must be the only expr.
pub fn grouping_set_expr_count(group_expr: &[Expr]) -> Result<usize> {
if let Some(Expr::GroupingSet(grouping_set)) = group_expr.first() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird that this function takes either a GroupingSet expression, or an arbitrary expression. I dealt with this a lot in the subquery decorrelation - it almost feels like dynamic typing. I think the root cause might be that many of the enum values have fields inside them, rather than a struct - so it's impossible to de-reference them and pass them between functions.

How would you feel about blanket converting all enum values like InSubquery, Exists, etc to point to structs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also found this frustrating to work with. Maybe we could introduce another enum?

pub enum Grouping {
    Expr(Vec<Expr>),
    Set(GroupingSet)
}

and then

pub struct Aggregate {
    pub input: Arc<LogicalPlan>,
    pub group_expr: Grouping,
    pub aggr_expr: Vec<Expr>,
    pub schema: DFSchemaRef,
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you feel about blanket converting all enum values like InSubquery, Exists, etc to point to structs?

Sounds good to me. This would be consistent with how we do things in the LogicalPlan.

}
Ok(grouping_set.distinct_expr().len())
} else {
Ok(group_expr.len())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do other validation? Should I be able to group by *?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect that we could add extra validation here over time. I would need to research what is allowable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW the sqlparser won't allow group by *

❯ select count(*) from foo group by *;  🤔 Invalid statement: sql parser error: Expected an expression:, found: *

@codecov-commenter
Copy link

Codecov Report

Merging #3286 (04b44e0) into master (7aed4d6) will decrease coverage by 0.00%.
The diff coverage is 88.23%.

@@            Coverage Diff             @@
##           master    #3286      +/-   ##
==========================================
- Coverage   85.92%   85.92%   -0.01%     
==========================================
  Files         294      294              
  Lines       53469    53489      +20     
==========================================
+ Hits        45945    45959      +14     
- Misses       7524     7530       +6     
Impacted Files Coverage Δ
datafusion/expr/src/logical_plan/plan.rs 78.18% <69.23%> (-0.55%) ⬇️
datafusion/expr/src/utils.rs 90.74% <81.81%> (-0.37%) ⬇️
datafusion/core/src/physical_plan/planner.rs 80.88% <100.00%> (ø)
datafusion/expr/src/logical_plan/builder.rs 90.35% <100.00%> (ø)
...tafusion/optimizer/src/common_subexpr_eliminate.rs 94.30% <100.00%> (ø)
datafusion/optimizer/src/projection_push_down.rs 98.06% <100.00%> (ø)
...fusion/optimizer/src/single_distinct_to_groupby.rs 98.81% <100.00%> (ø)
datafusion/core/src/physical_plan/metrics/value.rs 87.43% <0.00%> (+0.50%) ⬆️
datafusion/expr/src/window_frame.rs 93.27% <0.00%> (+0.84%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@andygrove
Copy link
Member Author

@alamb @tustvold @jdye64 PTAL when you can

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- thanks @andygrove

Comment on lines 1331 to 1333
return Err(DataFusionError::Plan(
"Aggregate schema has wrong number of fields".to_string(),
));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return Err(DataFusionError::Plan(
"Aggregate schema has wrong number of fields".to_string(),
));
return Err(DataFusionError::Plan(
format!("Aggregate schema has wrong number of fields. Expected {} got {}",
schema.fields().len(), group_expr_count + aggr_expr.len()
)
));

}
Ok(grouping_set.distinct_expr().len())
} else {
Ok(group_expr.len())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW the sqlparser won't allow group by *

❯ select count(*) from foo group by *;  🤔 Invalid statement: sql parser error: Expected an expression:, found: *

@andygrove andygrove merged commit 3d37016 into apache:master Aug 31, 2022
@andygrove andygrove deleted the aggregate-try-new branch August 31, 2022 14:58
@ursabot
Copy link

ursabot commented Aug 31, 2022

Benchmark runs are scheduled for baseline = 516ad0d and contender = 3d37016. 3d37016 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

kmitchener pushed a commit to kmitchener/arrow-datafusion that referenced this pull request Sep 4, 2022
* Add Aggregate::try_new with validation checks

* fix calculation of number of grouping expressions

* use suggested error message
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants