-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Reuse last projection layer when renaming columns #14684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Quite a speed improvement! I'll look at the code in a bit |
.with_column_renamed("c4", "boom")?; | ||
|
||
assert_eq!("\ | ||
Projection: aggregate_test_100.c1 AS one, aggregate_test_100.c2 AS two, aggregate_test_100.c3, aggregate_test_100.c2 + aggregate_test_100.c3 AS sum AS total\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AS sum AS total is valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the thorough reading 🙏 AS sum AS total
happens because there's an alias over an alias - I believe it's harmless since it just wraps the original expression.
There's a function unalias_nested
that can resolve this, but I don't want to call it because it's recursive and expensive. A potentially better option IMO is to reuse the existing alias in alias_qualified
and alias
if it's already there (similar to this PR). This feels like a separate improvement from this PR though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just worry that another column referenced 'sum' and then it is aliased to 'total' and what the behaviour of that would be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I don't think that's possible? "total" cannot be referenced within the same projection, because for that projection it is not in its input. But to display it in a nicer way, I'll do #14781
SchemaError::FieldNotFound { .. }, | ||
_, | ||
)) => { | ||
self.plan = LogicalPlan::Projection( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the field is not found why are we not just returning Ok(self)? I am unsure why we would need to add a projection for the plan in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to do that because self
was destructed and so I need to put plan
back. try_new_with_schema
is quite cheap and this shouldn't be happening often anyway
}) | ||
.collect(); | ||
LogicalPlan::Projection(Projection::try_new(expr, input)?) | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be a lot of duplication here. I wonder if there is a way we could reduce that? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's very fair! I also dislike repeating qualifier_rename, field_rename
twice, but it's happening because they reference plan
and must come after its destruction (hence needed twice)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I still want to merge this one. #14653 has indeed made the benches much faster, but I think it's still good to make the logical plan smaller if we can |
Looks to me like there are a few outstanding items:
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
d56e5f9
to
2fd800c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contribution @blaginin
let project_plan = LogicalPlanBuilder::from(self.plan) | ||
.project_with_validation(projection)? | ||
.build()?; | ||
let project_plan = if let LogicalPlan::Projection(Projection { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be clear, the projections are flattened as part of the planning process right? So this PR mostly is about making logical EXPLAIN plans look nicer with with_column_renamed
?
If so it seems like a lot of complexity for (just) that feature. Maybe we can switch to showing optimized explain plans or something for people who want to see a simpler version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may have been related to trying to improve planning performance by reducing the number of projections up front so planning doesn't need to do it. Improving the logical plan output is definitely nice though :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is indeed removed at the optimisation step! The idea behind that PR was that sometimes you might want to call functions involving expr tree traversal before performing optimisations. For example:
- Create a df
- Rename a lot of columns
- Call something like
replace_params_with_values
← then, it'll iterate over a lot of projection layers
However, after revisiting the PR half a year later 😁, I feel like:
- this isn't actually a problem (spent 30 mins and couldn't easily reproduce the problem)
- it's very niche
- it does make the code more complicated
So I'll close the PR for now, but will keep an eye on related issues and resubmit if it turns out to be a real problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it may not be a problem exactly I do think it may be worth it for cases where a large number of with_column (thus projections) are happening.
You can see from a sample instrumentation for a dataframe operation I do that handling the projections is take a long time:
Optimization for rule optimize_projections took > 50ms: 22821 ms
The logical plan is 31MB in text (.display_indent()
) and you can see just how nested things can get just from the document map in Notepad++
I posted about improving the planning performance via multithreading in the discord channel but from a quick look I'm not sure that approach will necessarily be a simple change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh wait, this was with_column_renamed, not with_column. never mind :(
Which issue does this PR close?
Related to #14563
Rationale for this change
Currently, when
with_column_renamed
is called, a new Projection layer is always created. For example, in thedataframe
benchmark, logical plan looks like this:These layers do not affect the query itself (as they are removed by
optimize_projections
), but they make renaming new columns (and optimization itself) quite slow.What changes are included in this PR?
I added an optimization for one edge case - when there's already a projection on top and we're adding a new one. This is a common scenario, especially when renaming many columns (as in the
dataframe
benchmark). Instead of adding a new projection layer on top, we replace the existing one if possible.Are these changes tested?
Extended test case
Are there any user-facing changes?
No