Skip to content

Conversation

@dvogelbacher
Copy link
Contributor

What changes were proposed in this pull request?

Add a configuration option for the maximum number of leaf expressions in collapsed project nodes. If a collapsed project node (result of the CollapseProject optimizer rule) would have more leaf expressions than the configured maximum number we don't collapse. This is to protect against an exponentially exploding number of leaf expressions when collapsing many (binary) expression that refer to the same columns (see https://issues.apache.org/jira/browse/SPARK-24983).

How was this patch tested?

Add a new unit test.

@dvogelbacher dvogelbacher changed the title [SPARK-24983][SQL] Add configuration for maximum number of leaf expressions in collapsed project nodes [SPARK-24983][Catalyst] Add configuration for maximum number of leaf expressions in collapsed project nodes Aug 3, 2018
@dvogelbacher dvogelbacher changed the title [SPARK-24983][Catalyst] Add configuration for maximum number of leaf expressions in collapsed project nodes [SPARK-24983][Catalyst] Add configuration for maximum number of leaf expressions in collapsed projects Aug 3, 2018
@dvogelbacher
Copy link
Contributor Author

@HyukjinKwon can you help with finding reviewers for this PR?

@HyukjinKwon
Copy link
Member

Usually @gatorsmile and @cloud-fan.

@gatorsmile
Copy link
Member

Let us blacklist CASE WHEN in CollapseProject, instead of introducing this new conf.

@gatorsmile
Copy link
Member

@dvogelbacher Currently, in the master branch (2.4 release), you have a workaround. Add CollapseProject to spark.sql.optimizer.excludedRules before such queries.

@dvogelbacher
Copy link
Contributor Author

dvogelbacher commented Aug 4, 2018

@gatorsmile yes, I found that workaround. Very useful :)
I think it would still be good to handle this better by default. I can see that introducing such an arbitrary configuration param doesn't seem optimal and am open to better suggestions.

Not sure if blacklisting case-when statements outright is the right way to go. That could have negative perf impacts as well? And it wouldn't handle the case in the unit test where we have exponential growth when adding/subtracting columns (though that example might be somewhat contrived).

Maybe we should just not collapse if the number of leaf expressions for changed expressions in the collapsed project is higher than the sum of the number of corresponding leaf expressions in the original project lists?

} else {
p2.copy(projectList = buildCleanedProjectList(p1.projectList, p2.projectList))
}
case p1@Project(_, p2: Project) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Do we need to change this line? We can keep this line as is.

@gatorsmile
Copy link
Member

Backlisting case-when statements looks good to me.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Sep 17, 2019

Closing due to inactivity over a year.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants