-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-15680][SQL] Disable comments in generated code in order to avoid perf. issues #13421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-15680][SQL] Disable comments in generated code in order to avoid perf. issues #13421
Conversation
|
Test build #59676 has finished for PR 13421 at commit
|
| // be extremely expensive in certain cases, such as deeply-nested expressions which operate over | ||
| // inputs with wide schemas. For more details on the performance issues that motivated this | ||
| // flat, see SPARK-15680. | ||
| if (SparkEnv.get != null && SparkEnv.get.conf.getBoolean("spark.sql.codegen.comments", false)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm this is not a runtime config -- can we use a runtime sqlconf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basically the change would be larger, but i think immutable configs like this make this feature pretty much dead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with using a runtime SQLConf (actually a CatalystConf (to avoid a circular dependency)) is that we'd need to thread that configuration into the implementations of the CodeGenerator.generate method and that method has 60+ call sites, many of which do not have a readily-accessible configuration instance.
If we had some thread-local mechanism for implicitly obtaining these configurations then this would be easy, but for now I don't see a simple way to thread this configuration without changing 20+ files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have a thread-local SQLContext (SQLContext.getActive()), it could be used here.
In BroadcastExchangeExec and prepare of subquery (in SparkPlan), we did not set the current SQLContext as active one, we should also fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we access SQLContext in the catalyst package, though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's sad ...
|
Test build #59683 has finished for PR 13421 at commit
|
|
Jenkins, retest this please. |
|
LGTM pending tests. |
|
Test build #59689 has finished for PR 13421 at commit
|
|
OK this isn't great, but I'm going to merge it first. |
|
Merging in master/2.0. |
…id perf. issues ## What changes were proposed in this pull request? In benchmarks involving tables with very wide and complex schemas (thousands of columns, deep nesting), I noticed that significant amounts of time (order of tens of seconds per task) were being spent generating comments during the code generation phase. The root cause of the performance problem stems from the fact that calling toString() on a complex expression can involve thousands of string concatenations, resulting in huge amounts (tens of gigabytes) of character array allocation and copying. In the long term, we can avoid this problem by passing StringBuilders down the tree and using them to accumulate output. As a short-term workaround, this patch guards comment generation behind a flag and disables comments by default (for wide tables / complex queries, these comments were being truncated prior to display and thus were not very useful). ## How was this patch tested? This was tested manually by running a Spark SQL query over an empty table with a very wide schema obtained from a real workload. Disabling comments brought the per-task time down from about 16 seconds to 600 milliseconds. Author: Josh Rosen <joshrosen@databricks.com> Closes #13421 from JoshRosen/disable-line-comments-in-codegen. (cherry picked from commit 8ca01a6) Signed-off-by: Reynold Xin <rxin@databricks.com>
What changes were proposed in this pull request?
In benchmarks involving tables with very wide and complex schemas (thousands of columns, deep nesting), I noticed that significant amounts of time (order of tens of seconds per task) were being spent generating comments during the code generation phase.
The root cause of the performance problem stems from the fact that calling toString() on a complex expression can involve thousands of string concatenations, resulting in huge amounts (tens of gigabytes) of character array allocation and copying.
In the long term, we can avoid this problem by passing StringBuilders down the tree and using them to accumulate output. As a short-term workaround, this patch guards comment generation behind a flag and disables comments by default (for wide tables / complex queries, these comments were being truncated prior to display and thus were not very useful).
How was this patch tested?
This was tested manually by running a Spark SQL query over an empty table with a very wide schema obtained from a real workload. Disabling comments brought the per-task time down from about 16 seconds to 600 milliseconds.