[SPARK-15680][SQL] Disable comments in generated code in order to avoid perf. issues #13421

JoshRosen · 2016-05-31T20:24:26Z

What changes were proposed in this pull request?

In benchmarks involving tables with very wide and complex schemas (thousands of columns, deep nesting), I noticed that significant amounts of time (order of tens of seconds per task) were being spent generating comments during the code generation phase.

The root cause of the performance problem stems from the fact that calling toString() on a complex expression can involve thousands of string concatenations, resulting in huge amounts (tens of gigabytes) of character array allocation and copying.

In the long term, we can avoid this problem by passing StringBuilders down the tree and using them to accumulate output. As a short-term workaround, this patch guards comment generation behind a flag and disables comments by default (for wide tables / complex queries, these comments were being truncated prior to display and thus were not very useful).

How was this patch tested?

This was tested manually by running a Spark SQL query over an empty table with a very wide schema obtained from a real workload. Disabling comments brought the per-task time down from about 16 seconds to 600 milliseconds.

SparkQA · 2016-05-31T21:34:18Z

Test build #59676 has finished for PR 13421 at commit 0b6a190.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-05-31T21:45:37Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+    // be extremely expensive in certain cases, such as deeply-nested expressions which operate over
+    // inputs with wide schemas. For more details on the performance issues that motivated this
+    // flat, see SPARK-15680.
+    if (SparkEnv.get != null && SparkEnv.get.conf.getBoolean("spark.sql.codegen.comments", false)) {


hm this is not a runtime config -- can we use a runtime sqlconf?

basically the change would be larger, but i think immutable configs like this make this feature pretty much dead.

The problem with using a runtime SQLConf (actually a CatalystConf (to avoid a circular dependency)) is that we'd need to thread that configuration into the implementations of the CodeGenerator.generate method and that method has 60+ call sites, many of which do not have a readily-accessible configuration instance.

If we had some thread-local mechanism for implicitly obtaining these configurations then this would be easy, but for now I don't see a simple way to thread this configuration without changing 20+ files.

We already have a thread-local SQLContext (SQLContext.getActive()), it could be used here.

In BroadcastExchangeExec and prepare of subquery (in SparkPlan), we did not set the current SQLContext as active one, we should also fix that.

Can we access SQLContext in the catalyst package, though?

That's sad ...

SparkQA · 2016-05-31T22:32:52Z

Test build #59683 has finished for PR 13421 at commit db46241.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-05-31T22:36:56Z

Jenkins, retest this please.

rxin · 2016-05-31T23:09:35Z

LGTM pending tests.

SparkQA · 2016-06-01T00:26:13Z

Test build #59689 has finished for PR 13421 at commit db46241.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-06-01T00:28:42Z

OK this isn't great, but I'm going to merge it first.

rxin · 2016-06-01T00:28:48Z

Merging in master/2.0.

…id perf. issues ## What changes were proposed in this pull request? In benchmarks involving tables with very wide and complex schemas (thousands of columns, deep nesting), I noticed that significant amounts of time (order of tens of seconds per task) were being spent generating comments during the code generation phase. The root cause of the performance problem stems from the fact that calling toString() on a complex expression can involve thousands of string concatenations, resulting in huge amounts (tens of gigabytes) of character array allocation and copying. In the long term, we can avoid this problem by passing StringBuilders down the tree and using them to accumulate output. As a short-term workaround, this patch guards comment generation behind a flag and disables comments by default (for wide tables / complex queries, these comments were being truncated prior to display and thus were not very useful). ## How was this patch tested? This was tested manually by running a Spark SQL query over an empty table with a very wide schema obtained from a real workload. Disabling comments brought the per-task time down from about 16 seconds to 600 milliseconds. Author: Josh Rosen <joshrosen@databricks.com> Closes #13421 from JoshRosen/disable-line-comments-in-codegen. (cherry picked from commit 8ca01a6) Signed-off-by: Reynold Xin <rxin@databricks.com>

Use flag to disable comments in generated code.

0b6a190

Fix NPE in tests.

db46241

rxin reviewed May 31, 2016
View reviewed changes

asfgit closed this in 8ca01a6 Jun 1, 2016

JoshRosen deleted the disable-line-comments-in-codegen branch June 1, 2016 01:11

[SPARK-15680][SQL] Disable comments in generated code in order to avoid perf. issues #13421

[SPARK-15680][SQL] Disable comments in generated code in order to avoid perf. issues #13421

Uh oh!

Conversation

JoshRosen commented May 31, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented May 31, 2016

Uh oh!

rxin May 31, 2016

Choose a reason for hiding this comment

Uh oh!

rxin May 31, 2016

Choose a reason for hiding this comment

Uh oh!

JoshRosen May 31, 2016

Choose a reason for hiding this comment

Uh oh!

davies May 31, 2016

Choose a reason for hiding this comment

Uh oh!

JoshRosen May 31, 2016

Choose a reason for hiding this comment

Uh oh!

davies Jun 1, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 31, 2016

Uh oh!

JoshRosen commented May 31, 2016

Uh oh!

rxin commented May 31, 2016

Uh oh!

SparkQA commented Jun 1, 2016

Uh oh!

rxin commented Jun 1, 2016

Uh oh!

rxin commented Jun 1, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants