Skip to content

Conversation

@JoshRosen
Copy link
Contributor

What changes were proposed in this pull request?

In benchmarks involving tables with very wide and complex schemas (thousands of columns, deep nesting), I noticed that significant amounts of time (order of tens of seconds per task) were being spent generating comments during the code generation phase.

The root cause of the performance problem stems from the fact that calling toString() on a complex expression can involve thousands of string concatenations, resulting in huge amounts (tens of gigabytes) of character array allocation and copying.

In the long term, we can avoid this problem by passing StringBuilders down the tree and using them to accumulate output. As a short-term workaround, this patch guards comment generation behind a flag and disables comments by default (for wide tables / complex queries, these comments were being truncated prior to display and thus were not very useful).

How was this patch tested?

This was tested manually by running a Spark SQL query over an empty table with a very wide schema obtained from a real workload. Disabling comments brought the per-task time down from about 16 seconds to 600 milliseconds.

@SparkQA
Copy link

SparkQA commented May 31, 2016

Test build #59676 has finished for PR 13421 at commit 0b6a190.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// be extremely expensive in certain cases, such as deeply-nested expressions which operate over
// inputs with wide schemas. For more details on the performance issues that motivated this
// flat, see SPARK-15680.
if (SparkEnv.get != null && SparkEnv.get.conf.getBoolean("spark.sql.codegen.comments", false)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm this is not a runtime config -- can we use a runtime sqlconf?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basically the change would be larger, but i think immutable configs like this make this feature pretty much dead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with using a runtime SQLConf (actually a CatalystConf (to avoid a circular dependency)) is that we'd need to thread that configuration into the implementations of the CodeGenerator.generate method and that method has 60+ call sites, many of which do not have a readily-accessible configuration instance.

If we had some thread-local mechanism for implicitly obtaining these configurations then this would be easy, but for now I don't see a simple way to thread this configuration without changing 20+ files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a thread-local SQLContext (SQLContext.getActive()), it could be used here.

In BroadcastExchangeExec and prepare of subquery (in SparkPlan), we did not set the current SQLContext as active one, we should also fix that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we access SQLContext in the catalyst package, though?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's sad ...

@SparkQA
Copy link

SparkQA commented May 31, 2016

Test build #59683 has finished for PR 13421 at commit db46241.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor Author

Jenkins, retest this please.

@rxin
Copy link
Contributor

rxin commented May 31, 2016

LGTM pending tests.

@SparkQA
Copy link

SparkQA commented Jun 1, 2016

Test build #59689 has finished for PR 13421 at commit db46241.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jun 1, 2016

OK this isn't great, but I'm going to merge it first.

@rxin
Copy link
Contributor

rxin commented Jun 1, 2016

Merging in master/2.0.

asfgit pushed a commit that referenced this pull request Jun 1, 2016
…id perf. issues

## What changes were proposed in this pull request?

In benchmarks involving tables with very wide and complex schemas (thousands of columns, deep nesting), I noticed that significant amounts of time (order of tens of seconds per task) were being spent generating comments during the code generation phase.

The root cause of the performance problem stems from the fact that calling toString() on a complex expression can involve thousands of string concatenations, resulting in huge amounts (tens of gigabytes) of character array allocation and copying.

In the long term, we can avoid this problem by passing StringBuilders down the tree and using them to accumulate output. As a short-term workaround, this patch guards comment generation behind a flag and disables comments by default (for wide tables / complex queries, these comments were being truncated prior to display and thus were not very useful).

## How was this patch tested?

This was tested manually by running a Spark SQL query over an empty table with a very wide schema obtained from a real workload. Disabling comments brought the per-task time down from about 16 seconds to 600 milliseconds.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #13421 from JoshRosen/disable-line-comments-in-codegen.

(cherry picked from commit 8ca01a6)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@asfgit asfgit closed this in 8ca01a6 Jun 1, 2016
@JoshRosen JoshRosen deleted the disable-line-comments-in-codegen branch June 1, 2016 01:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants