[SPARK-17972][SQL] Build Datasets upon `withCachedData` instead of `analyzed` to avoid slow query planning #15517

liancheng · 2016-10-17T18:13:01Z

What changes were proposed in this pull request?

(This PR is based on a PoC branch authored by @clockfly.)

Iterative ML code may easily create query plans that grow exponentially. We found that query planning time also increases exponentially even when all the sub-plan trees are cached.

The following snippet illustrates the problem:

(0 until 6).foldLeft(Seq(1, 2, 3).toDS) { (plan, iteration) =>
  val start = System.currentTimeMillis()
  val result = plan.join(plan, "value").join(plan, "value").join(plan, "value").join(plan, "value")
  result.cache()
  System.out.println(s"Iteration $iteration takes time ${System.currentTimeMillis() - start} ms")
  result.as[Int]
}

// Iteration 0 takes time 9 ms
// Iteration 1 takes time 19 ms
// Iteration 2 takes time 61 ms
// Iteration 3 takes time 219 ms
// Iteration 4 takes time 830 ms
// Iteration 5 takes time 4080 ms

This is because when building a new Dataset, the new plan is always built upon QueryExecution.analyzed, which doesn't leverage existing cached plans. This PR tries to fix this issue by building new Datasets upon QueryExecution.withCachedData to leverage cached plans and avoid super slow query planning.

Here is the result of running 1,000 iterations using the same posted above after applying this PR:

Iteration 0 takes time 10 ms
Iteration 1 takes time 48 ms
Iteration 2 takes time 39 ms
Iteration 3 takes time 56 ms
Iteration 4 takes time 43 ms
Iteration 5 takes time 36 ms
Iteration 6 takes time 43 ms
Iteration 7 takes time 44 ms
Iteration 8 takes time 38 ms
Iteration 9 takes time 42 ms
...
Iteration 990 takes time 207 ms
Iteration 991 takes time 187 ms
Iteration 992 takes time 223 ms
Iteration 993 takes time 220 ms
Iteration 994 takes time 231 ms
Iteration 995 takes time 211 ms
Iteration 996 takes time 216 ms
Iteration 997 takes time 199 ms
Iteration 998 takes time 209 ms
Iteration 999 takes time 202 ms

Query planning time slows down much more slowly. This is mostly because cache manager query lookup slows down as entries stored in the cache manager grows.

Many thanks to @clockfly, who investigated this issue and made an initial PoC.

How was this patch tested?

Existing tests.

liancheng · 2016-10-17T18:21:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala

The original toString method may OOM for super large query plans. This is especially true for those plan trees built in iterative manner and grow exponentially.

SparkQA · 2016-10-17T19:55:21Z

Test build #67081 has finished for PR 15517 at commit 292ef36.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

naliazheli · 2016-10-18T02:37:27Z

LGTM.
Util this issue is resolved,I can only do Dataset.toRdd.checkpoint() to avoid the growing time of qurry plan.

cloud-fan · 2016-10-18T03:37:16Z

sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala

before this PR, we also cache the analyzed plan right?

I think the major change is that, now we cache cached plan instead of analyzed plan.

liancheng · 2016-10-18T09:54:31Z

The previous test failure was because we replace the analyzed plan with withCacheData, while cache manager uses the original analyzed plan as keys.

Force-pushed a new and much simpler approach by building new Datasets upon withCacheData. Let's see whether Jenkins passes.

SparkQA · 2016-10-18T11:25:21Z

Test build #67120 has finished for PR 15517 at commit e1283a8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-10-19T17:19:35Z

cc @mengxr and @jkbradley

SparkQA · 2016-10-19T21:23:06Z

Test build #67208 has finished for PR 15517 at commit fffbd69.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-10-20T00:15:47Z

The most recent version still breaks some test cases related to caching. Investigating it.

yhuai · 2016-10-20T00:36:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala


  lazy val withCachedData: LogicalPlan = {
    assertAnalyzed()
-    assertSupported()


This line is actually moved to optimizedPlan. It's for fixing the streaming test failures.

Although streaming queries doesn't use QueryExecution for actual execution, it somehow triggers this line after changes made in the previous commit and throws exception. Not quite familiar with structured streaming though, I may missed something here.

liancheng · 2016-10-21T23:34:26Z

I'm closing this since caching is not the ultimate solution for this problem anyway. Caching is too memory consuming when you, say, computing connected components in an iterative way over a graph with 50 billion nodes.

Going to add a checkpoint API for Dataset so that we can truncate both the plan tree and the RDD lineage without caching.

naliazheli · 2016-10-22T14:11:05Z

dataset.checkpoint is what i need.

liancheng commented Oct 17, 2016

View reviewed changes

cloud-fan reviewed Oct 18, 2016

View reviewed changes

Build Datasets upon cached plans

e1283a8

liancheng force-pushed the fix-slow-planning branch from 292ef36 to e1283a8 Compare October 18, 2016 09:47

liancheng changed the title ~~[SPARK-17972][SQL] Cache analyzed plan instead of optimized plan to avoid slow query planning~~ [SPARK-17972][SQL] Build Datasets upon withCachedData instead of analyzed to avoid slow query planning Oct 18, 2016

Fix streaming test failures

fffbd69

yhuai reviewed Oct 20, 2016

View reviewed changes

liancheng mentioned this pull request Oct 20, 2016

[DO NOT MERGE][17972][SQL] Another try of PR #15517 #15565

Closed

liancheng closed this Oct 21, 2016

[SPARK-17972][SQL] Build Datasets upon withCachedData instead of analyzed to avoid slow query planning #15517

[SPARK-17972][SQL] Build Datasets upon withCachedData instead of analyzed to avoid slow query planning #15517

Uh oh!

Conversation

liancheng commented Oct 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

liancheng Oct 17, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 17, 2016

Uh oh!

naliazheli commented Oct 18, 2016

Uh oh!

cloud-fan Oct 18, 2016

Choose a reason for hiding this comment

Uh oh!

liancheng commented Oct 18, 2016

Uh oh!

SparkQA commented Oct 18, 2016

Uh oh!

yhuai commented Oct 19, 2016

Uh oh!

SparkQA commented Oct 19, 2016

Uh oh!

liancheng commented Oct 20, 2016

Uh oh!

yhuai Oct 20, 2016

Choose a reason for hiding this comment

Uh oh!

liancheng Oct 20, 2016

Choose a reason for hiding this comment

Uh oh!

liancheng commented Oct 21, 2016

Uh oh!

naliazheli commented Oct 22, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-17972][SQL] Build Datasets upon `withCachedData` instead of `analyzed` to avoid slow query planning #15517

[SPARK-17972][SQL] Build Datasets upon `withCachedData` instead of `analyzed` to avoid slow query planning #15517

liancheng commented Oct 17, 2016 •

edited

Loading