[SPARK-17972][SQL] Add Dataset.checkpoint() to truncate large query plans #15651

liancheng · 2016-10-27T00:04:18Z

What changes were proposed in this pull request?

Problem

Iterative ML code may easily create query plans that grow exponentially. We found that query planning time also increases exponentially even when all the sub-plan trees are cached.

The following snippet illustrates the problem:

(0 until 6).foldLeft(Seq(1, 2, 3).toDS) { (plan, iteration) =>
  println(s"== Iteration $iteration ==")
  val time0 = System.currentTimeMillis()
  val joined = plan.join(plan, "value").join(plan, "value").join(plan, "value").join(plan, "value")
  joined.cache()
  println(s"Query planning takes ${System.currentTimeMillis() - time0} ms")
  joined.as[Int]
}

// == Iteration 0 ==
// Query planning takes 9 ms
// == Iteration 1 ==
// Query planning takes 26 ms
// == Iteration 2 ==
// Query planning takes 53 ms
// == Iteration 3 ==
// Query planning takes 163 ms
// == Iteration 4 ==
// Query planning takes 700 ms
// == Iteration 5 ==
// Query planning takes 3418 ms

This is because when building a new Dataset, the new plan is always built upon QueryExecution.analyzed, which doesn't leverage existing cached plans.

On the other hand, usually, doing caching every a few iterations may not be the right direction for this problem since caching is too memory consuming (imaging computing connected components over a graph with 50 billion nodes). What we really need here is to truncate both the query plan (to minimize query planning time) and the lineage of the underlying RDD (to avoid stack overflow).

Changes introduced in this PR

This PR tries to fix this issue by introducing a checkpoint() method into Dataset[T], which does exactly the things described above. The following snippet, which is essentially the same as the one above but invokes checkpoint() instead of cache(), shows the micro benchmark result of this PR:

One key point is that the checkpointed Dataset should preserve the origianl partitioning and ordering information of the original Dataset, so that we can avoid unnecessary shuffling (similar to reading from a pre-bucketed table). This is done by adding outputPartitioning and outputOrdering to LogicalRDD and RDDScanExec.

Micro benchmark

spark.sparkContext.setCheckpointDir("/tmp/cp")

(0 until 100).foldLeft(Seq(1, 2, 3).toDS) { (plan, iteration) =>
  println(s"== Iteration $iteration ==")
  val time0 = System.currentTimeMillis()
  val cp = plan.checkpoint()
  cp.count()
  System.out.println(s"Checkpointing takes ${System.currentTimeMillis() - time0} ms")

  val time1 = System.currentTimeMillis()
  val joined = cp.join(cp, "value").join(cp, "value").join(cp, "value").join(cp, "value")
  val result = joined.as[Int]

  println(s"Query planning takes ${System.currentTimeMillis() - time1} ms")
  result
}

// == Iteration 0 ==
// Checkpointing takes 591 ms
// Query planning takes 13 ms
// == Iteration 1 ==
// Checkpointing takes 1605 ms
// Query planning takes 16 ms
// == Iteration 2 ==
// Checkpointing takes 782 ms
// Query planning takes 8 ms
// == Iteration 3 ==
// Checkpointing takes 729 ms
// Query planning takes 10 ms
// == Iteration 4 ==
// Checkpointing takes 734 ms
// Query planning takes 9 ms
// == Iteration 5 ==
// ...
// == Iteration 50 ==
// Checkpointing takes 571 ms
// Query planning takes 7 ms
// == Iteration 51 ==
// Checkpointing takes 548 ms
// Query planning takes 7 ms
// == Iteration 52 ==
// Checkpointing takes 596 ms
// Query planning takes 8 ms
// == Iteration 53 ==
// Checkpointing takes 568 ms
// Query planning takes 7 ms
// ...

You may see that although checkpointing is more heavy weight an operation, it always takes roughly the same amount of time to perform both checkpointing and query planning.

Open question

@mengxr mentioned that it would be more convenient if we can make Dataset.checkpoint() eager, i.e., always performs a RDD.count() after calling RDD.checkpoint(). Not quite sure whether this is a universal requirement. Maybe we can add a eager: Boolean argument for Dataset.checkpoint() to support that.

How was this patch tested?

Unit test added in DatasetSuite.

liancheng · 2016-10-27T00:11:03Z

cc @mengxr @jkbradley @yhuai

liancheng · 2016-10-27T00:14:25Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

The reason why we would like to pick the first leaf Partitioning here is that PartitioningCollection, which is also an Expression and participates query planning, may grow exponentially in the benchmark snippet, which essentially builds a full binary tree of Joins.

Are the partitioning other than the first useful? Can we just filter out the partitioning guaranteed by other partitionings instead of picking the first only?

There can be cases where the optimizer fails to eliminate an unnecessary shuffle if we strip all the other partitionings. But that's still better than an exponentially growing PartitioningCollection, which basically runs into the same slow query planning issue this PR tries to solve.

I talked to @yhuai offline about exactly the same issue you brought up before sending out this PR, and we decided to have a working version first and optimize it later since we still need feedback from ML people to see whether the basic mechanism works for their workloads.

SparkQA · 2016-10-27T02:13:50Z

Test build #67608 has finished for PR 15651 at commit c7503e3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-10-27T03:05:18Z

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

Should we set it back to the original checkpoint dir at the end of this test?

Good point. Thanks.

viirya · 2016-10-27T04:34:39Z

@liancheng We have RDD.localCheckpoint which provides alternative approach to truncate periodically long lineages while skipping the expensive step of replicating the materialized data in a reliable distributed file system.

What do you think we can also add it to Dataset?

viirya · 2016-10-27T05:00:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala

Do we need to convert the old attributes in outputPartitioning and outputOrdering to new attributes? Otherwise the partitioning and ordering are still referring to old attributes.

Thanks! Good point. I think we probably don't need to call _.newInstance() here anyway.

Actually we do need the _.newInstance() call. You're right.

SparkQA · 2016-10-27T06:09:28Z

Test build #67619 has finished for PR 15651 at commit 1dea5af.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-10-27T19:24:37Z

retest this please

SparkQA · 2016-10-27T19:33:32Z

Test build #67660 has finished for PR 15651 at commit b631d57.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-27T21:12:51Z

Test build #67663 has finished for PR 15651 at commit b631d57.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-10-27T23:41:48Z

@viirya Dataset.localCheckpoint() also makes sense. Would like to add it as a follow-up though. Thanks for the suggestion!

SparkQA · 2016-10-28T01:16:10Z

Test build #67679 has finished for PR 15651 at commit 0ee8f41.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Restore checkpoint directory at the end of the test case Add eager argument to Dataset.checkpoint() Address PR comments

SparkQA · 2016-10-28T05:58:24Z

Test build #67691 has finished for PR 15651 at commit 609fba7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-10-29T04:29:19Z

Also cc @JoshRosen

yhuai · 2016-10-29T18:01:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala

+          case e: Attribute => rewrite.getOrElse(e, e)
+        }.asInstanceOf[Partitioning]
+
+      case p => p


Can you explain this?

Not all Partitioning classes are Expressions, while we only need to rewrite attributes within those Partitionings that are also Expressions.

viirya · 2016-10-31T06:28:24Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala


+  /**
+   * Returns a checkpointed version of this Dataset.
+   *


nit: this API does not have this param.

Oh, nice catch!

viirya · 2016-10-31T06:37:02Z

LGTM except for one minor comment.

SparkQA · 2016-10-31T08:10:56Z

Test build #67802 has finished for PR 15651 at commit cee16ce.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-10-31T17:33:35Z

lgtm pending jenkins

SparkQA · 2016-10-31T19:14:45Z

Test build #67813 has finished for PR 15651 at commit 5405a94.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-31T19:35:17Z

Test build #67816 has finished for PR 15651 at commit ffe4318.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-31T19:39:05Z

Test build #67817 has finished for PR 15651 at commit e0e38bf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…lans ## What changes were proposed in this pull request? ### Problem Iterative ML code may easily create query plans that grow exponentially. We found that query planning time also increases exponentially even when all the sub-plan trees are cached. The following snippet illustrates the problem: ``` scala (0 until 6).foldLeft(Seq(1, 2, 3).toDS) { (plan, iteration) => println(s"== Iteration $iteration ==") val time0 = System.currentTimeMillis() val joined = plan.join(plan, "value").join(plan, "value").join(plan, "value").join(plan, "value") joined.cache() println(s"Query planning takes ${System.currentTimeMillis() - time0} ms") joined.as[Int] } // == Iteration 0 == // Query planning takes 9 ms // == Iteration 1 == // Query planning takes 26 ms // == Iteration 2 == // Query planning takes 53 ms // == Iteration 3 == // Query planning takes 163 ms // == Iteration 4 == // Query planning takes 700 ms // == Iteration 5 == // Query planning takes 3418 ms ``` This is because when building a new Dataset, the new plan is always built upon `QueryExecution.analyzed`, which doesn't leverage existing cached plans. On the other hand, usually, doing caching every a few iterations may not be the right direction for this problem since caching is too memory consuming (imaging computing connected components over a graph with 50 billion nodes). What we really need here is to truncate both the query plan (to minimize query planning time) and the lineage of the underlying RDD (to avoid stack overflow). ### Changes introduced in this PR This PR tries to fix this issue by introducing a `checkpoint()` method into `Dataset[T]`, which does exactly the things described above. The following snippet, which is essentially the same as the one above but invokes `checkpoint()` instead of `cache()`, shows the micro benchmark result of this PR: One key point is that the checkpointed Dataset should preserve the origianl partitioning and ordering information of the original Dataset, so that we can avoid unnecessary shuffling (similar to reading from a pre-bucketed table). This is done by adding `outputPartitioning` and `outputOrdering` to `LogicalRDD` and `RDDScanExec`. ### Micro benchmark ``` scala spark.sparkContext.setCheckpointDir("/tmp/cp") (0 until 100).foldLeft(Seq(1, 2, 3).toDS) { (plan, iteration) => println(s"== Iteration $iteration ==") val time0 = System.currentTimeMillis() val cp = plan.checkpoint() cp.count() System.out.println(s"Checkpointing takes ${System.currentTimeMillis() - time0} ms") val time1 = System.currentTimeMillis() val joined = cp.join(cp, "value").join(cp, "value").join(cp, "value").join(cp, "value") val result = joined.as[Int] println(s"Query planning takes ${System.currentTimeMillis() - time1} ms") result } // == Iteration 0 == // Checkpointing takes 591 ms // Query planning takes 13 ms // == Iteration 1 == // Checkpointing takes 1605 ms // Query planning takes 16 ms // == Iteration 2 == // Checkpointing takes 782 ms // Query planning takes 8 ms // == Iteration 3 == // Checkpointing takes 729 ms // Query planning takes 10 ms // == Iteration 4 == // Checkpointing takes 734 ms // Query planning takes 9 ms // == Iteration 5 == // ... // == Iteration 50 == // Checkpointing takes 571 ms // Query planning takes 7 ms // == Iteration 51 == // Checkpointing takes 548 ms // Query planning takes 7 ms // == Iteration 52 == // Checkpointing takes 596 ms // Query planning takes 8 ms // == Iteration 53 == // Checkpointing takes 568 ms // Query planning takes 7 ms // ... ``` You may see that although checkpointing is more heavy weight an operation, it always takes roughly the same amount of time to perform both checkpointing and query planning. ### Open question mengxr mentioned that it would be more convenient if we can make `Dataset.checkpoint()` eager, i.e., always performs a `RDD.count()` after calling `RDD.checkpoint()`. Not quite sure whether this is a universal requirement. Maybe we can add a `eager: Boolean` argument for `Dataset.checkpoint()` to support that. ## How was this patch tested? Unit test added in `DatasetSuite`. Author: Cheng Lian <lian@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes apache#15651 from liancheng/ds-checkpoint.

liancheng mentioned this pull request Oct 27, 2016

[DO NOT MERGE][17972][SQL] Another try of PR #15517 #15565

Closed

liancheng commented Oct 27, 2016

View reviewed changes

yhuai reviewed Oct 27, 2016

View reviewed changes

viirya reviewed Oct 27, 2016

View reviewed changes

Add Dataset.checkpoint() to truncate large query plans

609fba7

Restore checkpoint directory at the end of the test case Add eager argument to Dataset.checkpoint() Address PR comments

liancheng force-pushed the ds-checkpoint branch from 0ee8f41 to 609fba7 Compare October 28, 2016 03:56

yhuai reviewed Oct 29, 2016

View reviewed changes

Make it more Java friendly

cee16ce

viirya reviewed Oct 31, 2016

View reviewed changes

liancheng and others added 3 commits October 31, 2016 10:11

Fix ScalaDoc

5405a94

Remove redundant .setCheckpointDir() call in tests

ffe4318

Update DatasetSuite.scala

e0e38bf

asfgit closed this in 8bfc3b7 Oct 31, 2016

liancheng deleted the ds-checkpoint branch November 2, 2016 21:09

[SPARK-17972][SQL] Add Dataset.checkpoint() to truncate large query plans #15651

[SPARK-17972][SQL] Add Dataset.checkpoint() to truncate large query plans #15651

Uh oh!

Conversation

liancheng commented Oct 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Problem

Changes introduced in this PR

Micro benchmark

Open question

How was this patch tested?

Uh oh!

liancheng commented Oct 27, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 27, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Oct 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 27, 2016

Uh oh!

liancheng commented Oct 27, 2016

Uh oh!

SparkQA commented Oct 27, 2016

Uh oh!

SparkQA commented Oct 27, 2016

Uh oh!

liancheng commented Oct 27, 2016

Uh oh!

SparkQA commented Oct 28, 2016

Uh oh!

SparkQA commented Oct 28, 2016

Uh oh!

liancheng commented Oct 29, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Oct 31, 2016

Uh oh!

SparkQA commented Oct 31, 2016

Uh oh!

yhuai commented Oct 31, 2016

Uh oh!

SparkQA commented Oct 31, 2016

Uh oh!

SparkQA commented Oct 31, 2016

Uh oh!

SparkQA commented Oct 31, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

liancheng commented Oct 27, 2016 •

edited

Loading

viirya commented Oct 27, 2016 •

edited

Loading