[SPARK-19433][ML] Periodic checkout datasets for long ml pipeline #16775

viirya · 2017-02-02T05:01:44Z

What changes were proposed in this pull request?

For a Pipeline including long stages, the iterative fit and transform cause extremely grown query plans and RDD lineages, it takes longer time to finish the fit and transform.

This patch introduces PeriodicDatasetCheckpointer to do periodic checkout for dataset used in fit and transform.

~~This introduces new param checkpointInterval to Pipeline and PipelineModel. Once it is set, we will do periodic checkout by PeriodicDatasetCheckpointer.~~

As there is existing trait HasCheckpointInterval which already defines checkpointInterval param. This patch lets Pipeline and PipelineModel extend HasCheckpointInterval.

Benchmark

Run the following codes locally.

import org.apache.spark.ml.{Pipeline, PipelineStage}
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler}

spark.sparkContext.setCheckpointDir("/tmp/checkpoint")

val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3, "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0"))

val indexers = df.columns.tail.map(c => new StringIndexer()
  .setInputCol(c)
  .setOutputCol(s"${c}_indexed")
  .setHandleInvalid("skip"))

val encoders = indexers.map(indexer => new OneHotEncoder()
  .setInputCol(indexer.getOutputCol)
  .setOutputCol(s"${indexer.getOutputCol}_encoded")
  .setDropLast(true))

val stages: Array[PipelineStage] = indexers ++ encoders
val pipeline = new Pipeline().setStages(stages)
pipeline.setCheckpointInterval(5)  // only run this line after applying this patch

val startTime = System.nanoTime
pipeline.fit(df).transform(df).show
val runningTime = System.nanoTime - startTime

Before this patch: 1786001 ms
After this patch: 69013 ms

This issue is originally reported at http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tc20803.html

How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

viirya · 2017-02-02T05:08:47Z

cc @mengxr @jkbradley @liancheng

viirya · 2017-02-02T05:17:07Z

also cc @MLnick

…terval param.

SparkQA · 2017-02-02T07:17:09Z

Test build #72274 has finished for PR 16775 at commit 5ed5c2a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-02T07:17:35Z

Test build #72275 has started for PR 16775 at commit 7a1b300.

SparkQA · 2017-02-02T07:53:37Z

Test build #72276 has started for PR 16775 at commit 32c90dd.

viirya · 2017-02-02T08:33:05Z

retest this please.

SparkQA · 2017-02-02T10:49:23Z

Test build #72277 has finished for PR 16775 at commit 32c90dd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

DavidArenburg · 2017-02-02T15:19:53Z

Wouldn't it better to Vectorize StringIndexer and OneHotEncoder? Like for instance .na.fill or .na.replace operate over the whole data set at once instead of running it in a loop (or at least I think it works this way)? Or similar to how model.matrix works in R. I feel like even with this patch this isn't scalable on lets say 1MM covariates (unless I'm missing something)- and yes, fitting a model on 1MM covariates while using LASSO for feature selection is quite common. If I'm not missing something, as it stands, looping StringIndexer and OneHotEncoder isn't executed at all nodes/cores at the same time- while it should be.

viirya · 2017-02-02T15:34:35Z

StringIndexer and OneHotEncoder are just used as example here. The concept is to have a pipeline with enough long stages.

viirya · 2017-02-03T07:31:30Z

For the issue reported on mailing list, I found the root cause makes significant difference between 1.6 and current branch. The fix is at #16785.

However, I think this patch is still useful. So I keep it open for a while for reviewers.

viirya · 2017-02-09T05:37:46Z

ping @mengxr @jkbradley @liancheng @MLnick May you take a look at this? Thanks.

mallman · 2017-02-09T17:43:25Z

@viirya I believe this PR meshes with the refactoring and application to pregel GraphX algorithms in #15125. Basically, it moves the periodic checkpointing code from mllib into core and uses it in GraphX to checkpoint long lineages. This is essential to scale GraphX to huge graphs, as described in my comment in the PR, and solves a very real problem for us. Can you take a look at that PR?

viirya · 2017-02-23T05:09:47Z

I think we can solve this issue by tackling the codes in SQL. So close it for now.

Periodic checkout datasets for long ml pipeline.

5ed5c2a

viirya changed the title ~~[WIP][ML] Periodic checkout datasets for long ml pipeline~~ [SPARK-19433][ML] Periodic checkout datasets for long ml pipeline Feb 2, 2017

Reuse HasCheckpointInterval trait which already defines checkpoint in…

7a1b300

…terval param.

Add test case to verify the correctness of result.

32c90dd

viirya mentioned this pull request Feb 9, 2017

[SPARK-19443][SQL] The function to generate constraints takes too long when the query plan grows continuously #16785

Closed

viirya closed this Feb 23, 2017

viirya deleted the periodic-checkout-for-long-ml-pipeline branch December 27, 2023 18:34

[SPARK-19433][ML] Periodic checkout datasets for long ml pipeline #16775

[SPARK-19433][ML] Periodic checkout datasets for long ml pipeline #16775

Uh oh!

Conversation

viirya commented Feb 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Benchmark

How was this patch tested?

Uh oh!

viirya commented Feb 2, 2017

Uh oh!

viirya commented Feb 2, 2017

Uh oh!

SparkQA commented Feb 2, 2017

Uh oh!

SparkQA commented Feb 2, 2017

Uh oh!

SparkQA commented Feb 2, 2017

Uh oh!

viirya commented Feb 2, 2017

Uh oh!

SparkQA commented Feb 2, 2017

Uh oh!

DavidArenburg commented Feb 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Feb 2, 2017

Uh oh!

viirya commented Feb 3, 2017

Uh oh!

viirya commented Feb 9, 2017

Uh oh!

mallman commented Feb 9, 2017

Uh oh!

viirya commented Feb 23, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

viirya commented Feb 2, 2017 •

edited

Loading

DavidArenburg commented Feb 2, 2017 •

edited

Loading