[SPARK-19665][SQL] Improve constraint propagation #16998

viirya · 2017-02-20T09:02:23Z

What changes were proposed in this pull request?

If there are aliased expression in the projection, we propagate constraints by completely expanding the original constraints with aliases.

This expanding costs much computation time when the number of aliases increases.

Fully expanding all constraints at all the time makes iterative ML algorithms where a ML pipeline with many stages runs very slow. See #16785.

Another issue is we actually don't need the additional constraints at most of time. For example, if there is a constraint "a > b", and "a" is aliased to "c" and "d". When we use this constraint in filtering, we don't need all constraints "a > b", "c > b", "d > b". We only need "a > b" because if it is false, it is guaranteed that all other constraints are false too.

Benchmark

Run the following codes locally.

import org.apache.spark.ml.{Pipeline, PipelineStage}
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler}

val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3, "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0"))

val indexers = df.columns.tail.map(c => new StringIndexer()
  .setInputCol(c)
  .setOutputCol(s"${c}_indexed")
  .setHandleInvalid("skip"))

val encoders = indexers.map(indexer => new OneHotEncoder()
  .setInputCol(indexer.getOutputCol)
  .setOutputCol(s"${indexer.getOutputCol}_encoded")
  .setDropLast(true))

val stages: Array[PipelineStage] = indexers ++ encoders
val pipeline = new Pipeline().setStages(stages)

val startTime = System.nanoTime
pipeline.fit(df).transform(df).show
val runningTime = System.nanoTime - startTime

Before this patch: 1786001 ms ~= 30 mins
After this patch: 49972 ms = less than 1 min

How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

…ts-generation

viirya · 2017-02-20T09:04:16Z

cc @cloud-fan @hvanhovell @sameeragarwal

hvanhovell · 2017-02-20T09:12:02Z

@viirya does this PR supersede #16785? I do like the non-parallel approach. I will try to take a more in-depth look at the end of the week (beginning of the next sprint).

SparkQA · 2017-02-20T09:18:40Z

Test build #73159 has finished for PR 16998 at commit 917de74.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-02-20T10:30:49Z

@hvanhovell Yes. #16785 only does a limited improvement. Both #16785 and this are non-parallel approach.

SparkQA · 2017-02-20T10:35:32Z

Test build #73158 has finished for PR 16998 at commit 24fb723.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-20T14:02:57Z

Test build #73163 has finished for PR 16998 at commit d691c66.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-20T17:31:37Z

Test build #73174 has finished for PR 16998 at commit 6cb896f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-02-20T18:33:27Z

Another issue is we actually don't need the additional constraints at most of time. For example, if there is a constraint "a > b", and "a" is aliased to "c" and "d". When we use this constraint in filtering, we don't need all constraints "a > b", "c > b", "d > b". We only need "a > b" because if it is false, it is guaranteed that all other constraints are false too.

I do not get your points. What does this mean? Constraint propagation is a bottom up mechanism for inferring the constraints. Can you elaborate your idea in the more formal way.

I did not read the code. Just wondering if we could miss the chance of plan optimization after this PR? What is the negative impact, if exists?

viirya · 2017-02-21T02:00:59Z

I do not get your points. What does this mean? Constraint propagation is a bottom up mechanism for inferring the constraints. Can you elaborate your idea in the more formal way.

We fully expand the constraints with aliased attributes now. For example, if there is a constraint "a > b", and current query plan aliases "a" to "c" and "d". The final constraints of this plan is "a > b", "c > b", "d > b".

The values of those constraints are all the same, either all true or all false. So in case of inferring filters from the constraints, we only need "a > b", other aliased constraints "c > b", "d > b" are not necessary.

I did not read the code. Just wondering if we could miss the chance of plan optimization after this PR? What is the negative impact, if exists?

The only one optimization I think would be affected is PruneFilters. PruneFilters will prune a condition if its child's constraints already contain the condition. Using above example to elaborate, if there is a Filter above the query plan and its condition is "c > b". As we only have "a > b" in the query plan's constraint, we can't prune the condition and the Filter.

However, this is not a big impact and it can be easily solved. We can use a simple method to inquire if a given condition like "c > b" is contained in the fully expanded constraints of a query plan, without really fully expanding the constraints.

sameeragarwal · 2017-02-22T01:23:51Z

@viirya please correct me if I'm wrong but scanning through this patch, it appears that the underlying problem is that duplicating and tracking aliased constraints using a Set tends to blow up quickly (causing regressions) and this patch is proposing an alternate data structure (aliasedExpressionsInConstraints) to keep track of aliases? For e.g., in your example where a > b, and a is aliased to c and d, we currently track constraints as Set(a > b, c > b, d > b) whereas you'd like it to be tracked as Set(a > b) and Map(a -> Set(c, d))? Is that correct?

sameeragarwal · 2017-02-22T01:26:11Z

By the way, as an aside we should probably allow constraint inference/propagation to be turned off via a conf flag to provide a quick work around against these kind of problems.

viirya · 2017-02-22T01:35:53Z

@sameeragarwal That's correct.

By the way, as an aside we should probably allow constraint inference/propagation to be turned off via a conf flag to provide a quick work around against these kind of problems.

As we use constraints in optimization, if we turn off constraint inference/propagation, wouldn't it miss optimization chance for query plans?

SparkQA · 2017-02-23T05:39:19Z

Test build #73316 has finished for PR 16998 at commit 5be21b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-02-28T08:35:56Z

@hvanhovell Do you have time to review this?

chriso · 2017-03-02T06:03:46Z

I just ran into the same issue with spark-2.1.0

Here's a minimal test case:

val max = 12 // try increasing this

val df = Seq.empty[Int].toDF

val filter = (for (i <- 0 to max)
  yield col("value") <=> i) reduce (_ || _)

val projections = for (i <- 0 to max)
  yield (col("value") <=> i).as(s"value_$i")

val dummy = lit(true) // this can be anything

val result = df.filter(dummy).select(projections: _*).filter(filter).filter(dummy)

result.explain

The explain call hangs here.

viirya · 2017-03-05T00:08:19Z

@hvanhovell Do you have any thoughts on this already? Please let me know. Thanks!

sameeragarwal · 2017-03-07T01:44:30Z

As we use constraints in optimization, if we turn off constraint inference/propagation, wouldn't it miss optimization chance for query plans?

Not really. Constraint propagation will still be enabled by default in Spark. The flag would just be a hammer to quickly get around issues like this and SPARK-17733.

sameeragarwal · 2017-03-07T01:53:16Z

@viirya I'll take a closer look at this patch but given that this PR is primarily introducing a data structure that keeps track of aliased constraints, is there a fundamental reason for changing the underlying behavior (and restricting the optimization space)? Can there be a simpler alternative where we can still keep the old semantics?

viirya · 2017-03-07T04:02:16Z

Not really. Constraint propagation will still be enabled by default in Spark. The flag would just be a hammer to quickly get around issues like this and SPARK-17733.

Yeah, of course. I meant that when you disable the flag, you wouldn't enjoy the optimization relying on constraint propagation.

I will create another PR for this option.

I'll take a closer look at this patch but given that this PR is primarily introducing a data structure that keeps track of aliased constraints, is there a fundamental reason for changing the underlying behavior (and restricting the optimization space)? Can there be a simpler alternative where we can still keep the old semantics?

I don't find an alternative fixing to keep the old semantics and not change the propagation structure, and also can largely improve performance at the same time.

#16785 keeps the old semantics and not change the propagation structure, but it just can cut half of the running time regarding the benchmark.

Adding the flag is one simpler option.

## What changes were proposed in this pull request? Constraint propagation can be computation expensive and block the driver execution for long time. For example, the below benchmark needs 30mins. Compared with previous PRs #16998, #16785, this is a much simpler option: add a flag to disable constraint propagation. ### Benchmark Run the following codes locally. import org.apache.spark.ml.{Pipeline, PipelineStage} import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler} import org.apache.spark.sql.internal.SQLConf spark.conf.set(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key, false) val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3, "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0")) val indexers = df.columns.tail.map(c => new StringIndexer() .setInputCol(c) .setOutputCol(s"${c}_indexed") .setHandleInvalid("skip")) val encoders = indexers.map(indexer => new OneHotEncoder() .setInputCol(indexer.getOutputCol) .setOutputCol(s"${indexer.getOutputCol}_encoded") .setDropLast(true)) val stages: Array[PipelineStage] = indexers ++ encoders val pipeline = new Pipeline().setStages(stages) val startTime = System.nanoTime pipeline.fit(df).transform(df).show val runningTime = System.nanoTime - startTime Before this patch: 1786001 ms ~= 30 mins After this patch: 26392 ms = less than half of a minute Related PRs: #16998, #16785. ## How was this patch tested? Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #17186 from viirya/add-flag-disable-constraint-propagation.

viirya · 2017-03-31T13:30:00Z

Once we've added the flag, this issue is not urgent for now. I close first.

Constraint propagation can be computation expensive and block the driver execution for long time. For example, the below benchmark needs 30mins. Compared with previous PRs apache#16998, apache#16785, this is a much simpler option: add a flag to disable constraint propagation. Run the following codes locally. import org.apache.spark.ml.{Pipeline, PipelineStage} import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler} import org.apache.spark.sql.internal.SQLConf spark.conf.set(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key, false) val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3, "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0")) val indexers = df.columns.tail.map(c => new StringIndexer() .setInputCol(c) .setOutputCol(s"${c}_indexed") .setHandleInvalid("skip")) val encoders = indexers.map(indexer => new OneHotEncoder() .setInputCol(indexer.getOutputCol) .setOutputCol(s"${indexer.getOutputCol}_encoded") .setDropLast(true)) val stages: Array[PipelineStage] = indexers ++ encoders val pipeline = new Pipeline().setStages(stages) val startTime = System.nanoTime pipeline.fit(df).transform(df).show val runningTime = System.nanoTime - startTime Before this patch: 1786001 ms ~= 30 mins After this patch: 26392 ms = less than half of a minute Related PRs: apache#16998, apache#16785. Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#17186 from viirya/add-flag-disable-constraint-propagation.

…ional (#179) [SNAP-3195] Exposing `spark.sql.constraintPropagation.enabled` config to disable optimization rules related to constraint propagation. Cherry-picked from e011004 and resolved merge conflicts. --- # Original commit message: [SPARK-19846][SQL] Add a flag to disable constraint propagation ## What changes were proposed in this pull request? Constraint propagation can be computation expensive and block the driver execution for long time. For example, the below benchmark needs 30mins. Compared with previous PRs apache#16998, apache#16785, this is a much simpler option: add a flag to disable constraint propagation. ### Benchmark Run the following codes locally. import org.apache.spark.ml.{Pipeline, PipelineStage} import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler} import org.apache.spark.sql.internal.SQLConf spark.conf.set(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key, false) val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3, "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0")) val indexers = df.columns.tail.map(c => new StringIndexer() .setInputCol(c) .setOutputCol(s"${c}_indexed") .setHandleInvalid("skip")) val encoders = indexers.map(indexer => new OneHotEncoder() .setInputCol(indexer.getOutputCol) .setOutputCol(s"${indexer.getOutputCol}_encoded") .setDropLast(true)) val stages: Array[PipelineStage] = indexers ++ encoders val pipeline = new Pipeline().setStages(stages) val startTime = System.nanoTime pipeline.fit(df).transform(df).show val runningTime = System.nanoTime - startTime Before this patch: 1786001 ms ~= 30 mins After this patch: 26392 ms = less than half of a minute Related PRs: apache#16998, apache#16785.

viirya added 4 commits February 3, 2017 07:08

Improve the code to generate constraints.

b4e514a

Use parallel collection to improve the function.

8c98a5c

Merge remote-tracking branch 'upstream/master' into improve-constrain…

1b9c561

…ts-generation

Revert parallel collection approach. Reduce aliased constraints.

278c31c

viirya changed the title ~~[SPARK-19665][SQL] Improve constraint propagation~~ [SPARK-19665][SQL][WIP] Improve constraint propagation Feb 20, 2017

viirya force-pushed the improve-constraints-generation-2 branch from 24fb723 to 917de74 Compare February 20, 2017 09:10

viirya force-pushed the improve-constraints-generation-2 branch from 917de74 to d691c66 Compare February 20, 2017 10:28

viirya force-pushed the improve-constraints-generation-2 branch from d691c66 to 6cb896f Compare February 20, 2017 15:29

sameeragarwal mentioned this pull request Feb 22, 2017

[SPARK-19443][SQL] The function to generate constraints takes too long when the query plan grows continuously #16785

Closed

Improve constraint propagation.

5be21b3

viirya force-pushed the improve-constraints-generation-2 branch from 6cb896f to 5be21b3 Compare February 23, 2017 03:33

viirya changed the title ~~[SPARK-19665][SQL][WIP] Improve constraint propagation~~ [SPARK-19665][SQL] Improve constraint propagation Feb 23, 2017

viirya mentioned this pull request Mar 7, 2017

[SPARK-19846][SQL] Add a flag to disable constraint propagation #17186

Closed

viirya closed this Mar 31, 2017

viirya deleted the improve-constraints-generation-2 branch December 27, 2023 18:34

[SPARK-19665][SQL] Improve constraint propagation #16998

[SPARK-19665][SQL] Improve constraint propagation #16998

Uh oh!

Conversation

viirya commented Feb 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Benchmark

How was this patch tested?

Uh oh!

viirya commented Feb 20, 2017

Uh oh!

hvanhovell commented Feb 20, 2017

Uh oh!

SparkQA commented Feb 20, 2017

Uh oh!

viirya commented Feb 20, 2017

Uh oh!

SparkQA commented Feb 20, 2017

Uh oh!

SparkQA commented Feb 20, 2017

Uh oh!

SparkQA commented Feb 20, 2017

Uh oh!

gatorsmile commented Feb 20, 2017

Uh oh!

viirya commented Feb 21, 2017

Uh oh!

sameeragarwal commented Feb 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sameeragarwal commented Feb 22, 2017

Uh oh!

viirya commented Feb 22, 2017

Uh oh!

SparkQA commented Feb 23, 2017

Uh oh!

viirya commented Feb 28, 2017

Uh oh!

chriso commented Mar 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Mar 5, 2017

Uh oh!

sameeragarwal commented Mar 7, 2017

Uh oh!

sameeragarwal commented Mar 7, 2017

Uh oh!

viirya commented Mar 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Mar 31, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

viirya commented Feb 20, 2017 •

edited

Loading

sameeragarwal commented Feb 22, 2017 •

edited

Loading

chriso commented Mar 2, 2017 •

edited

Loading

viirya commented Mar 7, 2017 •

edited

Loading