[SPARK-19846][SQL] Add a flag to disable constraint propagation #17186

viirya · 2017-03-07T05:25:05Z

What changes were proposed in this pull request?

Constraint propagation can be computation expensive and block the driver execution for long time. For example, the below benchmark needs 30mins.

Compared with previous PRs #16998, #16785, this is a much simpler option: add a flag to disable constraint propagation.

Benchmark

Run the following codes locally.

import org.apache.spark.ml.{Pipeline, PipelineStage}
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler}
import org.apache.spark.sql.internal.SQLConf

spark.conf.set(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key, false)

val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3, "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0"))

val indexers = df.columns.tail.map(c => new StringIndexer()
  .setInputCol(c)
  .setOutputCol(s"${c}_indexed")
  .setHandleInvalid("skip"))

val encoders = indexers.map(indexer => new OneHotEncoder()
  .setInputCol(indexer.getOutputCol)
  .setOutputCol(s"${indexer.getOutputCol}_encoded")
  .setDropLast(true))

val stages: Array[PipelineStage] = indexers ++ encoders
val pipeline = new Pipeline().setStages(stages)

val startTime = System.nanoTime
pipeline.fit(df).transform(df).show
val runningTime = System.nanoTime - startTime

Before this patch: 1786001 ms ~= 30 mins
After this patch: 26392 ms = less than half of a minute

Related PRs: #16998, #16785.

How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

viirya · 2017-03-07T05:28:51Z

cc @sameeragarwal @hvanhovell

This is a much simpler option: add a flag to disable constraint propagation, if we are ok for skipping optimizations relying on constraints (InferFiltersFromConstraints, part of PruneFilters, EliminateOuterJoin) when disabling this flag for the uncommon cases.

SparkQA · 2017-03-07T07:24:47Z

Test build #74071 has finished for PR 17186 at commit 44e494b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class InferFiltersFromConstraints(conf: CatalystConf)
case class PruneFilters(conf: CatalystConf) extends Rule[LogicalPlan] with PredicateHelper
case class EliminateOuterJoin(conf: CatalystConf) extends Rule[LogicalPlan] with PredicateHelper

hvanhovell · 2017-03-07T13:25:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

-object InferFiltersFromConstraints extends Rule[LogicalPlan] with PredicateHelper {
+case class InferFiltersFromConstraints(conf: CatalystConf)
+    extends Rule[LogicalPlan] with PredicateHelper {
  def apply(plan: LogicalPlan): LogicalPlan = plan transform {


Just check the flag before you start transforming the tree. That is a lot simpler & faster.

+1; this rule is just a no-op if constraints aren't inferred.

hvanhovell · 2017-03-07T13:31:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala

  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
-    case f @ Filter(condition, j @ Join(_, _, RightOuter | LeftOuter | FullOuter, _)) =>
+    case f @ Filter(condition, j @ Join(_, _, RightOuter | LeftOuter | FullOuter, _))
+        if conf.constraintPropagationEnabled =>


This is far to restrictive. We can still eliminate outer joins without constraint propagation.

hvanhovell · 2017-03-07T13:31:44Z

@viirya could you add a test?

viirya · 2017-03-07T14:05:10Z

@hvanhovell It is late in local time, I addressed the comments first. I will add test tomorrow.

SparkQA · 2017-03-07T16:11:03Z

Test build #74108 has finished for PR 17186 at commit ae9f037.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sameeragarwal

Instead of disabling every optimization, can we just not configure lazy val constraints: ExpressionSet = ExpressionSet(Set.empty) in QueryPlan to achieve the same goal?

viirya · 2017-03-07T23:43:49Z

@sameeragarwal To do that, we may need to change constraints to a method taking CatalystConf. As constraints is public, is it good to do?

viirya · 2017-03-08T01:56:53Z

@sameeragarwal Btw, another point is, if we do that, we still need to transform the plan even the flag is disabled.

viirya · 2017-03-08T02:33:22Z

@sameeragarwal Rethink about it, I think let QueryPlan returns constraints depending on the flag is more easy to test. I will give it a try.

viirya · 2017-03-08T03:52:16Z

@hvanhovell @sameeragarwal Following @sameeragarwal's comment, now instead of disabling every optimization, a new method getConstraints in QueryPlan will return empty constraints if the flag is disabled, otherwise returning original propagated constraints.

Please take a look. Thanks.

viirya · 2017-03-08T03:52:32Z

Btw, test cases are added.

SparkQA · 2017-03-08T05:15:12Z

Test build #74171 has finished for PR 17186 at commit 8318152.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-08T06:07:20Z

Test build #74173 has finished for PR 17186 at commit c863f67.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-08T06:29:58Z

Test build #74174 has finished for PR 17186 at commit 3eda726.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…constraint-propagation

SparkQA · 2017-03-09T07:37:43Z

Test build #74248 has started for PR 17186 at commit eb200d6.

viirya · 2017-03-09T08:06:25Z

retest this please.

SparkQA · 2017-03-09T10:16:01Z

Test build #74253 has finished for PR 17186 at commit eb200d6.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public static class LongWrapper
public static class IntWrapper
case class CostBasedJoinReorder(conf: CatalystConf) extends Rule[LogicalPlan] with PredicateHelper
case class JoinPlan(itemIds: Set[Int], plan: LogicalPlan, joinConds: Set[Expression], cost: Cost)
case class Cost(rows: BigInt, size: BigInt)
abstract class RepartitionOperation extends UnaryNode
case class FlatMapGroupsWithState(
class CSVOptions(
class UnivocityParser(
trait WatermarkSupport extends UnaryExecNode
case class FlatMapGroupsWithStateExec(

viirya · 2017-03-10T02:11:09Z

@hvanhovell @sameeragarwal Please let me know if you have more thoughts on the new change. Thanks.

sameeragarwal · 2017-03-13T23:00:35Z

Thanks @viirya, this approach makes sense to me. Can you please modify InferFiltersFromConstraints and I'll take a closer look.

sameeragarwal · 2017-03-13T23:02:01Z

Here's another instantiation of the underlying bug: https://issues.apache.org/jira/browse/SPARK-19875

gatorsmile · 2017-03-13T23:05:51Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

    .createWithDefault(false)

+  val CONSTRAINT_PROPAGATION_ENABLED = buildConf("spark.sql.constraintPropagation.enabled")
+    .internal()


This should not be an internal flag, right? cc @sameeragarwal @hvanhovell

Not sure about it, because constraint propagation is internal details.

To determine whether a flag is internal or not, we should consider the impact of external users. If users could easily hit this, we might need to expose it as an external flag and document it in the public document.

Due to the fact there are few users reporting hitting this issue, we may need to expose it as an external flag. But looks like it is not very common issue, compared with other external flag.

However, I would think that a large portion of external users may not know constraint propagation. It might not be intuitive to link the problem they hit to constraint propagation and to find this config, even it is external.

Constraint propagation is like predicate pushdown, which has already been used in external configurations. We can rename it to make external users easy to understand, e.g., constraints inferences.

SparkQA · 2017-03-14T02:51:04Z

Test build #74473 has finished for PR 17186 at commit 0e204bc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-03-14T03:09:19Z

@sameeragarwal Thanks for the comment. I've updated InferFiltersFromConstraints.

…constraint-propagation

SparkQA · 2017-03-15T05:22:29Z

Test build #74581 has started for PR 17186 at commit d3b0a72.

…constraint-propagation

SparkQA · 2017-03-15T09:28:10Z

Test build #74588 has finished for PR 17186 at commit d4c9a5e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class JsonDataSource extends Serializable

nsyca · 2017-03-15T13:40:22Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PruneFiltersSuite.scala

+    val optimized = OptimizeDisableConstraintPropagation.execute(queryWithUselessFilter.analyze)
+    // When constraint propagation is disabled, the useless filter won't be pruned.
+    // It gets pushed down. Because the rule `CombineFilters` runs only once, there are redundant
+    // and duplicate filters.


This behaviour does not make sense to me. If I write a query like

select * from (select * from t1 where t1.a1 > 1) tx where tx.a1 > 1

I expect that Spark evaluates the predicate only once. The wording of "constraint propagation" is misleading. In this example, there is no activity of propagation at all. Perhaps we want to distinguish the "constraints" between the ones written originally and the ones that are inferred from relationships with other predicates. When the "propagation" (or perhaps a more meaningful term "predicate inference") is set to OFF, we want to exclude those inferred predicates in the def constraints.

What needs to clarify is, this behaviour is just limited to this test case. That is why I added the comment. In normal optimization, CombineFilters will run multiple times and the predicates will be combined.

I am aware of it. My point is when users turn on this setting in hope of alleviating the long compilation time, they will get this "unintentional" side effect that could lengthen the execution time of evaluating the same predicate twice.

Overall, I agree with your approach but the point I raised could be a followup work.

This is a workaround in short term. Actually I have proposed another approach to bring new data structure for constraint propagation in #16998. But it is more complex and may need more time to consider and review.

viirya · 2017-03-20T04:33:22Z

ping @sameeragarwal This is updated according to your previous comment. Can you help review this? Thanks.

…constraint-propagation

viirya · 2017-03-22T09:05:29Z

ping @sameeragarwal Is it possible this goes in before code-freeze of 2.2? Please let me know that, thanks.

SparkQA · 2017-03-22T11:17:36Z

Test build #75043 has finished for PR 17186 at commit da09d9f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sameeragarwal · 2017-03-23T06:12:19Z

Sorry @viirya, I'll review this first thing tomorrow morning

viirya · 2017-03-23T06:48:39Z

@sameeragarwal Thanks a lot.

…constraint-propagation

sameeragarwal

This looks great overall, I just left a few minor comments. Thanks!

sameeragarwal · 2017-03-23T22:35:46Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+
+  private def inferFilters(plan: LogicalPlan): LogicalPlan = plan transform {
    case filter @ Filter(condition, child) =>
+      val constraintEnabled = conf.constraintPropagationEnabled


this is unused

oh. missing it.

sameeragarwal · 2017-03-23T23:00:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala


+  val CONSTRAINT_PROPAGATION_ENABLED = buildConf("spark.sql.constraintPropagation.enabled")
+    .internal()
+    .doc("When true, the query optimizer will use constraint propagation in query plans to " +


nit: 'get around the issue' might sound pretty vague to a non-expert user. How about something along these lines?

.doc("When true, the query optimizer will infer and propagate data constraints in the query " + "plan to optimize them. Constraint propagation can sometimes be computationally expensive" + "for certain kinds of query plans (such as those with a large number of predicates and " + "aliases) which might negatively impact overall runtime.")

Looks good.

sameeragarwal · 2017-03-23T23:02:11Z

...rc/test/scala/org/apache/spark/sql/catalyst/optimizer/InferFiltersFromConstraintsSuite.scala

+        CombineFilters) :: Nil
+  }
+
+  object OptimizeDisableConstraintPropagation extends RuleExecutor[LogicalPlan] {


nit: perhaps OptimizeWithConstraintPropagationDisabled?

sameeragarwal · 2017-03-23T23:04:57Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PruneFiltersSuite.scala

+        PushPredicateThroughJoin) :: Nil
+  }
+
+  object OptimizeDisableConstraintPropagation extends RuleExecutor[LogicalPlan] {


nit: same as above

sameeragarwal · 2017-03-23T23:16:38Z

...alyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OuterJoinEliminationSuite.scala

        EliminateSubqueryAliases) ::
      Batch("Outer Join Elimination", Once,
-        EliminateOuterJoin,
+        EliminateOuterJoin(SimpleCatalystConf(caseSensitiveAnalysis = true)),


Can we add a test for outer join elimination as well?

Added a test.

viirya · 2017-03-24T02:55:56Z

@sameeragarwal Thanks for review! I've addressed all the comments now. Please take a look if it is good for you.

SparkQA · 2017-03-24T05:05:25Z

Test build #75139 has finished for PR 17186 at commit a02c8cb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sameeragarwal · 2017-03-24T18:36:03Z

LGTM, thanks! cc @hvanhovell

rxin · 2017-03-24T23:03:26Z

Merging in master.

Constraint propagation can be computation expensive and block the driver execution for long time. For example, the below benchmark needs 30mins. Compared with previous PRs apache#16998, apache#16785, this is a much simpler option: add a flag to disable constraint propagation. Run the following codes locally. import org.apache.spark.ml.{Pipeline, PipelineStage} import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler} import org.apache.spark.sql.internal.SQLConf spark.conf.set(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key, false) val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3, "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0")) val indexers = df.columns.tail.map(c => new StringIndexer() .setInputCol(c) .setOutputCol(s"${c}_indexed") .setHandleInvalid("skip")) val encoders = indexers.map(indexer => new OneHotEncoder() .setInputCol(indexer.getOutputCol) .setOutputCol(s"${indexer.getOutputCol}_encoded") .setDropLast(true)) val stages: Array[PipelineStage] = indexers ++ encoders val pipeline = new Pipeline().setStages(stages) val startTime = System.nanoTime pipeline.fit(df).transform(df).show val runningTime = System.nanoTime - startTime Before this patch: 1786001 ms ~= 30 mins After this patch: 26392 ms = less than half of a minute Related PRs: apache#16998, apache#16785. Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#17186 from viirya/add-flag-disable-constraint-propagation.

xyxiaoyou · 2019-11-05T06:49:45Z

May I ask how to solve this issue in spark2.3?
I see that this flag has been removed in PruneFilters and EliminateOuterJoin. @viirya @gatorsmile

gatorsmile · 2019-11-05T07:02:04Z

@xyxiaoyou SQLConf.get ?

xyxiaoyou · 2019-11-05T07:14:00Z

@xyxiaoyou SQLConf.get ?
What I want to express is that I wonder if this issue has been resolved in spark2.3/2.4,
Is there a more appropriate solution?
Because I found that more complex SQL would cause the excutor stuck.

xyxiaoyou · 2019-12-02T10:18:57Z

Hi, @gatorsmile , when using 'create or replace view test_view as...', spark will first generate a 'select...'query job. This causes the create view to be particularly slow. Is there a switch control so that spark does not do queries or create faster when creating a view?

xyxiaoyou · 2019-12-02T10:21:24Z

xyxiaoyou · 2019-12-02T10:24:46Z

@gatorsmile This problem has been bothering our team for a long time, I hope you can give us some suggestions or help us solve it.Thanks a lot.

ahshahid · 2024-10-10T22:46:40Z

@xyxiaoyou : for your reference: take a look at https://issues.apache.org/jira/browse/SPARK-33152
and corresponding PR ( though it has been closed, but solution is there)

Add a flag to disable constraint propagation.

44e494b

hvanhovell reviewed Mar 7, 2017

View reviewed changes

Address comments.

ae9f037

sameeragarwal reviewed Mar 7, 2017

View reviewed changes

Let QueryPlan return empty constraints when the flag is disabled.

8318152

Add more tests.

3eda726

viirya force-pushed the add-flag-disable-constraint-propagation branch from c863f67 to 3eda726 Compare March 8, 2017 04:03

Merge remote-tracking branch 'upstream/master' into add-flag-disable-…

eb200d6

…constraint-propagation

Address comments.

0e204bc

gatorsmile reviewed Mar 13, 2017

View reviewed changes

Merge remote-tracking branch 'upstream/master' into add-flag-disable-…

d3b0a72

…constraint-propagation

Merge remote-tracking branch 'upstream/master' into add-flag-disable-…

d4c9a5e

…constraint-propagation

nsyca reviewed Mar 15, 2017

View reviewed changes

Merge remote-tracking branch 'upstream/master' into add-flag-disable-…

da09d9f

…constraint-propagation

viirya added 2 commits March 23, 2017 11:25

Merge remote-tracking branch 'upstream/master' into add-flag-disable-…

92f368e

…constraint-propagation

Address comments.

a02c8cb

sameeragarwal reviewed Mar 23, 2017

View reviewed changes

asfgit closed this in e011004 Mar 24, 2017

viirya deleted the add-flag-disable-constraint-propagation branch December 27, 2023 18:34

[SPARK-19846][SQL] Add a flag to disable constraint propagation #17186

[SPARK-19846][SQL] Add a flag to disable constraint propagation #17186

Uh oh!

Conversation

viirya commented Mar 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Benchmark

How was this patch tested?

Uh oh!

viirya commented Mar 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Mar 7, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented Mar 7, 2017

Uh oh!

viirya commented Mar 7, 2017

Uh oh!

SparkQA commented Mar 7, 2017

Uh oh!

sameeragarwal left a comment

Choose a reason for hiding this comment

Uh oh!

viirya commented Mar 7, 2017

Uh oh!

viirya commented Mar 8, 2017

Uh oh!

viirya commented Mar 8, 2017

Uh oh!

viirya commented Mar 8, 2017

Uh oh!

viirya commented Mar 8, 2017

Uh oh!

SparkQA commented Mar 8, 2017

Uh oh!

SparkQA commented Mar 8, 2017

Uh oh!

SparkQA commented Mar 8, 2017

Uh oh!

SparkQA commented Mar 9, 2017

Uh oh!

viirya commented Mar 9, 2017

Uh oh!

SparkQA commented Mar 9, 2017

Uh oh!

viirya commented Mar 10, 2017

Uh oh!

sameeragarwal commented Mar 13, 2017

Uh oh!

sameeragarwal commented Mar 13, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Mar 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 14, 2017

Uh oh!

viirya commented Mar 14, 2017

Uh oh!

SparkQA commented Mar 15, 2017

Uh oh!

SparkQA commented Mar 15, 2017

Uh oh!

viirya commented Mar 7, 2017 •

edited

Loading

viirya commented Mar 7, 2017 •

edited

Loading

viirya Mar 14, 2017 •

edited

Loading

nsyca Mar 15, 2017 •

edited

Loading