-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19443][SQL] The function to generate constraints takes too long when the query plan grows continuously #16785
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #72303 has started for PR 16785 at commit |
|
retest this please. |
|
Test build #72305 has finished for PR 16785 at commit
|
|
The rewritten logic is not correct. I am working to improve this with other approach. |
|
I don't find a way to improve We may consider #16775 which is an another solution to fix this issue by checkpointing datasets for pipelines of long stages, or both of them. |
|
since this change is related to SQL, cc @cloud-fan @hvanhovell |
|
Test build #72625 has finished for PR 16785 at commit
|
| val parAllConstraints = child.constraints.asInstanceOf[Set[Expression]].filter { constraint => | ||
| constraint.references.intersect(relativeReferences).nonEmpty | ||
| }.par | ||
| parAllConstraints.tasksupport = UnaryNode.taskSupport |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we using a custom task support instead of the default (which uses the global fork-join executor)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do they have the same parallelism level? BTW, I saw the parallel collection used in other places in Spark all take custom task support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether they do or not depends on the implementation of the default task support. But even if they use the same level of parallelism, they're distinct executors. Which means they won't share a common thread pool or task queue. I don't know why Spark would use custom task support in other places. It may be to avoid engaging all of the CPU cores on the host machine. But then it seems more efficient for Spark to have its own global task support.
|
@viirya this looks like a very big hammer to solve this problem. Can't we try a different approach? I think we should try to avoid optimizing already optimized code snippets, you might be able to do this using some kind of a fence. It would even be better if we would have a recursive node. |
|
@hvanhovell Yeah, I think so. As in previous comment, I don't find a way now to improve We may consider #16775 which is an another solution to fix this issue by checkpointing datasets for pipelines of long stages. |
|
can we consider this in a higher-level of view instead of focusing on the method |
|
@cloud-fan yeah, i agreed with you and @hvanhovell. For too slow constraint propagation, except for attacking If we can't, for such long lineages, I think we should use checkpointing to fix it like #16775. |
@cloud-fan @hvanhovell Ok. I've figured out to add a filter to reduce the candidates of aliased constraints. It can achieve same speed-up (cut of half running time in benchmark) without parallel collection hammer. Can you have time to look at it? Thanks. |
4ba93fe to
278c31c
Compare
|
Test build #73034 has finished for PR 16785 at commit
|
|
Test build #73035 has finished for PR 16785 at commit
|
| // For example, for a constraint 'a > b', if 'a' is aliased to 'c', we need to get aliased | ||
| // constraint 'c > b' only if 'b' is in output. | ||
| var allConstraints = child.constraints.filter { constraint => | ||
| constraint.references.subsetOf(relativeReferences) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. You can see the benchmark in the pr description. With pruning these attributes, the running time is cut half.
If I understand your comment correctly, pruning them later in QueryPlan means we prune constraints which don't refer attributes in outputSet.
But the pruning here is happened before the pruning you pointed out, we need to reduce the constraints taken for transforming aliasing attributes to lower the computation cost.
## What changes were proposed in this pull request? Constraint propagation can be computation expensive and block the driver execution for long time. For example, the below benchmark needs 30mins. Compared with previous PRs #16998, #16785, this is a much simpler option: add a flag to disable constraint propagation. ### Benchmark Run the following codes locally. import org.apache.spark.ml.{Pipeline, PipelineStage} import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler} import org.apache.spark.sql.internal.SQLConf spark.conf.set(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key, false) val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3, "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0")) val indexers = df.columns.tail.map(c => new StringIndexer() .setInputCol(c) .setOutputCol(s"${c}_indexed") .setHandleInvalid("skip")) val encoders = indexers.map(indexer => new OneHotEncoder() .setInputCol(indexer.getOutputCol) .setOutputCol(s"${indexer.getOutputCol}_encoded") .setDropLast(true)) val stages: Array[PipelineStage] = indexers ++ encoders val pipeline = new Pipeline().setStages(stages) val startTime = System.nanoTime pipeline.fit(df).transform(df).show val runningTime = System.nanoTime - startTime Before this patch: 1786001 ms ~= 30 mins After this patch: 26392 ms = less than half of a minute Related PRs: #16998, #16785. ## How was this patch tested? Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #17186 from viirya/add-flag-disable-constraint-propagation.
Constraint propagation can be computation expensive and block the driver execution for long time. For example, the below benchmark needs 30mins. Compared with previous PRs apache#16998, apache#16785, this is a much simpler option: add a flag to disable constraint propagation. Run the following codes locally. import org.apache.spark.ml.{Pipeline, PipelineStage} import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler} import org.apache.spark.sql.internal.SQLConf spark.conf.set(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key, false) val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3, "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0")) val indexers = df.columns.tail.map(c => new StringIndexer() .setInputCol(c) .setOutputCol(s"${c}_indexed") .setHandleInvalid("skip")) val encoders = indexers.map(indexer => new OneHotEncoder() .setInputCol(indexer.getOutputCol) .setOutputCol(s"${indexer.getOutputCol}_encoded") .setDropLast(true)) val stages: Array[PipelineStage] = indexers ++ encoders val pipeline = new Pipeline().setStages(stages) val startTime = System.nanoTime pipeline.fit(df).transform(df).show val runningTime = System.nanoTime - startTime Before this patch: 1786001 ms ~= 30 mins After this patch: 26392 ms = less than half of a minute Related PRs: apache#16998, apache#16785. Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#17186 from viirya/add-flag-disable-constraint-propagation.
…ional (#179) [SNAP-3195] Exposing `spark.sql.constraintPropagation.enabled` config to disable optimization rules related to constraint propagation. Cherry-picked from e011004 and resolved merge conflicts. --- # Original commit message: [SPARK-19846][SQL] Add a flag to disable constraint propagation ## What changes were proposed in this pull request? Constraint propagation can be computation expensive and block the driver execution for long time. For example, the below benchmark needs 30mins. Compared with previous PRs apache#16998, apache#16785, this is a much simpler option: add a flag to disable constraint propagation. ### Benchmark Run the following codes locally. import org.apache.spark.ml.{Pipeline, PipelineStage} import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler} import org.apache.spark.sql.internal.SQLConf spark.conf.set(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key, false) val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3, "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0")) val indexers = df.columns.tail.map(c => new StringIndexer() .setInputCol(c) .setOutputCol(s"${c}_indexed") .setHandleInvalid("skip")) val encoders = indexers.map(indexer => new OneHotEncoder() .setInputCol(indexer.getOutputCol) .setOutputCol(s"${indexer.getOutputCol}_encoded") .setDropLast(true)) val stages: Array[PipelineStage] = indexers ++ encoders val pipeline = new Pipeline().setStages(stages) val startTime = System.nanoTime pipeline.fit(df).transform(df).show val runningTime = System.nanoTime - startTime Before this patch: 1786001 ms ~= 30 mins After this patch: 26392 ms = less than half of a minute Related PRs: apache#16998, apache#16785.
What changes were proposed in this pull request?
This issue is originally reported and discussed at http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tc20803.html
When run a ML
Pipelinewith many stages, during the iterative updating toDataset, it is observed the it takes longer time to finish the fit and transform as the query plan grows continuously.The example code show as the following in benchmark.
Specially, the time spent on preparing optimized plan in current branch is much higher than 1.6. Actually, the time is spent mostly on generating query plan's constraints during few optimization rules.
getAliasedConstraintsis found to be a function costing most of the running time. As the constraints for aliasing will increase very fast.This patch tries to improve the performance of
getAliasedConstraints.Benchmark
Run the following codes locally.
Before this patch: 1786001 ms
After this patch: 843688 ms
More than half of original running time is saved.
How was this patch tested?
Jenkins tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.