[SPARK-27707][SQL] Prune unnecessary nested fields from Generate #24637

viirya · 2019-05-18T15:52:06Z

What changes were proposed in this pull request?

Performance issue using explode was found when a complex field contains huge array is to get duplicated as the number of exploded array elements. Given example:

val df = spark.sparkContext.parallelize(Seq(("1",
  Array.fill(M)({
    val i = math.random
    (i.toString, (i + 1).toString, (i + 2).toString, (i + 3).toString)
  })))).toDF("col", "arr")
  .selectExpr("col", "struct(col, arr) as st")
  .selectExpr("col", "st.col as col1", "explode(st.arr) as arr_col")

The explode causes st to be duplicated as many as the exploded elements.

Benchmarks it:

[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
[info] Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
[info] generate big nested struct array:         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] generate big nested struct array wholestage off          52668          53162         699          0.0      877803.4       1.0X
[info] generate big nested struct array wholestage on          47261          49093        1125          0.0      787690.2       1.1X
[info]

The query plan:

== Physical Plan ==
 Project [col#508, st#512.col AS col1#515, arr_col#519]
 +- Generate explode(st#512.arr), [col#508, st#512], false, [arr_col#519]
    +- Project [_1#503 AS col#508, named_struct(col, _1#503, arr, _2#504) AS st#512]
       +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS _1#503, mapobjects(MapObjects_loopValue84, MapObjects_loopIsNull84,      ObjectType(class scala.Tuple4), if (isnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true)))     null else named_struct(_1, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))._1, true, false), _2, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))._2, true, false), _3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String,     StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))._3, true,  false), _4, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue84,   MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))._4, true, false)), knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, None) AS _2#504]
          +- Scan[obj#534]

This patch takes nested column pruning approach to prune unnecessary nested fields. It adds a projection of the needed nested fields as aliases on the child of Generate, and substitutes them by alias attributes on the projection on top of Generate.

Benchmarks it after the change:

 [info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
 [info] Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
 [info] generate big nested struct array:         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 [info] ------------------------------------------------------------------------------------------------------------------------
 [info] generate big nested struct array wholestage off            311            331          28          0.2        5188.6       1.0X
 [info] generate big nested struct array wholestage on            297            312          15          0.2        4947.3       1.0X
 [info]

The query plan:

== Physical Plan ==
 Project [col#592, _gen_alias_608#608 AS col1#599, arr_col#603]
 +- Generate explode(st#596.arr), [col#592, _gen_alias_608#608], false, [arr_col#603]
    +- Project [_1#587 AS col#592, named_struct(col, _1#587, arr, _2#588) AS st#596, _1#587 AS _gen_alias_608#608]
       +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(in
 put[0, scala.Tuple2, true]))._1, true, false) AS _1#587, mapobjects(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4),
 if (isnull(lambdavariable(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))) null else named_struct(_1,        staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue102,              MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))._1, true, false), _2, staticinvoke(class org.apache.spark.unsafe.types.UTF8String,    StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))._2,      true, false), _3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString,                                                 knownnotnull(lambdavariable(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))._3, true, false), _4,            staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue102,              MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))._4, true, false)), knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2,      None) AS _2#588]
          +- Scan[obj#586]

This behavior is controlled by a SQL config spark.sql.optimizer.expression.nestedPruning.enabled.

How was this patch tested?

Added benchmark.

SparkQA · 2019-05-18T18:50:22Z

Test build #105516 has finished for PR 24637 at commit 6cab5ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-05-19T00:00:27Z

cc @uzadude @cloud-fan @dongjoon-hyun

uzadude · 2019-05-19T06:10:13Z

@viirya - looks great! exactly what I had in mind but wasn't sure how to implement it.

SparkQA · 2019-05-19T07:05:02Z

Test build #105522 has finished for PR 24637 at commit f036649.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-05-19T07:13:41Z

retest this please.

SparkQA · 2019-05-19T10:14:31Z

Test build #105526 has finished for PR 24637 at commit f036649.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-05-20T04:56:19Z

cc @dbtsai

dongjoon-hyun · 2019-05-20T05:02:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala

-    case Project(projectList, child)
-        if SQLConf.get.nestedSchemaPruningEnabled && canProjectPushThrough(child) =>
-      getAliasSubMap(projectList)
+    case Project(projectList, child) => getAliasSubMap(projectList)


@viirya . Sorry, but this is a regression on all the existing code. We should avoid getAliasSubMap invocation. https://github.com/apache/spark/pull/24637/files#diff-a636a87d8843eeccca90140be91d4fafR635 doesn't prevent getAliasSubMap invocation inside unapply, does it?

I see. If so, I need to make a little change to prevent it. Will change it later.

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/MiscBenchmark.scala

SparkQA · 2019-05-20T11:17:32Z

Test build #105564 has finished for PR 24637 at commit beb8993.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-20T13:04:41Z

Test build #105568 has finished for PR 24637 at commit caea246.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/benchmarks/MiscBenchmark-results.txt

cloud-fan · 2019-05-21T06:36:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

      p.copy(child = g.copy(child = newChild, unrequiredChildIndex = unrequiredIndices))

+    // prune unrequired nested fields
+    case p @ Project(projectList, g: Generate) =>


Why we need to special case Generate? I see there is a general case for nested column pruning below.

I think this is more general case you are looking for:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

Lines 623 to 624 in 163a6e2

case p @ NestedColumnAliasing(nestedFieldToAlias, attrToAliases) =>

NestedColumnAliasing.replaceToAliases(p, nestedFieldToAlias, attrToAliases)

In the general case, it's doing pruning only when the flag (nestedSchemaPruningEnabled) is enabled. The case considers some operators that nested project can be pushed through. Generate isn't one of them. So the general case doesn't work on this case.

Generate is special for another reason. We can't prune an output from its child even just a nested field of the output is used in the top project list. The generator could use it.

Shall we add if SQLConf.get.nestedSchemaPruningEnabled?

nestedSchemaPruningEnabled is for pruning nested fields from logical relation. But this fix isn't due to the same cause. For the data sources can't be pruned nested fields, it is also useful to apply this fix.

Out of curiosity, why don't we need another configuration then if spark.sql.optimizer.nestedSchemaPruning.enabled isn't part of this optimization, or fixing the doc of spark.sql.optimizer.nestedSchemaPruning.enabled?

This looks a general fix no matter the setting of nestedSchemaPruning is. Do we need a config to disable this?

My impression was that we need a configuration but I think you or @dongjoon-hyun have more context then me about nested pruning stuff. @cloud-fan, @dongjoon-hyun, @gatorsmile, can you make a call here if we need a config or not?

nestedSchemaPruning is vague as it can point to nested pruning in scan, or general nested pruning in other operators. Currently the config affects scan nested pruning. I feel it is not tightly related to the fix here, because the fix isn't for scan.

If we are considering a config here, I think a different config or fix the doc of nestedSchemaPruning to explicitly indicate it also helps general nested pruning.

Let me fix the doc of nestedSchemaPruning and apply this pruning when nestedSchemaPruning is enabled.

SparkQA · 2019-05-22T05:07:49Z

Test build #105650 has finished for PR 24637 at commit 7a790ff.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-14T05:06:42Z

Test build #106494 has finished for PR 24637 at commit ef97ffc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-06-16T03:54:40Z

Retest this please.

SparkQA · 2019-06-16T06:59:28Z

Test build #106546 has finished for PR 24637 at commit ef97ffc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-06-19T02:27:15Z

Retest this please.

SparkQA · 2019-06-19T05:36:51Z

Test build #106651 has finished for PR 24637 at commit ef97ffc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

...alyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala

dongjoon-hyun · 2019-06-20T05:45:50Z

sql/core/benchmarks/MiscBenchmark-results.txt

+range/limit/sum:                          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+------------------------------------------------------------------------------------------------------------------------
+range/limit/sum wholestage off                      191            205          19       2738.4           0.4       1.0X
+range/limit/sum wholestage on                       112            124          13       4699.4           0.2       1.7X


Ur, this is irrelevant, but the ratio looks weird.

This seems to be improved by some reason. This is consistently better in @viirya and my tests.

[info] OpenJDK 64-Bit Server VM 1.8.0_212-b04 on Linux 3.10.0-862.3.2.el7.x86_64 [info] Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz [info] range/limit/sum: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] range/limit/sum wholestage off 222 226 6 2359.5 0.4 1.0X [info] range/limit/sum wholestage on 114 121 8 4608.4 0.2 2.0X

sql/core/benchmarks/MiscBenchmark-results.txt

viirya · 2019-07-12T01:06:03Z

Thanks @dongjoon-hyun for the advice! I will add more few test cases targeting other Generator.

dongjoon-hyun · 2019-07-12T06:13:16Z

Thank you so much, @viirya !

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala

dongjoon-hyun · 2019-07-17T16:35:42Z

Thank you for adding a new test. Are you going to add more tests since Stack is one of them? In fact, we need more to be exhaustive.

One another approach is simply reducing the scope to the original goal. We can match only-Explode like the following in this PR.

case p @ Project(projectList, g: Generate) if ...
case p @ Project(projectList, g @ Generate(_: Explode, _, _, _, _, _)) if ...

Later, to cover more patterns, I think we need unapply and a white-list approach like the following.

  private def canProjectPushThrough(plan: LogicalPlan) = plan match {
    case _: GlobalLimit => true
    ...
    case _ => false
  }

Since this PR is here for a long time, how about finishing here with Explode first, @viirya ?

SparkQA · 2019-07-17T18:21:53Z

Test build #107788 has finished for PR 24637 at commit 9c225f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-07-18T06:24:48Z

@dongjoon-hyun I added more generators. I think existing generators should be found in the test.

SparkQA · 2019-07-18T07:05:02Z

Test build #107827 has finished for PR 24637 at commit 5511445.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-07-18T07:11:57Z

retest this please

SparkQA · 2019-07-18T10:35:57Z

Test build #107834 has finished for PR 24637 at commit 5511445.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-07-18T15:45:16Z

Thank you for adding more. Then, please do a white list approach for those four expressions.

case p @ Project(projectList, g @ Generate(e: Explode, _, _, _, _, _)) if canPruneXXX(e) &&

cc @cloud-fan and @gatorsmile

viirya · 2019-07-19T00:54:25Z

I'm fine to add a white-list, thro I think this approach is not generator-specific. It is more conservative and safer, anyway.

SparkQA · 2019-07-19T04:51:06Z

Test build #107870 has finished for PR 24637 at commit 0821444.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM.
Thank you so much for working on this.
Merged to master.

cloud-fan · 2020-01-08T08:31:34Z

We hit an exception caused by this rule. The plan becomes invalid after optimization

+- *(2) !Project [_gen_alias_68718#68718L AS cardinality#68575L, _gen_alias_68719#68719 AS durationSec#68576, _gen_alias_68720#68720 AS group#68578, _gen_alias_68721#68721 AS jobUuid#68579, _gen_alias_68722#68722 AS suite#68584, _gen_alias_68723#68723 AS testcase#68585, sha1(cast(_gen_alias_68721#68721 as binary)) AS jobSha#68600, sha1(cast(concat(_gen_alias_68722#68722, -, _gen_alias_68720#68720, -, cast(_gen_alias_68718#68718L as string), -, _gen_alias_68723#68723) as binary)) AS caseSha#68615]
      +- *(2) Generate explode(results#64594), false, [flattenRuns#68572]
         +- *(2) Project [results#64594]
            +- *(2) Sort [startTime#68717 DESC NULLS LAST], true, 0

We generate _gen_alias attributes in the parent Project but they are not available in the child Generate.

@viirya Can you help to take a look? thanks!

viirya · 2020-01-08T08:42:15Z

@cloud-fan Yea, will look at it tomorrow. Do you have test case? If no, I may try to reproduce it.

cloud-fan · 2020-01-08T09:08:34Z

It's a very long query and I'm trying to minimize. Let's see if there is some clue about the query plan.

dongjoon-hyun · 2020-01-08T16:19:51Z

Thank you for reporting, @cloud-fan !

dongjoon-hyun · 2020-01-08T16:20:38Z

Could you file a JIRA for that, @cloud-fan ?

viirya · 2020-01-08T16:37:31Z

Is flattenRuns (generatorOutput) also not in the parent Project?

viirya · 2020-01-08T23:37:58Z

Re-checked the current rule. Still cannot find clue from it and and above query plan. At first glance, I suspect if there are nested column access at the top Project but not in Generate. I tested it and re-checked the rule, looks it is good. @cloud-fan May you have more clues?

Prune unnecessary nested fields from Generate.

6cab5ac

viirya changed the title ~~[SPARK-27707][SQL] Prune unnecessary nested fields from Generate~~ [SPARK-27707][SQL] Prune unnecessary nested fields from Generate to address performance issue in explode May 19, 2019

Add one more test in ColumnPruningSuite.

f036649

dongjoon-hyun reviewed May 20, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/MiscBenchmark.scala Show resolved Hide resolved

viirya added 2 commits May 20, 2019 16:16

Remove the regression.

beb8993

Update benchmark results.

caea246

viirya commented May 20, 2019

View reviewed changes

sql/core/benchmarks/MiscBenchmark-results.txt Outdated Show resolved Hide resolved

cloud-fan reviewed May 21, 2019

View reviewed changes

Use DSL.

7a790ff

Merge remote-tracking branch 'upstream/master' into SPARK-27707

ef97ffc

dongjoon-hyun added the SQL label Jun 14, 2019

dongjoon-hyun reviewed Jun 20, 2019

View reviewed changes

...alyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala Show resolved Hide resolved

dongjoon-hyun reviewed Jun 20, 2019

View reviewed changes

sql/core/benchmarks/MiscBenchmark-results.txt Show resolved Hide resolved

viirya added 3 commits July 17, 2019 21:43

Merge remote-tracking branch 'upstream/master' into SPARK-27707

ff3b472

Add more generator.

dde0dbd

Merge branch 'SPARK-27707' of github.com:viirya/spark-1 into SPARK-27707

9c225f3

viirya commented Jul 17, 2019

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala Show resolved Hide resolved

More generators.

5511445

Add white-list.

0821444

dongjoon-hyun approved these changes Jul 19, 2019

View reviewed changes

dongjoon-hyun closed this in 127bc89 Jul 19, 2019

dongjoon-hyun mentioned this pull request Jan 10, 2020

[SPARK-29721][SQL] Prune unnecessary nested fields from Generate without Project #26978

Closed

viirya deleted the SPARK-27707 branch December 27, 2023 18:22

	case p @ NestedColumnAliasing(nestedFieldToAlias, attrToAliases) =>
	NestedColumnAliasing.replaceToAliases(p, nestedFieldToAlias, attrToAliases)

[SPARK-27707][SQL] Prune unnecessary nested fields from Generate #24637

[SPARK-27707][SQL] Prune unnecessary nested fields from Generate #24637

Uh oh!

Conversation

viirya commented May 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented May 18, 2019

Uh oh!

viirya commented May 19, 2019

Uh oh!

uzadude commented May 19, 2019

Uh oh!

SparkQA commented May 19, 2019

Uh oh!

viirya commented May 19, 2019

Uh oh!

SparkQA commented May 19, 2019

Uh oh!

dongjoon-hyun commented May 20, 2019

Uh oh!

dongjoon-hyun May 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented May 20, 2019

Uh oh!

SparkQA commented May 20, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 22, 2019

Uh oh!

SparkQA commented Jun 14, 2019

Uh oh!

dongjoon-hyun commented Jun 16, 2019

Uh oh!

SparkQA commented Jun 16, 2019

Uh oh!

dongjoon-hyun commented Jun 19, 2019

Uh oh!

SparkQA commented Jun 19, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

viirya commented Jul 12, 2019

Uh oh!

dongjoon-hyun commented Jul 12, 2019

Uh oh!

Uh oh!

viirya commented May 18, 2019 •

edited

Loading

dongjoon-hyun May 20, 2019 •

edited

Loading

dongjoon-hyun commented Jul 18, 2019 •

edited

Loading