Add metrics and cost tests for partition pruning effectiveness #5

ericl · 2016-10-14T00:22:43Z

This adds tests that verify the expected metadata IO cost of executing queries with metastore pruning enabled.

Also, fixed a configuration leak in a suite. That was causing other suites to fail by flipping CONVERT_METASTORE_PARQUET's value.

to answer a query

partitions' files from a table file catalog

BasicFileCatalog (or vice-versa)

instead of once per partition

moving a protected method down and making it private

tables

mallman · 2016-10-14T00:44:15Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala

      .createOrReplaceTempView("jt_array")

-    setConf(HiveUtils.CONVERT_METASTORE_PARQUET, true)
+    assert(spark.sqlContext.getConf(HiveUtils.CONVERT_METASTORE_PARQUET.key) == "true")


Can you explain why you made this change?

This should no longer be needed since the flag value is true by default. I changed it to an assert to validate this.

This lets us get rid of the setConf(..., false) in the afterAll(), which was causing the conf value to be leaked to other suites.

mallman · 2016-10-14T00:49:52Z

Please rebase your changes off of 765f93c.

ericl · 2016-10-14T00:52:23Z

^ just did

mallman · 2016-10-14T00:57:00Z

Please rebase your changes off of 765f93c.

^ just did

Sorry, I'm not seeing that. I still see 21 commits in this PR.

ericl · 2016-10-14T00:57:32Z

Ah, I did a merge. You can "squash and merge" into the pr branch right?

mallman · 2016-10-14T01:00:28Z

core/src/main/scala/org/apache/spark/metrics/source/StaticSources.scala

+  val METRIC_PARTITIONS_FETCHED = metricRegistry.counter(MetricRegistry.name("partitionsFetched"))
+
+  /**
+   * Tracks the total number of files discovered off of S3 by ListingFileCatalog.


I don't see how this is specific to S3.

mallman · 2016-10-14T01:08:09Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveDataFrameSuite.scala

    }
  }
+
+  test("late partition pruning reads only necessary partition data") {


I don't know what you mean by "late" here. Did you mean "lazy"?

mallman · 2016-10-14T01:18:07Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveDataFrameSuite.scala

+          // of doing plan cache validation based on the entire partition set.
+          HiveCatalogMetrics.reset()
+          spark.sql("select * from test where partCol1 = 999").count()
+          assert(HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount() == 10)


I would expect this to be 5 because this table has 5 partitions. Why does the test expect 10?

The first 5 are from resolving the table, and the latter 5 are from ListingFileCatalog. It is possible to optimize this to only have 5, but it didn't seem worth the cost since this is (1) legacy mode and (2) not a regression..

Hm, maybe I can break it up into analysis and execution to make it more clear.

Not easy, so just added a comment here.

Thanks for the clarification. I think that adding the comment is good enough.

* [SPARK-16980][SQL] Load only catalog table partition metadata required to answer a query * Add a new catalyst optimizer rule to SQL core for pruning unneeded partitions' files from a table file catalog * Include the type of file catalog in the FileSourceScanExec metadata * TODO: Consider renaming FileCatalog to better differentiate it from BasicFileCatalog (or vice-versa) * try out parquet case insensitive fallback * Refactor the FileSourceScanExec.metadata val to make it prettier * fix and add test for input files * rename test * Refactor `TableFileCatalog.listFiles` to call `listDataLeafFiles` once instead of once per partition * fix it * more test cases * also fix a bug with zero partitions selected * feature flag * add comments * extend and fix flakiness in test * Enhance `ParquetMetastoreSuite` with mixed-case partition columns * Tidy up a little by removing some unused imports, an unused method and moving a protected method down and making it private * Put partition count in `FileSourceScanExec.metadata` for partitioned tables * Fix some errors in my revision of `ParquetSourceSuite` * Thu Oct 13 17:18:14 PDT 2016 * more generic * Thu Oct 13 18:09:42 PDT 2016 * Thu Oct 13 18:09:55 PDT 2016 * Thu Oct 13 18:22:31 PDT 2016

## What changes were proposed in this pull request? This PR aims to optimize GroupExpressions by removing repeating expressions. `RemoveRepetitionFromGroupExpressions` is added. **Before** ```scala scala> sql("select a+1 from values 1,2 T(a) group by a+1, 1+a, A+1, 1+A").explain() == Physical Plan == WholeStageCodegen : +- TungstenAggregate(key=[(a#0 + 1)#6,(1 + a#0)#7,(A#0 + 1)#8,(1 + A#0)#9], functions=[], output=[(a + 1)#5]) : +- INPUT +- Exchange hashpartitioning((a#0 + 1)#6, (1 + a#0)#7, (A#0 + 1)#8, (1 + A#0)#9, 200), None +- WholeStageCodegen : +- TungstenAggregate(key=[(a#0 + 1) AS (a#0 + 1)#6,(1 + a#0) AS (1 + a#0)#7,(A#0 + 1) AS (A#0 + 1)#8,(1 + A#0) AS (1 + A#0)#9], functions=[], output=[(a#0 + 1)#6,(1 + a#0)#7,(A#0 + 1)#8,(1 + A#0)#9]) : +- INPUT +- LocalTableScan [a#0], [[1],[2]] ``` **After** ```scala scala> sql("select a+1 from values 1,2 T(a) group by a+1, 1+a, A+1, 1+A").explain() == Physical Plan == WholeStageCodegen : +- TungstenAggregate(key=[(a#0 + 1)#6], functions=[], output=[(a + 1)#5]) : +- INPUT +- Exchange hashpartitioning((a#0 + 1)#6, 200), None +- WholeStageCodegen : +- TungstenAggregate(key=[(a#0 + 1) AS (a#0 + 1)#6], functions=[], output=[(a#0 + 1)#6]) : +- INPUT +- LocalTableScan [a#0], [[1],[2]] ``` ## How was this patch tested? Pass the Jenkins tests (with a new testcase) Author: Dongjoon Hyun <dongjoon@apache.org> Closes apache#12590 from dongjoon-hyun/SPARK-14830. (cherry picked from commit 6e63201) Signed-off-by: Michael Armbrust <michael@databricks.com>

## What changes were proposed in this pull request? Implements Every, Some, Any aggregates in SQL. These new aggregate expressions are analyzed in normal way and rewritten to equivalent existing aggregate expressions in the optimizer. Every(x) => Min(x) where x is boolean. Some(x) => Max(x) where x is boolean. Any is a synonym for Some. SQL ``` explain extended select every(v) from test_agg group by k; ``` Plan : ``` == Parsed Logical Plan == 'Aggregate ['k], [unresolvedalias('every('v), None)] +- 'UnresolvedRelation `test_agg` == Analyzed Logical Plan == every(v): boolean Aggregate [k#0], [every(v#1) AS every(v)#5] +- SubqueryAlias `test_agg` +- Project [k#0, v#1] +- SubqueryAlias `test_agg` +- LocalRelation [k#0, v#1] == Optimized Logical Plan == Aggregate [k#0], [min(v#1) AS every(v)#5] +- LocalRelation [k#0, v#1] == Physical Plan == *(2) HashAggregate(keys=[k#0], functions=[min(v#1)], output=[every(v)#5]) +- Exchange hashpartitioning(k#0, 200) +- *(1) HashAggregate(keys=[k#0], functions=[partial_min(v#1)], output=[k#0, min#7]) +- LocalTableScan [k#0, v#1] Time taken: 0.512 seconds, Fetched 1 row(s) ``` ## How was this patch tested? Added tests in SQLQueryTestSuite, DataframeAggregateSuite Closes apache#22809 from dilipbiswal/SPARK-19851-specific-rewrite. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Michael Allman and others added 21 commits October 13, 2016 11:08

[SPARK-16980][SQL] Load only catalog table partition metadata required

c8e3a1e

to answer a query

Add a new catalyst optimizer rule to SQL core for pruning unneeded

ac89aef

partitions' files from a table file catalog

Include the type of file catalog in the FileSourceScanExec metadata

f657256

TODO: Consider renaming FileCatalog to better differentiate it from

65298f0

BasicFileCatalog (or vice-versa)

Refactor the FileSourceScanExec.metadata val to make it prettier

4d257e1

try out parquet case insensitive fallback

d3b9f3c

fix and add test for input files

b1847ad

rename test

84f3741

Refactor TableFileCatalog.listFiles to call listDataLeafFiles once

026951c

instead of once per partition

fix it

fb664d6

more test cases

25e880f

also fix a bug with zero partitions selected

869d090

feature flag

225d0fe

add comments

8aa1ed1

extend and fix flakiness in test

5f3061b

Enhance ParquetMetastoreSuite with mixed-case partition columns

bf6f46f

Tidy up a little by removing some unused imports, an unused method and

d48ff10

moving a protected method down and making it private

Put partition count in FileSourceScanExec.metadata for partitioned

3a072bd

tables

Fix some errors in my revision of ParquetSourceSuite

dc9e613

Thu Oct 13 17:18:14 PDT 2016

989f3b3

Merge commit '765f93c' into more-testing

49112e6

mallman reviewed Oct 14, 2016

View reviewed changes

more generic

6a46fea

mallman reviewed Oct 14, 2016

View reviewed changes

Thu Oct 13 18:09:42 PDT 2016

3f192cd

Thu Oct 13 18:09:55 PDT 2016

a7c0d35

mallman reviewed Oct 14, 2016

View reviewed changes

Thu Oct 13 18:22:31 PDT 2016

39513b7

mallman merged commit e1635e4 into VideoAmp:spark-16980-lazy_partition_fetching Oct 14, 2016

Add metrics and cost tests for partition pruning effectiveness #5

Add metrics and cost tests for partition pruning effectiveness #5

Uh oh!

Conversation

ericl commented Oct 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mallman commented Oct 14, 2016

Uh oh!

ericl commented Oct 14, 2016

Uh oh!

mallman commented Oct 14, 2016

Uh oh!

ericl commented Oct 14, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ericl commented Oct 14, 2016 •

edited

Loading