[SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package #33350

sunchao · 2021-07-15T00:26:05Z

What changes were proposed in this pull request?

Move both PruneFileSourcePartitionsSuite and PrunePartitionSuiteBase to the package org.apache.spark.sql.execution.datasources. Did a few refactoring to enable this.

Why are the changes needed?

Currently both PruneFileSourcePartitionsSuite and PrunePartitionSuiteBase are in package org.apache.spark.sql.hive.execution which doesn't look correct as these tests are not specific to Hive. Therefore, it's better to move them into org.apache.spark.sql.execution.datasources, the same place where the rule PruneFileSourcePartitions is at.

Does this PR introduce any user-facing change?

No, it's just test refactoring.

How was this patch tested?

Using existing tests:

build/sbt "sql/testOnly *PruneFileSourcePartitionsSuite"

and

build/sbt "hive/testOnly *PruneHiveTablePartitionsSuite"

SparkQA · 2021-07-15T01:18:58Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45554/

SparkQA · 2021-07-15T01:54:42Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45554/

SparkQA · 2021-07-15T02:23:59Z

Test build #141039 has finished for PR 33350 at commit f782ce7.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class PruneFileSourcePartitionsSuite extends PrunePartitionSuiteBase with SharedSparkSession
abstract class PrunePartitionSuiteBase extends StatisticsCollectionTestBase
class PruneHiveTablePartitionsSuite extends PrunePartitionSuiteBase with TestHiveSingleton

dongjoon-hyun · 2021-07-15T07:04:36Z

...c/test/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitionsSuite.scala

Since this claims a simple moving classes, shall we preserve i instead of introducing new column name, id?

Thanks. I'm not sure whether it's worth doing so because we changed how the test table is created by using the DataFrame API spark.range(10).selectExpr("id", "id % 3 as p").write.partitionBy("p").saveAsTable("test"), which creates id column by default. The id here is also consistent with the rest of the tests in this file as well as other tests which use the same API to create tables.

I meant we should recover it from id to i together~

Anyway, if the title and scope becomes broaden, I'm okay for id, too.

Thanks! I'll keep as it is then :-)

dongjoon-hyun

It would be great if we change the title. Moving is misleading for this PR content. This PR is a kind of generalization or refactoring.

dongjoon-hyun · 2021-07-15T07:07:03Z

BTW, I agree with the idea.

sunchao · 2021-07-15T17:37:39Z

Thanks @dongjoon-hyun for reviewing. On the title, do you have any suggestion? this PR does move these files from one package to another package so I'm thinking at least it expressed that part OK. How about something like "Refactor PruneFileSourcePartitionsSuite and move it to the correct package"?

dongjoon-hyun · 2021-07-15T20:56:17Z

Ya, Refactoring ... sounds more accurate. However, the correct package should not be there. The previous one also can be considered correct.

sunchao · 2021-07-16T00:30:43Z

Thanks @dongjoon-hyun . I've changed the title to "Refactor PruneFileSourcePartitionsSuite etc to a different package" - let me know if this works for you :)

dongjoon-hyun · 2021-07-16T03:08:13Z

...c/test/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitionsSuite.scala

As you know, saveAsTable is different from STORED AS parquet. The original test coverage seems to be coupled with convertMetastoreParquet, but this one looks different. Are we losing the existing test coverage?

scala> spark.range(10).selectExpr("id", "id % 3 as p").write.partitionBy("p").saveAsTable("t1") scala> sql("DESCRIBE TABLE EXTENDED t1").show() ... | Provider| parquet| | ... scala> sql("CREATE TABLE t2(a int) STORED AS parquet").show() scala> sql("DESCRIBE TABLE EXTENDED t2").show() ... | Provider| hive| | ...

This specific test coverage should remain at hive module.

Hmm what is convertMetastoreParquet? I couldn't find it anywhere.

Regarding the test, I think it is still covered (I've debugged the test and made sure it is still going through the related code paths in PruneFileSourcePartitions). Much has changed since 2016 though: the test (added in #15569) was originally designed to make sure that LogicalRelation.expectedOutputAttributes was correctly populated in the class. The expectedOutputAttributes, however, was later replaced by directly passing output in LogicalRelation (in #17552), which I think further prevented the issue from happening.

I mean spark.sql.hive.convertMetastoreParquet. Here is the document.

https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#hive-metastore-parquet-table-conversion

FYI, CREATE TABLE ... USING PARQUET (spark syntax) and CREATE TABLE ... STORED AS PARQUET (hive syntax) generates different tables in Apache Spark.

For Hive tables generated by STORED AS syntax, Spark converts them to data source tables on the fly because spark.sql.hive.convertMetastoreParquet is true by default. It's the same for ORC. For ORC, we have spark.sql.hive.convertMetastoreOrc.

Thanks @dongjoon-hyun . I found it now. However I'm not sure whether this matters for the test though: what it does is just 1) register table metadata in the catalog, 2) create a LogicalRelation wrapping a HadoopFsRelation which has the data and partition schema from the step 1), and 3) feed it into the rule PruneFileSourcePartitions and see if the LogicalRelation's expectedOutputAttributes is properly set. Seems this is irrelevant to what SerDe it is using?

dongjoon-hyun · 2021-07-19T06:16:35Z

Could you resolve the conflicts, @sunchao ?

Also, cc @cloud-fan , @maropu , @viirya .

viirya · 2021-07-19T07:21:31Z

...c/test/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitionsSuite.scala

Do we know why it uses external table before? Is it related to the test coverage here?

I explained this in the other thread and I don't think this is related to the test coverage here. Let me know if you think otherwise @viirya @cloud-fan .

Yea, I saw the comment. It looks making sense and that's also what I read from the test. Just wondering why it uses external table originally.

I'm not sure too, IMO the EXTERNAL keyword doesn't matter here. I've run the test with and without it and the outcome is the same.

cloud-fan · 2021-07-19T07:23:08Z

Do we still have the test coverage for partition pruning with hive tables?

sunchao · 2021-07-19T16:48:38Z

Do we still have the test coverage for partition pruning with hive tables?

@cloud-fan you mean PruneHiveTablePartitionsSuite? yes it is untouched by this PR.

SparkQA · 2021-07-19T18:59:14Z

Test build #141273 has finished for PR 33350 at commit 33c38a4.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class PruneFileSourcePartitionsSuite extends PrunePartitionSuiteBase with SharedSparkSession
abstract class PrunePartitionSuiteBase extends StatisticsCollectionTestBase
class PruneHiveTablePartitionsSuite extends PrunePartitionSuiteBase with TestHiveSingleton

SparkQA · 2021-07-19T19:06:25Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45787/

SparkQA · 2021-07-19T20:07:59Z

Test build #141276 has finished for PR 33350 at commit 1316644.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-19T20:51:19Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45790/

SparkQA · 2021-07-19T21:25:11Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45790/

SparkQA · 2021-07-19T22:02:18Z

Test build #141282 has finished for PR 33350 at commit c98975c.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-19T22:39:56Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45796/

SparkQA · 2021-07-19T23:13:29Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45796/

SparkQA · 2021-07-20T04:50:22Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45807/

SparkQA · 2021-07-20T05:30:57Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45807/

SparkQA · 2021-07-20T06:19:22Z

Test build #141293 has finished for PR 33350 at commit 7473aea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-07-20T07:31:51Z

...c/test/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitionsSuite.scala

so it's not only moving the package, but also changes some tests to not use hive tables but use data source tables instead?

Yes, since we're moving PruneFileSourcePartitionsSuite out of the hive package, we need to remove the Hive dependency here too.

As commented in the other thread, to me it's OK to switch to use data source table here. I also digged the history of the change, and it seems at the time when this test was added (in #15569), data source table doesn't use HMS to store table metadata by default (it was added #15515 later), but instead was using ListingFileCatalog (?). Maybe it was for testing purpose that we created a Hive table here but then constructed a LogicalRelation to feed into the PruneFileSourcePartitions rule?

Let me know if you see concern here @cloud-fan , since you are the main author of this test and the related code :)

cloud-fan

LGTM. Can you do a rebases since the last test run is a quite long time ago?

sunchao · 2021-07-26T17:09:43Z

Thanks @cloud-fan ! just rebased.

SparkQA · 2021-07-26T18:27:16Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46167/

SparkQA · 2021-07-26T19:03:58Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46167/

SparkQA · 2021-07-26T19:21:07Z

Test build #141651 has finished for PR 33350 at commit d6fa7a5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2021-07-26T20:03:20Z

Thanks. Merging to master/3.2. Although this is not bug fix, but it is only for test and I think it is better to keep consistency between master/3.2 so it is easier to backport changes.

Feel free to revert it from 3.2 if you prefer to have it only in master. Thanks.

… to a different package ### What changes were proposed in this pull request? Move both `PruneFileSourcePartitionsSuite` and `PrunePartitionSuiteBase` to the package `org.apache.spark.sql.execution.datasources`. Did a few refactoring to enable this. ### Why are the changes needed? Currently both `PruneFileSourcePartitionsSuite` and `PrunePartitionSuiteBase` are in package `org.apache.spark.sql.hive.execution` which doesn't look correct as these tests are not specific to Hive. Therefore, it's better to move them into `org.apache.spark.sql.execution.datasources`, the same place where the rule `PruneFileSourcePartitions` is at. ### Does this PR introduce _any_ user-facing change? No, it's just test refactoring. ### How was this patch tested? Using existing tests: ``` build/sbt "sql/testOnly *PruneFileSourcePartitionsSuite" ``` and ``` build/sbt "hive/testOnly *PruneHiveTablePartitionsSuite" ``` Closes #33350 from sunchao/SPARK-36136-partitions-suite. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 634f96d) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

sunchao · 2021-07-26T20:30:18Z

Thanks @viirya ! Yes I agree we should backport and keep master & 3.2 consistent.

venkata91 · 2021-07-27T03:06:31Z

@viirya @sunchao It seems like the refactor caused couple of tests in PruneFileSourcePartitionsSuite tests to fail. Seems like the refactor from Hive tables to datasource is causing issues.
I think this test SPARK-35985 push filters for empty read schema - returns all the files under the partition with DSV2 therefore failing

SPARK-36128: spark.sql.hive.metastorePartitionPruning should work for file data sources - this test checks for HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount since the tests are now moved, this is becoming 0. Please take a look.

viirya · 2021-07-27T03:25:23Z

Hm? This passed Jenkins and GA was also passed. Where do you see the failed tests?

cloud-fan · 2021-07-27T04:30:46Z

seems they are failing in 3.2 branch

cloud-fan · 2021-07-27T04:40:56Z

hmm, they are also failing in master: #33498

sunchao · 2021-07-27T04:42:56Z

Sorry. Let me check the failed tests.

viirya · 2021-07-27T05:04:57Z

We can revert it if it needs taking some time to investigate.

viirya · 2021-07-27T05:06:49Z

Is it flaky or do other merged PRs cause the result different?

viirya · 2021-07-27T05:29:41Z

Created #33533 to revert it first. @cloud-fan @sunchao @dongjoon-hyun

sunchao · 2021-07-27T06:17:52Z

We can revert it first. The test failures are related. Not sure why they weren't detected by the CI previously though.

LuciferYang · 2021-07-27T06:19:54Z

@sunchao I also meet this problem

LuciferYang · 2021-07-27T06:57:04Z

For SPARK-35985 push filters for empty read schema

The old case writes one result file for each partition, but the new case writes two result files for each partition, I guess some test configurations have changed

LuciferYang · 2021-07-27T07:10:48Z

@sunchao
I found that the old case use spark.master local[1], but SharedSparkSession create TestSparkSession with local[2] as default , so we should override createSparkSession method in new PruneFileSourcePartitionsSuite file to create TestSparkSession with local[1] to pass SPARK-35985 push filters for empty read schema

LuciferYang · 2021-07-27T07:47:46Z

It seems that SPARK-36128: spark.sql.hive.metastorePartitionPruning should work for file data sources should not be placed in sql/core module.

sunchao · 2021-07-27T16:38:20Z

Thanks @LuciferYang ! yes I found that we could either use local[1] or coalesce(1) to fix the first test case. For the second, it relies on HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED which is no longer available since it switched to use InMemoryCatalog. I need to find a new way to write the test.

github-actions bot added the SQL label Jul 15, 2021

dongjoon-hyun reviewed Jul 15, 2021

View reviewed changes

sunchao changed the title ~~[SPARK-36136][SQL][TEST] Move PruneFileSourcePartitionsSuite to org.apache.spark.sql.execution.datasources~~ [SPARK-36136][SQL][TEST] Refactor PruneFileSourcePartitionsSuite etc to a different package Jul 15, 2021

HyukjinKwon changed the title ~~[SPARK-36136][SQL][TEST] Refactor PruneFileSourcePartitionsSuite etc to a different package~~ [SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package Jul 16, 2021

dongjoon-hyun reviewed Jul 16, 2021

View reviewed changes

viirya reviewed Jul 19, 2021

View reviewed changes

sunchao force-pushed the SPARK-36136-partitions-suite branch from f782ce7 to 33c38a4 Compare July 19, 2021 18:48

sunchao force-pushed the SPARK-36136-partitions-suite branch from c98975c to 7473aea Compare July 20, 2021 03:32

cloud-fan reviewed Jul 20, 2021

View reviewed changes

cloud-fan approved these changes Jul 26, 2021

View reviewed changes

sunchao added 3 commits July 26, 2021 10:08

initial commit

6b39a06

fix compilation error

9d5f550

retrigger build

d6fa7a5

sunchao force-pushed the SPARK-36136-partitions-suite branch from 7473aea to d6fa7a5 Compare July 26, 2021 17:09

viirya approved these changes Jul 26, 2021

View reviewed changes

viirya closed this in 634f96d Jul 26, 2021

sunchao deleted the SPARK-36136-partitions-suite branch July 26, 2021 20:30

[SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package #33350

[SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package #33350

Uh oh!

Conversation

sunchao commented Jul 15, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jul 15, 2021

Uh oh!

SparkQA commented Jul 15, 2021

Uh oh!

SparkQA commented Jul 15, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jul 15, 2021

Uh oh!

sunchao commented Jul 15, 2021

Uh oh!

dongjoon-hyun commented Jul 15, 2021

Uh oh!

sunchao commented Jul 16, 2021

Uh oh!

dongjoon-hyun Jul 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jul 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jul 19, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunchao Jul 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jul 19, 2021

Uh oh!

sunchao commented Jul 19, 2021

Uh oh!

SparkQA commented Jul 19, 2021

Uh oh!

SparkQA commented Jul 19, 2021

Uh oh!

SparkQA commented Jul 19, 2021

Uh oh!

SparkQA commented Jul 19, 2021

Uh oh!

SparkQA commented Jul 19, 2021

dongjoon-hyun Jul 16, 2021 •

edited

Loading

dongjoon-hyun Jul 16, 2021 •

edited

Loading

sunchao Jul 19, 2021 •

edited

Loading