Skip to content

Conversation

@sunchao
Copy link
Member

@sunchao sunchao commented Jul 15, 2021

What changes were proposed in this pull request?

Move both PruneFileSourcePartitionsSuite and PrunePartitionSuiteBase to the package org.apache.spark.sql.execution.datasources. Did a few refactoring to enable this.

Why are the changes needed?

Currently both PruneFileSourcePartitionsSuite and PrunePartitionSuiteBase are in package org.apache.spark.sql.hive.execution which doesn't look correct as these tests are not specific to Hive. Therefore, it's better to move them into org.apache.spark.sql.execution.datasources, the same place where the rule PruneFileSourcePartitions is at.

Does this PR introduce any user-facing change?

No, it's just test refactoring.

How was this patch tested?

Using existing tests:

build/sbt "sql/testOnly *PruneFileSourcePartitionsSuite"

and

build/sbt "hive/testOnly *PruneHiveTablePartitionsSuite"

@github-actions github-actions bot added the SQL label Jul 15, 2021
@SparkQA
Copy link

SparkQA commented Jul 15, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45554/

@SparkQA
Copy link

SparkQA commented Jul 15, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45554/

@SparkQA
Copy link

SparkQA commented Jul 15, 2021

Test build #141039 has finished for PR 33350 at commit f782ce7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class PruneFileSourcePartitionsSuite extends PrunePartitionSuiteBase with SharedSparkSession
  • abstract class PrunePartitionSuiteBase extends StatisticsCollectionTestBase
  • class PruneHiveTablePartitionsSuite extends PrunePartitionSuiteBase with TestHiveSingleton

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this claims a simple moving classes, shall we preserve i instead of introducing new column name, id?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'm not sure whether it's worth doing so because we changed how the test table is created by using the DataFrame API spark.range(10).selectExpr("id", "id % 3 as p").write.partitionBy("p").saveAsTable("test"), which creates id column by default. The id here is also consistent with the rest of the tests in this file as well as other tests which use the same API to create tables.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant we should recover it from id to i together~

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, if the title and scope becomes broaden, I'm okay for id, too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'll keep as it is then :-)

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if we change the title. Moving is misleading for this PR content. This PR is a kind of generalization or refactoring.

@dongjoon-hyun
Copy link
Member

BTW, I agree with the idea.

@sunchao
Copy link
Member Author

sunchao commented Jul 15, 2021

Thanks @dongjoon-hyun for reviewing. On the title, do you have any suggestion? this PR does move these files from one package to another package so I'm thinking at least it expressed that part OK. How about something like "Refactor PruneFileSourcePartitionsSuite and move it to the correct package"?

@dongjoon-hyun
Copy link
Member

Ya, Refactoring ... sounds more accurate. However, the correct package should not be there. The previous one also can be considered correct.

@sunchao sunchao changed the title [SPARK-36136][SQL][TEST] Move PruneFileSourcePartitionsSuite to org.apache.spark.sql.execution.datasources [SPARK-36136][SQL][TEST] Refactor PruneFileSourcePartitionsSuite etc to a different package Jul 15, 2021
@sunchao
Copy link
Member Author

sunchao commented Jul 16, 2021

Thanks @dongjoon-hyun . I've changed the title to "Refactor PruneFileSourcePartitionsSuite etc to a different package" - let me know if this works for you :)

@HyukjinKwon HyukjinKwon changed the title [SPARK-36136][SQL][TEST] Refactor PruneFileSourcePartitionsSuite etc to a different package [SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package Jul 16, 2021
Copy link
Member

@dongjoon-hyun dongjoon-hyun Jul 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you know, saveAsTable is different from STORED AS parquet. The original test coverage seems to be coupled with convertMetastoreParquet, but this one looks different. Are we losing the existing test coverage?

scala> spark.range(10).selectExpr("id", "id % 3 as p").write.partitionBy("p").saveAsTable("t1")

scala> sql("DESCRIBE TABLE EXTENDED t1").show()
...
|            Provider|             parquet|       |
...
scala> sql("CREATE TABLE t2(a int) STORED AS parquet").show()
scala> sql("DESCRIBE TABLE EXTENDED t2").show()
...
|            Provider|                hive|       |
...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This specific test coverage should remain at hive module.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm what is convertMetastoreParquet? I couldn't find it anywhere.

Regarding the test, I think it is still covered (I've debugged the test and made sure it is still going through the related code paths in PruneFileSourcePartitions). Much has changed since 2016 though: the test (added in #15569) was originally designed to make sure that LogicalRelation.expectedOutputAttributes was correctly populated in the class. The expectedOutputAttributes, however, was later replaced by directly passing output in LogicalRelation (in #17552), which I think further prevented the issue from happening.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean spark.sql.hive.convertMetastoreParquet. Here is the document.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, CREATE TABLE ... USING PARQUET (spark syntax) and CREATE TABLE ... STORED AS PARQUET (hive syntax) generates different tables in Apache Spark.

Copy link
Member

@dongjoon-hyun dongjoon-hyun Jul 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Hive tables generated by STORED AS syntax, Spark converts them to data source tables on the fly because spark.sql.hive.convertMetastoreParquet is true by default. It's the same for ORC. For ORC, we have spark.sql.hive.convertMetastoreOrc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dongjoon-hyun . I found it now. However I'm not sure whether this matters for the test though: what it does is just 1) register table metadata in the catalog, 2) create a LogicalRelation wrapping a HadoopFsRelation which has the data and partition schema from the step 1), and 3) feed it into the rule PruneFileSourcePartitions and see if the LogicalRelation's expectedOutputAttributes is properly set. Seems this is irrelevant to what SerDe it is using?

@dongjoon-hyun
Copy link
Member

Could you resolve the conflicts, @sunchao ?

Also, cc @cloud-fan , @maropu , @viirya .

Comment on lines -47 to -54
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know why it uses external table before? Is it related to the test coverage here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explained this in the other thread and I don't think this is related to the test coverage here. Let me know if you think otherwise @viirya @cloud-fan .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I saw the comment. It looks making sense and that's also what I read from the test. Just wondering why it uses external table originally.

Copy link
Member Author

@sunchao sunchao Jul 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure too, IMO the EXTERNAL keyword doesn't matter here. I've run the test with and without it and the outcome is the same.

@cloud-fan
Copy link
Contributor

Do we still have the test coverage for partition pruning with hive tables?

@sunchao
Copy link
Member Author

sunchao commented Jul 19, 2021

Do we still have the test coverage for partition pruning with hive tables?

@cloud-fan you mean PruneHiveTablePartitionsSuite? yes it is untouched by this PR.

@sunchao sunchao force-pushed the SPARK-36136-partitions-suite branch from f782ce7 to 33c38a4 Compare July 19, 2021 18:48
@SparkQA
Copy link

SparkQA commented Jul 19, 2021

Test build #141273 has finished for PR 33350 at commit 33c38a4.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class PruneFileSourcePartitionsSuite extends PrunePartitionSuiteBase with SharedSparkSession
  • abstract class PrunePartitionSuiteBase extends StatisticsCollectionTestBase
  • class PruneHiveTablePartitionsSuite extends PrunePartitionSuiteBase with TestHiveSingleton

@SparkQA
Copy link

SparkQA commented Jul 19, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45787/

@SparkQA
Copy link

SparkQA commented Jul 19, 2021

Test build #141276 has finished for PR 33350 at commit 1316644.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45790/

@SparkQA
Copy link

SparkQA commented Jul 19, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45790/

@SparkQA
Copy link

SparkQA commented Jul 19, 2021

Test build #141282 has finished for PR 33350 at commit c98975c.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45796/

@SparkQA
Copy link

SparkQA commented Jul 19, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45796/

@sunchao sunchao force-pushed the SPARK-36136-partitions-suite branch from c98975c to 7473aea Compare July 20, 2021 03:32
@SparkQA
Copy link

SparkQA commented Jul 20, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45807/

@SparkQA
Copy link

SparkQA commented Jul 20, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45807/

@SparkQA
Copy link

SparkQA commented Jul 20, 2021

Test build #141293 has finished for PR 33350 at commit 7473aea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it's not only moving the package, but also changes some tests to not use hive tables but use data source tables instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, since we're moving PruneFileSourcePartitionsSuite out of the hive package, we need to remove the Hive dependency here too.

As commented in the other thread, to me it's OK to switch to use data source table here. I also digged the history of the change, and it seems at the time when this test was added (in #15569), data source table doesn't use HMS to store table metadata by default (it was added #15515 later), but instead was using ListingFileCatalog (?). Maybe it was for testing purpose that we created a Hive table here but then constructed a LogicalRelation to feed into the PruneFileSourcePartitions rule?

Let me know if you see concern here @cloud-fan , since you are the main author of this test and the related code :)

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can you do a rebases since the last test run is a quite long time ago?

@sunchao sunchao force-pushed the SPARK-36136-partitions-suite branch from 7473aea to d6fa7a5 Compare July 26, 2021 17:09
@sunchao
Copy link
Member Author

sunchao commented Jul 26, 2021

Thanks @cloud-fan ! just rebased.

@SparkQA
Copy link

SparkQA commented Jul 26, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46167/

@SparkQA
Copy link

SparkQA commented Jul 26, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46167/

@SparkQA
Copy link

SparkQA commented Jul 26, 2021

Test build #141651 has finished for PR 33350 at commit d6fa7a5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Jul 26, 2021

Thanks. Merging to master/3.2. Although this is not bug fix, but it is only for test and I think it is better to keep consistency between master/3.2 so it is easier to backport changes.

Feel free to revert it from 3.2 if you prefer to have it only in master. Thanks.

viirya pushed a commit that referenced this pull request Jul 26, 2021
… to a different package

### What changes were proposed in this pull request?

Move both `PruneFileSourcePartitionsSuite` and `PrunePartitionSuiteBase` to the package `org.apache.spark.sql.execution.datasources`. Did a few refactoring to enable this.

### Why are the changes needed?

Currently both `PruneFileSourcePartitionsSuite` and `PrunePartitionSuiteBase` are in package `org.apache.spark.sql.hive.execution` which doesn't look correct as these tests are not specific to Hive. Therefore, it's better to move them into `org.apache.spark.sql.execution.datasources`, the same place where the rule `PruneFileSourcePartitions` is at.

### Does this PR introduce _any_ user-facing change?

No, it's just test refactoring.

### How was this patch tested?

Using existing tests:
```
build/sbt "sql/testOnly *PruneFileSourcePartitionsSuite"
```
and
```
build/sbt "hive/testOnly *PruneHiveTablePartitionsSuite"
```

Closes #33350 from sunchao/SPARK-36136-partitions-suite.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 634f96d)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
@viirya viirya closed this in 634f96d Jul 26, 2021
@sunchao
Copy link
Member Author

sunchao commented Jul 26, 2021

Thanks @viirya ! Yes I agree we should backport and keep master & 3.2 consistent.

@sunchao sunchao deleted the SPARK-36136-partitions-suite branch July 26, 2021 20:30
@venkata91
Copy link
Contributor

@viirya @sunchao It seems like the refactor caused couple of tests in PruneFileSourcePartitionsSuite tests to fail. Seems like the refactor from Hive tables to datasource is causing issues.
I think this test SPARK-35985 push filters for empty read schema - returns all the files under the partition with DSV2 therefore failing

SPARK-36128: spark.sql.hive.metastorePartitionPruning should work for file data sources - this test checks for HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount since the tests are now moved, this is becoming 0. Please take a look.

@viirya
Copy link
Member

viirya commented Jul 27, 2021

Hm? This passed Jenkins and GA was also passed. Where do you see the failed tests?

@cloud-fan
Copy link
Contributor

seems they are failing in 3.2 branch

@cloud-fan
Copy link
Contributor

hmm, they are also failing in master: #33498

@sunchao
Copy link
Member Author

sunchao commented Jul 27, 2021

Sorry. Let me check the failed tests.

@viirya
Copy link
Member

viirya commented Jul 27, 2021

We can revert it if it needs taking some time to investigate.

@viirya
Copy link
Member

viirya commented Jul 27, 2021

Is it flaky or do other merged PRs cause the result different?

@viirya
Copy link
Member

viirya commented Jul 27, 2021

Created #33533 to revert it first. @cloud-fan @sunchao @dongjoon-hyun

@sunchao
Copy link
Member Author

sunchao commented Jul 27, 2021

We can revert it first. The test failures are related. Not sure why they weren't detected by the CI previously though.

@LuciferYang
Copy link
Contributor

@sunchao I also meet this problem

@LuciferYang
Copy link
Contributor

For SPARK-35985 push filters for empty read schema

The old case writes one result file for each partition, but the new case writes two result files for each partition, I guess some test configurations have changed

@LuciferYang
Copy link
Contributor

LuciferYang commented Jul 27, 2021

@sunchao
I found that the old case use spark.master local[1], but SharedSparkSession create TestSparkSession with local[2] as default , so we should override createSparkSession method in new PruneFileSourcePartitionsSuite file to create TestSparkSession with local[1] to pass SPARK-35985 push filters for empty read schema

@LuciferYang
Copy link
Contributor

It seems that SPARK-36128: spark.sql.hive.metastorePartitionPruning should work for file data sources should not be placed in sql/core module.

@sunchao
Copy link
Member Author

sunchao commented Jul 27, 2021

Thanks @LuciferYang ! yes I found that we could either use local[1] or coalesce(1) to fix the first test case. For the second, it relies on HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED which is no longer available since it switched to use InMemoryCatalog. I need to find a new way to write the test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants