-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package #33350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #141039 has finished for PR 33350 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this claims a simple moving classes, shall we preserve i instead of introducing new column name, id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I'm not sure whether it's worth doing so because we changed how the test table is created by using the DataFrame API spark.range(10).selectExpr("id", "id % 3 as p").write.partitionBy("p").saveAsTable("test"), which creates id column by default. The id here is also consistent with the rest of the tests in this file as well as other tests which use the same API to create tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant we should recover it from id to i together~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, if the title and scope becomes broaden, I'm okay for id, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I'll keep as it is then :-)
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if we change the title. Moving is misleading for this PR content. This PR is a kind of generalization or refactoring.
|
BTW, I agree with the idea. |
|
Thanks @dongjoon-hyun for reviewing. On the title, do you have any suggestion? this PR does move these files from one package to another package so I'm thinking at least it expressed that part OK. How about something like "Refactor PruneFileSourcePartitionsSuite and move it to the correct package"? |
|
Ya, |
|
Thanks @dongjoon-hyun . I've changed the title to "Refactor PruneFileSourcePartitionsSuite etc to a different package" - let me know if this works for you :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you know, saveAsTable is different from STORED AS parquet. The original test coverage seems to be coupled with convertMetastoreParquet, but this one looks different. Are we losing the existing test coverage?
scala> spark.range(10).selectExpr("id", "id % 3 as p").write.partitionBy("p").saveAsTable("t1")
scala> sql("DESCRIBE TABLE EXTENDED t1").show()
...
| Provider| parquet| |
...
scala> sql("CREATE TABLE t2(a int) STORED AS parquet").show()
scala> sql("DESCRIBE TABLE EXTENDED t2").show()
...
| Provider| hive| |
...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This specific test coverage should remain at hive module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm what is convertMetastoreParquet? I couldn't find it anywhere.
Regarding the test, I think it is still covered (I've debugged the test and made sure it is still going through the related code paths in PruneFileSourcePartitions). Much has changed since 2016 though: the test (added in #15569) was originally designed to make sure that LogicalRelation.expectedOutputAttributes was correctly populated in the class. The expectedOutputAttributes, however, was later replaced by directly passing output in LogicalRelation (in #17552), which I think further prevented the issue from happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean spark.sql.hive.convertMetastoreParquet. Here is the document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, CREATE TABLE ... USING PARQUET (spark syntax) and CREATE TABLE ... STORED AS PARQUET (hive syntax) generates different tables in Apache Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Hive tables generated by STORED AS syntax, Spark converts them to data source tables on the fly because spark.sql.hive.convertMetastoreParquet is true by default. It's the same for ORC. For ORC, we have spark.sql.hive.convertMetastoreOrc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dongjoon-hyun . I found it now. However I'm not sure whether this matters for the test though: what it does is just 1) register table metadata in the catalog, 2) create a LogicalRelation wrapping a HadoopFsRelation which has the data and partition schema from the step 1), and 3) feed it into the rule PruneFileSourcePartitions and see if the LogicalRelation's expectedOutputAttributes is properly set. Seems this is irrelevant to what SerDe it is using?
|
Could you resolve the conflicts, @sunchao ? Also, cc @cloud-fan , @maropu , @viirya . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know why it uses external table before? Is it related to the test coverage here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I explained this in the other thread and I don't think this is related to the test coverage here. Let me know if you think otherwise @viirya @cloud-fan .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I saw the comment. It looks making sense and that's also what I read from the test. Just wondering why it uses external table originally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure too, IMO the EXTERNAL keyword doesn't matter here. I've run the test with and without it and the outcome is the same.
|
Do we still have the test coverage for partition pruning with hive tables? |
@cloud-fan you mean |
f782ce7 to
33c38a4
Compare
|
Test build #141273 has finished for PR 33350 at commit
|
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
|
Test build #141276 has finished for PR 33350 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #141282 has finished for PR 33350 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
c98975c to
7473aea
Compare
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #141293 has finished for PR 33350 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it's not only moving the package, but also changes some tests to not use hive tables but use data source tables instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, since we're moving PruneFileSourcePartitionsSuite out of the hive package, we need to remove the Hive dependency here too.
As commented in the other thread, to me it's OK to switch to use data source table here. I also digged the history of the change, and it seems at the time when this test was added (in #15569), data source table doesn't use HMS to store table metadata by default (it was added #15515 later), but instead was using ListingFileCatalog (?). Maybe it was for testing purpose that we created a Hive table here but then constructed a LogicalRelation to feed into the PruneFileSourcePartitions rule?
Let me know if you see concern here @cloud-fan , since you are the main author of this test and the related code :)
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Can you do a rebases since the last test run is a quite long time ago?
7473aea to
d6fa7a5
Compare
|
Thanks @cloud-fan ! just rebased. |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #141651 has finished for PR 33350 at commit
|
|
Thanks. Merging to master/3.2. Although this is not bug fix, but it is only for test and I think it is better to keep consistency between master/3.2 so it is easier to backport changes. Feel free to revert it from 3.2 if you prefer to have it only in master. Thanks. |
… to a different package ### What changes were proposed in this pull request? Move both `PruneFileSourcePartitionsSuite` and `PrunePartitionSuiteBase` to the package `org.apache.spark.sql.execution.datasources`. Did a few refactoring to enable this. ### Why are the changes needed? Currently both `PruneFileSourcePartitionsSuite` and `PrunePartitionSuiteBase` are in package `org.apache.spark.sql.hive.execution` which doesn't look correct as these tests are not specific to Hive. Therefore, it's better to move them into `org.apache.spark.sql.execution.datasources`, the same place where the rule `PruneFileSourcePartitions` is at. ### Does this PR introduce _any_ user-facing change? No, it's just test refactoring. ### How was this patch tested? Using existing tests: ``` build/sbt "sql/testOnly *PruneFileSourcePartitionsSuite" ``` and ``` build/sbt "hive/testOnly *PruneHiveTablePartitionsSuite" ``` Closes #33350 from sunchao/SPARK-36136-partitions-suite. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 634f96d) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
|
Thanks @viirya ! Yes I agree we should backport and keep master & 3.2 consistent. |
|
@viirya @sunchao It seems like the refactor caused couple of tests in
|
|
Hm? This passed Jenkins and GA was also passed. Where do you see the failed tests? |
|
seems they are failing in 3.2 branch |
|
hmm, they are also failing in master: #33498 |
|
Sorry. Let me check the failed tests. |
|
We can revert it if it needs taking some time to investigate. |
|
Is it flaky or do other merged PRs cause the result different? |
|
Created #33533 to revert it first. @cloud-fan @sunchao @dongjoon-hyun |
|
We can revert it first. The test failures are related. Not sure why they weren't detected by the CI previously though. |
|
@sunchao I also meet this problem |
|
For The old case writes one result file for each partition, but the new case writes two result files for each partition, I guess some test configurations have changed |
|
@sunchao |
|
It seems that |
|
Thanks @LuciferYang ! yes I found that we could either use |
What changes were proposed in this pull request?
Move both
PruneFileSourcePartitionsSuiteandPrunePartitionSuiteBaseto the packageorg.apache.spark.sql.execution.datasources. Did a few refactoring to enable this.Why are the changes needed?
Currently both
PruneFileSourcePartitionsSuiteandPrunePartitionSuiteBaseare in packageorg.apache.spark.sql.hive.executionwhich doesn't look correct as these tests are not specific to Hive. Therefore, it's better to move them intoorg.apache.spark.sql.execution.datasources, the same place where the rulePruneFileSourcePartitionsis at.Does this PR introduce any user-facing change?
No, it's just test refactoring.
How was this patch tested?
Using existing tests:
and