Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flink: refactor sink tests to reduce the number of combinations with parameterized tests #10777

Merged
merged 3 commits into from
Aug 2, 2024

Conversation

stevenzwu
Copy link
Contributor

@stevenzwu stevenzwu commented Jul 25, 2024

E.g. it is not really necessary to test different file format for distribution mode (hash, range). I found this when trying to add new tests for the new range distribution feature. We often run 20+ combinations for every test/method in some classes.

@stevenzwu stevenzwu requested review from nastra and pvary July 25, 2024 00:42
@github-actions github-actions bot added the flink label Jul 25, 2024
@stevenzwu stevenzwu force-pushed the refactor-sink-tests branch 2 times, most recently from 998bee4 to 85168ca Compare July 25, 2024 01:04
@@ -118,26 +118,4 @@ protected String getFullQualifiedTableName(String tableName) {
static String getURI(HiveConf conf) {
return conf.get(HiveConf.ConfVars.METASTOREURIS.varname);
}

static String toWithClause(Map<String, String> props) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is moved to SqlBase

@@ -41,7 +40,7 @@
import org.junit.jupiter.api.extension.RegisterExtension;
import org.junit.jupiter.api.io.TempDir;

public abstract class TestBase extends TestBaseUtils {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestBaseUtils seems like a pure util class with static methods only

@@ -169,39 +154,6 @@ public void testOverwriteTable() throws Exception {
icebergTable, Lists.newArrayList(SimpleDataUtil.createRecord(2, "b")));
}

@TestTemplate
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved these tests to the new TestFlinkTableSinkExtended which only tests on the dimensions of isStreamingJob boolean value. Currently every test run with 20+ combinations of parameters. Don't want to add range distribution test here to further slow down the test run time.

}

@TestTemplate
public void testJobNoneDistributeMode() throws Exception {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved all tests related to distribution mode to TestFlinkIcebergSinkDistributionMode. we don't need to test different file formats (Avro, Orc, Parquet) for distribution mode.

@pvary
Copy link
Contributor

pvary commented Jul 25, 2024

@rodmeneses: Could you please take a look. I think it will greatly affect your testing for Sink V2

import org.junit.jupiter.api.io.TempDir;

/**
* This tests the more extended features of Flink sink. Extract them separately since it is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: These instead of This

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually correct. but I see why it is confusing. let me change it to This class tests ...

*/
@ExtendWith(ParameterizedTestExtension.class)
public class TestFlinkTableSinkExtended extends SqlBase {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. will remove. but one empty line at the beginning is also ok with Iceberg coding style


@TempDir protected File warehouseRoot;

protected final String flinkDatabase = CATALOG + "." + DATABASE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these ones could be also static. Also make them uppercase when changing to static

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

protected TableEnvironment getTableEnv() {
if (tEnv == null) {
synchronized (this) {
EnvironmentSettings.Builder settingsBuilder = EnvironmentSettings.newInstance();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typically the implementation of the singleton pattern ask for another if (tEnv == null) after synchronized

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably copied it somewhere. we don't actually need double checked locking. I will just make the method synchronized to simplify the code.

Copy link
Contributor

@rodmeneses rodmeneses left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor comments, otherwise lgtm

…parameterized tests.

E.g. it is not really necessary to test different file format for distribution mode (hash, range). I found this necessary when trying to add new tests for the new range distribution feature. We often run 20+ combinations for every test/method in some classes.
@stevenzwu stevenzwu force-pushed the refactor-sink-tests branch from bf7aaa9 to b7f39a2 Compare July 30, 2024 21:10
…ergSinkExtended without combinational parameters
* @param catalogName The catalog to drop
* @param ifExists If we should use the 'IF EXISTS' when dropping the catalog
*/
protected void dropCatalog(String catalogName, boolean ifExists) {
Copy link
Contributor

@pvary pvary Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do a throughout cleanup instead of dropping random catalogs?

Like:

  • Dropping every catalog other than the dafault
  • Dropping every database other than the default
  • Dropping every table other than the default?

Copy link
Contributor Author

@stevenzwu stevenzwu Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now, inherited classes clean up created tables and databases explicitly, e.g. TableSourceTestBase. We can implement what you suggested. but it will be a couple more SQL executions (1) list databases (2) for each database (usually one), list tables.

  @AfterEach
  public void clean() {
    sql("DROP TABLE IF EXISTS %s.%s", DATABASE_NAME, TABLE_NAME);
    dropDatabase(DATABASE_NAME, true);
    dropCatalog(CATALOG_NAME, true);
  }

I am leaning a little more toward the existing approach. Or if we want to change it, we can do it in a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is good to reduce the complexity of writing tests.
If we create a good cleanup method we can use everywhere then we just call that in an AfterEach method and forget about it altogether. Otherwise we need to remember which database/table was created during the tests. Also, if you different test methods creating different tables, then you need to add logic to the cleanup to differentiate those.

I feel more inclined to the generic cleanup method, even if it adds a few more calls to the test runs.

What do you think @rodmeneses?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even if it adds a few more calls to the test runs.

note that this is for each test method. with parameterized tests, some classes have dozens of tests. this can add up quickly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have numbers?
Higher test cases means higher probability of a mistake to forget to delete a db or a table.

Copy link
Contributor Author

@stevenzwu stevenzwu Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want to redo the catalog clean up pattern across the board, I think it is better to do it in a separate PR if we can get consensus on the new approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

@stevenzwu stevenzwu merged commit dc7ad71 into apache:main Aug 2, 2024
24 checks passed
@stevenzwu
Copy link
Contributor Author

thanks @rodmeneses and @pvary for the review

@stevenzwu stevenzwu deleted the refactor-sink-tests branch August 19, 2024 22:27
stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Aug 19, 2024
…toring.

1.20 porting is not needed as the 1.20 code was based on the hash that already included PR apache#10777.
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants