Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBZ-7050 Add automatic retry for snapshots #163

Merged
merged 8 commits into from
Oct 24, 2023
Merged

Conversation

twthorn
Copy link
Contributor

@twthorn twthorn commented Oct 17, 2023

Add automatic retry for snapshots.

Automatic Retry

As detailed in the table copy RFC the last PK value in the VGTID is used for resuming partially complete snapshots. We utilize this for our automatic retries and parse this additional information now being sent by Vitess in the latest version (i.e., including fixes like this).

One implementation decision I debated between was switching all of our VGTID parsing logic to instead use the provided protobuf json converters. This would simplify our codebase but also mean the Debezium logic is more tightly coupled with Vitesss/protobuf, so for the latter reason I decided against this.

GTID -> VGTID config

Previous configs had the option of vitess.gtid which was originally added when vitess.shard only allowed for a single shard. With the changes to support multiple shards back in #135, we made GTID into a CSV as well. In order to easily test the automatic retry snapshot behavior, it would be useful to also specify the last PK values in the config. Additionally, GTID csv adds unnecessary complexity to support another format of specifying GTID(s). So to resolve both of these issues, I changed this to simply be a vitess.vgtid config which is a JSON parseable string that we can use to initialize our VGTID for a VStream. Since VGTID is an array of keyspace/shard/gtid/lastpks it has far more versatility for the user to be able to arbitrarily specify the GTID(s) for any shard(s) subset and even PKs to resume snapshots on. This is what is used to test the snapshot resume behavior in the integration tests.

@twthorn
Copy link
Contributor Author

twthorn commented Oct 18, 2023

@jpechane Can you review this when you get the chance? Thank you!

@@ -464,28 +467,34 @@ public List<String> getShard() {
return getConfig().getStrings(SHARD, CSV_DELIMITER);
}

public List<String> getGtid() {
public String getGtid() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the method name called getVGtid?


public static final Field GTID = Field.create(VITESS_CONFIG_GROUP_PREFIX + "gtid")
.withDisplayName("gtid")
public static final Field VGTID = Field.create(VITESS_CONFIG_GROUP_PREFIX + "vgtid")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it cause any regression with the config name changes from gtid to vgtid?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will. It is necessary to deprecate the existing setting. So both should be available and when the deprecated one is used then WARN should be written to the log.

Copy link
Contributor

@jpechane jpechane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@twthorn Thanks for the PR. I left few minor comments. One question for you. I supposed that incomplete snapshots would be resumed automaically so I'd expect additional data being writtent to or read from offsets. Is is the case?
Also for testing there should be test when connector is started. Then it is stopped in the middle of snapshot and started again. There should be no lost records or duplicates.

pom.xml Outdated
@@ -245,6 +245,11 @@
<type>test-jar</type>
<scope>test</scope>
</dependency>
<dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please place this dependency version into dependencyManagement and configure it via property?
Also I'd recommend to align the version with the one pulled by grpc client.


import binlogdata.Binlogdata;

public class TablePrimaryKeys {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add JavaDoc to this class?

@@ -37,8 +40,7 @@
*/
public class VitessConnectorConfig extends RelationalDatabaseConnectorConfig {

public static final List<String> EMPTY_GTID_LIST = List.of(Vgtid.EMPTY_GTID);
public static final List<String> DEFAULT_GTID_LIST = List.of(Vgtid.CURRENT_GTID);
public static final String DEFAULT_GTID = Vgtid.CURRENT_GTID;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be moved to Vgtid class?


public static final Field GTID = Field.create(VITESS_CONFIG_GROUP_PREFIX + "gtid")
.withDisplayName("gtid")
public static final Field VGTID = Field.create(VITESS_CONFIG_GROUP_PREFIX + "vgtid")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will. It is necessary to deprecate the existing setting. So both should be available and when the deprecated one is used then WARN should be written to the log.

@twthorn
Copy link
Contributor Author

twthorn commented Oct 19, 2023

I supposed that incomplete snapshots would be resumed automaically so I'd expect additional data being writtent to or read from offsets. Is is the case?

Correct, there will be additional data read/written from offsets. There is now an additional field table_p_ks that will always be read/written to offsets. When there is no table copy phase its value will be an empty list. When there is table copy phase, it includes the last sent primary keys of the table (example). So it varies between marginally more data, and additional nested json array.

Also worth noting, for rolling forward, this works, and I added a test showing this (we can read offsets that lack the field table_p_ks). However, for rolling back, assuming offsets were written with newer code, the user may need to manually overwrite the offset data to remove the table_p_ks field to make it parseable. Jackson errors out on unknown field by default. One option I considered was excluding the table_p_ks field when it's an empty list, but this means the vgtid format will be inconsistent/changing so I opted against it.

Copy link
Contributor

@HenryCaiHaiying HenryCaiHaiying left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm


@Test
public void testSnapshotLargeTable() throws Exception {
TestHelper.executeDDL("vitess_create_tables.ddl");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@twthorn Could you please store all the records captured and assert at the end that all are present and there are neither gaps nor duplicates?

@jpechane
Copy link
Contributor

@twthorn LGTM. I left one comment requesting a hardening of a a test. Otherwise we are good to go.

@twthorn twthorn requested a review from jpechane October 20, 2023 20:21
@twthorn
Copy link
Contributor Author

twthorn commented Oct 23, 2023

@jpechane Thanks for the changes! Please let me know if there's anything else I can do on the PR

@jpechane jpechane merged commit e164d6f into debezium:main Oct 24, 2023
3 checks passed
@jpechane
Copy link
Contributor

@twthorn Applied, thanks a lot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants