owner: fix a bug which lead to replication stopped and no error report (#1814) #1827
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is an automated cherry-pick of #1814
Bug phenomenon
The resolved TS and checkpoint TS of a change feed is stopped, but no error reported
How to confirm the bug
We can check the owner log, and find logs like this:
Trigger conditions
A capture is offline when some table is moving by the owner.
Versions have this bug
[4.0.0, 4.0.13], 5.0.0-rc, [5.0.0, 5.0.1]
Bug mechanism
when the owner moves a table, the owner tries to remove the table from source capture and add the table to target capture.
if the target capture is offline at the same time, the owner should add this table to orphan tables and dispatch this table in the next tick.
but the owner forget to remove the invalid move table job, so the owner will add the table to orphan table every tick, leads to the other logic does not work properly
Check List
Tests
Release note