Flakiness fix for Upgrade Downgrade Backup Manual test #9957
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
The upgrade downgrade backups manual test has been flaky on the ci for a while. On investigation of the same with the help of #9944 led to the following discovery -
While restarting the tablets with a backup, the tablet transitions to
RESTORE
type to restore from the local backup. Without waiting for the backup to finish, we callInitShardPrimary
which sees the tablet type asRESTORE
and turns off semi-sync on that tablet.Later restoration completes and the tablet transitions to
REPLICA
, but its semi-sync settings aren't fixed, which leads to the primary stalling indefinitely on writes.This PR fixes this issue by first waiting for the restoration phase to finish before calling InitShardPrimary. Earlier we had a 5 second hard wait to ensure that this phase had finished but on the CI it generally takes longer. Locally 5 seconds is sufficient, which made reproducibility a challenge.
I think it is better to have an explicit check on the tablet type by
curl
ing the tablet server API endpoint instead of any fixed hard time sleep.Related Issue(s)
Checklist
Deployment Notes