Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flakiness fix for Upgrade Downgrade Backup Manual test #9957

Merged
merged 2 commits into from
Mar 25, 2022

Conversation

GuptaManan100
Copy link
Member

@GuptaManan100 GuptaManan100 commented Mar 23, 2022

Description

The upgrade downgrade backups manual test has been flaky on the ci for a while. On investigation of the same with the help of #9944 led to the following discovery -

While restarting the tablets with a backup, the tablet transitions to RESTORE type to restore from the local backup. Without waiting for the backup to finish, we call InitShardPrimary which sees the tablet type as RESTORE and turns off semi-sync on that tablet.

Later restoration completes and the tablet transitions to REPLICA, but its semi-sync settings aren't fixed, which leads to the primary stalling indefinitely on writes.

This PR fixes this issue by first waiting for the restoration phase to finish before calling InitShardPrimary. Earlier we had a 5 second hard wait to ensure that this phase had finished but on the CI it generally takes longer. Locally 5 seconds is sufficient, which made reproducibility a challenge.

I think it is better to have an explicit check on the tablet type by curling the tablet server API endpoint instead of any fixed hard time sleep.

Related Issue(s)

Checklist

  • Should this PR be backported? No
  • Tests were added or are not required Yes, the upgrade downgrade tests for backup
  • Documentation was added or is not required

Deployment Notes

…rdPrimary to ensure it sets semi-sync correctly

Signed-off-by: Manan Gupta <manan@planetscale.com>
Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find! lgtm

…blet

Signed-off-by: Manan Gupta <manan@planetscale.com>
Copy link
Contributor

@ajm188 ajm188 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works for me!! very nice

@GuptaManan100 GuptaManan100 merged commit d0fd6b0 into vitessio:main Mar 25, 2022
@GuptaManan100 GuptaManan100 deleted the backup-fix-flakiness branch March 25, 2022 11:04
DAlperin pushed a commit to DAlperin/vitess that referenced this pull request Mar 25, 2022
* feat: wait for the replica to finish restoring before calling InitShardPrimary to ensure it sets semi-sync correctly

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: have a common timeout for all the tablets instead of 300 per tablet

Signed-off-by: Manan Gupta <manan@planetscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants