Flakiness fix for Upgrade Downgrade Backup Manual test #9957

GuptaManan100 · 2022-03-23T11:56:10Z

Description

The upgrade downgrade backups manual test has been flaky on the ci for a while. On investigation of the same with the help of #9944 led to the following discovery -

While restarting the tablets with a backup, the tablet transitions to RESTORE type to restore from the local backup. Without waiting for the backup to finish, we call InitShardPrimary which sees the tablet type as RESTORE and turns off semi-sync on that tablet.

Later restoration completes and the tablet transitions to REPLICA, but its semi-sync settings aren't fixed, which leads to the primary stalling indefinitely on writes.

This PR fixes this issue by first waiting for the restoration phase to finish before calling InitShardPrimary. Earlier we had a 5 second hard wait to ensure that this phase had finished but on the CI it generally takes longer. Locally 5 seconds is sufficient, which made reproducibility a challenge.

I think it is better to have an explicit check on the tablet type by curling the tablet server API endpoint instead of any fixed hard time sleep.

Related Issue(s)

Checklist

Should this PR be backported? No
Tests were added or are not required Yes, the upgrade downgrade tests for backup
Documentation was added or is not required

Deployment Notes

…rdPrimary to ensure it sets semi-sync correctly Signed-off-by: Manan Gupta <manan@planetscale.com>

examples/local/backups/restart_tablets.sh

deepthi

Nice find! lgtm

…blet Signed-off-by: Manan Gupta <manan@planetscale.com>

ajm188

works for me!! very nice

* feat: wait for the replica to finish restoring before calling InitShardPrimary to ensure it sets semi-sync correctly Signed-off-by: Manan Gupta <manan@planetscale.com> * feat: have a common timeout for all the tablets instead of 300 per tablet Signed-off-by: Manan Gupta <manan@planetscale.com>

feat: wait for the replica to finish restoring before calling InitSha…

828a5e7

…rdPrimary to ensure it sets semi-sync correctly Signed-off-by: Manan Gupta <manan@planetscale.com>

GuptaManan100 added Component: Build/CI Component: Cluster management Type: CI/Build release notes none labels Mar 23, 2022

GuptaManan100 marked this pull request as ready for review March 23, 2022 13:13

GuptaManan100 requested review from rohit-nayak-ps, frouioui and mattlord as code owners March 23, 2022 13:13

GuptaManan100 requested a review from deepthi March 23, 2022 13:17

ajm188 reviewed Mar 23, 2022

View reviewed changes

examples/local/backups/restart_tablets.sh Outdated Show resolved Hide resolved

examples/local/backups/restart_tablets.sh Show resolved Hide resolved

deepthi reviewed Mar 23, 2022

View reviewed changes

GuptaManan100 requested review from ajm188 and deepthi March 24, 2022 05:06

feat: have a common timeout for all the tablets instead of 300 per ta…

d99dfd8

…blet Signed-off-by: Manan Gupta <manan@planetscale.com>

ajm188 approved these changes Mar 25, 2022

View reviewed changes

GuptaManan100 merged commit d0fd6b0 into vitessio:main Mar 25, 2022

GuptaManan100 deleted the backup-fix-flakiness branch March 25, 2022 11:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flakiness fix for Upgrade Downgrade Backup Manual test #9957

Flakiness fix for Upgrade Downgrade Backup Manual test #9957

GuptaManan100 commented Mar 23, 2022 •

edited

Loading

deepthi left a comment •

edited

Loading

ajm188 left a comment

Flakiness fix for Upgrade Downgrade Backup Manual test #9957

Flakiness fix for Upgrade Downgrade Backup Manual test #9957

Conversation

GuptaManan100 commented Mar 23, 2022 • edited Loading

Description

Related Issue(s)

Checklist

Deployment Notes

deepthi left a comment • edited Loading

Choose a reason for hiding this comment

ajm188 left a comment

Choose a reason for hiding this comment

GuptaManan100 commented Mar 23, 2022 •

edited

Loading

deepthi left a comment •

edited

Loading