Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: PRS promotes replica that has not caught to old primary as new primary #14738

Closed
deepthi opened this issue Dec 9, 2023 · 2 comments · Fixed by #14745
Closed

Bug Report: PRS promotes replica that has not caught to old primary as new primary #14738

deepthi opened this issue Dec 9, 2023 · 2 comments · Fixed by #14745

Comments

@deepthi
Copy link
Member

deepthi commented Dec 9, 2023

Overview of the Issue

We ran into an issue where promoting the replica to primary via PlannedReparent succeeded. However, the new primary had actually not caught up to the position of the old primary. There were several thousand missing transactions.

Reproduction Steps

This is non-trivial to reproduce, it needs a decent amount of load, or some other condition to make the replica lag.
One way to make the replica lag is to use it take a backup. As soon as the backup is complete, while the replica is lagged, use PRS to promote it.
Necessary pre-condition: Lag should be high enough that replica cannot catchup during the time allowed (wait-replicas-timeout). Making wait-replicas-timeout small (like 1 second) will probably help to reproduce.

Binary Version

main for now, will check other versions and update.

Operating System and Environment details

any

Log Fragments

2023-12-08 22:45:20.064	
I1208 22:45:20.064349       1 rpc_replication.go:199] WaitForPosition: <redacted>
2023-12-08 22:45:49.097	
I1208 22:45:49.097332       1 rpc_replication.go:867] PromoteReplica

Note the time difference - almost 30 seconds, which is the amount of time allowed by default.

@deepthi
Copy link
Member Author

deepthi commented Dec 9, 2023

main for now, will check other versions and update.

The bug is present on all release branches, but not in any released version. We will be fixing this on all branches.
However, in addition to fixing how we handle the return values from each flavor, we should also add a check in PRS after WaitForPosition to make sure that the replica did in fact reach the desired position.

@GuptaManan100
Copy link
Member

This part #14738 (comment) has not been implemented yet. So reopening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment