-
Notifications
You must be signed in to change notification settings - Fork 936
EnforceSemiSyncReplicas & RecoverLockedSemiSyncMaster - actively enable/disable semi-sync replicas to match master's wait count #1373
EnforceSemiSyncReplicas & RecoverLockedSemiSyncMaster - actively enable/disable semi-sync replicas to match master's wait count #1373
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EnableSemiSync
also manages the master flag. do we really want that? should we not have an ...
Right. I suggest moving away from EnableSemiSync()
. Recall that this function was originally authored by the Vitess authors to manage post-failover semi-sync behavior, specifically for Vitess. What I'm thinking is that that implementation is not inline with orchetsrator
's design; your proposed solution is. Let's just not use EnableSemiSync
at all.
For your convenience, the functions SetSemiSyncMaster
and SetSemiSyncReplica
already exist and are used by the API/CLI.
Should there be two modes: EnforceSemiSyncReplicas: exact|enough (exact would handle MasterWithTooManySemiSyncReplicas and LockedSemiSyncMaster, and enough would only handle LockedSemiSyncMaster)?
Yes. Many production deployments will have 1
as rpl_semi_sync_master_wait_for_slave_count
(BTW MariaDB does not support this variable and it is implicitly always 1
) and with multiple semi-sync replicas. I managed such a deployment it it worked well for us.
LockedSemiSyncMasterHypothesis waits ReasonableReplicationLagSeconds. I'd like there to be another variable to control the wait time. This seems like it's overloaded.
That makes sense. Let's fallback to ReasonableReplicationLagSeconds
is the new variable is not configured. It must be non-zero in reality, so the value of 0
can indicate that the value is "not configured".
…-semi-sync-replica-count
promotion_rule; only handle "exact" case for now
You can choose to ignore the comment below if you want to. I think I've convinced myself that this is a good idea and I'll implement it. I can always revert if it doesn't pan out. Aside from some battles I'm fighting in the code, there's one more open design question regarding In the current iteration of the PR, I made a config option
This requires basically an opt-in to handle Alternatively we could make a
Which would behave like this:
What do you think? |
Sounds good and either choice. If you pick using |
…rdinatesSeconds, RecoverLockedSemiSyncMaster and EnforceExactSemiSyncReplicas
I think I am making progress: I have now implemented both the "exact" mode ( I've also implemented Questions
Here are some logs of what it currently looks like: MasterWithTooManySemiSyncReplicas
LockedSemiSyncMasterBoth "exact" and "enough" mode look the same in my setup currently until I get another replica going (I think).
|
You're right, this will be hard to detect. Expanding the giant query (BTW it's literally commonly referred to as "orchestrator's giant query") to include the extra information is impractical at this time. Let's not deal with this right now. Detection/recovery will only take place if the number of semi-sync replicas is different/smaller than expected, but not when the number is identical, however distribution may be. Perhaps, in the future, way may hack around this by tricking I'll review further later today, and please be advised that I'll then be out till mid next week.
|
I will hold off on this for now.
No rush.
I named it Today I worked on fixing the I left more |
rename config option to ReasonableLockedSemiSyncMasterSeconds, split out MaybeDisableSemiSyncMaster
Friendly reminder. 😁 |
Yup! I hope to conclude tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! One cosmetic requested change, see inline
Do you mean: immediately following a recovery there's another recovery running to fix the exact same problem, even though it's already fixed, and this repeats 4-5 times? If this is the case, then something is missing: we have probably not marked the recovery as successful. Alternatively, anti-flapping doesn't work for this scenario. Also, is there a detection that still claims a semi-sync issue? I can understand if it takes 5-10 seconds for |
Co-authored-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
It's detecting the same situation again a few times after it is resolved. The recovery registration is guarding the repeated execution, so it's fine. It's just ugly:
You can see the full output here: https://phil.nopaste.net/BfTtK0RggZ?a=ywx2nLkl8C -- Here's a longer excerpt:
Note how executeCheckAndRecoverFunction is executed multiple times and (luckily) fails with
I will investigate this a little more today and tomorrow. If my understanding is correct, the analysis if completely independent of the recovery, so during the recovery, we will trigger and attempt recovery multiple times anyway. So this may in fact be an un-avoidable situation, since it takes a while to restart the replica threads and all that. |
I've added two commits to remedy the situation. It is working nicely now. We're now re-reading the master after taking actions until the desired state (based on the count only). Looks like this in the logs:
Note the section Please find the commits for this here: |
I think I'm happy with this. Let me know if there is anything else you'd like changed. |
It works fine without the two commits, so I'm fine with removing them. Let me know if you'd like to revert them. The reason why the detection re-triggers is that when we enable/disable replication on the replicas, the Rpl_master_semi_sync_clients variable on the master is only updated after the clients properly connect. So the whole thing typically takes 1-2 seconds before getting into a proper state. If your poll interval is 5 seconds or even 20 seconds, then you'll re-trigger the detection for that long. In my tests, I never had to wait longer than 1-2 seconds in the detection loop. |
Gotcha, thanks. One last thing: this has to be documented? |
Does that mean you'd like me to remove the two commits?
You mean in the Wiki? I can do that if you like. |
Yes please.
I'm thinking under https://github.com/openark/orchestrator/blob/master/docs/configuration-discovery-classifying.md and under https://github.com/openark/orchestrator/blob/master/docs/configuration-recovery.md. WDYT? |
Sounds good. I'll do it tomorrow. |
I added docs. Let me know what you think. They are best viewed on Github: Two side notes:
|
Awesome job on the documentation, much appreciated. Yes, I'll create one more release with these changes! Won't leave them stranded. As I mentioned in my post I may be back after some break. I do love and enjoy this product, I'm just unable to keep up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
👍 |
This is a WIP PR that attempts to address #1360.
There are tons of open questions and things missing, but this is the idea.
Open questions (all answered in #1373 (review)):
EnableSemiSync
also manages the master flag. do we really want that? should we not have anEnableSemiSyncReplica
?Should there be two modes:EnforceSemiSyncReplicas: exact|enough
(exact
would handleMasterWithTooManySemiSyncReplicas
andLockedSemiSyncMaster
, andenough
would only handleLockedSemiSyncMaster
)?LockedSemiSyncMasterHypothesis
waitsReasonableReplicationLagSeconds
. I'd like there to be another variable to control the wait time. This seems like it's overloaded.TODO:
MasterWithIncorrectSemiSyncReplicas
, see PoC: WIP: MasterWithIncorrectSemiSyncReplicas binwiederhier/orchestrator#1MasterWithTooManySemiSyncReplicas
does not behave correctlyMaybeEnableSemiSyncReplica
does not manage themaster
flag though it previously did (in the new logic only)excludeNotReplicatingReplicas
should be a specific instance, not all non-replicating instances!RecoverLockedSemiSyncMaster
without exact modecheckAndRecover*
functions BEFORE enabling/disabling replicasReasonableLockedSemiSyncSeconds
with fallback toReasonableReplicationLagSeconds