Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

EnforceSemiSyncReplicas & RecoverLockedSemiSyncMaster - actively enable/disable semi-sync replicas to match master's wait count #1373

Merged

Conversation

binwiederhier
Copy link
Contributor

@binwiederhier binwiederhier commented Jun 19, 2021

This is a WIP PR that attempts to address #1360.

There are tons of open questions and things missing, but this is the idea.

Open questions (all answered in #1373 (review)):

  1. EnableSemiSync also manages the master flag. do we really want that? should we not have an EnableSemiSyncReplica?

  2. Should there be two modes: EnforceSemiSyncReplicas: exact|enough (exact would handle MasterWithTooManySemiSyncReplicas and LockedSemiSyncMaster, and enough would only handle LockedSemiSyncMaster)?

  3. LockedSemiSyncMasterHypothesis waits ReasonableReplicationLagSeconds. I'd like there to be another variable to control the wait time. This seems like it's overloaded.

TODO:

  • properly succeed failover; currently it kinda retries even though it succeeded, not sure why
  • discuss downtime behavior with shlomi
  • possibly implement MasterWithIncorrectSemiSyncReplicas, see PoC: WIP: MasterWithIncorrectSemiSyncReplicas binwiederhier/orchestrator#1
  • when a replica is downtimed but replication is enabled, MasterWithTooManySemiSyncReplicas does not behave correctly
  • MaybeEnableSemiSyncReplica does not manage the master flag though it previously did (in the new logic only)
  • excludeNotReplicatingReplicas should be a specific instance, not all non-replicating instances!
  • re-test old logic
  • handle master failover semi-sync enable/disable
  • semi-sync replica priority (or come up with better concept)
  • enabled RecoverLockedSemiSyncMaster without exact mode
  • perform sanity checks in checkAndRecover* functions BEFORE enabling/disabling replicas
  • add ReasonableLockedSemiSyncSeconds with fallback to ReasonableReplicationLagSeconds

Copy link
Collaborator

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EnableSemiSync also manages the master flag. do we really want that? should we not have an ...

Right. I suggest moving away from EnableSemiSync(). Recall that this function was originally authored by the Vitess authors to manage post-failover semi-sync behavior, specifically for Vitess. What I'm thinking is that that implementation is not inline with orchetsrator's design; your proposed solution is. Let's just not use EnableSemiSync at all.

For your convenience, the functions SetSemiSyncMaster and SetSemiSyncReplica already exist and are used by the API/CLI.

Should there be two modes: EnforceSemiSyncReplicas: exact|enough (exact would handle MasterWithTooManySemiSyncReplicas and LockedSemiSyncMaster, and enough would only handle LockedSemiSyncMaster)?

Yes. Many production deployments will have 1 as rpl_semi_sync_master_wait_for_slave_count (BTW MariaDB does not support this variable and it is implicitly always 1) and with multiple semi-sync replicas. I managed such a deployment it it worked well for us.

LockedSemiSyncMasterHypothesis waits ReasonableReplicationLagSeconds. I'd like there to be another variable to control the wait time. This seems like it's overloaded.

That makes sense. Let's fallback to ReasonableReplicationLagSeconds is the new variable is not configured. It must be non-zero in reality, so the value of 0 can indicate that the value is "not configured".

go/inst/analysis_dao.go Show resolved Hide resolved
go/logic/topology_recovery.go Show resolved Hide resolved
go/logic/topology_recovery.go Outdated Show resolved Hide resolved
go/logic/topology_recovery.go Outdated Show resolved Hide resolved
@binwiederhier
Copy link
Contributor Author

You can choose to ignore the comment below if you want to. I think I've convinced myself that this is a good idea and I'll implement it. I can always revert if it doesn't pan out.


Aside from some battles I'm fighting in the code, there's one more open design question regarding LockedSemiSync:

In the current iteration of the PR, I made a config option EnforceSemiSyncReplicas with three possible values:

  • empty (current behavior; do not handle LockedSemiSyncMaster or MasterWithTooManySemiSyncReplicas)
  • exact to enable/disable semi sync on replicas to match priority order (for both LockedSemiSyncMaster and MasterWithTooManySemiSyncReplicas)
  • enough to handle LockedSemiSyncMaster by enabling replicas in priority order, but never disabling it.

This requires basically an opt-in to handle LockedSemiSyncMaster.

Alternatively we could make a RecoverLockedSemiSyncMaster bool flag (like you did in #1332 for RecoverNonWriteableMaster) and make EnforceSemiSyncReplicas a bool too. So we'd have:

RecoverLockedSemiSyncMaster bool
EnforceSemiSyncReplicas bool

Which would behave like this:

  • For LockedSemiSyncMaster: if EnforceSemiSyncReplicas { recoverExactSemiSyncReplicas() } else { recoverEnoughSemiSyncReplicas() }
  • For MasterWithTooManySemiSyncReplicas (and later for MasterWithIncorrectSemiSyncReplicas): if EnforceSemiSyncReplicas { recoverExactSemiSyncReplicas() } else { nothing() }

What do you think?

@shlomi-noach
Copy link
Collaborator

Sounds good and either choice. If you pick using EnforceSemiSyncReplicas, then I'd rename it as EnforceExactSemiSyncReplicas to be more precise.

@binwiederhier
Copy link
Contributor Author

binwiederhier commented Jun 30, 2021

I think I am making progress:

I have now implemented both the "exact" mode (EnforceExactSemiSyncReplicas=true, see allowDisable = true in recoverSemiSyncReplicas), as well as the "enough" mode (RecoverLockedSemiSyncMaster=true, but EnforceExactSemiSyncReplicas=false, see allowDisable = false). Other than possibly making this more resilient and breaking things into smaller functions I am quite happy with the logic. It works nicely.

I've also implemented ReasonableStaleBinlogCoordinatesSeconds with a fallback to ReasonableReplicationLagSeconds as discussed.

Questions

  1. A scenario this will not detect is if the counts match, but the priority order does not match. I wanted to implement MasterWithIncorrectSemiSyncReplicas, but for that we'd need to read more in GetReplicationAnalysis and I'm not sure if I can get all that I need in that giant query. What are your thoughts on this?
  2. Could you check the TODOs in the PR and give me some feedback as to whether I'm doing the right thing. Specifically, I pretty consistently get repeated failure detection happening after a successful recovery until another round of detection happens. Not sure why that is.
  3. I have chosen to treat priority=0 (previously: semi_sync_enforced=0) as "this is an async replica". In the "exact" mode (allowDisable=true), that means we'll always disable semi sync on these replicas (see actions array); in the "enough" mode (allowDisable=false), I don't touch them at all. Does that make sense to you? Obviously this only applies if EnforceExactSemiSyncReplicas or RecoverLockedSemiSyncMaster are set.

Here are some logs of what it currently looks like:

MasterWithTooManySemiSyncReplicas

Jun 30 01:28:48 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:28:48 INFO executeCheckAndRecoverFunction: proceeding with MasterWithTooManySemiSyncReplicas recovery on pheckel-devm-db-g0-1:3306; isRecoverable?: true; skipProcesses: false
Jun 30 01:28:48 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:28:48 INFO topology_recovery: master semi-sync wait count is 1, currently we have 2 semi-sync replica(s)
Jun 30 01:28:48 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:28:48 INFO topology_recovery: possible semi-sync replicas (in priority order):
Jun 30 01:28:48 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:28:48 INFO topology_recovery: - pheckel-devm-db-g0-2:3306: semi-sync enabled = true, priority = 1, promotion rule = neutral, downtimed = false, last check = true, replicating = true
Jun 30 01:28:48 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:28:48 INFO topology_recovery: - pheckel-devm-db-g0-3:3306: semi-sync enabled = true, priority = 1, promotion rule = must_not, downtimed = false, last check = true, replicating = true
Jun 30 01:28:48 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:28:48 INFO topology_recovery: always-async replicas: (none)
Jun 30 01:28:48 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:28:48 INFO topology_recovery: excluded replicas (downtimed/defunct): (none)
Jun 30 01:28:48 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:28:48 INFO topology_recovery: taking actions:
Jun 30 01:28:48 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:28:48 INFO topology_recovery: - pheckel-devm-db-g0-3:3306: setting rpl_semi_sync_slave_enabled=false, restarting slave_io thread
Jun 30 01:28:48 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:28:48 INFO auditType:register-candidate instance:pheckel-devm-db-g0-3:3306 cluster:pheckel-devm-db-g0-1:3306 message:must_not
Jun 30 01:28:48 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:28:48 INFO auditType:register-candidate instance:pheckel-devm-db-g0-3:3306 cluster:pheckel-devm-db-g0-1:3306 message:must_not
Jun 30 01:28:48 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:28:48 INFO auditType:register-candidate instance:pheckel-devm-db-g0-1:3306 cluster:pheckel-devm-db-g0-1:3306 message:neutral
Jun 30 01:28:48 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:28:48 INFO topology_recovery: recovery complete; success = true

LockedSemiSyncMaster

Both "exact" and "enough" mode look the same in my setup currently until I get another replica going (I think).

Jun 30 01:26:12 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:26:12 INFO executeCheckAndRecoverFunction: proceeding with LockedSemiSyncMaster recovery on pheckel-devm-db-g0-1:3306; isRecoverable?: true; skipProcesses: false
Jun 30 01:26:12 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:26:12 INFO topology_recovery: master semi-sync wait count is 1, currently we have 0 semi-sync replica(s)
Jun 30 01:26:12 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:26:12 INFO topology_recovery: possible semi-sync replicas (in priority order):
Jun 30 01:26:12 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:26:12 INFO topology_recovery: - pheckel-devm-db-g0-2:3306: semi-sync enabled = false, priority = 1, promotion rule = neutral, downtimed = false, last check = true, replicating = true
Jun 30 01:26:12 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:26:12 INFO topology_recovery: - pheckel-devm-db-g0-3:3306: semi-sync enabled = false, priority = 1, promotion rule = must_not, downtimed = false, last check = true, replicating = true
Jun 30 01:26:12 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:26:12 INFO topology_recovery: always-async replicas: (none)
Jun 30 01:26:12 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:26:12 INFO topology_recovery: excluded replicas (downtimed/defunct): (none)
Jun 30 01:26:12 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:26:12 INFO topology_recovery: taking actions:
Jun 30 01:26:12 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:26:12 INFO topology_recovery: - pheckel-devm-db-g0-2:3306: setting rpl_semi_sync_slave_enabled=true, restarting slave_io thread
Jun 30 01:26:12 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:26:12 INFO auditType:register-candidate instance:pheckel-devm-db-g0-2:3306 cluster:pheckel-devm-db-g0-1:3306 message:neutral
Jun 30 01:26:12 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:26:12 INFO auditType:register-candidate instance:pheckel-devm-db-g0-2:3306 cluster:pheckel-devm-db-g0-1:3306 message:neutral
Jun 30 01:26:12 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:26:12 INFO auditType:register-candidate instance:pheckel-devm-db-g0-1:3306 cluster:pheckel-devm-db-g0-1:3306 message:neutral
Jun 30 01:26:12 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:26:12 INFO topology_recovery: recovery complete; success = true

@shlomi-noach
Copy link
Collaborator

A scenario this will not detect is if the counts match, but the priority order does not match. I wanted to implement MasterWithIncorrectSemiSyncReplicas, but for that we'd need to read more in GetReplicationAnalysis and I'm not sure if I can get all that I need in that giant query. What are your thoughts on this?

You're right, this will be hard to detect. Expanding the giant query (BTW it's literally commonly referred to as "orchestrator's giant query") to include the extra information is impractical at this time. Let's not deal with this right now. Detection/recovery will only take place if the number of semi-sync replicas is different/smaller than expected, but not when the number is identical, however distribution may be.

Perhaps, in the future, way may hack around this by tricking orchetrator into a recovery.

I'll review further later today, and please be advised that I'll then be out till mid next week.

allowDisable is a confusing name. Can we make it more descriptive?

@binwiederhier
Copy link
Contributor Author

binwiederhier commented Jul 1, 2021

MasterWithIncorrectSemiSyncReplicas

I will hold off on this for now.

I'll review further later today, and please be advised that I'll then be out till mid next week.

No rush.

allowDisable is a confusing name. Can we make it more descriptive?

I named it exactReplicaTopology bool now; it's still a little wonky, but okay.


Today I worked on fixing the EnableSemiSyncMaster stuff and now we're correctly enabling the semi-sync flags during a failover. I only tested in a 3-node topology, but I'm hoping that I don't have to change anything to make it work for more replicas.

I left more TODO markers for you to look at. I think I managed to keep everything 100% backwards compatible if none of the new flags are enabled, so I think that's good :-)

rename config option to ReasonableLockedSemiSyncMasterSeconds, split out MaybeDisableSemiSyncMaster
@binwiederhier
Copy link
Contributor Author

Friendly reminder. 😁

@shlomi-noach
Copy link
Collaborator

Friendly reminder. grin

Yup! I hope to conclude tomorrow

Copy link
Collaborator

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! One cosmetic requested change, see inline

go/inst/instance_dao.go Outdated Show resolved Hide resolved
go/inst/instance_topology_dao.go Show resolved Hide resolved
go/logic/topology_recovery.go Show resolved Hide resolved
@shlomi-noach
Copy link
Collaborator

We are still re-triggering recoveries right after they are resolved 4-5 times

Do you mean: immediately following a recovery there's another recovery running to fix the exact same problem, even though it's already fixed, and this repeats 4-5 times?

If this is the case, then something is missing: we have probably not marked the recovery as successful. Alternatively, anti-flapping doesn't work for this scenario. Also, is there a detection that still claims a semi-sync issue? I can understand if it takes 5-10 seconds for orchestrator to diagnose the resolved scenario. We should ensure that detections during that time are ignored.

Co-authored-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@binwiederhier
Copy link
Contributor Author

Do you mean: immediately following a recovery there's another recovery running to fix the exact same problem, even though it's already fixed, and this repeats 4-5 times?

It's detecting the same situation again a few times after it is resolved. The recovery registration is guarding the repeated execution, so it's fine. It's just ugly:

2021-07-14 18:11:25 ERROR AttemptRecoveryRegistration: instance pheckel-devm-db-g0-1:3306 has recently been promoted (by failover of pheckel-devm-db-g0-1:3306) and is in active period. It will not be failed over. You may acknowledge the failure on 

You can see the full output here: https://phil.nopaste.net/BfTtK0RggZ?a=ywx2nLkl8C -- Here's a longer excerpt:

2021-07-14 18:11:24 INFO executeCheckAndRecoverFunction: proceeding with LockedSemiSyncMaster recovery on pheckel-devm-db-g0-1:3306; isRecoverable?: true; skipProcesses: false
....
2021-07-14 18:11:24 INFO topology_recovery: semi-sync: recovery complete; success = true
...
2021-07-14 18:11:25 INFO topology_recovery: done running PostFailoverProcesses hooks
2021-07-14 18:11:25 INFO topology_recovery: Waiting for 0 postponed functions
2021-07-14 18:11:25 DEBUG PostponedFunctionsContainer: waiting on 0 postponed functions
2021-07-14 18:11:25 DEBUG PostponedFunctionsContainer: done waiting
2021-07-14 18:11:25 INFO topology_recovery: Executed 0 postponed functions

2021-07-14 18:11:25 INFO executeCheckAndRecoverFunction: proceeding with LockedSemiSyncMaster detection on pheckel-devm-db-g0-1:3306; isActionable?: true; skipProcesses: false
2021-07-14 18:11:25 INFO checkAndExecuteFailureDetectionProcesses: could not register LockedSemiSyncMaster detection on pheckel-devm-db-g0-1:3306
2021-07-14 18:11:25 INFO executeCheckAndRecoverFunction: proceeding with LockedSemiSyncMaster recovery on pheckel-devm-db-g0-1:3306; isRecoverable?: true; skipProcesses: false
2021-07-14 18:11:25 ERROR AttemptRecoveryRegistration: instance pheckel-devm-db-g0-1:3306 has recently been promoted (by failover of pheckel-devm-db-g0-1:3306) and is in active period. It will not be failed over. You may acknowledge the failure on pheckel-devm-db-g0-1:3306 (-c ack-instance-recoveries) to remove this blockage
2021-07-14 18:11:25 INFO topology_recovery: found an active or recent recovery on pheckel-devm-db-g0-1:3306. Will not issue another RecoverLockedSemiSyncMaster.
2021-07-14 18:11:25 ERROR 2021-07-14 18:11:25 ERROR AttemptRecoveryRegistration: instance pheckel-devm-db-g0-1:3306 has recently been promoted (by failover of pheckel-devm-db-g0-1:3306) and is in active period. It will not be failed over. You may acknowledge the failure on pheckel-devm-db-g0-1:3306 (-c ack-instance-recoveries) to remove this blockage

2021-07-14 18:11:26 INFO executeCheckAndRecoverFunction: proceeding with LockedSemiSyncMaster detection on pheckel-devm-db-g0-1:3306; isActionable?: true; skipProcesses: false
2021-07-14 18:11:26 INFO executeCheckAndRecoverFunction: proceeding with LockedSemiSyncMaster recovery on pheckel-devm-db-g0-1:3306; isRecoverable?: true; skipProcesses: false
2021-07-14 18:11:26 ERROR AttemptRecoveryRegistration: instance pheckel-devm-db-g0-1:3306 has recently been promoted (by failover of pheckel-devm-db-g0-1:3306) and is in active period. It will not be failed over. You may acknowledge the failure on pheckel-devm-db-g0-1:3306 (-c ack-instance-recoveries) to remove this blockage
2021-07-14 18:11:26 INFO topology_recovery: found an active or recent recovery on pheckel-devm-db-g0-1:3306. Will not issue another RecoverLockedSemiSyncMaster.
2021-07-14 18:11:26 ERROR 2021-07-14 18:11:26 ERROR AttemptRecoveryRegistration: instance pheckel-devm-db-g0-1:3306 has recently been promoted (by failover of pheckel-devm-db-g0-1:3306) and is in active period. It will not be failed over. You may acknowledge the failure on pheckel-devm-db-g0-1:3306 (-c ack-instance-recoveries) to remove this blockage

2021-07-14 18:11:27 INFO executeCheckAndRecoverFunction: proceeding with LockedSemiSyncMaster detection on pheckel-devm-db-g0-1:3306; isActionable?: true; skipProcesses: false
2021-07-14 18:11:27 INFO executeCheckAndRecoverFunction: proceeding with LockedSemiSyncMaster recovery on pheckel-devm-db-g0-1:3306; isRecoverable?: true; skipProcesses: false
2021-07-14 18:11:27 ERROR AttemptRecoveryRegistration: instance pheckel-devm-db-g0-1:3306 has recently been promoted (by failover of pheckel-devm-db-g0-1:3306) and is in active period. It will not be failed over. You may acknowledge the failure on pheckel-devm-db-g0-1:3306 (-c ack-instance-recoveries) to remove this blockage
2021-07-14 18:11:27 INFO topology_recovery: found an active or recent recovery on pheckel-devm-db-g0-1:3306. Will not issue another RecoverLockedSemiSyncMaster.
2021-07-14 18:11:27 ERROR 2021-07-14 18:11:27 ERROR AttemptRecoveryRegistration: instance pheckel-devm-db-g0-1:3306 has recently been promoted (by failover of pheckel-devm-db-g0-1:3306) and is in active period. It will not be failed over. You may acknowledge the failure on pheckel-devm-db-g0-1:3306 (-c ack-instance-recoveries) to remove this blockage
2021-07-14 18:11:33 DEBUG analysis: ClusterName: pheckel-devm-db-g0-1:3306, IsMaster: false, LastCheckValid: false, LastCheckPartialSuccess: false, CountReplicas: 0, CountValidReplicas: 0, CountValidReplicatingReplicas: 0, CountLaggingReplicas: 0, CountDelayedReplicas: 0, CountReplicasFailingToConnectToMaster: 0
2021-07-14 18:11:34 ERROR ReadTopologyInstance(pheckel-devm-db-g0-2:3306) show global status like 'Uptime': dial tcp 10.40.182.188:3306: connect: connection refused
2021-07-14 18:11:34 WARNING DiscoverInstance(pheckel-devm-db-g0-2:3306) instance is nil in 0.007s (Backend: 0.006s, Instance: 0.001s), error=dial tcp 10.40.182.188:3306: connect: connection refused
2021-07-14 18:11:39 INFO auditType:forget-unseen-differently-resolved instance::0 cluster: message:Forgotten instances: 0
2021-07-14 18:11:39 INFO auditType:forget-clustr-aliases instance::0 cluster: message:Forgotten aliases: 0

Note how executeCheckAndRecoverFunction is executed multiple times and (luckily) fails with ERROR AttemptRecoveryRegistration. It may just be that it takes this long for everything to fully recover though.

If this is the case, then something is missing: we have probably not marked the recovery as successful. Alternatively, anti-flapping doesn't work for this scenario. Also, is there a detection that still claims a semi-sync issue? I can understand if it takes 5-10 seconds for orchestrator to diagnose the resolved scenario. We should ensure that detections during that time are ignored.

I will investigate this a little more today and tomorrow. If my understanding is correct, the analysis if completely independent of the recovery, so during the recovery, we will trigger and attempt recovery multiple times anyway. So this may in fact be an un-avoidable situation, since it takes a while to restart the replica threads and all that.

@binwiederhier
Copy link
Contributor Author

I've added two commits to remedy the situation. It is working nicely now. We're now re-reading the master after taking actions until the desired state (based on the count only).

Looks like this in the logs:

17:56:52 DEBUG semi-sync: master = pheckel-devm-db-g0-1:3306, master semi-sync wait count = 1, master semi-sync replica count = 2
17:56:52 DEBUG semi-sync: possible semi-sync replicas (in priority order):
17:56:52 DEBUG semi-sync: - pheckel-devm-db-g0-2:3306: semi-sync enabled = true, priority = 1, promotion rule = neutral, last check = true, replicating = true
17:56:52 DEBUG semi-sync: - pheckel-devm-db-g0-3:3306: semi-sync enabled = true, priority = 1, promotion rule = must_not, last check = true, replicating = true
17:56:52 DEBUG semi-sync: always-async replicas: (none)
17:56:52 DEBUG semi-sync: excluded replicas (defunct): (none)
17:56:52 DEBUG semi-sync: suggested actions:
17:56:52 DEBUG semi-sync: - pheckel-devm-db-g0-3:3306: should set semi-sync enabled = false
17:56:52 INFO topology_recovery: semi-sync: taking actions:
17:56:52 INFO topology_recovery: semi-sync: - pheckel-devm-db-g0-3:3306: setting rpl_semi_sync_slave_enabled=false, restarting slave_io thread
17:56:52 INFO topology_recovery: semi-sync: waiting for desired state:
17:56:52 INFO topology_recovery: semi-sync: - current semi-sync replica count = 2, desired semi-sync replica count = 1
17:56:53 INFO topology_recovery: semi-sync: - current semi-sync replica count = 1, desired semi-sync replica count = 1
17:56:53 INFO topology_recovery: semi-sync: recovery complete; success = true

Note the section waiting for desired state. This waits up to WaitForSemiSyncRecoverySeconds, which defaults to 3 * InstancePollSeconds.

Please find the commits for this here:

@binwiederhier
Copy link
Contributor Author

I think I'm happy with this. Let me know if there is anything else you'd like changed.

@shlomi-noach
Copy link
Collaborator

I'm gonna look at the superfluous detection cycle. I'd like to rollback 664d3c1 as well as 5382128. Blocking the recovery process while waiting for results does not align with how orchestrator runs things.

@binwiederhier
Copy link
Contributor Author

It works fine without the two commits, so I'm fine with removing them. Let me know if you'd like to revert them.

The reason why the detection re-triggers is that when we enable/disable replication on the replicas, the Rpl_master_semi_sync_clients variable on the master is only updated after the clients properly connect. So the whole thing typically takes 1-2 seconds before getting into a proper state. If your poll interval is 5 seconds or even 20 seconds, then you'll re-trigger the detection for that long.

In my tests, I never had to wait longer than 1-2 seconds in the detection loop.

@shlomi-noach
Copy link
Collaborator

Gotcha, thanks.

One last thing: this has to be documented?

@binwiederhier
Copy link
Contributor Author

Gotcha, thanks.

Does that mean you'd like me to remove the two commits?

One last thing: this has to be documented?

You mean in the Wiki? I can do that if you like.

@shlomi-noach
Copy link
Collaborator

Does that mean you'd like me to remove the two commits?

Yes please.

You mean in the Wiki?

I'm thinking under https://github.com/openark/orchestrator/blob/master/docs/configuration-discovery-classifying.md and under https://github.com/openark/orchestrator/blob/master/docs/configuration-recovery.md. WDYT?

@binwiederhier
Copy link
Contributor Author

Sounds good. I'll do it tomorrow.

@binwiederhier
Copy link
Contributor Author

I added docs. Let me know what you think. They are best viewed on Github:

Two side notes:

  • I read your blog post regarding your hiatus. Will you be cutting one last release that includes these changes? (Also: It's super sad that you're leaving Orchestrator behind, but totally understandable!)
  • I will be out of office for 3 weeks starting Wednesday; I'll probably still check in, but I'll mostly be away.

@shlomi-noach
Copy link
Collaborator

Awesome job on the documentation, much appreciated.

Yes, I'll create one more release with these changes! Won't leave them stranded.

As I mentioned in my post I may be back after some break. I do love and enjoy this product, I'm just unable to keep up.

Copy link
Collaborator

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@shlomi-noach shlomi-noach merged commit 1a6c3cd into openark:master Jul 27, 2021
@cndoit18
Copy link
Contributor

👍

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants