EnforceSemiSyncReplicas & RecoverLockedSemiSyncMaster - actively enable/disable semi-sync replicas to match master's wait count #1373

binwiederhier · 2021-06-19T01:15:48Z

This is a WIP PR that attempts to address #1360.

There are tons of open questions and things missing, but this is the idea.

Open questions (all answered in #1373 (review)):

~~EnableSemiSync also manages the master flag. do we really want that? should we not have an EnableSemiSyncReplica?~~
Should there be two modes: EnforceSemiSyncReplicas: exact|enough (exact would handle MasterWithTooManySemiSyncReplicas and LockedSemiSyncMaster, and enough would only handle LockedSemiSyncMaster)?
~~LockedSemiSyncMasterHypothesis waits ReasonableReplicationLagSeconds. I'd like there to be another variable to control the wait time. This seems like it's overloaded.~~

TODO:

shlomi-noach

EnableSemiSync also manages the master flag. do we really want that? should we not have an ...

Right. I suggest moving away from EnableSemiSync(). Recall that this function was originally authored by the Vitess authors to manage post-failover semi-sync behavior, specifically for Vitess. What I'm thinking is that that implementation is not inline with orchetsrator's design; your proposed solution is. Let's just not use EnableSemiSync at all.

For your convenience, the functions SetSemiSyncMaster and SetSemiSyncReplica already exist and are used by the API/CLI.

Should there be two modes: EnforceSemiSyncReplicas: exact|enough (exact would handle MasterWithTooManySemiSyncReplicas and LockedSemiSyncMaster, and enough would only handle LockedSemiSyncMaster)?

Yes. Many production deployments will have 1 as rpl_semi_sync_master_wait_for_slave_count (BTW MariaDB does not support this variable and it is implicitly always 1) and with multiple semi-sync replicas. I managed such a deployment it it worked well for us.

LockedSemiSyncMasterHypothesis waits ReasonableReplicationLagSeconds. I'd like there to be another variable to control the wait time. This seems like it's overloaded.

That makes sense. Let's fallback to ReasonableReplicationLagSeconds is the new variable is not configured. It must be non-zero in reality, so the value of 0 can indicate that the value is "not configured".

go/inst/analysis_dao.go

go/logic/topology_recovery.go

…-semi-sync-replica-count

promotion_rule; only handle "exact" case for now

binwiederhier · 2021-06-29T12:57:11Z

You can choose to ignore the comment below if you want to. I think I've convinced myself that this is a good idea and I'll implement it. I can always revert if it doesn't pan out.

Aside from some battles I'm fighting in the code, there's one more open design question regarding LockedSemiSync:

In the current iteration of the PR, I made a config option EnforceSemiSyncReplicas with three possible values:

empty (current behavior; do not handle LockedSemiSyncMaster or MasterWithTooManySemiSyncReplicas)
exact to enable/disable semi sync on replicas to match priority order (for both LockedSemiSyncMaster and MasterWithTooManySemiSyncReplicas)
enough to handle LockedSemiSyncMaster by enabling replicas in priority order, but never disabling it.

This requires basically an opt-in to handle LockedSemiSyncMaster.

Alternatively we could make a RecoverLockedSemiSyncMaster bool flag (like you did in #1332 for RecoverNonWriteableMaster) and make EnforceSemiSyncReplicas a bool too. So we'd have:

RecoverLockedSemiSyncMaster bool
EnforceSemiSyncReplicas bool

Which would behave like this:

For LockedSemiSyncMaster: if EnforceSemiSyncReplicas { recoverExactSemiSyncReplicas() } else { recoverEnoughSemiSyncReplicas() }
For MasterWithTooManySemiSyncReplicas (and later for MasterWithIncorrectSemiSyncReplicas): if EnforceSemiSyncReplicas { recoverExactSemiSyncReplicas() } else { nothing() }

What do you think?

shlomi-noach · 2021-06-29T13:10:36Z

Sounds good and either choice. If you pick using EnforceSemiSyncReplicas, then I'd rename it as EnforceExactSemiSyncReplicas to be more precise.

…rdinatesSeconds, RecoverLockedSemiSyncMaster and EnforceExactSemiSyncReplicas

mode)

binwiederhier · 2021-06-30T01:31:00Z

I think I am making progress:

I have now implemented both the "exact" mode (EnforceExactSemiSyncReplicas=true, see allowDisable = true in recoverSemiSyncReplicas), as well as the "enough" mode (RecoverLockedSemiSyncMaster=true, but EnforceExactSemiSyncReplicas=false, see allowDisable = false). Other than possibly making this more resilient and breaking things into smaller functions I am quite happy with the logic. It works nicely.

I've also implemented ReasonableStaleBinlogCoordinatesSeconds with a fallback to ReasonableReplicationLagSeconds as discussed.

Questions

A scenario this will not detect is if the counts match, but the priority order does not match. I wanted to implement MasterWithIncorrectSemiSyncReplicas, but for that we'd need to read more in GetReplicationAnalysis and I'm not sure if I can get all that I need in that giant query. What are your thoughts on this?
Could you check the TODOs in the PR and give me some feedback as to whether I'm doing the right thing. Specifically, I pretty consistently get repeated failure detection happening after a successful recovery until another round of detection happens. Not sure why that is.
I have chosen to treat priority=0 (previously: semi_sync_enforced=0) as "this is an async replica". In the "exact" mode (allowDisable=true), that means we'll always disable semi sync on these replicas (see actions array); in the "enough" mode (allowDisable=false), I don't touch them at all. Does that make sense to you? Obviously this only applies if EnforceExactSemiSyncReplicas or RecoverLockedSemiSyncMaster are set.

Here are some logs of what it currently looks like:

MasterWithTooManySemiSyncReplicas

Jun 30 01:28:48 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:28:48 INFO executeCheckAndRecoverFunction: proceeding with MasterWithTooManySemiSyncReplicas recovery on pheckel-devm-db-g0-1:3306; isRecoverable?: true; skipProcesses: false
Jun 30 01:28:48 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:28:48 INFO topology_recovery: master semi-sync wait count is 1, currently we have 2 semi-sync replica(s)
Jun 30 01:28:48 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:28:48 INFO topology_recovery: possible semi-sync replicas (in priority order):
Jun 30 01:28:48 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:28:48 INFO topology_recovery: - pheckel-devm-db-g0-2:3306: semi-sync enabled = true, priority = 1, promotion rule = neutral, downtimed = false, last check = true, replicating = true
Jun 30 01:28:48 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:28:48 INFO topology_recovery: - pheckel-devm-db-g0-3:3306: semi-sync enabled = true, priority = 1, promotion rule = must_not, downtimed = false, last check = true, replicating = true
Jun 30 01:28:48 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:28:48 INFO topology_recovery: always-async replicas: (none)
Jun 30 01:28:48 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:28:48 INFO topology_recovery: excluded replicas (downtimed/defunct): (none)
Jun 30 01:28:48 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:28:48 INFO topology_recovery: taking actions:
Jun 30 01:28:48 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:28:48 INFO topology_recovery: - pheckel-devm-db-g0-3:3306: setting rpl_semi_sync_slave_enabled=false, restarting slave_io thread
Jun 30 01:28:48 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:28:48 INFO auditType:register-candidate instance:pheckel-devm-db-g0-3:3306 cluster:pheckel-devm-db-g0-1:3306 message:must_not
Jun 30 01:28:48 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:28:48 INFO auditType:register-candidate instance:pheckel-devm-db-g0-3:3306 cluster:pheckel-devm-db-g0-1:3306 message:must_not
Jun 30 01:28:48 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:28:48 INFO auditType:register-candidate instance:pheckel-devm-db-g0-1:3306 cluster:pheckel-devm-db-g0-1:3306 message:neutral
Jun 30 01:28:48 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:28:48 INFO topology_recovery: recovery complete; success = true

LockedSemiSyncMaster

Both "exact" and "enough" mode look the same in my setup currently until I get another replica going (I think).

Jun 30 01:26:12 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:26:12 INFO executeCheckAndRecoverFunction: proceeding with LockedSemiSyncMaster recovery on pheckel-devm-db-g0-1:3306; isRecoverable?: true; skipProcesses: false
Jun 30 01:26:12 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:26:12 INFO topology_recovery: master semi-sync wait count is 1, currently we have 0 semi-sync replica(s)
Jun 30 01:26:12 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:26:12 INFO topology_recovery: possible semi-sync replicas (in priority order):
Jun 30 01:26:12 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:26:12 INFO topology_recovery: - pheckel-devm-db-g0-2:3306: semi-sync enabled = false, priority = 1, promotion rule = neutral, downtimed = false, last check = true, replicating = true
Jun 30 01:26:12 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:26:12 INFO topology_recovery: - pheckel-devm-db-g0-3:3306: semi-sync enabled = false, priority = 1, promotion rule = must_not, downtimed = false, last check = true, replicating = true
Jun 30 01:26:12 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:26:12 INFO topology_recovery: always-async replicas: (none)
Jun 30 01:26:12 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:26:12 INFO topology_recovery: excluded replicas (downtimed/defunct): (none)
Jun 30 01:26:12 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:26:12 INFO topology_recovery: taking actions:
Jun 30 01:26:12 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:26:12 INFO topology_recovery: - pheckel-devm-db-g0-2:3306: setting rpl_semi_sync_slave_enabled=true, restarting slave_io thread
Jun 30 01:26:12 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:26:12 INFO auditType:register-candidate instance:pheckel-devm-db-g0-2:3306 cluster:pheckel-devm-db-g0-1:3306 message:neutral
Jun 30 01:26:12 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:26:12 INFO auditType:register-candidate instance:pheckel-devm-db-g0-2:3306 cluster:pheckel-devm-db-g0-1:3306 message:neutral
Jun 30 01:26:12 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:26:12 INFO auditType:register-candidate instance:pheckel-devm-db-g0-1:3306 cluster:pheckel-devm-db-g0-1:3306 message:neutral
Jun 30 01:26:12 pheckel-devm-orch-1 orchestrator[591830]: 2021-06-30 01:26:12 INFO topology_recovery: recovery complete; success = true

shlomi-noach · 2021-06-30T05:50:56Z

A scenario this will not detect is if the counts match, but the priority order does not match. I wanted to implement MasterWithIncorrectSemiSyncReplicas, but for that we'd need to read more in GetReplicationAnalysis and I'm not sure if I can get all that I need in that giant query. What are your thoughts on this?

You're right, this will be hard to detect. Expanding the giant query (BTW it's literally commonly referred to as "orchestrator's giant query") to include the extra information is impractical at this time. Let's not deal with this right now. Detection/recovery will only take place if the number of semi-sync replicas is different/smaller than expected, but not when the number is identical, however distribution may be.

Perhaps, in the future, way may hack around this by tricking orchetrator into a recovery.

I'll review further later today, and please be advised that I'll then be out till mid next week.

allowDisable is a confusing name. Can we make it more descriptive?

…-semi-sync-replica-count

binwiederhier · 2021-07-01T20:18:11Z

MasterWithIncorrectSemiSyncReplicas

I will hold off on this for now.

I'll review further later today, and please be advised that I'll then be out till mid next week.

No rush.

allowDisable is a confusing name. Can we make it more descriptive?

I named it exactReplicaTopology bool now; it's still a little wonky, but okay.

Today I worked on fixing the EnableSemiSyncMaster stuff and now we're correctly enabling the semi-sync flags during a failover. I only tested in a 3-node topology, but I'm hoping that I don't have to change anything to make it work for more replicas.

I left more TODO markers for you to look at. I think I managed to keep everything 100% backwards compatible if none of the new flags are enabled, so I think that's good :-)

rename config option to ReasonableLockedSemiSyncMasterSeconds, split out MaybeDisableSemiSyncMaster

binwiederhier · 2021-07-19T15:02:28Z

Friendly reminder. 😁

shlomi-noach · 2021-07-20T12:20:42Z

Friendly reminder. grin

Yup! I hope to conclude tomorrow

shlomi-noach

Looks good! One cosmetic requested change, see inline

go/inst/instance_dao.go

go/inst/instance_topology_dao.go

go/logic/topology_recovery.go

shlomi-noach · 2021-07-21T12:24:56Z

We are still re-triggering recoveries right after they are resolved 4-5 times

Do you mean: immediately following a recovery there's another recovery running to fix the exact same problem, even though it's already fixed, and this repeats 4-5 times?

If this is the case, then something is missing: we have probably not marked the recovery as successful. Alternatively, anti-flapping doesn't work for this scenario. Also, is there a detection that still claims a semi-sync issue? I can understand if it takes 5-10 seconds for orchestrator to diagnose the resolved scenario. We should ensure that detections during that time are ignored.

Co-authored-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

binwiederhier · 2021-07-22T15:07:27Z

Do you mean: immediately following a recovery there's another recovery running to fix the exact same problem, even though it's already fixed, and this repeats 4-5 times?

It's detecting the same situation again a few times after it is resolved. The recovery registration is guarding the repeated execution, so it's fine. It's just ugly:

2021-07-14 18:11:25 ERROR AttemptRecoveryRegistration: instance pheckel-devm-db-g0-1:3306 has recently been promoted (by failover of pheckel-devm-db-g0-1:3306) and is in active period. It will not be failed over. You may acknowledge the failure on

You can see the full output here: https://phil.nopaste.net/BfTtK0RggZ?a=ywx2nLkl8C -- Here's a longer excerpt:

2021-07-14 18:11:24 INFO executeCheckAndRecoverFunction: proceeding with LockedSemiSyncMaster recovery on pheckel-devm-db-g0-1:3306; isRecoverable?: true; skipProcesses: false
....
2021-07-14 18:11:24 INFO topology_recovery: semi-sync: recovery complete; success = true
...
2021-07-14 18:11:25 INFO topology_recovery: done running PostFailoverProcesses hooks
2021-07-14 18:11:25 INFO topology_recovery: Waiting for 0 postponed functions
2021-07-14 18:11:25 DEBUG PostponedFunctionsContainer: waiting on 0 postponed functions
2021-07-14 18:11:25 DEBUG PostponedFunctionsContainer: done waiting
2021-07-14 18:11:25 INFO topology_recovery: Executed 0 postponed functions

2021-07-14 18:11:25 INFO executeCheckAndRecoverFunction: proceeding with LockedSemiSyncMaster detection on pheckel-devm-db-g0-1:3306; isActionable?: true; skipProcesses: false
2021-07-14 18:11:25 INFO checkAndExecuteFailureDetectionProcesses: could not register LockedSemiSyncMaster detection on pheckel-devm-db-g0-1:3306
2021-07-14 18:11:25 INFO executeCheckAndRecoverFunction: proceeding with LockedSemiSyncMaster recovery on pheckel-devm-db-g0-1:3306; isRecoverable?: true; skipProcesses: false
2021-07-14 18:11:25 ERROR AttemptRecoveryRegistration: instance pheckel-devm-db-g0-1:3306 has recently been promoted (by failover of pheckel-devm-db-g0-1:3306) and is in active period. It will not be failed over. You may acknowledge the failure on pheckel-devm-db-g0-1:3306 (-c ack-instance-recoveries) to remove this blockage
2021-07-14 18:11:25 INFO topology_recovery: found an active or recent recovery on pheckel-devm-db-g0-1:3306. Will not issue another RecoverLockedSemiSyncMaster.
2021-07-14 18:11:25 ERROR 2021-07-14 18:11:25 ERROR AttemptRecoveryRegistration: instance pheckel-devm-db-g0-1:3306 has recently been promoted (by failover of pheckel-devm-db-g0-1:3306) and is in active period. It will not be failed over. You may acknowledge the failure on pheckel-devm-db-g0-1:3306 (-c ack-instance-recoveries) to remove this blockage

2021-07-14 18:11:26 INFO executeCheckAndRecoverFunction: proceeding with LockedSemiSyncMaster detection on pheckel-devm-db-g0-1:3306; isActionable?: true; skipProcesses: false
2021-07-14 18:11:26 INFO executeCheckAndRecoverFunction: proceeding with LockedSemiSyncMaster recovery on pheckel-devm-db-g0-1:3306; isRecoverable?: true; skipProcesses: false
2021-07-14 18:11:26 ERROR AttemptRecoveryRegistration: instance pheckel-devm-db-g0-1:3306 has recently been promoted (by failover of pheckel-devm-db-g0-1:3306) and is in active period. It will not be failed over. You may acknowledge the failure on pheckel-devm-db-g0-1:3306 (-c ack-instance-recoveries) to remove this blockage
2021-07-14 18:11:26 INFO topology_recovery: found an active or recent recovery on pheckel-devm-db-g0-1:3306. Will not issue another RecoverLockedSemiSyncMaster.
2021-07-14 18:11:26 ERROR 2021-07-14 18:11:26 ERROR AttemptRecoveryRegistration: instance pheckel-devm-db-g0-1:3306 has recently been promoted (by failover of pheckel-devm-db-g0-1:3306) and is in active period. It will not be failed over. You may acknowledge the failure on pheckel-devm-db-g0-1:3306 (-c ack-instance-recoveries) to remove this blockage

2021-07-14 18:11:27 INFO executeCheckAndRecoverFunction: proceeding with LockedSemiSyncMaster detection on pheckel-devm-db-g0-1:3306; isActionable?: true; skipProcesses: false
2021-07-14 18:11:27 INFO executeCheckAndRecoverFunction: proceeding with LockedSemiSyncMaster recovery on pheckel-devm-db-g0-1:3306; isRecoverable?: true; skipProcesses: false
2021-07-14 18:11:27 ERROR AttemptRecoveryRegistration: instance pheckel-devm-db-g0-1:3306 has recently been promoted (by failover of pheckel-devm-db-g0-1:3306) and is in active period. It will not be failed over. You may acknowledge the failure on pheckel-devm-db-g0-1:3306 (-c ack-instance-recoveries) to remove this blockage
2021-07-14 18:11:27 INFO topology_recovery: found an active or recent recovery on pheckel-devm-db-g0-1:3306. Will not issue another RecoverLockedSemiSyncMaster.
2021-07-14 18:11:27 ERROR 2021-07-14 18:11:27 ERROR AttemptRecoveryRegistration: instance pheckel-devm-db-g0-1:3306 has recently been promoted (by failover of pheckel-devm-db-g0-1:3306) and is in active period. It will not be failed over. You may acknowledge the failure on pheckel-devm-db-g0-1:3306 (-c ack-instance-recoveries) to remove this blockage
2021-07-14 18:11:33 DEBUG analysis: ClusterName: pheckel-devm-db-g0-1:3306, IsMaster: false, LastCheckValid: false, LastCheckPartialSuccess: false, CountReplicas: 0, CountValidReplicas: 0, CountValidReplicatingReplicas: 0, CountLaggingReplicas: 0, CountDelayedReplicas: 0, CountReplicasFailingToConnectToMaster: 0
2021-07-14 18:11:34 ERROR ReadTopologyInstance(pheckel-devm-db-g0-2:3306) show global status like 'Uptime': dial tcp 10.40.182.188:3306: connect: connection refused
2021-07-14 18:11:34 WARNING DiscoverInstance(pheckel-devm-db-g0-2:3306) instance is nil in 0.007s (Backend: 0.006s, Instance: 0.001s), error=dial tcp 10.40.182.188:3306: connect: connection refused
2021-07-14 18:11:39 INFO auditType:forget-unseen-differently-resolved instance::0 cluster: message:Forgotten instances: 0
2021-07-14 18:11:39 INFO auditType:forget-clustr-aliases instance::0 cluster: message:Forgotten aliases: 0

Note how executeCheckAndRecoverFunction is executed multiple times and (luckily) fails with ERROR AttemptRecoveryRegistration. It may just be that it takes this long for everything to fully recover though.

If this is the case, then something is missing: we have probably not marked the recovery as successful. Alternatively, anti-flapping doesn't work for this scenario. Also, is there a detection that still claims a semi-sync issue? I can understand if it takes 5-10 seconds for orchestrator to diagnose the resolved scenario. We should ensure that detections during that time are ignored.

I will investigate this a little more today and tomorrow. If my understanding is correct, the analysis if completely independent of the recovery, so during the recovery, we will trigger and attempt recovery multiple times anyway. So this may in fact be an un-avoidable situation, since it takes a while to restart the replica threads and all that.

binwiederhier · 2021-07-22T18:00:41Z

I've added two commits to remedy the situation. It is working nicely now. We're now re-reading the master after taking actions until the desired state (based on the count only).

Looks like this in the logs:

17:56:52 DEBUG semi-sync: master = pheckel-devm-db-g0-1:3306, master semi-sync wait count = 1, master semi-sync replica count = 2
17:56:52 DEBUG semi-sync: possible semi-sync replicas (in priority order):
17:56:52 DEBUG semi-sync: - pheckel-devm-db-g0-2:3306: semi-sync enabled = true, priority = 1, promotion rule = neutral, last check = true, replicating = true
17:56:52 DEBUG semi-sync: - pheckel-devm-db-g0-3:3306: semi-sync enabled = true, priority = 1, promotion rule = must_not, last check = true, replicating = true
17:56:52 DEBUG semi-sync: always-async replicas: (none)
17:56:52 DEBUG semi-sync: excluded replicas (defunct): (none)
17:56:52 DEBUG semi-sync: suggested actions:
17:56:52 DEBUG semi-sync: - pheckel-devm-db-g0-3:3306: should set semi-sync enabled = false
17:56:52 INFO topology_recovery: semi-sync: taking actions:
17:56:52 INFO topology_recovery: semi-sync: - pheckel-devm-db-g0-3:3306: setting rpl_semi_sync_slave_enabled=false, restarting slave_io thread
17:56:52 INFO topology_recovery: semi-sync: waiting for desired state:
17:56:52 INFO topology_recovery: semi-sync: - current semi-sync replica count = 2, desired semi-sync replica count = 1
17:56:53 INFO topology_recovery: semi-sync: - current semi-sync replica count = 1, desired semi-sync replica count = 1
17:56:53 INFO topology_recovery: semi-sync: recovery complete; success = true

Note the section waiting for desired state. This waits up to WaitForSemiSyncRecoverySeconds, which defaults to 3 * InstancePollSeconds.

Please find the commits for this here:

Add "wait for desired state": 664d3c1
Add config option: 5382128

binwiederhier · 2021-07-23T18:39:15Z

I think I'm happy with this. Let me know if there is anything else you'd like changed.

shlomi-noach · 2021-07-25T07:14:38Z

I'm gonna look at the superfluous detection cycle. I'd like to rollback 664d3c1 as well as 5382128. Blocking the recovery process while waiting for results does not align with how orchestrator runs things.

binwiederhier · 2021-07-25T12:07:52Z

It works fine without the two commits, so I'm fine with removing them. Let me know if you'd like to revert them.

The reason why the detection re-triggers is that when we enable/disable replication on the replicas, the Rpl_master_semi_sync_clients variable on the master is only updated after the clients properly connect. So the whole thing typically takes 1-2 seconds before getting into a proper state. If your poll interval is 5 seconds or even 20 seconds, then you'll re-trigger the detection for that long.

In my tests, I never had to wait longer than 1-2 seconds in the detection loop.

shlomi-noach · 2021-07-25T12:16:58Z

Gotcha, thanks.

One last thing: this has to be documented?

binwiederhier · 2021-07-25T13:04:39Z

Gotcha, thanks.

Does that mean you'd like me to remove the two commits?

One last thing: this has to be documented?

You mean in the Wiki? I can do that if you like.

shlomi-noach · 2021-07-25T15:18:42Z

Does that mean you'd like me to remove the two commits?

Yes please.

You mean in the Wiki?

I'm thinking under https://github.com/openark/orchestrator/blob/master/docs/configuration-discovery-classifying.md and under https://github.com/openark/orchestrator/blob/master/docs/configuration-recovery.md. WDYT?

binwiederhier · 2021-07-25T15:50:17Z

Sounds good. I'll do it tomorrow.

This reverts commit 5382128.

This reverts commit 664d3c1.

binwiederhier · 2021-07-26T18:25:28Z

I added docs. Let me know what you think. They are best viewed on Github:

Two side notes:

I read your blog post regarding your hiatus. Will you be cutting one last release that includes these changes? (Also: It's super sad that you're leaving Orchestrator behind, but totally understandable!)
I will be out of office for 3 weeks starting Wednesday; I'll probably still check in, but I'll mostly be away.

shlomi-noach · 2021-07-27T06:18:48Z

Awesome job on the documentation, much appreciated.

Yes, I'll create one more release with these changes! Won't leave them stranded.

As I mentioned in my post I may be back after some break. I do love and enjoy this product, I'm just unable to keep up.

shlomi-noach

Thank you!

cndoit18 · 2021-07-27T06:23:58Z

👍

WIP: EnforceSemiSyncReplicaCount

6a47297

binwiederhier mentioned this pull request Jun 19, 2021

When I turn on SemiSyncEnforced, should it be automatically configured when the instance is discovered? #1360

Closed

shlomi-noach reviewed Jun 27, 2021

View reviewed changes

go/inst/analysis_dao.go Show resolved Hide resolved

go/logic/topology_recovery.go Show resolved Hide resolved

go/logic/topology_recovery.go Outdated Show resolved Hide resolved

go/logic/topology_recovery.go Outdated Show resolved Hide resolved

Philipp Heckel added 4 commits June 28, 2021 14:08

Merge branch 'master' of github.com:openark/orchestrator into enforce…

d995cd1

…-semi-sync-replica-count

Redo SemiSyncEnforced as priority; sort replicas by priority, then

26f85da

promotion_rule; only handle "exact" case for now

Progress on recoverExactSemiSyncReplicas; works now, with issues

d95345a

gofmt

e36ffbf

Philipp Heckel added 9 commits June 29, 2021 12:52

Split and rename config variables; introduce ReasonableStaleBinlogCoo…

7212bf6

…rdinatesSeconds, RecoverLockedSemiSyncMaster and EnforceExactSemiSyncReplicas

Rename variable again

6fb8a1d

Add logging; add check for replication running and last check valid

bc3cc8d

Re-order; add logging

ac1b899

Rename field

7f70458

Also handle async replicas

7916658

Split out classify function

e689391

Implement RecoverLockedSemiSyncMaster without exact counts (enable-only

684b02b

mode)

gofmt

32835c7

Philipp Heckel added 2 commits June 30, 2021 15:59

Rename variable

303c9dc

Make setting the semi-sync flag work during failover

1b7ed20

binwiederhier mentioned this pull request Jul 1, 2021

WIP: Enforce semi sync during failover #1381

Closed

Philipp Heckel added 5 commits July 1, 2021 14:24

Fix backwards compatible logic

e83dbad

Split determine* function

a5566d8

Better logging; formattin

2bd2761

Add more TODOs

935e315

Merge branch 'master' of github.com:openark/orchestrator into enforce…

504a27c

…-semi-sync-replica-count

Pin specific non-replicating instance (includeNonReplicatingInstance),

550eabc

rename config option to ReasonableLockedSemiSyncMasterSeconds, split out MaybeDisableSemiSyncMaster

binwiederhier mentioned this pull request Jul 19, 2021

Allow semi-sync even if a replica's promotion rule is "must_not" #1371

Closed

shlomi-noach suggested changes Jul 21, 2021

View reviewed changes

go/inst/instance_dao.go Outdated Show resolved Hide resolved

go/inst/instance_topology_dao.go Show resolved Hide resolved

go/logic/topology_recovery.go Show resolved Hide resolved

Update go/inst/instance_dao.go

370e9d3

Co-authored-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

Philipp Heckel added 2 commits July 22, 2021 13:31

Re-read master after recovery to limit re-triggering

664d3c1

Add WaitForSemiSyncRecoverySeconds config option

5382128

Add empty commit to re-trigger the system-tests

ec7aab6

Philipp Heckel added 4 commits July 26, 2021 08:47

Revert "Add WaitForSemiSyncRecoverySeconds config option"

6ecf21c

This reverts commit 5382128.

Revert "Re-read master after recovery to limit re-triggering"

b61b174

This reverts commit 664d3c1.

Documentation

aca57c4

Typo

2ce8e33

Philipp Heckel added 2 commits July 26, 2021 15:41

Re-trigger system tests

d13a966

Re-trigger tests again

dd42bc5

shlomi-noach approved these changes Jul 27, 2021

View reviewed changes

shlomi-noach merged commit 1a6c3cd into openark:master Jul 27, 2021

scottnemes mentioned this pull request Jan 25, 2022

"Semi-sync enforced" is false #1425

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EnforceSemiSyncReplicas & RecoverLockedSemiSyncMaster - actively enable/disable semi-sync replicas to match master's wait count #1373

EnforceSemiSyncReplicas & RecoverLockedSemiSyncMaster - actively enable/disable semi-sync replicas to match master's wait count #1373

binwiederhier commented Jun 19, 2021 •

edited

Loading

shlomi-noach left a comment

binwiederhier commented Jun 29, 2021

shlomi-noach commented Jun 29, 2021

binwiederhier commented Jun 30, 2021 •

edited

Loading

shlomi-noach commented Jun 30, 2021

binwiederhier commented Jul 1, 2021 •

edited

Loading

binwiederhier commented Jul 19, 2021

shlomi-noach commented Jul 20, 2021

shlomi-noach left a comment

shlomi-noach commented Jul 21, 2021

binwiederhier commented Jul 22, 2021

binwiederhier commented Jul 22, 2021

binwiederhier commented Jul 23, 2021

shlomi-noach commented Jul 25, 2021

binwiederhier commented Jul 25, 2021

shlomi-noach commented Jul 25, 2021

binwiederhier commented Jul 25, 2021

shlomi-noach commented Jul 25, 2021

binwiederhier commented Jul 25, 2021

binwiederhier commented Jul 26, 2021

shlomi-noach commented Jul 27, 2021

shlomi-noach left a comment

cndoit18 commented Jul 27, 2021

EnforceSemiSyncReplicas & RecoverLockedSemiSyncMaster - actively enable/disable semi-sync replicas to match master's wait count #1373

EnforceSemiSyncReplicas & RecoverLockedSemiSyncMaster - actively enable/disable semi-sync replicas to match master's wait count #1373

Conversation

binwiederhier commented Jun 19, 2021 • edited Loading

shlomi-noach left a comment

Choose a reason for hiding this comment

binwiederhier commented Jun 29, 2021

shlomi-noach commented Jun 29, 2021

binwiederhier commented Jun 30, 2021 • edited Loading

Questions

MasterWithTooManySemiSyncReplicas

LockedSemiSyncMaster

shlomi-noach commented Jun 30, 2021

binwiederhier commented Jul 1, 2021 • edited Loading

binwiederhier commented Jul 19, 2021

shlomi-noach commented Jul 20, 2021

shlomi-noach left a comment

Choose a reason for hiding this comment

shlomi-noach commented Jul 21, 2021

binwiederhier commented Jul 22, 2021

binwiederhier commented Jul 22, 2021

binwiederhier commented Jul 23, 2021

shlomi-noach commented Jul 25, 2021

binwiederhier commented Jul 25, 2021

shlomi-noach commented Jul 25, 2021

binwiederhier commented Jul 25, 2021

shlomi-noach commented Jul 25, 2021

binwiederhier commented Jul 25, 2021

binwiederhier commented Jul 26, 2021

shlomi-noach commented Jul 27, 2021

shlomi-noach left a comment

Choose a reason for hiding this comment

cndoit18 commented Jul 27, 2021

binwiederhier commented Jun 19, 2021 •

edited

Loading

binwiederhier commented Jun 30, 2021 •

edited

Loading

binwiederhier commented Jul 1, 2021 •

edited

Loading