-
Notifications
You must be signed in to change notification settings - Fork 936
Can orchestrator + semi-sync guarantee zero data loss? #1312
Comments
@shlomi-noach I take the liberty of hoping that you will take the time to answer my question, as I don't know golang, so I don't know much about the failover logic of the orchestrator, so please forgive me if I'm wrong, and I look forward to your reply. |
Whoops, sorry, missed this in the backlog. Right, I think I saw another similar questio nrecently. What you tests show is:
The systems I've worked with are such that replication lag is very low (by actively pushing back on apps). Therefore, at time of failover, it only takes a fraction of a second for any replica to consume whatever relay log events are in the queue. Back to your question, could the following configuration help? So, we need a mechanism that chooses a replica based on potential data, not on current data. This is only applicable for GTID based failovers, because you can only compare replicas in GTID topologies. Let me look into this. |
maybe use Master_Log_File and Read_Master_Log_Pos? |
大佬, have you made any progress? |
You may be interested in this: https://datto.engineering/post/lossless-mysql-semi-sync-replication-and-automated-failover (disclaimer: I wrote it :-)) |
Thank you @binwiederhier , the article was very helpful I was planning to replace MHA with Orchestrator this year, but I've found that Orchestrator's philosophy is different from MHA. Orchestrator tends to prioritise availability and retain the maximum number of replica in the cluster. Orchestrator uses ExecBinlogCoordinates to select candidate, which does have the potential for data loss in the extreme scenario I described. So I learned a bit about go, and made some "modifications" on May Day Holiday, which are still being tested. However, in the process of learning the source code I found that there was something wrong with DelayMasterPromotionIfSQLThreadNotUpToDate, it didn't "work", according to the path I sorted out for the code call:
StopReplicationNicely finally executes stop slave. I don't find anywhere in the code where the start slave sql_thread is executed afterwards. So DelayMasterPromotionIfSQLThreadNotUpToDate has been waiting for a stopped slave... I'll have to look into it, your orchestrator.json is very informative for me, anyway, thanks~ |
@Fanduzi 无损半同步复制也会丢数据么 在上面这个情况下? |
我测试的结果是会,你也可以测测 |
嗯 应该是的 从库没有收到完整的relay log,旧主会多出事务 |
Let's say I have one master and two slaves, semi-sync is on, rpl_semi_sync_master_wait_for_slave_count = 1
M is the master
S1 and S2 are the slaves
At some point:
Then, master crash. which slave will be the new master?
Here I provide a test method for this scenario:
after run these command, S2 will not receive master's binlog event, but S2's Slave_IO_Running is still 'ON'
4. shutdown master, run
tc qdisc del dev ens33 root
on S2, release lock on S15. see who will be the new master(In our tests, the orchestrator chose S2 as the New Master, But I think S1 should be chosen as the new master)
The text was updated successfully, but these errors were encountered: