-
Notifications
You must be signed in to change notification settings - Fork 939
Orchestrator promotes a replica with lag on Mariadb #1363
Description
We have a 3 node MariaDB 10.5.10 setup on Centos. 1 Primary and 2 replicas with semi-sync enabled.
Our current orchestrator version is 3.2.4
We had a scenario where the replicas were lagging by few hours, master was not reachable so one of the replicas was promoted as primary in spite of the huge lag. This resulted in a data loss. Ideally orchestrator should wait for the replica's relay logs to be applied on the replica then promote as a master. This seems to be the behavior on MySQL based on my testing but not on Mariadb.
--Test case:
Tests against MySQL and Mariadb are done with these orchestrator parameters in /etc/orchaestrator.conf.json
"DelayMasterPromotionIfSQLThreadNotUpToDate": true,
"debug": true
Restart orchestrator on all 3 nodes
I)Test on MariaDB:
Start a 3 node Mariadb cluster (Semi-sync enabled)
1.Create and add data to a test table
create table test (colA int, colB int, colC datetime, colD int);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
-
Stop slave SQL_THREAD on replicas (Node 2, 3)
-
Wait for few secs and add some more data to Node 1 (master)
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000);
insert into test values (rand()*100,rand()*1000,now(),rand()*10000); -
Stop mysqld on Master (Node 1)
-
You will see the orchestrator promoting a replica without the data added in Step Rqlite bin #3.
Test on MySQL 5.7.32:
Repeat the same test on 3 node MySQL
You will notice orchestrator promoting one of the replicas with out any data loss ie seeing 14 rows !!!
Thank You
Mohan