Async replication setup fails when there's write load to primary #399

paulomach · 2024-03-27T20:51:18Z

Steps to reproduce

Deploy two clusters with more then one unit each
add write load to the one selected to become the primary
and relate to form a cluster-set

Expected behavior

Cluster set is formed

Actual behavior

Joining instances on secondary cluster fail to join the cluster

github-actions · 2024-03-27T20:51:36Z

https://warthogs.atlassian.net/browse/DPE-3890

paulomach · 2024-04-11T13:38:05Z

Upstream bug report: http://bugs.mysql.com/114624

Oracles internal bug report reference: 36403327

* poc k8s-k8s async repl * isort had mismatched line_length config * (WIP) improved support * (WIP) support more cases * scale up/down for repl clusters * idle state for 2nds * instance label when creating repl cluster * more typing * (wip) replica secondaries support * allow replica secondaries to rejoin * (WIP) sync cluster-set-name and better deal with repl 2ndaries * (WIP) fix secondaries join and messaging * fix: removed unneeded flag * support for relation broken * dissolved replica cluster stays blocked * added the promote standby action * support for cluster set name config * minor refactors and typing * async common methods, single module/file * removed dup action and lint fixes * check for user data on replica side * moved async_replication to a library (owned by vm charm) * sync with vm code * lint fixes * scale-in locks write to global primary * address pr comments * second batch of PR comments adressing * fence/unfence actions * fix for single unit replica cluster * workaround: secrets in relation data * partial unit test fixes * normalized cluster name for dict reference * remove test due to be refactored * allows rejoin after unrelate * fix lock instance reference * allow unrelated cluster rejoin * reset cluster-set name * dealing with secret not found * ensure secrets are shared with full uri * fix recreation and unrelation after promotion * ensure all online unis * unset read only after unfence * refactor remove instance to acommodate changing * fix race condition * automatic deal with clusters with the same name * bump * test for mysql version and cluster-set-name on rejoin * avoid handling on unit removal * sync local root * use method from lib * lint fixes * addressing pr feedback * using node mode instead of role * covering edge cases for recovery/failover * add rejoin invalidated cluster action * the integration test * workaround for issue #399 * bump libpatch * add group marks * missing default * set flag to avoid lock release * chore: fixes old issue on test * fix markers * fix marker import * fix retry process on removal * fix user creation after refactor `root@%` created only when required. * lint/grammar fixes * pr comment

taurus-forever · 2024-06-19T08:08:44Z

@paulomach do we have an update from MySQL team here? I see no activity in http://bugs.mysql.com/114624

paulomach · 2024-06-19T13:13:57Z

@taurus-forever they've pinged directly on slack.

tl;dr; they cannot reproduce on sandboxed environment. There's something special about the way we are setting things. I've allocated some time to compare their instructions to the way we do it.

For the record, the instructions:

dba.deploy_sandbox_instance(3310, {'password': "", 'sandboxDir': "/tmp/sandbox"});
dba.deploy_sandbox_instance(3320, {'password': "", 'sandboxDir': "/tmp/sandbox"});
dba.deploy_sandbox_instance(3330, {'password': "", 'sandboxDir': "/tmp/sandbox"});

# 1. Deploy Group Replication cluster (A)

shell.connect("root@127.0.0.1:3310", "")

cluster = dba.create_cluster("my_cluster")

cluster.add_instance("127.0.0.1:3320", {'recoveryMethod':'incremental'});
cluster.add_instance("127.0.0.1:3330", {'recoveryMethod':'incremental'});

# 2. Create ClusterSet on A

clusterset = cluster.create_cluster_set("foobar")

# 3. Deploy standalone instance B1

dba.deploy_sandbox_instance(3340, {'password': "", 'sandboxDir': "/tmp/sandbox"});

# 4. Create ReplicaCluster on B1 from A (cluster_set.create_replica_cluster)

clusterset.create_replica_cluster("127.0.0.1:3340", "replica", {'recoveryMethod':'incremental'});

# 5. Add some write load to cluster A

# \sql
my_session = mysql.get_classic_session('root@localhost:3310', '');
my_session.run_sql("CREATE DATABASE test");
my_session.run_sql("USE test;");
my_session.run_sql("CREATE TABLE t1 (c1 INT NOT NULL AUTO_INCREMENT PRIMARY KEY, c2 INT) ENGINE=InnoDB;");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) SELECT c2 FROM t1;");
my_session.run_sql("INSERT INTO t1 (c2) SELECT c2 FROM t1;");
my_session.run_sql("INSERT INTO t1 (c2) SELECT c2 FROM t1;");
my_session.run_sql("INSERT INTO t1 (c2) SELECT c2 FROM t1;");
my_session.run_sql("INSERT INTO t1 (c2) SELECT c2 FROM t1;");

On other terminal generate load with mysqlslap:

mysqlslap -S /tmp/sandbox/3310/sandboxdata/mysqld.sock --create-schema=test --delimiter=";" --iterations=300000 --query="INSERT INTO t1 (c2) SELECT c2 FROM t1 LIMIT 100" --concurrency=100 &

And on mysqlshell execute final steps:

shell.connect("root@127.0.0.1:3310", "")

dba.deploy_sandbox_instance(3350, {'password': "", 'sandboxDir': "/tmp/sandbox"});

cluster = dba.get_cluster("my_cluster")

cluster.add_instance("127.0.0.1:3350", {'recoveryMethod':'incremental'})

paulomach · 2024-08-07T21:04:32Z

Reproducible using:

dba.deploy_sandbox_instance(3310, {'password': "", 'sandboxDir': "/tmp/sandbox"});
dba.deploy_sandbox_instance(3320, {'password': "", 'sandboxDir': "/tmp/sandbox"});
dba.deploy_sandbox_instance(3330, {'password': "", 'sandboxDir': "/tmp/sandbox"});

# 1. Deploy Group Replication cluster (A)
shell.connect("root@127.0.0.1:3310", "")
cluster = dba.create_cluster("my_cluster")
cluster.add_instance("127.0.0.1:3320", {'recoveryMethod':'incremental'});
cluster.add_instance("127.0.0.1:3330", {'recoveryMethod':'incremental'});

# 2. Create ClusterSet on A
clusterset = cluster.create_cluster_set("foobar")

# 3. Deploy standalone instance B1
dba.deploy_sandbox_instance(3340, {'password': "", 'sandboxDir': "/tmp/sandbox"});

# 4. Create ReplicaCluster on B1 from A (cluster_set.create_replica_cluster)
clusterset.create_replica_cluster("127.0.0.1:3340", "replica", {'recoveryMethod':'incremental'});

# 5. Add some write load to cluster A

# \sql
my_session = mysql.get_classic_session('root@localhost:3310', '');
my_session.run_sql("CREATE DATABASE test");
my_session.run_sql("USE test;");
my_session.run_sql("CREATE TABLE t1 (c1 INT NOT NULL AUTO_INCREMENT PRIMARY KEY, c2 INT) ENGINE=InnoDB;");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) SELECT c2 FROM t1;");
my_session.run_sql("INSERT INTO t1 (c2) SELECT c2 FROM t1;");
my_session.run_sql("INSERT INTO t1 (c2) SELECT c2 FROM t1;");
my_session.run_sql("INSERT INTO t1 (c2) SELECT c2 FROM t1;");
my_session.run_sql("INSERT INTO t1 (c2) SELECT c2 FROM t1;");

On other terminal generate load with mysqlslap:

mysqlslap -S /tmp/sandbox/3310/sandboxdata/mysqld.sock --create-schema=test --delimiter=";" --iterations=300000 --query="INSERT INTO t1 (c2) SELECT c2 FROM t1 LIMIT 100" --concurrency=100 &

And on mysqlshell execute final steps:

shell.connect("root@127.0.0.1:3340", "")
dba.deploy_sandbox_instance(3350, {'password': "", 'sandboxDir': "/tmp/sandbox"});
# First difference - I'm adding instances to the replica cluster
cluster = dba.get_cluster("replica")
cluster.add_instance("127.0.0.1:3350", {'recoveryMethod':'incremental'})

# Second difference - sometimes It was needed to add a third instance to the replica cluster
dba.deploy_sandbox_instance(3360, {'password': "", 'sandboxDir': "/tmp/sandbox"});
cluster.add_instance("127.0.0.1:3360", {'recoveryMethod':'incremental'})

session.run_sql("SELECT max(c1) FROM test.t1")
# PK stop incrementing while mysqlslap is running

# connect to `my_cluster` primary
shell.connect("root@127.0.0.1:3310", "")
# PK still incrementing while mysqlslap is running
session.run_sql("SELECT max(c1) FROM test.t1")

paulomach added the bug Something isn't working label Mar 27, 2024

paulomach added a commit that referenced this issue Apr 8, 2024

workaround for issue #399

16f976c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async replication setup fails when there's write load to primary #399

Async replication setup fails when there's write load to primary #399

paulomach commented Mar 27, 2024

github-actions bot commented Mar 27, 2024

paulomach commented Apr 11, 2024 •

edited

Loading

taurus-forever commented Jun 19, 2024

paulomach commented Jun 19, 2024 •

edited

Loading

paulomach commented Aug 7, 2024

Async replication setup fails when there's write load to primary #399

Async replication setup fails when there's write load to primary #399

Comments

paulomach commented Mar 27, 2024

Steps to reproduce

Expected behavior

Actual behavior

github-actions bot commented Mar 27, 2024

paulomach commented Apr 11, 2024 • edited Loading

taurus-forever commented Jun 19, 2024

paulomach commented Jun 19, 2024 • edited Loading

paulomach commented Aug 7, 2024

paulomach commented Apr 11, 2024 •

edited

Loading

paulomach commented Jun 19, 2024 •

edited

Loading