New master is not replicating after a failover #613

michaellzc · 2020-10-15T21:49:46Z

Info

Version: the latest operator and cluster chart from this repository.

Description

Our test environment consists of 3 nodes.

$release-mysqlcluster-db-0 (master)
$release-mysqlcluster-db-1
$release-mysqlcluster-db-2

We are simulating a master node failure by killing the mysqld process in db-0

In the event of a DeadMaster event, orchestrator automatically promotes db-1 to master, but the new master node is stuck at not replicating error.

What would be the correct recovery process?

Operator log

{"severity":"INFO","timestamp":"2020-10-15T21:21:41.371071513Z","logger":"orchestrator-reconciler","message":"cluster not ready for acknowledge","key":"$namespace/$release-mysqlcluster-db","threshold":600}

{"severity":"ERROR","timestamp":"2020-10-15T21:26:14.340158975Z","logger":"kubebuilder.controller","message":"Reconciler error","controller":"mysqlbackup-controller","request":"$namespace/$release-mysql-cluster-db-auto-2020-10-14t19-24-00","error":"MysqlCluster.mysql.presslabs.org \"$release-mysql-cluster-db\" not found","stacktrace":"github.com/presslabs/mysql-operator/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/presslabs/mysql-operator/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/presslabs/mysql-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/presslabs/mysql-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\ngithub.com/presslabs/mysql-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/presslabs/mysql-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\ngithub.com/presslabs/mysql-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/presslabs/mysql-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/presslabs/mysql-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/presslabs/mysql-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/presslabs/mysql-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/presslabs/mysql-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

Dead master pod (db-0) log after restart

2020-10-15T21:11:23.994309Z 0 [Note] mysqld: ready for connections.
Version: '5.7.26-29-log'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Percona Server (GPL), Release 29, Revision 11ad961
2020-10-15T21:11:23.995177Z 3 [Note] Got an error reading communication packets
2020-10-15T21:11:23.995742Z 4 [Note] Got an error reading communication packets
2020-10-15T21:11:24.173001Z 6 [Note] Start binlog_dump to master_thread_id(6) slave_server(101), pos(, 4)
2020-10-15T21:11:28.133393Z 18 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MySQL server acts as a slave and has his hostname changed!! Please use '--relay-log=$release-mysqlcluster-db-mysql-0-relay-bin' to avoid this problem.
2020-10-15T21:11:28.150677Z 18 [Note] 'CHANGE MASTER TO FOR CHANNEL '' executed'. Previous state master_host='', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='$release-mysqlcluster-db-mysql-1.mysql.$release', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''.
2020-10-15T21:11:28.175401Z 20 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2020-10-15T21:11:28.176719Z 21 [Note] Slave SQL thread for channel '' initialized, starting replication in log 'FIRST' at position 0, relay log './$release-mysqlcluster-db-mysql-0-relay-bin.000001' position: 4
2020-10-15T21:11:28.184056Z 20 [Note] Slave I/O thread for channel '': connected to master 'sys_replication@$release-mysqlcluster-db-mysql-1.mysql.$release:3306',replication started in log 'FIRST' at position 4
2020-10-15T21:11:31.551274Z 6 [Note] Aborted connection 6 to db: 'unconnected' user: 'sys_replication' host: '172.30.254.227' (failed on flush_net())

New master node (db-1)

2020-10-15T21:40:51.873465Z 576 [ERROR] Slave I/O for channel '': error connecting to master 'sys_replication@//$release-mysqlcluster-db-mysql-0.mysql.$namespace:3306' - retry-time: 1  retries: 1755, Error_code: 2005

The text was updated successfully, but these errors were encountered:

baurmatt · 2020-10-19T09:43:17Z

Seeing more or less the same in our tests. We're testing how we're going to handle node drains. While testing this, I've realized that, no matter what kind of persistence (hostPath, local path PVC, SDS PVC, emptyDir) we configure the cluster is always broken in Orchestrator after the drain. The MySQL pods gets recreated correctly on another node, but it seems like the cluster isn't correctly set back to a fully functional state.

michaellzc · 2020-10-19T16:03:38Z

Seeing more or less the same in our tests. We're testing how we're going to handle node drains. While testing this, I've realized that, no matter what kind of persistence (hostPath, local path PVC, SDS PVC, emptyDir) we configure the cluster is always broken in Orchestrator after the drain. The MySQL pods gets recreated correctly on another node, but it seems like the cluster isn't correctly set back to a fully functional state.

We then tried to set ApplyMySQLPromotionAfterMasterFailover: true to workaround the issue, but it makes it worse. If there are concurrent master node failures, it will result in brain-split. The workaround is to keep killing the master node until db-0 becomes master again, then scale down from 3 to 1 and scale up again. This is the kind of thing that we expect the operator to handle for us.

ynnt · 2020-10-22T13:46:09Z

For me it looks like it's not really a problem, but rather misleading message from operator/orchestrator. Replication is still working in an async mode... And master doesn't have to replicate anything at all...

After I reset new master's replication via Orchestrator everything becomes normal again.
Looks like it's worth automating this inside the operator.

munjalpatel · 2021-02-06T14:19:25Z

@ExiaSR @ynnt I am also running into the same issue. Did you ever resolve this?

cndoit18 · 2021-06-09T10:27:18Z

I think PR #690 can solve this problem
cc @munjalpatel @ynnt

cndoit18 · 2021-10-28T11:59:24Z

I think I've fixed it, you can try version0.5.1 so I'll close it first. If this problem happened again, please reopen

michaellzc mentioned this issue Oct 15, 2020

Q: orchestrator failover defaults #482

Open

ynnt mentioned this issue Oct 23, 2020

Reset replication on master after failover #617

Closed

1 task

AMecea mentioned this issue Nov 3, 2020

Let the orchestrator handle the read-only/writable of nodes #627

Closed

cndoit18 mentioned this issue Jun 9, 2021

fix(bugs): fixed a lot of bugs #690

Merged

1 task

cndoit18 closed this as completed Oct 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New master is not replicating after a failover #613

New master is not replicating after a failover #613

michaellzc commented Oct 15, 2020 •

edited

Loading

baurmatt commented Oct 19, 2020

michaellzc commented Oct 19, 2020

ynnt commented Oct 22, 2020

munjalpatel commented Feb 6, 2021

cndoit18 commented Jun 9, 2021 •

edited

Loading

cndoit18 commented Oct 28, 2021

New master is not replicating after a failover #613

New master is not replicating after a failover #613

Comments

michaellzc commented Oct 15, 2020 • edited Loading

Info

Description

Operator log

Dead master pod (db-0) log after restart

New master node (db-1)

baurmatt commented Oct 19, 2020

michaellzc commented Oct 19, 2020

ynnt commented Oct 22, 2020

munjalpatel commented Feb 6, 2021

cndoit18 commented Jun 9, 2021 • edited Loading

cndoit18 commented Oct 28, 2021

michaellzc commented Oct 15, 2020 •

edited

Loading

cndoit18 commented Jun 9, 2021 •

edited

Loading