Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New master is not replicating after a failover #613

Closed
michaellzc opened this issue Oct 15, 2020 · 6 comments · Fixed by #690
Closed

New master is not replicating after a failover #613

michaellzc opened this issue Oct 15, 2020 · 6 comments · Fixed by #690

Comments

@michaellzc
Copy link

michaellzc commented Oct 15, 2020

Info

Version: the latest operator and cluster chart from this repository.

Description

Our test environment consists of 3 nodes.

  • $release-mysqlcluster-db-0 (master)
  • $release-mysqlcluster-db-1
  • $release-mysqlcluster-db-2

We are simulating a master node failure by killing the mysqld process in db-0

In the event of a DeadMaster event, orchestrator automatically promotes db-1 to master, but the new master node is stuck at not replicating error.

What would be the correct recovery process?

Operator log

{"severity":"INFO","timestamp":"2020-10-15T21:21:41.371071513Z","logger":"orchestrator-reconciler","message":"cluster not ready for acknowledge","key":"$namespace/$release-mysqlcluster-db","threshold":600}
{"severity":"ERROR","timestamp":"2020-10-15T21:26:14.340158975Z","logger":"kubebuilder.controller","message":"Reconciler error","controller":"mysqlbackup-controller","request":"$namespace/$release-mysql-cluster-db-auto-2020-10-14t19-24-00","error":"MysqlCluster.mysql.presslabs.org \"$release-mysql-cluster-db\" not found","stacktrace":"github.com/presslabs/mysql-operator/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/presslabs/mysql-operator/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/presslabs/mysql-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/presslabs/mysql-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\ngithub.com/presslabs/mysql-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/presslabs/mysql-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\ngithub.com/presslabs/mysql-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/presslabs/mysql-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/presslabs/mysql-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/presslabs/mysql-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/presslabs/mysql-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/presslabs/mysql-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

Dead master pod (db-0) log after restart

2020-10-15T21:11:23.994309Z 0 [Note] mysqld: ready for connections.
Version: '5.7.26-29-log'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Percona Server (GPL), Release 29, Revision 11ad961
2020-10-15T21:11:23.995177Z 3 [Note] Got an error reading communication packets
2020-10-15T21:11:23.995742Z 4 [Note] Got an error reading communication packets
2020-10-15T21:11:24.173001Z 6 [Note] Start binlog_dump to master_thread_id(6) slave_server(101), pos(, 4)
2020-10-15T21:11:28.133393Z 18 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MySQL server acts as a slave and has his hostname changed!! Please use '--relay-log=$release-mysqlcluster-db-mysql-0-relay-bin' to avoid this problem.
2020-10-15T21:11:28.150677Z 18 [Note] 'CHANGE MASTER TO FOR CHANNEL '' executed'. Previous state master_host='', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='$release-mysqlcluster-db-mysql-1.mysql.$release', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''.
2020-10-15T21:11:28.175401Z 20 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2020-10-15T21:11:28.176719Z 21 [Note] Slave SQL thread for channel '' initialized, starting replication in log 'FIRST' at position 0, relay log './$release-mysqlcluster-db-mysql-0-relay-bin.000001' position: 4
2020-10-15T21:11:28.184056Z 20 [Note] Slave I/O thread for channel '': connected to master 'sys_replication@$release-mysqlcluster-db-mysql-1.mysql.$release:3306',replication started in log 'FIRST' at position 4
2020-10-15T21:11:31.551274Z 6 [Note] Aborted connection 6 to db: 'unconnected' user: 'sys_replication' host: '172.30.254.227' (failed on flush_net())

New master node (db-1)

2020-10-15T21:40:51.873465Z 576 [ERROR] Slave I/O for channel '': error connecting to master 'sys_replication@//$release-mysqlcluster-db-mysql-0.mysql.$namespace:3306' - retry-time: 1  retries: 1755, Error_code: 2005
@baurmatt
Copy link
Contributor

Seeing more or less the same in our tests. We're testing how we're going to handle node drains. While testing this, I've realized that, no matter what kind of persistence (hostPath, local path PVC, SDS PVC, emptyDir) we configure the cluster is always broken in Orchestrator after the drain. The MySQL pods gets recreated correctly on another node, but it seems like the cluster isn't correctly set back to a fully functional state.

@michaellzc
Copy link
Author

Seeing more or less the same in our tests. We're testing how we're going to handle node drains. While testing this, I've realized that, no matter what kind of persistence (hostPath, local path PVC, SDS PVC, emptyDir) we configure the cluster is always broken in Orchestrator after the drain. The MySQL pods gets recreated correctly on another node, but it seems like the cluster isn't correctly set back to a fully functional state.

We then tried to set ApplyMySQLPromotionAfterMasterFailover: true to workaround the issue, but it makes it worse. If there are concurrent master node failures, it will result in brain-split. The workaround is to keep killing the master node until db-0 becomes master again, then scale down from 3 to 1 and scale up again. This is the kind of thing that we expect the operator to handle for us.

@ynnt
Copy link

ynnt commented Oct 22, 2020

For me it looks like it's not really a problem, but rather misleading message from operator/orchestrator. Replication is still working in an async mode... And master doesn't have to replicate anything at all...

After I reset new master's replication via Orchestrator everything becomes normal again.
Looks like it's worth automating this inside the operator.

@munjalpatel
Copy link

@ExiaSR @ynnt I am also running into the same issue. Did you ever resolve this?

@cndoit18
Copy link
Collaborator

cndoit18 commented Jun 9, 2021

I think PR #690 can solve this problem
cc @munjalpatel @ynnt

@cndoit18
Copy link
Collaborator

I think I've fixed it, you can try version0.5.1 so I'll close it first. If this problem happened again, please reopen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants