Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

why is it ORC_AUTO_MASTER_RECOVERY=false ? #933

Closed
jianhaiqing opened this issue Jul 15, 2019 · 8 comments
Closed

why is it ORC_AUTO_MASTER_RECOVERY=false ? #933

jianhaiqing opened this issue Jul 15, 2019 · 8 comments

Comments

@jianhaiqing
Copy link

Hi Shlomi, i'm trying to test the auto master recovery procedure, by killing master. But according to the hooks result and environment variables output as follows, it doesn't happen, how I can troubleshoot the issue?

  • hook executed.
current stage: OnFailureDetectionProcesses
ORC_FAILURE_TYPE=DeadMaster
ORC_FAILURE_DESCRIPTION=Master cannot be reached by orchestrator and none of its replicas is replicating
ORC_FAILED_HOST=10.111.211.243
ORC_FAILED_PORT=3307
ORC_FAILURE_CLUSTER=10.111.211.243:3307
ORC_FAILURE_CLUSTER_ALIAS=mysql-3307
ORC_FAILURE_CLUSTER_DOMAIN=
ORC_COUNT_REPLICAS=1
ORC_IS_DOWNTIMED=false
ORC_AUTO_MASTER_RECOVERY=false
ORC_AUTO_INTERMEDIATE_MASTER_RECOVERY=false
ORC_ORCHESTRATOR_HOST=mysql-10-111-21-216
ORC_IS_SUCCESSFUL=false
ORC_LOST_REPLICAS=
ORC_REPLICA_HOSTS=10.111.211.242:3307
ORC_COMMAND=

ORC_SUCCESSOR_HOST=
ORC_SUCCESSOR_PORT=
ORC_SUCCESSOR_ALIAS=
# i print out the env and topology in the hook scripts.
10.111.211.243:3307      [unknown,invalid,5.7.25-28-log,rw,ROW,>>,GTID]
- 10.111.211.242:3307    [null,nonreplicating,5.7.25-28-log,ro,ROW,>>,GTID]
  + 10.111.211.244:3307  [0s,ok,5.7.25-28-log,ro,ROW,>>,GTID]
  + 10.111.211.244:13307 [0s,ok,5.7.25-28-log,ro,ROW,>>,GTID]
  • filters
curl  http://127.0.0.1:3000/api/automated-recovery-filters | jq
{
  "Code": "OK",
  "Message": "Automated recovery configuration details",
  "Details": {
    "RecoverIntermediateMasterClusterFilters": [
      "_intermediate_master_pattern_"
    ],
    "RecoverMasterClusterFilters": [
      "alias=mysql-3307,alias=mysql-3308"
    ],
    "RecoveryIgnoreHostnameFilters": []
  }
}
  • orchestrator
    3.0.14 f4c69ad05010518da784ce61865e65f0d9e0081c
  • mysql version
    percona server-5.7.25-28, gtid ON
  • global_recovery_disable
select * from global_recovery_disable;
-- nothing output
  • orchestrator.log
[mysql] 2019/07/15 07:57:14 packets.go:36: unexpected EOF
2019-07-15 07:57:14 ERROR invalid connection
2019-07-15 07:57:14 ERROR ReadTopologyInstance(10.111.211.243:3307) show variables like 'maxscale%': invalid connection
[mysql] 2019/07/15 07:57:14 packets.go:36: unexpected EOF
[mysql] 2019/07/15 07:57:14 packets.go:36: unexpected EOF
2019-07-15 07:57:14 WARNING  DiscoverInstance(10.111.211.243:3307) instance is nil in 0.002s (Backend: 0.001s, Instance: 0.000s), error=invalid connection
2019-07-15 07:57:15 WARNING executeCheckAndRecoverFunction: ignoring analysisEntry that has no action plan: FirstTierSlaveFailingToConnectToMaster; key: 10.111.211.242:3307
2019-07-15 07:57:15 INFO executeCheckAndRecoverFunction: proceeding with DeadMaster detection on 10.111.211.243:3307; isActionable?: true; skipProcesses: false
[mysql] 2019/07/15 07:57:15 connection.go:372: invalid connection
[mysql] 2019/07/15 07:57:15 connection.go:372: invalid connection
2019-07-15 07:57:15 INFO topology_recovery: detected DeadMaster failure on 10.111.211.243:3307
2019-07-15 07:57:15 INFO topology_recovery: Running 1 OnFailureDetectionProcesses hooks
2019-07-15 07:57:15 INFO topology_recovery: Running OnFailureDetectionProcesses hook 1 of 1: bash /usr/local/ops/mysql/monitor/orchestrator_failover.sh 'OnFailureDetectionProcesses' >> /usr/local/ops/mysql/monitor/failover.log
2019-07-15 07:57:15 INFO CommandRun(bash /usr/local/ops/mysql/monitor/orchestrator_failover.sh 'OnFailureDetectionProcesses' >> /usr/local/ops/mysql/monitor/failover.log,[])
2019-07-15 07:57:15 ERROR dial tcp 10.111.211.243:3307: connect: connection refused
[mysql] 2019/07/15 07:57:15 connection.go:372: invalid connection
2019-07-15 07:57:15 INFO auditType:emergently-read-topology-instance instance:10.111.211.243:3307 cluster:10.111.211.243:3307 message:FirstTierSlaveFailingToConnectToMaster
2019-07-15 07:57:15 INFO CommandRun/running: bash /tmp/orchestrator-process-cmd-314267411
2019-07-15 07:57:15 INFO CommandRun:

2019-07-15 07:57:15 INFO CommandRun successful. exit status 0
2019-07-15 07:57:15 INFO topology_recovery: Completed OnFailureDetectionProcesses hook 1 of 1 in 122.234593ms
2019-07-15 07:57:15 INFO topology_recovery: done running OnFailureDetectionProcesses hooks
2019-07-15 07:57:15 INFO executeCheckAndRecoverFunction: proceeding with DeadMaster recovery on 10.111.211.243:3307; isRecoverable?: true; skipProcesses: false
2019-07-15 07:57:16 INFO executeCheckAndRecoverFunction: proceeding with DeadMaster detection on 10.111.211.243:3307; isActionable?: true; skipProcesses: false
2019-07-15 07:57:16 INFO checkAndExecuteFailureDetectionProcesses: could not register DeadMaster detection on 10.111.211.243:3307
2019-07-15 07:57:16 INFO executeCheckAndRecoverFunction: proceeding with DeadMaster recovery on 10.111.211.243:3307; isRecoverable?: true; skipProcesses: false
2019-07-15 07:57:17 INFO executeCheckAndRecoverFunction: proceeding with DeadMaster detection on 10.111.211.243:3307; isActionable?: true; skipProcesses: false
2019-07-15 07:57:17 INFO executeCheckAndRecoverFunction: proceeding with DeadMaster recovery on 10.111.211.243:3307; isRecoverable?: true; skipProcesses: false
2019-07-15 07:57:17 ERROR dial tcp 10.111.211.243:3307: connect: connection refused
@shlomi-noach
Copy link
Collaborator

Please replace

    "RecoverMasterClusterFilters": [
      "alias=mysql-3307,alias=mysql-3308"
    ],

with

    "RecoverMasterClusterFilters": [
      "alias=mysql-3307",
      "alias=mysql-3308"
    ],

does that kick the failover?

@jianhaiqing
Copy link
Author

Wonderful, it works. Thank you.
And one more thing, ORC_FAILURE_CLUSTER_DOMAIN is empty during the whole process including OnFailureDetectionProcesses,PreFailoverProcesses,PostMasterFailoverProcesses,PostFailoverProcesses ? It's designed to be , or a bug?

@shlomi-noach
Copy link
Collaborator

"cluster domain" value depends on the configuration of DetectClusterDomainQuery. Do you have one set up?

@jianhaiqing
Copy link
Author

Yes

  • config.conf
"DetectClusterDomainQuery": "select ifnull(max(cluster_domain), '') as cluster_domain from meta.cluster where anchor=1",
  • ddl
CREATE TABLE `cluster` (
  `anchor` tinyint(4) NOT NULL,
  `cluster_name` varchar(128) CHARACTER SET ascii NOT NULL DEFAULT '',
  `cluster_domain` varchar(128) CHARACTER SET ascii NOT NULL DEFAULT '',
  PRIMARY KEY (`anchor`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

select ifnull(max(cluster_domain), '') as cluster_domain from meta.cluster where anchor=1;
+----------------------------+
| cluster_domain             |
+----------------------------+
| mysql-test-3307.hz.cvte.cn |
+----------------------------+

@shlomi-noach
Copy link
Collaborator

can you please select cluster_name, domain_name from database_instance on the orchestrator backend database?

@jianhaiqing
Copy link
Author

  • There is no definition of domain_name of database_instance.
  • more details, graceful-master-takeover is run, ORC_FAILURE_CLUSTER_DOMAIN is correct.
jianhaiqing@10.111.21.216:33307 [orchestrator]> select * from cluster_domain_name;
+----------------------+----------------------------+---------------------+
| cluster_name         | domain_name                | last_registered     |
+----------------------+----------------------------+---------------------+
| 10.111.211.242:3308  | mysql-test-3308.hz.cvte.cn | 2019-07-16 07:49:19 |
| 10.111.211.243:33306 | mysql-ha-test.hz.cvte.cn   | 2019-07-16 07:49:23 |
| 10.111.211.244:3307  | mysql-test-3307.hz.cvte.cn | 2019-07-16 07:49:19 |
+----------------------+----------------------------+---------------------+
3 rows in set (0.00 sec)

jianhaiqing@10.111.21.216:33307 [orchestrator]> select cluster_name from database_instance;
+----------------------+
| cluster_name         |
+----------------------+
| 10.111.211.242:3308  |
| 10.111.211.242:3308  |
| 10.111.211.243:33306 |
| 10.111.211.243:33306 |
| 10.111.211.243:33306 |
| 10.111.211.244:3307  |
| 10.111.211.244:3307  |
| 10.111.211.244:3307  |
| 10.111.211.244:3307  |
+----------------------+
9 rows in set (0.00 sec)

jianhaiqing@10.111.21.216:33307 [orchestrator]> select TABLE_SCHEMA,TABLE_NAME,COLUMN_NAME from information_schema.columns where column_name like '%domain%';
+--------------+---------------------+-------------+
| TABLE_SCHEMA | TABLE_NAME          | COLUMN_NAME |
+--------------+---------------------+-------------+
| orchestrator | cluster_domain_name | domain_name |
+--------------+---------------------+-------------+

@mostafahussein
Copy link

mostafahussein commented Sep 8, 2019

Hello,
I hope this is the correct place to add my question as the same variable pointed me to this issue.
I am using this config file as it is with the modification of few things:

MySQLTopologyUser and MySQLTopologyPassword with the value of root password (just for testing purposes locally)

and this points to three mariadb slaves, each name represents the alias of each container inside the docker network.

"RecoverMasterClusterFilters": [
    "alias=mariadb-slave-01",
    "alias=mariadb-slave-02",
    "alias=mariadb-slave-03"
  ]

Notes:

  • I have gtid enabled, (the earth icon appears on each slave but not on the master)
  • I have Pseudo-GTID enabled instead of GTID as it will be easier to automate in my case
  • I have no cluster table in my case (if i recall it was an optional step so i skipped it)
  • I can make manual/force recover through the dashboard when i press the button it selects a slave to be a master.

My questions are:
1- How to make this recover process automated as it seems to be disabled in my case and thats according to the dashboard which says "Automated master recovery for this cluster DISABLED". (I have checked the document but i didn't get it)
2- When master goes back online what should I do to make it back as master again or even join the same cluster as slave. (I mean not to lose the instances count)

Can you guide me to the missing points ?

Update 1
I have updated the entrypoint to auto discover and added the cluster pattern to enable automated master recovery.

Update 2
After finding an issue/question similar to the second question I found these comments: 1, 2.

As i moved to use Pseudo-GTID instead of GTID, what should I do to switch the old master to slave ?
If we meant to use CHANGE MASTER what would be the suitable scenario here to prevent data loss ? any advises ?
Is this configurable through hooks (I mean converting old master to slave once it goes back online)?

@jianhaiqing
Copy link
Author

jianhaiqing commented Sep 19, 2019

(ORC_FAILURE_CLUSTER_DOMAIN is empty) seems to be fixed in the PR #970
So I wanna close this issue

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants