why is it ORC_AUTO_MASTER_RECOVERY=false ? #933

jianhaiqing · 2019-07-15T01:35:23Z

Hi Shlomi, i'm trying to test the auto master recovery procedure, by killing master. But according to the hooks result and environment variables output as follows, it doesn't happen, how I can troubleshoot the issue?

hook executed.

current stage: OnFailureDetectionProcesses
ORC_FAILURE_TYPE=DeadMaster
ORC_FAILURE_DESCRIPTION=Master cannot be reached by orchestrator and none of its replicas is replicating
ORC_FAILED_HOST=10.111.211.243
ORC_FAILED_PORT=3307
ORC_FAILURE_CLUSTER=10.111.211.243:3307
ORC_FAILURE_CLUSTER_ALIAS=mysql-3307
ORC_FAILURE_CLUSTER_DOMAIN=
ORC_COUNT_REPLICAS=1
ORC_IS_DOWNTIMED=false
ORC_AUTO_MASTER_RECOVERY=false
ORC_AUTO_INTERMEDIATE_MASTER_RECOVERY=false
ORC_ORCHESTRATOR_HOST=mysql-10-111-21-216
ORC_IS_SUCCESSFUL=false
ORC_LOST_REPLICAS=
ORC_REPLICA_HOSTS=10.111.211.242:3307
ORC_COMMAND=

ORC_SUCCESSOR_HOST=
ORC_SUCCESSOR_PORT=
ORC_SUCCESSOR_ALIAS=
# i print out the env and topology in the hook scripts.
10.111.211.243:3307      [unknown,invalid,5.7.25-28-log,rw,ROW,>>,GTID]
- 10.111.211.242:3307    [null,nonreplicating,5.7.25-28-log,ro,ROW,>>,GTID]
  + 10.111.211.244:3307  [0s,ok,5.7.25-28-log,ro,ROW,>>,GTID]
  + 10.111.211.244:13307 [0s,ok,5.7.25-28-log,ro,ROW,>>,GTID]

filters

curl  http://127.0.0.1:3000/api/automated-recovery-filters | jq
{
  "Code": "OK",
  "Message": "Automated recovery configuration details",
  "Details": {
    "RecoverIntermediateMasterClusterFilters": [
      "_intermediate_master_pattern_"
    ],
    "RecoverMasterClusterFilters": [
      "alias=mysql-3307,alias=mysql-3308"
    ],
    "RecoveryIgnoreHostnameFilters": []
  }
}

orchestrator
3.0.14 f4c69ad05010518da784ce61865e65f0d9e0081c
mysql version
percona server-5.7.25-28, gtid ON
global_recovery_disable

select * from global_recovery_disable;
-- nothing output

orchestrator.log

[mysql] 2019/07/15 07:57:14 packets.go:36: unexpected EOF
2019-07-15 07:57:14 ERROR invalid connection
2019-07-15 07:57:14 ERROR ReadTopologyInstance(10.111.211.243:3307) show variables like 'maxscale%': invalid connection
[mysql] 2019/07/15 07:57:14 packets.go:36: unexpected EOF
[mysql] 2019/07/15 07:57:14 packets.go:36: unexpected EOF
2019-07-15 07:57:14 WARNING  DiscoverInstance(10.111.211.243:3307) instance is nil in 0.002s (Backend: 0.001s, Instance: 0.000s), error=invalid connection
2019-07-15 07:57:15 WARNING executeCheckAndRecoverFunction: ignoring analysisEntry that has no action plan: FirstTierSlaveFailingToConnectToMaster; key: 10.111.211.242:3307
2019-07-15 07:57:15 INFO executeCheckAndRecoverFunction: proceeding with DeadMaster detection on 10.111.211.243:3307; isActionable?: true; skipProcesses: false
[mysql] 2019/07/15 07:57:15 connection.go:372: invalid connection
[mysql] 2019/07/15 07:57:15 connection.go:372: invalid connection
2019-07-15 07:57:15 INFO topology_recovery: detected DeadMaster failure on 10.111.211.243:3307
2019-07-15 07:57:15 INFO topology_recovery: Running 1 OnFailureDetectionProcesses hooks
2019-07-15 07:57:15 INFO topology_recovery: Running OnFailureDetectionProcesses hook 1 of 1: bash /usr/local/ops/mysql/monitor/orchestrator_failover.sh 'OnFailureDetectionProcesses' >> /usr/local/ops/mysql/monitor/failover.log
2019-07-15 07:57:15 INFO CommandRun(bash /usr/local/ops/mysql/monitor/orchestrator_failover.sh 'OnFailureDetectionProcesses' >> /usr/local/ops/mysql/monitor/failover.log,[])
2019-07-15 07:57:15 ERROR dial tcp 10.111.211.243:3307: connect: connection refused
[mysql] 2019/07/15 07:57:15 connection.go:372: invalid connection
2019-07-15 07:57:15 INFO auditType:emergently-read-topology-instance instance:10.111.211.243:3307 cluster:10.111.211.243:3307 message:FirstTierSlaveFailingToConnectToMaster
2019-07-15 07:57:15 INFO CommandRun/running: bash /tmp/orchestrator-process-cmd-314267411
2019-07-15 07:57:15 INFO CommandRun:

2019-07-15 07:57:15 INFO CommandRun successful. exit status 0
2019-07-15 07:57:15 INFO topology_recovery: Completed OnFailureDetectionProcesses hook 1 of 1 in 122.234593ms
2019-07-15 07:57:15 INFO topology_recovery: done running OnFailureDetectionProcesses hooks
2019-07-15 07:57:15 INFO executeCheckAndRecoverFunction: proceeding with DeadMaster recovery on 10.111.211.243:3307; isRecoverable?: true; skipProcesses: false
2019-07-15 07:57:16 INFO executeCheckAndRecoverFunction: proceeding with DeadMaster detection on 10.111.211.243:3307; isActionable?: true; skipProcesses: false
2019-07-15 07:57:16 INFO checkAndExecuteFailureDetectionProcesses: could not register DeadMaster detection on 10.111.211.243:3307
2019-07-15 07:57:16 INFO executeCheckAndRecoverFunction: proceeding with DeadMaster recovery on 10.111.211.243:3307; isRecoverable?: true; skipProcesses: false
2019-07-15 07:57:17 INFO executeCheckAndRecoverFunction: proceeding with DeadMaster detection on 10.111.211.243:3307; isActionable?: true; skipProcesses: false
2019-07-15 07:57:17 INFO executeCheckAndRecoverFunction: proceeding with DeadMaster recovery on 10.111.211.243:3307; isRecoverable?: true; skipProcesses: false
2019-07-15 07:57:17 ERROR dial tcp 10.111.211.243:3307: connect: connection refused

The text was updated successfully, but these errors were encountered:

shlomi-noach · 2019-07-15T04:45:44Z

Please replace

    "RecoverMasterClusterFilters": [
      "alias=mysql-3307,alias=mysql-3308"
    ],

with

    "RecoverMasterClusterFilters": [
      "alias=mysql-3307",
      "alias=mysql-3308"
    ],

does that kick the failover?

jianhaiqing · 2019-07-15T09:15:26Z

Wonderful, it works. Thank you.
And one more thing, ORC_FAILURE_CLUSTER_DOMAIN is empty during the whole process including OnFailureDetectionProcesses,PreFailoverProcesses,PostMasterFailoverProcesses,PostFailoverProcesses ? It's designed to be , or a bug?

shlomi-noach · 2019-07-15T11:18:21Z

"cluster domain" value depends on the configuration of DetectClusterDomainQuery. Do you have one set up?

jianhaiqing · 2019-07-15T11:59:47Z

Yes

config.conf

"DetectClusterDomainQuery": "select ifnull(max(cluster_domain), '') as cluster_domain from meta.cluster where anchor=1",

ddl

CREATE TABLE `cluster` (
  `anchor` tinyint(4) NOT NULL,
  `cluster_name` varchar(128) CHARACTER SET ascii NOT NULL DEFAULT '',
  `cluster_domain` varchar(128) CHARACTER SET ascii NOT NULL DEFAULT '',
  PRIMARY KEY (`anchor`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

select ifnull(max(cluster_domain), '') as cluster_domain from meta.cluster where anchor=1;
+----------------------------+
| cluster_domain             |
+----------------------------+
| mysql-test-3307.hz.cvte.cn |
+----------------------------+

shlomi-noach · 2019-07-15T12:16:31Z

can you please select cluster_name, domain_name from database_instance on the orchestrator backend database?

jianhaiqing · 2019-07-15T23:55:17Z

There is no definition of domain_name of database_instance.
more details, graceful-master-takeover is run, ORC_FAILURE_CLUSTER_DOMAIN is correct.

jianhaiqing@10.111.21.216:33307 [orchestrator]> select * from cluster_domain_name;
+----------------------+----------------------------+---------------------+
| cluster_name         | domain_name                | last_registered     |
+----------------------+----------------------------+---------------------+
| 10.111.211.242:3308  | mysql-test-3308.hz.cvte.cn | 2019-07-16 07:49:19 |
| 10.111.211.243:33306 | mysql-ha-test.hz.cvte.cn   | 2019-07-16 07:49:23 |
| 10.111.211.244:3307  | mysql-test-3307.hz.cvte.cn | 2019-07-16 07:49:19 |
+----------------------+----------------------------+---------------------+
3 rows in set (0.00 sec)

jianhaiqing@10.111.21.216:33307 [orchestrator]> select cluster_name from database_instance;
+----------------------+
| cluster_name         |
+----------------------+
| 10.111.211.242:3308  |
| 10.111.211.242:3308  |
| 10.111.211.243:33306 |
| 10.111.211.243:33306 |
| 10.111.211.243:33306 |
| 10.111.211.244:3307  |
| 10.111.211.244:3307  |
| 10.111.211.244:3307  |
| 10.111.211.244:3307  |
+----------------------+
9 rows in set (0.00 sec)

jianhaiqing@10.111.21.216:33307 [orchestrator]> select TABLE_SCHEMA,TABLE_NAME,COLUMN_NAME from information_schema.columns where column_name like '%domain%';
+--------------+---------------------+-------------+
| TABLE_SCHEMA | TABLE_NAME          | COLUMN_NAME |
+--------------+---------------------+-------------+
| orchestrator | cluster_domain_name | domain_name |
+--------------+---------------------+-------------+

mostafahussein · 2019-09-08T11:45:54Z

Hello,
I hope this is the correct place to add my question as the same variable pointed me to this issue.
I am using this config file as it is with the modification of few things:

MySQLTopologyUser and MySQLTopologyPassword with the value of root password (just for testing purposes locally)

and this points to three mariadb slaves, each name represents the alias of each container inside the docker network.

"RecoverMasterClusterFilters": [
    "alias=mariadb-slave-01",
    "alias=mariadb-slave-02",
    "alias=mariadb-slave-03"
  ]

Notes:

~~I have gtid enabled, (the earth icon appears on each slave but not on the master)~~
I have Pseudo-GTID enabled instead of GTID as it will be easier to automate in my case
I have no cluster table in my case (if i recall it was an optional step so i skipped it)
I can make manual/force recover through the dashboard when i press the button it selects a slave to be a master.

My questions are:
1- How to make this recover process automated as it seems to be disabled in my case and thats according to the dashboard which says "Automated master recovery for this cluster DISABLED". (I have checked the document but i didn't get it)
2- When master goes back online what should I do to make it back as master again or even join the same cluster as slave. (I mean not to lose the instances count)

Can you guide me to the missing points ?

Update 1
I have updated the entrypoint to auto discover and added the cluster pattern to enable automated master recovery.

Update 2
After finding an issue/question similar to the second question I found these comments: 1, 2.

As i moved to use Pseudo-GTID instead of GTID, what should I do to switch the old master to slave ?
If we meant to use CHANGE MASTER what would be the suitable scenario here to prevent data loss ? any advises ?
Is this configurable through hooks (I mean converting old master to slave once it goes back online)?

jianhaiqing · 2019-09-19T10:33:33Z

(ORC_FAILURE_CLUSTER_DOMAIN is empty) seems to be fixed in the PR #970
So I wanna close this issue

jianhaiqing closed this as completed Sep 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why is it ORC_AUTO_MASTER_RECOVERY=false ? #933

why is it ORC_AUTO_MASTER_RECOVERY=false ? #933

jianhaiqing commented Jul 15, 2019

shlomi-noach commented Jul 15, 2019

jianhaiqing commented Jul 15, 2019

shlomi-noach commented Jul 15, 2019

jianhaiqing commented Jul 15, 2019

shlomi-noach commented Jul 15, 2019

jianhaiqing commented Jul 15, 2019

mostafahussein commented Sep 8, 2019 •

edited

Loading

jianhaiqing commented Sep 19, 2019 •

edited

Loading

why is it ORC_AUTO_MASTER_RECOVERY=false ? #933

why is it ORC_AUTO_MASTER_RECOVERY=false ? #933

Comments

jianhaiqing commented Jul 15, 2019

shlomi-noach commented Jul 15, 2019

jianhaiqing commented Jul 15, 2019

shlomi-noach commented Jul 15, 2019

jianhaiqing commented Jul 15, 2019

shlomi-noach commented Jul 15, 2019

jianhaiqing commented Jul 15, 2019

mostafahussein commented Sep 8, 2019 • edited Loading

jianhaiqing commented Sep 19, 2019 • edited Loading

mostafahussein commented Sep 8, 2019 •

edited

Loading

jianhaiqing commented Sep 19, 2019 •

edited

Loading