Changes current handling for replication lag in favor of setting lagging servers to SHUNNED state #3533

JavierJF · 2021-07-22T11:41:57Z

This pull request introduces several changes to how lag in 'Group Replication'
is handled.

Old behavior

Servers which lag is above the threshold determined by
'mysql-groupreplication_max_transactions_behind_count' and had read_only=1
were set 'OFFLINE' until replication catch up.

New behavior

Servers which lag is above the threshold determined by
'mysql-groupreplication_max_transactions_behind_count' are 'SHUNNED' depending
on the value of the new introduced variable:
'mysql-monitor_groupreplication_max_transaction_behind_for_read_only'.

This variable has three possible values:

'0': Only servers with read_only=0 are placed as 'SHUNNED'.
'1': Only servers with read_only=1 are placed as 'SHUNNED' (default).
'2': Both servers with read_only=1 and read_only=0 are placed as 'SHUNNED'.

In addition to this behavior regarding to actions when 'groupreplication_max_transactions_behind_count'
is exceeded by a server. Now it's also possible to set severs configured as writers
in 'OFFLINE_SOFT' state, while preserving the server in the 'writer_hostgroup'.

For achieve this behavior, simply set a server which is configured as a 'writer'
i.e. the server 'hostgroup_id' is the 'writer_hostgroup', and set it's state to
be 'OFFLINE_SOFT', after this, issue a 'LOAD MYSQL SERVERS TO RUNTIME'. The
server should be preserved in the writer hostgroup but it's status should change
to 'OFFLINE_SOFT'.

Situation description

We have 3 servers, '2' writers and '1' reader for a MySQL Group Replication
Cluster of 3 nodes, the servers are configure in ProxySQL in the following way:

mysql> select * from mysql_servers;
+--------------+-----------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| hostgroup_id | hostname  | port | gtid_port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
+--------------+-----------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| 3272         | 127.2.1.1 | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
| 3272         | 127.2.1.2 | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
| 3273         | 127.2.1.3 | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
+--------------+-----------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
3 rows in set (0.00 sec)

Resulting in the following cluster state in 'runtime_mysql_servers' table in
ProxySQL:

mysql: [Warning] Using a password on the command line interface can be insecure.
+--------------+-----------+------+-----------+--------+-----------------+---------+
| hostgroup_id | hostname  | port | gtid_port | status | max_connections | comment |
+--------------+-----------+------+-----------+--------+-----------------+---------+
| 3272         | 127.2.1.1 | 3306 | 0         | ONLINE | 1000            |         |
| 3273         | 127.2.1.2 | 3306 | 0         | ONLINE | 1000            |         |
| 3273         | 127.2.1.1 | 3306 | 0         | ONLINE | 1000            |         |
| 3273         | 127.2.1.3 | 3306 | 0         | ONLINE | 1000            |         |
| 3272         | 127.2.1.2 | 3306 | 0         | ONLINE | 1000            |         |
+--------------+-----------+------+-----------+--------+-----------------+---------+

Now we want to set the writer '127.2.1.2' to OFFLINE_SOFT, so we simply set it
via ProxySQL Admin:

UPDATE mysql_servers SET status='OFFLINE_SOFT' WHERE hostname='127.2.1.2';

And we load mysql_servers to runtime:

LOAD MYSQL SERVERS TO RUNTIME

The runtime_mysql_servers table should transition to the following state:

mysql: [Warning] Using a password on the command line interface can be insecure.
+--------------+-----------+------+-----------+--------------+-----------------+---------+
| hostgroup_id | hostname  | port | gtid_port | status       | max_connections | comment |
+--------------+-----------+------+-----------+--------------+-----------------+---------+
| 3272         | 127.2.1.1 | 3306 | 0         | ONLINE       | 1000            |         |
| 3273         | 127.2.1.2 | 3306 | 0         | OFFLINE_SOFT | 1000            |         |
| 3273         | 127.2.1.1 | 3306 | 0         | ONLINE       | 1000            |         |
| 3273         | 127.2.1.3 | 3306 | 0         | ONLINE       | 1000            |         |
| 3272         | 127.2.1.2 | 3306 | 0         | OFFLINE_SOFT | 1000            |         |
+--------------+-----------+------+-----------+--------------+-----------------+---------+

This change is performed, without afecting to any current transactions
behind executed in the server that have placed as 'OFFLINE_SOFT'. For making the
server operational again, it's required just to set it again to 'ONLINE' state.

…avior in favor of general server 'SHUNNING'

…iters

…ia 'SQLite3' server

1. Introduced new global variable: 'monitor_groupreplication_max_transaction_behind_for_read_only', that modifies the behavior of 'group_replication_lag'. 2. Improved logic making use of 'MyHGC_find' instead of directly searching 'MyHostGroups' structure. 3. Improved 'group_replication_lag' documentation with new implementation updates. 4. Introduced changes to 'update_group_replication_set_writer' preserving writters placed in 'OFFLINE_SOFT' state.

…on_set_server_status'

JavierJF · 2021-07-22T18:34:53Z

Retest this please.

…READONLY'

renecannao · 2021-08-12T07:51:23Z

We need to merge this.
@JavierJF : can you please document it?

bskllzh · 2021-08-19T16:53:51Z

@JavierJF @renecannao , I think there are not many scenarios for mgr with multiple master. For single master mode, there are more scenarios for using single master mode. And when the slave server is in a state where the delay exceeds the threshold, proxysql will immediately offline the slave server. I think this is inappropriate, because it will interrupt the business and cause the program to report an error. I submitted a fix PR, set it to OFFLINE_SOFT , and softly released the delay slave server. Please review PR: #3473.

…p_replication' actions 'set_read_only/set_offline/set_writer'

…plication' update actions

renecannao · 2021-08-25T10:12:57Z

Hi @bskllzh . Thank you for your feedback.
I think I get your point, and I absolutely agree with the problem you are pointing at.
Although, I think OFFLINE_SOFT is not the right approach.
Let me explain.

OFFLINE_SOFT is a configuration state, and a server in this state is configured to not be used for new connections, but not only...
SHUNNED is instead a temporary status from which the server should automatically recover.
In other words, a server shouldn't automatically go from OFFLINE_SOFT to ONLINE , but should automatically go from SHUNNED to ONLINE.

In fact, PR #3473 would conflict with what said previously: a server in OFFLINE_SOFT should never be returned to ONLINE automatically (and this is now implemented in PR #3533).

And when the slave server is in a state where the delay exceeds the threshold, proxysql will immediately offline the slave server. I think this is inappropriate, because it will interrupt the business and cause the program to report an error.

This is by design.
All the details are here: #774
We could set status to MYSQL_SERVER_STATUS_SHUNNED instead of MYSQL_SERVER_STATUS_SHUNNED_REPLICATION_LAG .
This will solve the issue of connections being closed immediately, but hostgroup manager automatically tries to bring a server back online from shunned, no matter if replication lag is still present or not: this is why MYSQL_SERVER_STATUS_SHUNNED_REPLICATION_LAG exists and is different than state MYSQL_SERVER_STATUS_SHUNNED .

Thinking about a possible solution, we could implement a mechanism in which a node is first configured as MYSQL_SERVER_STATUS_SHUNNED and then MYSQL_SERVER_STATUS_SHUNNED_REPLICATION_LAG if replication lag doesn't recover quickly.
The state MYSQL_SERVER_STATUS_SHUNNED_REPLICATION_LAG should be set within a short period of time after MYSQL_SERVER_STATUS_SHUNNED because otherwise the node could go ONLINE while it shouldn't.

bskllzh · 2021-08-25T14:42:38Z

@renecannao, PR #3473 , i added a mgr_replication_lag_status(MGR replication lag flag, true lag, false not lag) parameter to distinguish whether the server was manually configured to the configuration state, or the state changed to OFFLINE_SOFT due to the delay of the mysql slave.

When shunning a node due to replication lag in a group replication cluster, we first shun the node as MYSQL_SERVER_STATUS_SHUNNED , then we shun it as MYSQL_SERVER_STATUS_SHUNNED_REPLICATION_LAG . In this way we prevent (for a short time) to kill connections on that backend. This backing off from that server can give the server enough time to sync up. See discussion in comments in #3533

renecannao · 2021-08-25T22:06:17Z

@bskllzh , thank you for pointing out the new flag.
I implemented what I suggested in my previous comment. See dd71fcd
What is your feedback on that?

About your comment:

I think there are not many scenarios for mgr with multiple master. For single master mode, there are more scenarios for using single master mode

Please note that the enhancements in this PR are driven from the needs of a customer, that requires multi-writers, disable a node no matter if writer of reader (this is why we added a new variable to control this behavior), the ability to prevent configured OFFLINE_SOFT, and to not interfere with the status of the same server in an hostgroup not part of the cluster.
This PR is a combination of enhancements, bugs fixes, and new features.

bskllzh · 2021-08-26T15:08:19Z

@renecannao , PR dd71fcd, I think
the code is too complicated and this is to complicate simple things. Because for the program, do not interrupt its connection due to the delay of the mysql slave until it catches up. And It may take a long time for the slave to catch up with the master, not for a while. For example, when it comes to big transaction.

JavierJF added 7 commits July 22, 2021 11:30

Changed setting readonly servers 'OFFLINE' due to replication lag beh…

53bf18b

…avior in favor of general server 'SHUNNING'

Prevent servers that has been placed as 'OFFLINE_SOFT' of becoming wr…

33aa80c

…iters

Introduced a simple way of performing manual testing for 'GROUPREP' v…

8a0b872

…ia 'SQLite3' server

Improved the documentation for 'group_replication_lag_action'

0850c4d

Fixed compilation with invalid call to renamed function

3873f0b

Added missing parameter 'lag_count' to 'proxy_warning' from 'lag_acti…

edc631b

…on_set_server_status'

JavierJF added 3 commits July 27, 2021 14:41

Added nullity checks for params for 'lag_action_set_server_status'

459a3f1

Replaced 'TEST_GROUPREP' impl to better match approach followed for '…

a6c2246

…READONLY'

Fixed 'hostgroup_id' index selection in 'populate_grouprep_table'

1fac83d

JavierJF added 2 commits August 20, 2021 00:07

Fixed removal of servers not belonging to cluster hostgroups by 'grou…

9f2c883

…p_replication' actions 'set_read_only/set_offline/set_writer'

Improved preservation of 'OFFLINE_SOFT' server state during 'group_re…

fce6cfb

…plication' update actions

JavierJF marked this pull request as ready for review September 13, 2021 11:22

JavierJF merged commit 4f94fd3 into v2.x Sep 14, 2021

renecannao deleted the v2.x-gr_replication_lag_action branch April 30, 2022 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes current handling for replication lag in favor of setting lagging servers to SHUNNED state #3533

Changes current handling for replication lag in favor of setting lagging servers to SHUNNED state #3533

JavierJF commented Jul 22, 2021 •

edited

Loading

JavierJF commented Jul 22, 2021

renecannao commented Aug 12, 2021

bskllzh commented Aug 19, 2021

renecannao commented Aug 25, 2021

bskllzh commented Aug 25, 2021

renecannao commented Aug 25, 2021

bskllzh commented Aug 26, 2021

Changes current handling for replication lag in favor of setting lagging servers to SHUNNED state #3533

Changes current handling for replication lag in favor of setting lagging servers to SHUNNED state #3533

Conversation

JavierJF commented Jul 22, 2021 • edited Loading

Old behavior

New behavior

Situation description

JavierJF commented Jul 22, 2021

renecannao commented Aug 12, 2021

bskllzh commented Aug 19, 2021

renecannao commented Aug 25, 2021

bskllzh commented Aug 25, 2021

renecannao commented Aug 25, 2021

bskllzh commented Aug 26, 2021

JavierJF commented Jul 22, 2021 •

edited

Loading