-
Notifications
You must be signed in to change notification settings - Fork 981
AWS Aurora Monitor
AWS Aurora is a proprietary solution developed by AWS on top on MySQL.
It differs from MySQL on several aspects, but for the purpose of this document we will focus only on one aspect: replication.
In Aurora, a writer/master can have multiple replicas: replicas aren't updated using MySQL protocol replication, but using a proprietary mechanism developed by AWS, where changes are replicated at storage layer.
Without going into the technical details, what is important to highlight is that:
- replication cannot be monitored using queries normally used with MySQL protocol replication like
SHOW SLAVE STATUS
- replication is normally in the order of few milliseconds, and rarely above seconds
- Aurora exports metrics and status using new tables, where is possible to identify the current writer/master, the replicas, and their replication lag
- Aurora supports auto-provisioning of new replicas, for example in case of failure of existing replicas
- Aurora supports automatic-failover
ProxySQL introduces new monitor algorithm to monitor the status of AWS Aurora clusters and is able to detect:
- current topology : writer/master and replicas
- failovers
- replication lag of all the replicas
- failed replicas
- new replicas
In order to monitor AWS Aurora clusters, clusters need to be configured in a new table: mysql_aws_aurora_hostgroups
.
Table structure:
Admin> SHOW CREATE TABLE mysql_aws_aurora_hostgroups\G
*************************** 1. row ***************************
table: mysql_aws_aurora_hostgroups
Create Table: CREATE TABLE mysql_aws_aurora_hostgroups (
writer_hostgroup INT CHECK (writer_hostgroup>=0) NOT NULL PRIMARY KEY,
reader_hostgroup INT NOT NULL CHECK (reader_hostgroup<>writer_hostgroup AND reader_hostgroup>0),
active INT CHECK (active IN (0,1)) NOT NULL DEFAULT 1,
aurora_port INT NOT NUlL DEFAULT 3306,
domain_name VARCHAR NOT NULL CHECK (SUBSTR(domain_name,1,1) = '.'),
max_lag_ms INT NOT NULL CHECK (max_lag_ms>= 10 AND max_lag_ms <= 600000) DEFAULT 600000,
check_interval_ms INT NOT NULL CHECK (check_interval_ms >= 100 AND check_interval_ms <= 600000) DEFAULT 1000,
check_timeout_ms INT NOT NULL CHECK (check_timeout_ms >= 80 AND check_timeout_ms <= 3000) DEFAULT 800,
writer_is_also_reader INT CHECK (writer_is_also_reader IN (0,1)) NOT NULL DEFAULT 0,
new_reader_weight INT CHECK (new_reader_weight >= 0 AND new_reader_weight <=10000000) NOT NULL DEFAULT 1,
add_lag_ms INT NOT NULL CHECK (add_lag_ms >= 0 AND add_lag_ms <= 600000) DEFAULT 30,
min_lag_ms INT NOT NULL CHECK (min_lag_ms >= 0 AND min_lag_ms <= 600000) DEFAULT 30,
lag_num_checks INT NOT NULL CHECK (lag_num_checks >= 1 AND lag_num_checks <= 16) DEFAULT 1,
comment VARCHAR,
UNIQUE (reader_hostgroup))
Multiple fields define an AWS Aurora Cluster:
-
writer_hostgroup
is the hostgroup that will be assigned to the writer/master . This is also the primary key of the table, that means that everywriter_hostgroup
represents a single cluster. -
reader_hostgroup
is the hostgroup that will be assigned to all the replicas in a given cluster . There is anUNIQUE
constraint on this column to ensure that multiple clusters aren't configured with the samereader_hostgroup
-
active
allows to toggle on and off the monitoring of a cluster. For example, it is possible to configure a cluster but not monitor it (yet) -
aurora_port
is the port that Aurora uses to accept connection. All nodes in a cluster use the same port. By default this is port 3306, that is the same default port of MySQL -
domain_name
: in AWS Aurora internal table only hostnames are listed. Although ProxySQL needs the fully qualified domain name inmysql_servers
. When a new server is added inmysql_servers
, ProxySQL will adddomain_name
to the server hostname to obtain the FQDN . For example, if hostname isserverA
anddomain_name
is.abcde.us-east-1.rds.amazonaws.com
, inmysql_servers
table the hostnameserverA.abcde.us-east-1.rds.amazonaws.com
will be added. Note thatdomain_name
must start with a dot, and tablemysql_aws_aurora_hostgroups
enforce this constraint. -
max_lag_ms
: replicas that have a replication greater or equal thanmax_lag_ms
milliseconds are automatically disabled from the cluster until their lag return within the configured threshold. Also note that from empiric results (not documented by AWS) a failed node will have a replication lag of 600000ms , that is also the maximum value ofmax_lag_ms
: failed node will be automatically disabled -
check_interval_ms
defines how frequently ProxySQL's monitor will check the status of the cluster -
check_timeout_ms
defines the timeout for a check -
writer_is_also_reader
: this setting defines if writer/master will also be configured as part of the reader hostgroup -
new_reader_weight
: this is the weight that will be assigned to a new auto-discovered replicas -
add_lag_ms
,min_lag_ms
andlag_num_checks
: because replication lag is pulled at regular interval as defined incheck_interval_ms
and not in real time, the last measured replication lag could not be accurate. These 3 variables perform some tweaks on the last value of replication lag:- if replication lag is less than
min_lag_ms
, assume a read ofmin_lag_ms
- to any value read, add
add_lag_ms
- if
lag_num_checks
is greater than 1, the current replication lag is computed as the highest of the lastlag_num_checks
reads
- if replication lag is less than
-
comment
is any text that an administrator can assign to an Aurora cluster, for example to note what the cluster is used for
Following the same convention of other configuration tables, the configuration currently loaded at runtime is available in table runtime_mysql_aws_aurora_hostgroups
, that is the runtime configuration of mysql_aws_aurora_hostgroups
.
Monitor module introduces 3 new tables to
Admin> SHOW TABLES FROM monitor;
+--------------------------------------+
| tables |
+--------------------------------------+
| mysql_server_aws_aurora_check_status |
| mysql_server_aws_aurora_failovers |
| mysql_server_aws_aurora_log |
...
CREATE TABLE mysql_server_aws_aurora_log (
hostname VARCHAR NOT NULL,
port INT NOT NULL DEFAULT 3306,
time_start_us INT NOT NULL DEFAULT 0,
success_time_us INT DEFAULT 0,
error VARCHAR,
SERVER_ID VARCHAR NOT NULL DEFAULT '',
SESSION_ID VARCHAR,
LAST_UPDATE_TIMESTAMP VARCHAR,
replica_lag_in_milliseconds INT NOT NULL DEFAULT 0,
estimated_lag_ms INT NOT NULL DEFAULT 0,
CPU INT NOT NULL DEFAULT 0,
PRIMARY KEY (hostname, port, time_start_us, SERVER_ID))
When an Aurora cluster is enabled, Monitor connects to all the hosts in the cluster (in rotation), querying one node every check_interval_ms
milliseconds. Monitor will retrieve cluster information from Aurora table information_schema.replica_host_status
.
ProxySQL's table monitor
.mysql_server_aws_aurora_log
stores the information retried from information_schema.replica_host_status
(SERVER_ID
, SESSION_ID
, LAST_UPDATE_TIMESTAMP
, replica_lag_in_milliseconds
) , together with additional information:
-
hostname
/port
: the node where the check was performed -
time_stamp_us
: when the check was performed -
success_time_us
: if the check was successful, how long it took -
error
: if the check was not successful, what was the error message -
estimated_lag_ms
: estimated lag in based on readreplica_lag_in_milliseconds
and applying configuredadd_lag_ms
,min_lag_ms
andlag_num_checks
Note that ProxySQL determines which one is the writer/master based on SESSION_ID
: if the value is MASTER_SESSION_ID
, the server specific in SERVER_ID
is the master. All the other servers with SESSION_ID
not equal to MASTER_SESSION_ID
are replicas.
CREATE TABLE mysql_server_aws_aurora_check_status (
writer_hostgroup INT NOT NULL,
hostname VARCHAR NOT NULL,
port INT NOT NULL DEFAULT 3306,
last_checked_at VARCHAR,
checks_tot INT NOT NULL DEFAULT 0,
checks_ok INT NOT NULL DEFAULT 0,
last_error VARCHAR,
PRIMARY KEY (writer_hostgroup, hostname, port))
While table mysql_server_aws_aurora_log
is a log table, table mysql_server_aws_aurora_check_status
provides some aggregate information.
-
writer_hostgroup
+hostname
+port
defines a specific server -
last_checked_at
shows the last time this server was checked -
checks_tot
counts the number of checks executed against this server -
checks_ok
counts the number of successful checks -
last_error
shows the last error returned by this server (if any)
CREATE TABLE mysql_server_aws_aurora_failovers (
writer_hostgroup INT NOT NULL,
hostname VARCHAR NOT NULL,
inserted_at VARCHAR NOT NULL)
This table records all the detected failovers.
Every time a failover happens, a new row is inserted into table mysql_server_aws_aurora_failovers
:
-
writer_hostgroup
represents the Aurora cluster -
hostname
shows the newly promoted master -
inserted_at
is the timestamp of the failover
Since ProxySQL is able to monitor the replication lag of replicas in milliseconds, it is able to determine to which replicas reads should be sent to, that is the replicas that are up to date within the configured thresholds.
Most of the configuration is performed in table mysql_aws_aurora_hostgroups
, specifically on columns max_lag_ms
, add_lag_ms
, min_lag_ms
, and lag_num_checks
.
If a query id configured to be sent to a replica, the suitable replicas are determined based on these cluster settings.
Furthermore, the client is able to specific a more restrictive threshold for max_lag_ms
sending a comment with setting max_lag_ms
. For example, the Aurora cluster could be configured with mysql_aws_aurora_hostgroupsmax_lag_ms=1000
, but a client could send a query like:
SELECT /* max_lag_ms=20 */ ...
In this case, only replicas with replication lag less or equal than 20ms will be considered to process the query.
If no replicas are suitable to execute the queries (for example, if none has a lass less than 20ms), proxysql is able to hold the requests for long enough to then be safe to execute the query.
For example, if max_lag_ms
specified by the client is 20, but the replica has a lag of 50ms, proxysql can wait 30ms before executing the query on the replica.
If mysql_aws_aurora_hostgroups.writer_is_also_reader
is enabled, then the writer is also configured in the reader hostgroup.
This creates some interesting consequences if clients specify a very small value for max_lag_ms
: because the writer has always a lag of 0 , ProxySQL may sends traffic to it instead of waiting for replicas to catch up.
Similarly, it is possible that a user do not want to use the writer no matter the lag of the replicas if replicas exist.
ProxySQL has a new global variable that control this behavior: mysql-aurora_max_lag_ms_only_read_from_replicas
The variable defines the minimum number of replicas that need to be present in order to ignore the writer in the reader hostgroup, if the client specify a value of max_lag_ms
in the query it sends.