Estimate replica lag when seconds behind from mysqld is unknown #9308

mattlord · 2021-12-01T01:43:11Z

Description

When mysqld tells us that seconds_behind_master is unknown (NULL) — or we get back any other value that cannot be converted to a uint — let's estimate the replication lag seconds value (seconds_behind_master itself is fuzzy) in vttablet's healthcheck stats using the last seen value and the time elapsed since. This gives us the most accurate estimate possible of the replica lag when the replica cannot talk to its source and is constantly retrying and in a state where Slave_IO_Running == Connecting, which we consider to be equivalent to running as reconnects happen regularly anyway e.g. due to slave_net_timeout, and allows us to correctly support the defined -unhealthy_threshold behavior for the replica tablet — preventing us from serving queries from a replica that may have very stale data.

With these changes, the results of the test here seem to be expected and correct:

vitess@fb5036c15bc9:/vt/local$ for i in {1..5}; do
>   echo -n "mysqld says:"
>   command mysql -u root --socket=/vt/vtdataroot/vt_0000000101/mysql.sock -e "show slave status\G" | grep Seconds | cut -d: -f2
>   echo -n "vttablet says: "
>   curl -s localhost:15101/debug/status_details | jq -r '.[1].Value'
>   echo
>   sleep 5
> done
mysqld says: 0
vttablet says: 0s

mysqld says: NULL
vttablet says: 5s

mysqld says: NULL
vttablet says: 10s

mysqld says: NULL
vttablet says: 15s

mysqld says: NULL
vttablet says: 20s


vitess@fb5036c15bc9:/vt/local$ mysql commerce/0@replica -e "select * from customer"
ERROR 1105 (HY000) at line 1: target: commerce.0.replica: no healthy tablet available for 'keyspace:"commerce" shard:"0" tablet_type:REPLICA'

vitess@fb5036c15bc9:/vt/local$ curl -s localhost:15101/debug/status_details
[
 {
  "Key": "Current State",
  "Class": "healthy",
  "Value": "REPLICA: Serving"
 },
 {
  "Key": "Replication Lag",
  "Class": "unhealthy",
  "Value": "25s"
 }
]

vitess@fb5036c15bc9:/vt/local$ mysql -e "show vitess_tablets"
+-------+----------+-------+------------+-------------+------------------+--------------+----------------------+
| Cell  | Keyspace | Shard | TabletType | State       | Alias            | Hostname     | PrimaryTermStartTime |
+-------+----------+-------+------------+-------------+------------------+--------------+----------------------+
| zone1 | commerce | 0     | PRIMARY    | NOT_SERVING | zone1-0000000100 | fb5036c15bc9 | 2021-12-03T01:10:04Z |
| zone1 | commerce | 0     | REPLICA    | NOT_SERVING | zone1-0000000101 | fb5036c15bc9 |                      |
| zone1 | commerce | 0     | RDONLY     | SERVING     | zone1-0000000102 | fb5036c15bc9 |                      |
+-------+----------+-------+------------+-------------+------------------+--------------+----------------------+

Related Issue(s)

Fixes: vtgate serves queries from replicas when its source is unavailable and replication is unhealthy #9307

Checklist

Should this PR be backported? NO
Tests were added
Documentation is not required (I don't think?)

go/mysql/flavor.go

This prevents us from serving queries from the replica that may be beyond the defined -unhealthy_threshold for the tablet. Signed-off-by: Matt Lord <mattalord@gmail.com>

Signed-off-by: Matt Lord <mattalord@gmail.com>

deepthi

Looks fine to me.
I thought @aquarapid wanted us to add more error checking though..

This gives us the most accurate estimate possible -- and seconds_behind_master is already fuzzy anyway -- of the replica lag when the replica cannot talk to its source and is constantly retrying and in a state where Slave_IO_Running == Connecting which we consider to be equivalent to running (see parseReplicationStatus() in flavor.go) as reconnects happen regularly anyway e.g. due to slave_net_timeout. Signed-off-by: Matt Lord <mattalord@gmail.com>

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord · 2021-12-03T03:10:42Z

@deepthi, while I don't think that this PR as a stand-alone change warrants doc updates... after having gone through this I feel like we could probably use a new docs page on "Using Vitess for Read Scale Out" that documents how to do @replica shard targeting and how the vtgate, vttablet, and mysqld coordinate to prevent serving potentially stale reads beyond a configured window of inconsistency (walking through how that's configured). What does everyone else think? If everyone thinks that's a good idea then I can create a docs PR for that.

shlomi-noach

We do have a heartbeat mechanism running in tablets. It's still not always-on, though I've advocated for it to be, and perhaps this is a good opportunity to revive that discussion.
We use the heartbeat value in the throttler service and in vreplication. I think we should explore reusing it here, too.

mattlord · 2021-12-05T07:11:10Z

We do have a heartbeat mechanism running in tablets. It's still not always-on, though I've advocated for it to be, and perhaps this is a good opportunity to revive that discussion.
We use the heartbeat value in the throttler service and in vreplication. I think we should explore reusing it here, too.

@shlomi-noach We did not add another mechanism here. Are you talking about something other than what I’m using in the PR? It’s the existing tablet healthcheck. If there’s another one then I’m not aware of it. Can you share details about what you’re referring to?

shlomi-noach · 2021-12-05T07:58:13Z

@mattlord I misunderstood and thought there was a new functionality here.

I think we might still want to first consult with our heartbeat mechanism. I don't know whether this PR should change right now, but I think ideally all our lag calculations should use the heartbeat,

Examples:

tablet throttler uses this query to determine lag:

vitess/go/vt/vttablet/tabletserver/throttle/throttler.go

Line 82 in ea40f2d

replicationLagQuery = `select unix_timestamp(now(6))-max(ts/1000000000) as replication_lag from _vt.heartbeat`
tablet throttler is still not enabled by default. It takes a -enable-lag-throttler flag to enable it, in which case it enabled the heartbeat mechanism.
This discussion about whether/not to always enable heartbeat: Suggestion: always enable heartbeat on vttablet #6662 (comment)

Anyway, I think this digresses from the PR. LGTM and perhaps keep the discussion going 🙏

mattlord · 2021-12-05T18:39:15Z

Hi @shlomi-noach !

(Please poke holes in anything that may be wrong.)

Today we have the vttablet healthcheck stream that vtgates subscribe to which helps the gateway determine if the tablet should be serving queries or not. That is exposed to users via:

curl -s localhost:15101/debug/status_details
[
 {
  "Key": "Current State",
  "Class": "healthy",
  "Value": "REPLICA: Serving"
 },
 {
  "Key": "Replication Lag",
  "Class": "healthy",
  "Value": "0s"
 }

(vtgate exposes resulting info in SHOW VITESS_TABLETS and /debug/status.)

The replication lag portion of that — at least the polling which we rely on for estimates when we don't get a value from mysqld — is enabled by default via -enable_replication_reporter and what we do with the data can be configured via -health_check_interval and -unhealthy_threshold in the tablet, -discovery_low_replication_lag and -discovery_high_replication_lag_minimum_serving in the gateway. We're using that here in this PR but correctly handling seconds_behind_master when it's NULL or unknown.

We also have the heartbeat mechanism that IS NOT enabled by default but can be controlled via -heartbeat_enable and -heartbeat_interval in the tablet. This then causes the tablet to store a heartbeat timestamp info to its mysqld in the _vt.heartbeat table.

We then have a throttler interface that is NOT enabled by default but can be enabled in vttablet via -enable-lag-throttler and -enable-tx-throttler with what and how we generate the data and what we do with it can be controlled via -throttle_metrics_query, -throttle_threshold, -throttle_metrics_threshold, -throttle_check_as_check_self. Those results are exposed to the user via /throttler/check and /throttler/check-self in the tablet's web UI. Example underlying protobuf info:

target_replication_lag_sec: 2,
max_replication_lag_sec: 10,
initial_rate: 100,
max_increase: 1,
emergency_decrease: 0.5,
min_duration_between_increases_sec: 40,
max_duration_between_increases_sec: 62,
min_duration_between_decreases_sec: 20,
spread_backlog_across_sec: 20,
age_bad_rate_after_sec: 180,
bad_rate_increase: 0.1,
max_rate_approach_threshold: 0.9,

As a matter of principle, I would agree that we should consolidate these things to have a single shared mechanism for determining the key properties of a tablet: liveness, responsiveness & load (ability to handle more work in the immediate future), and consistency. Perhaps this is something we can discuss over dinner. I do, however, think it's orthogonal to this PR as this PR is a bug fix that is simply correcting behavior related to the healthcheck.

Thanks!

shlomi-noach · 2021-12-06T07:37:55Z

@mattlord agreed that this PR should proceed and no need to address hearbeat/throttler right now.

We also have the heartbeat mechanism that IS NOT enabled by default but can be controlled via -heartbeat_enable and -heartbeat_interval in the tablet. This then causes the tablet to store a heartbeat timestamp info to its mysqld in the _vt.heartbeat table.
We then have a throttler interface that is NOT enabled by default but can be enabled in vttablet via -enable-lag-throttler and -enable-tx-throttler with what and how we generate the data and what we do with it can be controlled via -throttle_metrics_query, -throttle_threshold, -throttle_metrics_threshold, -throttle_check_as_check_self. Those results are exposed to the user via /throttler/check and /throttler/check-self in the tablet's web UI. Example underlying protobuf info:

Agreed that this looks like quite a lot of configuration - though -enable-lag-throttler is the only thing the user would have to worry about. By enabling lag throttler, Vitess implicitly enables the heartbeat mechanism. The metrics query is by default the correct query to compare timestamps on _vt.heartbeat and with a 1s throttle threshold. So the idea is that the user just needs to enable the throttler and not worry about anything else; the defaults should be good for all but the most technical users who want greater control.

mattlord · 2021-12-07T00:40:09Z

go/test/endtoend/tabletmanager/tablet_health_test.go

+		// replicationLagSeconds varies till 7200 so setting safe limit
+		assert.True(t, replicationLagSeconds < 10000, "replica should not be behind primary")
+	} else {
+		assert.True(t, (!serving || replicationLagSeconds >= uint32(tabletUnhealthyThreshold.Seconds())), "Tablet should not be in serving and healthy state")


To help explain this assertion, as far as the healthcheck is concerned the tablet can be SERVING while unhealthy with regards to replica lag. For example:

vitess@fb5036c15bc9:/vt/local$ curl -s localhost:15101/debug/status_details [ { "Key": "Current State", "Class": "healthy", "Value": "REPLICA: Serving" }, { "Key": "Replication Lag", "Class": "unhealthy", "Value": "25s" } ]

deepthi · 2021-12-08T19:56:27Z

We do have a heartbeat mechanism running in tablets. It's still not always-on, though I've advocated for it to be, and perhaps this is a good opportunity to revive that discussion.
We use the heartbeat value in the throttler service and in vreplication. I think we should explore reusing it here, too.

We have discussed this in the past, and I still think it is desirable to at least make the heartbeat the default.
We have some work to do before we can make that change though as discussed here:
#6665 (comment)

mattlord added Type: Bug Component: Query Serving release notes labels Dec 1, 2021

deepthi reviewed Dec 1, 2021

View reviewed changes

go/mysql/flavor.go Outdated Show resolved Hide resolved

mattlord requested a review from rohit-nayak-ps December 1, 2021 01:59

mattlord force-pushed the brokenreplicaserving branch 2 times, most recently from c9663b1 to c099f84 Compare December 1, 2021 02:01

Mark a replica as unhealthy when the repl lag is unknown

11a8809

This prevents us from serving queries from the replica that may be beyond the defined -unhealthy_threshold for the tablet. Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord force-pushed the brokenreplicaserving branch 4 times, most recently from 2a5ff82 to 36f3055 Compare December 1, 2021 19:00

Add e2e test for replica health when its source is gone

ef39eb8

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord force-pushed the brokenreplicaserving branch from 36f3055 to ef39eb8 Compare December 1, 2021 19:08

mattlord marked this pull request as ready for review December 1, 2021 20:02

mattlord requested review from harshit-gangal and systay as code owners December 1, 2021 20:02

mattlord requested review from deepthi and sougou and removed request for deepthi, systay and harshit-gangal December 1, 2021 20:02

deepthi approved these changes Dec 1, 2021

View reviewed changes

mattlord requested a review from aquarapid December 2, 2021 00:53

mattlord marked this pull request as draft December 2, 2021 19:25

mattlord force-pushed the brokenreplicaserving branch from 39e29b0 to 8790643 Compare December 2, 2021 20:34

mattlord force-pushed the brokenreplicaserving branch from 8790643 to c5143d8 Compare December 2, 2021 20:53

mattlord removed the request for review from aquarapid December 2, 2021 20:55

mattlord requested review from deepthi and removed request for rohit-nayak-ps December 2, 2021 20:55

mattlord force-pushed the brokenreplicaserving branch 3 times, most recently from b374c25 to bcbe0b4 Compare December 2, 2021 22:42

Update tablet healthcheck test based on new replica lag logic

31560fd

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord force-pushed the brokenreplicaserving branch from bcbe0b4 to 31560fd Compare December 2, 2021 23:52

mattlord changed the title ~~Mark a replica as unhealthy when the repl lag is unknown~~ Estimate replica lag when seconds behind from mysqld is unknown Dec 3, 2021

mattlord marked this pull request as ready for review December 3, 2021 00:50

mattlord requested a review from shlomi-noach as a code owner December 3, 2021 00:50

shlomi-noach reviewed Dec 5, 2021

View reviewed changes

deepthi approved these changes Dec 7, 2021

View reviewed changes

deepthi merged commit e689307 into vitessio:main Dec 7, 2021

deepthi deleted the brokenreplicaserving branch December 7, 2021 00:30

mattlord commented Dec 7, 2021

View reviewed changes

This was referenced Feb 25, 2022

Bug Report: vttablet replica is considered healthy (serving) even when not connect to primary #9788

Closed

Consider repl NOT healthy with IO_Thread connecting and last attempt failed #9789

Closed

mattlord mentioned this pull request Mar 9, 2022

Manage MySQL Replication Status States Properly #9853

Merged

3 tasks

mattlord mentioned this pull request Mar 31, 2022

Handle SQL thread crash in vt/vttablet/tabletserver/repltracker #7157

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Estimate replica lag when seconds behind from mysqld is unknown #9308

Estimate replica lag when seconds behind from mysqld is unknown #9308

mattlord commented Dec 1, 2021 •

edited

Loading

deepthi left a comment

mattlord commented Dec 3, 2021 •

edited

Loading

shlomi-noach left a comment

mattlord commented Dec 5, 2021 •

edited

Loading

shlomi-noach commented Dec 5, 2021

mattlord commented Dec 5, 2021 •

edited

Loading

shlomi-noach commented Dec 6, 2021

mattlord Dec 7, 2021

deepthi commented Dec 8, 2021

Estimate replica lag when seconds behind from mysqld is unknown #9308

Estimate replica lag when seconds behind from mysqld is unknown #9308

Conversation

mattlord commented Dec 1, 2021 • edited Loading

Description

Related Issue(s)

Checklist

deepthi left a comment

Choose a reason for hiding this comment

mattlord commented Dec 3, 2021 • edited Loading

shlomi-noach left a comment

Choose a reason for hiding this comment

mattlord commented Dec 5, 2021 • edited Loading

shlomi-noach commented Dec 5, 2021

mattlord commented Dec 5, 2021 • edited Loading

shlomi-noach commented Dec 6, 2021

mattlord Dec 7, 2021

Choose a reason for hiding this comment

deepthi commented Dec 8, 2021

mattlord commented Dec 1, 2021 •

edited

Loading

mattlord commented Dec 3, 2021 •

edited

Loading

mattlord commented Dec 5, 2021 •

edited

Loading

mattlord commented Dec 5, 2021 •

edited

Loading