Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Estimate replica lag when seconds behind from mysqld is unknown #9308

Merged
merged 4 commits into from
Dec 7, 2021

Conversation

mattlord
Copy link
Contributor

@mattlord mattlord commented Dec 1, 2021

Description

When mysqld tells us that seconds_behind_master is unknown (NULL) — or we get back any other value that cannot be converted to a uint — let's estimate the replication lag seconds value (seconds_behind_master itself is fuzzy) in vttablet's healthcheck stats using the last seen value and the time elapsed since. This gives us the most accurate estimate possible of the replica lag when the replica cannot talk to its source and is constantly retrying and in a state where Slave_IO_Running == Connecting, which we consider to be equivalent to running as reconnects happen regularly anyway e.g. due to slave_net_timeout, and allows us to correctly support the defined -unhealthy_threshold behavior for the replica tablet — preventing us from serving queries from a replica that may have very stale data.

With these changes, the results of the test here seem to be expected and correct:

vitess@fb5036c15bc9:/vt/local$ for i in {1..5}; do
>   echo -n "mysqld says:"
>   command mysql -u root --socket=/vt/vtdataroot/vt_0000000101/mysql.sock -e "show slave status\G" | grep Seconds | cut -d: -f2
>   echo -n "vttablet says: "
>   curl -s localhost:15101/debug/status_details | jq -r '.[1].Value'
>   echo
>   sleep 5
> done
mysqld says: 0
vttablet says: 0s

mysqld says: NULL
vttablet says: 5s

mysqld says: NULL
vttablet says: 10s

mysqld says: NULL
vttablet says: 15s

mysqld says: NULL
vttablet says: 20s


vitess@fb5036c15bc9:/vt/local$ mysql commerce/0@replica -e "select * from customer"
ERROR 1105 (HY000) at line 1: target: commerce.0.replica: no healthy tablet available for 'keyspace:"commerce" shard:"0" tablet_type:REPLICA'

vitess@fb5036c15bc9:/vt/local$ curl -s localhost:15101/debug/status_details
[
 {
  "Key": "Current State",
  "Class": "healthy",
  "Value": "REPLICA: Serving"
 },
 {
  "Key": "Replication Lag",
  "Class": "unhealthy",
  "Value": "25s"
 }
]

vitess@fb5036c15bc9:/vt/local$ mysql -e "show vitess_tablets"
+-------+----------+-------+------------+-------------+------------------+--------------+----------------------+
| Cell  | Keyspace | Shard | TabletType | State       | Alias            | Hostname     | PrimaryTermStartTime |
+-------+----------+-------+------------+-------------+------------------+--------------+----------------------+
| zone1 | commerce | 0     | PRIMARY    | NOT_SERVING | zone1-0000000100 | fb5036c15bc9 | 2021-12-03T01:10:04Z |
| zone1 | commerce | 0     | REPLICA    | NOT_SERVING | zone1-0000000101 | fb5036c15bc9 |                      |
| zone1 | commerce | 0     | RDONLY     | SERVING     | zone1-0000000102 | fb5036c15bc9 |                      |
+-------+----------+-------+------------+-------------+------------------+--------------+----------------------+

Related Issue(s)

Checklist

  • Should this PR be backported? NO
  • Tests were added
  • Documentation is not required (I don't think?)

go/mysql/flavor.go Outdated Show resolved Hide resolved
@mattlord mattlord force-pushed the brokenreplicaserving branch 2 times, most recently from c9663b1 to c099f84 Compare December 1, 2021 02:01
This prevents us from serving queries from the replica that may
be beyond the defined -unhealthy_threshold for the tablet.

Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord force-pushed the brokenreplicaserving branch 4 times, most recently from 2a5ff82 to 36f3055 Compare December 1, 2021 19:00
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord marked this pull request as ready for review December 1, 2021 20:02
@mattlord mattlord requested review from deepthi and sougou and removed request for deepthi, systay and harshit-gangal December 1, 2021 20:02
Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me.
I thought @aquarapid wanted us to add more error checking though..

@mattlord mattlord marked this pull request as draft December 2, 2021 19:25
This gives us the most accurate estimate possible -- and seconds_behind_master
is already fuzzy anyway -- of the replica lag when the replica cannot
talk to its source and is constantly retrying and in a state where
Slave_IO_Running == Connecting which we consider to be equivalent to
running (see parseReplicationStatus() in flavor.go) as reconnects happen
regularly anyway e.g. due to slave_net_timeout.

Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord removed the request for review from aquarapid December 2, 2021 20:55
@mattlord mattlord requested review from deepthi and removed request for rohit-nayak-ps December 2, 2021 20:55
@mattlord mattlord force-pushed the brokenreplicaserving branch 3 times, most recently from b374c25 to bcbe0b4 Compare December 2, 2021 22:42
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord changed the title Mark a replica as unhealthy when the repl lag is unknown Estimate replica lag when seconds behind from mysqld is unknown Dec 3, 2021
@mattlord mattlord marked this pull request as ready for review December 3, 2021 00:50
@mattlord
Copy link
Contributor Author

mattlord commented Dec 3, 2021

@deepthi, while I don't think that this PR as a stand-alone change warrants doc updates... after having gone through this I feel like we could probably use a new docs page on "Using Vitess for Read Scale Out" that documents how to do @replica shard targeting and how the vtgate, vttablet, and mysqld coordinate to prevent serving potentially stale reads beyond a configured window of inconsistency (walking through how that's configured). What does everyone else think? If everyone thinks that's a good idea then I can create a docs PR for that.

Copy link
Contributor

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have a heartbeat mechanism running in tablets. It's still not always-on, though I've advocated for it to be, and perhaps this is a good opportunity to revive that discussion.
We use the heartbeat value in the throttler service and in vreplication. I think we should explore reusing it here, too.

@mattlord
Copy link
Contributor Author

mattlord commented Dec 5, 2021

We do have a heartbeat mechanism running in tablets. It's still not always-on, though I've advocated for it to be, and perhaps this is a good opportunity to revive that discussion.
We use the heartbeat value in the throttler service and in vreplication. I think we should explore reusing it here, too.

@shlomi-noach We did not add another mechanism here. Are you talking about something other than what I’m using in the PR? It’s the existing tablet healthcheck. If there’s another one then I’m not aware of it. Can you share details about what you’re referring to?

@shlomi-noach
Copy link
Contributor

@mattlord I misunderstood and thought there was a new functionality here.

I think we might still want to first consult with our heartbeat mechanism. I don't know whether this PR should change right now, but I think ideally all our lag calculations should use the heartbeat,

Examples:

Anyway, I think this digresses from the PR. LGTM and perhaps keep the discussion going 🙏

@mattlord
Copy link
Contributor Author

mattlord commented Dec 5, 2021

Hi @shlomi-noach !

(Please poke holes in anything that may be wrong.)

Today we have the vttablet healthcheck stream that vtgates subscribe to which helps the gateway determine if the tablet should be serving queries or not. That is exposed to users via:

curl -s localhost:15101/debug/status_details
[
 {
  "Key": "Current State",
  "Class": "healthy",
  "Value": "REPLICA: Serving"
 },
 {
  "Key": "Replication Lag",
  "Class": "healthy",
  "Value": "0s"
 }

(vtgate exposes resulting info in SHOW VITESS_TABLETS and /debug/status.)

The replication lag portion of that — at least the polling which we rely on for estimates when we don't get a value from mysqld — is enabled by default via -enable_replication_reporter and what we do with the data can be configured via -health_check_interval and -unhealthy_threshold in the tablet, -discovery_low_replication_lag and -discovery_high_replication_lag_minimum_serving in the gateway. We're using that here in this PR but correctly handling seconds_behind_master when it's NULL or unknown.

We also have the heartbeat mechanism that IS NOT enabled by default but can be controlled via -heartbeat_enable and -heartbeat_interval in the tablet. This then causes the tablet to store a heartbeat timestamp info to its mysqld in the _vt.heartbeat table.

We then have a throttler interface that is NOT enabled by default but can be enabled in vttablet via -enable-lag-throttler and -enable-tx-throttler with what and how we generate the data and what we do with it can be controlled via -throttle_metrics_query, -throttle_threshold, -throttle_metrics_threshold, -throttle_check_as_check_self. Those results are exposed to the user via /throttler/check and /throttler/check-self in the tablet's web UI. Example underlying protobuf info:

target_replication_lag_sec: 2,
max_replication_lag_sec: 10,
initial_rate: 100,
max_increase: 1,
emergency_decrease: 0.5,
min_duration_between_increases_sec: 40,
max_duration_between_increases_sec: 62,
min_duration_between_decreases_sec: 20,
spread_backlog_across_sec: 20,
age_bad_rate_after_sec: 180,
bad_rate_increase: 0.1,
max_rate_approach_threshold: 0.9,

As a matter of principle, I would agree that we should consolidate these things to have a single shared mechanism for determining the key properties of a tablet: liveness, responsiveness & load (ability to handle more work in the immediate future), and consistency. Perhaps this is something we can discuss over dinner. I do, however, think it's orthogonal to this PR as this PR is a bug fix that is simply correcting behavior related to the healthcheck.

Thanks!

@shlomi-noach
Copy link
Contributor

@mattlord agreed that this PR should proceed and no need to address hearbeat/throttler right now.

We also have the heartbeat mechanism that IS NOT enabled by default but can be controlled via -heartbeat_enable and -heartbeat_interval in the tablet. This then causes the tablet to store a heartbeat timestamp info to its mysqld in the _vt.heartbeat table.
We then have a throttler interface that is NOT enabled by default but can be enabled in vttablet via -enable-lag-throttler and -enable-tx-throttler with what and how we generate the data and what we do with it can be controlled via -throttle_metrics_query, -throttle_threshold, -throttle_metrics_threshold, -throttle_check_as_check_self. Those results are exposed to the user via /throttler/check and /throttler/check-self in the tablet's web UI. Example underlying protobuf info:

Agreed that this looks like quite a lot of configuration - though -enable-lag-throttler is the only thing the user would have to worry about. By enabling lag throttler, Vitess implicitly enables the heartbeat mechanism. The metrics query is by default the correct query to compare timestamps on _vt.heartbeat and with a 1s throttle threshold. So the idea is that the user just needs to enable the throttler and not worry about anything else; the defaults should be good for all but the most technical users who want greater control.

@deepthi deepthi merged commit e689307 into vitessio:main Dec 7, 2021
@deepthi deepthi deleted the brokenreplicaserving branch December 7, 2021 00:30
// replicationLagSeconds varies till 7200 so setting safe limit
assert.True(t, replicationLagSeconds < 10000, "replica should not be behind primary")
} else {
assert.True(t, (!serving || replicationLagSeconds >= uint32(tabletUnhealthyThreshold.Seconds())), "Tablet should not be in serving and healthy state")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To help explain this assertion, as far as the healthcheck is concerned the tablet can be SERVING while unhealthy with regards to replica lag. For example:

vitess@fb5036c15bc9:/vt/local$ curl -s localhost:15101/debug/status_details
[
 {
  "Key": "Current State",
  "Class": "healthy",
  "Value": "REPLICA: Serving"
 },
 {
  "Key": "Replication Lag",
  "Class": "unhealthy",
  "Value": "25s"
 }
]

@deepthi
Copy link
Member

deepthi commented Dec 8, 2021

We do have a heartbeat mechanism running in tablets. It's still not always-on, though I've advocated for it to be, and perhaps this is a good opportunity to revive that discussion.
We use the heartbeat value in the throttler service and in vreplication. I think we should explore reusing it here, too.

We have discussed this in the past, and I still think it is desirable to at least make the heartbeat the default.
We have some work to do before we can make that change though as discussed here:
#6665 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

vtgate serves queries from replicas when its source is unavailable and replication is unhealthy
3 participants