Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VReplication heartbeat updates can cause significant resource utilization on the target #6945

Closed
rohit-nayak-ps opened this issue Oct 26, 2020 · 5 comments

Comments

@rohit-nayak-ps
Copy link
Contributor

rohit-nayak-ps commented Oct 26, 2020

A recent change was made to update the time_updated field of _vt.vreplication to record the time of the last heartbeat sent by vstreamer. vstreamers send a heartbeat every second if there has been no binlog event during that period.

The advantage of this feature is that, just monitoring this column can alert users of issues on either the vstreamer or vreplication side of the stream causing them to hang.

However this change can result in side-effects when streams are replicating (after the copy phase is done) and the write qps on the source is low, resulting in a lot of heartbeat activity.

  1. A user with ~1000 streams suddenly sees an extra load of 1000 write qps because each stream's heartbeat gets updated per second. This could also have a cascading effect due to increased binlog activity for downstream mysql replication

  2. A user with high read qps had a relatively small disk space allocation since their table sizes were not large. This caused a disproportionate increase in the binlog sizes causing them to run out of disk space.

#6805 also reports this with more details

@matt-ullmer
Copy link

Can we increase the priority of this?

For the purposes of solving my issue, simply staggering all of the rules to heartbeat once per minute at a random offset would be sufficient and hopefully this would be a simple (temporary) fix?

1000 write qps -> 1000 write qpm is a huge improvement for our current in-development use case which has sensitivity to events per second in the binary log

@rohit-nayak-ps
Copy link
Contributor Author

Do you need the periodic heartbeat update at all for your use-cases? We have had some discussions about long-term solutions for this, mainly involving a new schema for the vrep tables, including, the moving of the dynamically updated values (gtids, timestamps) to a separate table so that the binlog footprint of updates will be small. Also possibly moving the heartbeat to an exported metric, so that can be used to monitor health.

But the major refactor will not happen in the near-term. I was wondering if we could make the heartbeat recording optional (default: off) through an additional binlog parameter as an immediate hotfix. Or implement just that metric now. I need to discuss internally since there are users who depend on the per-second heartbeat.

@matt-ullmer
Copy link

For low volume workflows how would they be monitored and determined to be operating successfully if no events are written natively?

In these scenarios is transaction_timestamp updated by the heartbeat, or on a regular cadence separate from the heartbeat?

If we don't need the heartbeat to get updated transaction_timestamps disabling the heartbeat is fine

@rohit-nayak-ps
Copy link
Contributor Author

For low volume workflows how would they be monitored and determined to be operating successfully if no events are written natively?

Right now there is a common frequency at which all workflows are updated, independent of the volume. As an initial fix I have #7659 where you can set the frequency of update. So if you use that column as part of your monitoring you can set it to a minute or more ... This will affect all workflows of a vttablet, but it was an easy fix. We will look at adaptive heartbeat updates later.

The VReplicationHeartbeat metric will continue to be updated at 1 second, so if you have Prometheus setup, say, you will continue to get the same granularity of alerts as now.

@rohit-nayak-ps
Copy link
Contributor Author

Closed via #7659

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants