Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement heartbeat metrics #183

Merged
merged 1 commit into from
Feb 21, 2017
Merged

Implement heartbeat metrics #183

merged 1 commit into from
Feb 21, 2017

Conversation

roidelapluie
Copy link
Member

pt-heartbeat measures actual replication lab on a mysql server. It does
not rely on data stored in SHOW SLAVE STATUS, as it can be unreliable.

pt-heartbeat website:
https://www.percona.com/doc/percona-toolkit/2.2/pt-heartbeat.html

Signed-off-by: Julien Pivotto roidelapluie@inuits.eu

@roidelapluie roidelapluie changed the title Implement pt-heartebat metrics Implement pt-heartbeat metrics Feb 20, 2017
@brian-brazil
Copy link
Contributor

How many different heartbeat implementations are there out there?

@roidelapluie
Copy link
Member Author

pt-heartbeat has been there for a very long time; it is widely used.

@grypyrg
Copy link

grypyrg commented Feb 20, 2017

@brian-brazil : pt-heartbeat is the most widely used (if not only) alternative tool to measure replication lag properly.
(Of course, there are many companies that do heartbeats using a custom built solution)

Given these 2 new metrics, can we easily get the time difference with them?
Maybe we should also calculate in mysqld_exporter itself already how much the difference is and write it to a different metric to make it easier to report the replication lag.

@SuperQ
Copy link
Member

SuperQ commented Feb 20, 2017

Could be interesting.

In what way is SHOW SLAVE STATUS unreliable?

@SuperQ SuperQ closed this Feb 20, 2017
@SuperQ SuperQ reopened this Feb 20, 2017
@grypyrg
Copy link

grypyrg commented Feb 20, 2017

SHOW SLAVE STATUS is unreliable when:

@roidelapluie roidelapluie force-pushed the pth branch 3 times, most recently from 4bd2443 to 956bd82 Compare February 20, 2017 21:30
@roidelapluie
Copy link
Member Author

Fixed:

  • added _seconds suffix as recommended in Prometheus documentation.
  • added the actual lag time
  • rounded to integers. Makes no sense to go below the second as pt-heartbeat only updates the table once per second.

@SuperQ
Copy link
Member

SuperQ commented Feb 21, 2017

@grypyrg Awesome, thanks for documenting those issues here.

@SuperQ
Copy link
Member

SuperQ commented Feb 21, 2017

Even with second-only precision hearbeats, it's still worth keeping the float data as is valid. Especially if pt-heartbeat attempts to write on-the-second and not just sleep(1)

Maybe for compatibility, we should implement a pt-hearbeat compatible writer and include it in the mysqld_exporter repo. We could also implement it to write out millisecond precision timestamps.

It's really too bad that pt-heartbeat uses a varchar for the timestamp instead of a datetime(6).

ptHeartbeat = "pt_heartbeat"
// pt-heartbeat query
// the second column allows gets the server timestamp at the exact same time the query is run
ptHeartbeatQuery = "SELECT UNIX_TIMESTAMP(ts), UNIX_TIMESTAMP(CONCAT(DATE(NOW()), ' ', CURTIME(6))), server_id from %s"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified with UNIX_TIMESTAMP(NOW(6)). I tested on 5.5 and 5.6.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@roidelapluie
Copy link
Member Author

@SuperQ pt-heartbeat'i very initial version dates from https://sourceforge.net/p/maatkit/code/857/ 2007-09-05. The latest version is the result of almost 10 years of maturing. Older releases were using timestamps, and that got changed to VARCHAR later. I could not track back the reason.

I do agree that a Go version of that tool might be handy, but is out of scope of this pull request.

"Timestamp of the current server.",
[]string{"server_id"}, nil,
)
PtHeartbeatDelayDesc = prometheus.NewDesc(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric isn't necessary (and against Prometheus design) since you can do this in PromQL.

Copy link
Member Author

@roidelapluie roidelapluie Feb 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grypyrg That delay can be made at the prometheus level using https://prometheus.io/docs/querying/rules/ . Once this patch gets in, maybe we can document an example recording rule for this on a blogpost.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

README.md Outdated
@@ -66,6 +66,8 @@ collect.perf_schema.indexiowaits | 5.6 | Collect
collect.perf_schema.tableiowaits | 5.6 | Collect metrics from performance_schema.table_io_waits_summary_by_table.
collect.perf_schema.tablelocks | 5.6 | Collect metrics from performance_schema.table_lock_waits_summary_by_table.
collect.slave_status | 5.1 | Collect from SHOW SLAVE STATUS (Enabled by default)
collect.pt-heartbeat | 5.1 | Collect from pt-heartbeat
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to call this heartbeat, as there could be other implementations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed everywhere

@SuperQ
Copy link
Member

SuperQ commented Feb 21, 2017

Overall, I like this idea, some comments in-line.

@roidelapluie
Copy link
Member Author

Should I rename ptHeartbeat to heartbeat everywhere or just in the CLI flags?

@roidelapluie
Copy link
Member Author

roidelapluie commented Feb 21, 2017

Updated after latest comments.

@SuperQ do you want more documentation for this in README?

@SuperQ
Copy link
Member

SuperQ commented Feb 21, 2017

@roidelapluie Totally agree, re-implementing is out of scope for this feature. I was mostly just thinking out loud.

Yes, please remove pt from everything. But add a note and a link to pt-heartbeat in the README that "this is a supported implementation of heartbeat checking".

@roidelapluie roidelapluie force-pushed the pth branch 2 times, most recently from c5685e4 to dab4a26 Compare February 21, 2017 08:58
// timestamps. %s will be replaced by the table name.
// The second column allows gets the server timestamp at the exact same
// time the query is run.
heartbeatQuery = "SELECT UNIX_TIMESTAMP(ts), UNIX_TIMESTAMP(NOW(6)), server_id from `%s`"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SuperQ how can we prevent this to turn into sql injection?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently not much golang/go#18478

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split in two: database and table, for better escaping

@grypyrg
Copy link

grypyrg commented Feb 21, 2017

@roidelapluie, @SuperQ : pt-heartbeat does support subsecond checking. See https://www.percona.com/doc/percona-toolkit/2.1/pt-heartbeat.html#cmdoption-pt-heartbeat--interval

The minimum (fastest) interval is 0.01, and the maximum precision is two decimal places, 
so 0.015 will be rounded to 0.02.

@grypyrg
Copy link

grypyrg commented Feb 21, 2017

And maybe fetch another column to precalculate the seconds_lag

SELECT UNIX_TIMESTAMP(ts) AS executed_time,
       UNIX_TIMESTAMP(NOW(6)) AS server_time,
       UNIX_TIMESTAMP(NOW(6)) - UNIX_TIMESTAMP(ts) AS seconds_lag,
       server_id
FROM `%s`

Example output (to also show it's subsecond):

+-------------------+-------------------+-------------+-----------+
| executed_time     | server_time       | seconds_lag | server_id |
+-------------------+-------------------+-------------+-----------+
| 1487668183.001080 | 1487668278.442121 |   95.441041 |         1 |
+-------------------+-------------------+-------------+-----------+

@roidelapluie
Copy link
Member Author

roidelapluie commented Feb 21, 2017

Updated with all the comments taken in consideration; also split database and table parameters so we can better escape them.

@roidelapluie roidelapluie changed the title Implement pt-heartbeat metrics Implement heartbeat metrics Feb 21, 2017
@SuperQ
Copy link
Member

SuperQ commented Feb 21, 2017

Maybe make a parameter for the column?

@roidelapluie
Copy link
Member Author

Two parameters then. Is that complexity needed for an initial release?

@SuperQ
Copy link
Member

SuperQ commented Feb 21, 2017

No, not necssary, it was just an idea. I think it's fine to keep the columns hard-coded.

}
defer heartbeatRows.Close()

var now sql.RawBytes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be:

var (
  now, ts  sql.RawBytes
  serverId int
)

Copy link
Member

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM besides one Go nit.

@roidelapluie
Copy link
Member Author

Go nit fixed.

pt-heartbeat measures actual replication lab on a mysql server. It does
not rely on data stored in SHOW SLAVE STATUS, as it can be unreliable.

pt-heartbeat website:
https://www.percona.com/doc/percona-toolkit/2.2/pt-heartbeat.html

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
@roidelapluie
Copy link
Member Author

Wording fixed in README.md and mysqld_exporter.go, so for me it is good to merge.

@SuperQ
Copy link
Member

SuperQ commented Feb 21, 2017

Yup, looks fine, thanks.

@SuperQ SuperQ merged commit 5bf756d into prometheus:master Feb 21, 2017
@grypyrg
Copy link

grypyrg commented Feb 21, 2017

Thanks @roidelapluie !

@roidelapluie
Copy link
Member Author

roidelapluie commented Feb 21, 2017

For reference, promsql query:

mysql_heartbeat_now_timestamp_seconds - on (server_id) mysql_heartbeat_stored_timestamp_seconds

@SuperQ
Copy link
Member

SuperQ commented Feb 21, 2017

Normally Prometheus will be able to match up the labels and you don't need to specify the on (server_id).

@roidelapluie
Copy link
Member Author

Oh thanks. Nice trick.

SuperQ added a commit that referenced this pull request Feb 21, 2017
Heartbeat[0] metrics allow for similar alerting symantics to "seconds
behind master" metrics.
* Add a recording rule to provide a similar gauge of seconds behind.
* Add an alert with the same style.

[0]: #183
SuperQ added a commit that referenced this pull request Feb 21, 2017
Heartbeat[0] metrics allow for similar alerting symantics to "seconds
behind master" metrics.
* Add a recording rule to provide a similar gauge of seconds behind.
* Add an alert with the same style.

[0]: #183
SuperQ added a commit that referenced this pull request Mar 22, 2017
* [FEATURE] Add read/write query response time #166
* [FEATURE] Add Galera gcache size metric #169
* [FEATURE] Add MariaDB multi source replication support #178
* [FEATURE] Implement heartbeat metrics #183
* [FEATURE] Add basic file_summary_by_instance metrics #189
* [BUGFIX] Workaround MySQL bug 79533 #173
@SuperQ SuperQ mentioned this pull request Mar 22, 2017
SuperQ added a commit that referenced this pull request Mar 22, 2017
* [FEATURE] Add read/write query response time #166
* [FEATURE] Add Galera gcache size metric #169
* [FEATURE] Add MariaDB multi source replication support #178
* [FEATURE] Implement heartbeat metrics #183
* [FEATURE] Add basic file_summary_by_instance metrics #189
* [BUGFIX] Workaround MySQL bug 79533 #173
SuperQ added a commit that referenced this pull request Apr 25, 2017
* [FEATURE] Add read/write query response time #166
* [FEATURE] Add Galera gcache size metric #169
* [FEATURE] Add MariaDB multi source replication support #178
* [FEATURE] Implement heartbeat metrics #183
* [FEATURE] Add basic file_summary_by_instance metrics #189
* [BUGFIX] Workaround MySQL bug 79533 #173
SuperQ added a commit that referenced this pull request Apr 25, 2017
* [FEATURE] Add read/write query response time #166
* [FEATURE] Add Galera gcache size metric #169
* [FEATURE] Add MariaDB multi source replication support #178
* [FEATURE] Implement heartbeat metrics #183
* [FEATURE] Add basic file_summary_by_instance metrics #189
* [BUGFIX] Workaround MySQL bug 79533 #173
@koorgoo koorgoo mentioned this pull request Jun 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants