Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[postgres] replication_delay metric doesn't always report accurate details #1312

Closed
miketheman opened this issue Jan 24, 2015 · 2 comments
Closed
Labels
Milestone

Comments

@miketheman
Copy link
Contributor

In a recent situation, we were chugging along happily with pg streaming replication.

At some point, under heavy write load, the streaming replication cut off, and our secondary replication method (wal-e) kicked in to catch up, until it was finally in-sync and streaming replication could resume.

During this time, the postgres.replication_delay kept increasing, despite replication logs being applied, leading to a misleading graph:

screen shot 2015-01-24 at 11 32 49 am

I suspect that the value being captured/reported in pg_last_xact_replay_timestamp() may only apply to timestamps being applied via streaming replication, thereby being slightly misleading.

Here's what pg_last_xact_replay_timestamp does:

Get time stamp of last transaction replayed during recovery. This is the time at which the commit or abort WAL record for that transaction was generated on the primary. If no transactions have been replayed during recovery, this function returns NULL. Otherwise, if recovery is still in progress this will increase monotonically. If recovery has completed then this value will remain static at the value of the last transaction applied during that recovery. When the server has been started normally without recovery the function returns NULL.

Another way to examine replication delay is using a few of other functions:

  • pg_last_xlog_receive_location:

    Get last transaction log location received and synced to disk by streaming replication. While streaming replication is in progress this will increase monotonically. If recovery has completed this will remain static at the value of the last WAL record received and synced to disk during recovery. If streaming replication is disabled, or if it has not yet started, the function returns NULL.

  • pg_last_xlog_replay_location:

    Get last transaction log location replayed during recovery. If recovery is still in progress this will increase monotonically. If recovery has completed then this value will remain static at the value of the last WAL record applied during that recovery. When the server has been started normally without recovery the function returns NULL.

This should accurately capture replication lag in bytes applied:

SELECT pg_xlog_location_diff(pg_last_xlog_receive_location(), pg_last_xlog_replay_location())
    AS replication_delay_bytes;

Note: pg_xlog_location_diff was introduced in PG 9.2 onwards. Some have backported a function to perform the same, not sure I'd want to add a function to someone else's DB.

I'm still not certain of how to best translate the bytes diffed to time, however this might be a good addition to have bytes lag in any case.

Further, an evolved version of the current replication_delay metric might be this:

SELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() 
            THEN 0 
            ELSE GREATEST (0, EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp())) 
            END
    AS replication_delay;

As now we will only report a timestamp of the receive doesn't match the replay, meaning there must be a byte difference.

I'm still not certain this will improve the visible lag graph as shown above, as the reported lag will still be reflecting the timestamp of the last transaction applied, but it would likely smooth out a replication delay graph during normal operations, removing the 0.01 second values being reported, since of there are no bytes lagging, we'll get 0.

Reference:
http://www.postgresql.org/docs/9.2/static/functions-admin.html

@LeoCavaille LeoCavaille changed the title postgres.replication_delay doesn't always report accurate details [postgres] replication_delay metric doesn't always report accurate details Jan 28, 2015
@remh remh added this to the Future milestone Feb 25, 2015
@remh
Copy link
Contributor

remh commented Feb 25, 2015

Thanks a lot for this very detailed feedback Mike!

@remh remh modified the milestones: 5.4.0, Future Feb 25, 2015
@DorianZaccaria
Copy link
Contributor

Updated here: #1373

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants