[postgres] replication_delay metric doesn't always report accurate details #1312

miketheman · 2015-01-24T17:34:14Z

In a recent situation, we were chugging along happily with pg streaming replication.

At some point, under heavy write load, the streaming replication cut off, and our secondary replication method (wal-e) kicked in to catch up, until it was finally in-sync and streaming replication could resume.

During this time, the postgres.replication_delay kept increasing, despite replication logs being applied, leading to a misleading graph:

I suspect that the value being captured/reported in pg_last_xact_replay_timestamp() may only apply to timestamps being applied via streaming replication, thereby being slightly misleading.

Here's what pg_last_xact_replay_timestamp does:

Get time stamp of last transaction replayed during recovery. This is the time at which the commit or abort WAL record for that transaction was generated on the primary. If no transactions have been replayed during recovery, this function returns NULL. Otherwise, if recovery is still in progress this will increase monotonically. If recovery has completed then this value will remain static at the value of the last transaction applied during that recovery. When the server has been started normally without recovery the function returns NULL.

Another way to examine replication delay is using a few of other functions:

pg_last_xlog_receive_location:

Get last transaction log location received and synced to disk by streaming replication. While streaming replication is in progress this will increase monotonically. If recovery has completed this will remain static at the value of the last WAL record received and synced to disk during recovery. If streaming replication is disabled, or if it has not yet started, the function returns NULL.
pg_last_xlog_replay_location:

Get last transaction log location replayed during recovery. If recovery is still in progress this will increase monotonically. If recovery has completed then this value will remain static at the value of the last WAL record applied during that recovery. When the server has been started normally without recovery the function returns NULL.

This should accurately capture replication lag in bytes applied:

SELECT pg_xlog_location_diff(pg_last_xlog_receive_location(), pg_last_xlog_replay_location())
    AS replication_delay_bytes;

Note: pg_xlog_location_diff was introduced in PG 9.2 onwards. Some have backported a function to perform the same, not sure I'd want to add a function to someone else's DB.

I'm still not certain of how to best translate the bytes diffed to time, however this might be a good addition to have bytes lag in any case.

Further, an evolved version of the current replication_delay metric might be this:

SELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() 
            THEN 0 
            ELSE GREATEST (0, EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp())) 
            END
    AS replication_delay;

As now we will only report a timestamp of the receive doesn't match the replay, meaning there must be a byte difference.

I'm still not certain this will improve the visible lag graph as shown above, as the reported lag will still be reflecting the timestamp of the last transaction applied, but it would likely smooth out a replication delay graph during normal operations, removing the 0.01 second values being reported, since of there are no bytes lagging, we'll get 0.

Reference:
http://www.postgresql.org/docs/9.2/static/functions-admin.html

The text was updated successfully, but these errors were encountered:

remh · 2015-02-25T23:34:07Z

Thanks a lot for this very detailed feedback Mike!

DorianZaccaria · 2015-03-18T21:39:53Z

Updated here: #1373

miketheman added the checks label Jan 24, 2015

LeoCavaille changed the title ~~postgres.replication_delay doesn't always report accurate details~~ [postgres] replication_delay metric doesn't always report accurate details Jan 28, 2015

remh added this to the Future milestone Feb 25, 2015

remh modified the milestones: 5.4.0, Future Feb 25, 2015

DorianZaccaria closed this as completed Mar 18, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[postgres] replication_delay metric doesn't always report accurate details #1312

[postgres] replication_delay metric doesn't always report accurate details #1312

miketheman commented Jan 24, 2015

remh commented Feb 25, 2015

DorianZaccaria commented Mar 18, 2015

[postgres] replication_delay metric doesn't always report accurate details #1312

[postgres] replication_delay metric doesn't always report accurate details #1312

Comments

miketheman commented Jan 24, 2015

remh commented Feb 25, 2015

DorianZaccaria commented Mar 18, 2015