You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In a recent situation, we were chugging along happily with pg streaming replication.
At some point, under heavy write load, the streaming replication cut off, and our secondary replication method (wal-e) kicked in to catch up, until it was finally in-sync and streaming replication could resume.
During this time, the postgres.replication_delay kept increasing, despite replication logs being applied, leading to a misleading graph:
I suspect that the value being captured/reported in pg_last_xact_replay_timestamp() may only apply to timestamps being applied via streaming replication, thereby being slightly misleading.
Here's what pg_last_xact_replay_timestamp does:
Get time stamp of last transaction replayed during recovery. This is the time at which the commit or abort WAL record for that transaction was generated on the primary. If no transactions have been replayed during recovery, this function returns NULL. Otherwise, if recovery is still in progress this will increase monotonically. If recovery has completed then this value will remain static at the value of the last transaction applied during that recovery. When the server has been started normally without recovery the function returns NULL.
Another way to examine replication delay is using a few of other functions:
pg_last_xlog_receive_location:
Get last transaction log location received and synced to disk by streaming replication. While streaming replication is in progress this will increase monotonically. If recovery has completed this will remain static at the value of the last WAL record received and synced to disk during recovery. If streaming replication is disabled, or if it has not yet started, the function returns NULL.
pg_last_xlog_replay_location:
Get last transaction log location replayed during recovery. If recovery is still in progress this will increase monotonically. If recovery has completed then this value will remain static at the value of the last WAL record applied during that recovery. When the server has been started normally without recovery the function returns NULL.
This should accurately capture replication lag in bytes applied:
SELECT pg_xlog_location_diff(pg_last_xlog_receive_location(), pg_last_xlog_replay_location())
AS replication_delay_bytes;
Note: pg_xlog_location_diff was introduced in PG 9.2 onwards. Some have backported a function to perform the same, not sure I'd want to add a function to someone else's DB.
I'm still not certain of how to best translate the bytes diffed to time, however this might be a good addition to have bytes lag in any case.
Further, an evolved version of the current replication_delay metric might be this:
SELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location()
THEN 0
ELSE GREATEST (0, EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp()))
END
AS replication_delay;
As now we will only report a timestamp of the receive doesn't match the replay, meaning there must be a byte difference.
I'm still not certain this will improve the visible lag graph as shown above, as the reported lag will still be reflecting the timestamp of the last transaction applied, but it would likely smooth out a replication delay graph during normal operations, removing the 0.01 second values being reported, since of there are no bytes lagging, we'll get 0.
LeoCavaille
changed the title
postgres.replication_delay doesn't always report accurate details
[postgres] replication_delay metric doesn't always report accurate details
Jan 28, 2015
In a recent situation, we were chugging along happily with pg streaming replication.
At some point, under heavy write load, the streaming replication cut off, and our secondary replication method (wal-e) kicked in to catch up, until it was finally in-sync and streaming replication could resume.
During this time, the postgres.replication_delay kept increasing, despite replication logs being applied, leading to a misleading graph:
I suspect that the value being captured/reported in
pg_last_xact_replay_timestamp()
may only apply to timestamps being applied via streaming replication, thereby being slightly misleading.Here's what
pg_last_xact_replay_timestamp
does:Another way to examine replication delay is using a few of other functions:
pg_last_xlog_receive_location:
pg_last_xlog_replay_location:
This should accurately capture replication lag in bytes applied:
Note:
pg_xlog_location_diff
was introduced in PG 9.2 onwards. Some have backported a function to perform the same, not sure I'd want to add a function to someone else's DB.I'm still not certain of how to best translate the bytes diffed to time, however this might be a good addition to have bytes lag in any case.
Further, an evolved version of the current
replication_delay
metric might be this:As now we will only report a timestamp of the receive doesn't match the replay, meaning there must be a byte difference.
I'm still not certain this will improve the visible lag graph as shown above, as the reported lag will still be reflecting the timestamp of the last transaction applied, but it would likely smooth out a replication delay graph during normal operations, removing the 0.01 second values being reported, since of there are no bytes lagging, we'll get 0.
Reference:
http://www.postgresql.org/docs/9.2/static/functions-admin.html
The text was updated successfully, but these errors were encountered: