Skip to content

[BUG] Segment replication lag metric seems to be incorrect #18437

@varunbharadwaj

Description

@varunbharadwaj

Describe the bug

Segment replication lag metric seems to be incorrect. On testing with both push and pull based indexing, the observation is the replication lag (segments.segment_replication.max_replication_lag from node stats API) catches up soon after indexing stops. But the replication lag value does not seem to match bytes behind and other metrics, and is very high.

We convert the replication lag in milliseconds and is seen in days in the following graph.
Image

Verify/confirm if this is a bug.

Related component

Indexing:Replication

To Reproduce

  1. Setup a cluster running on segment replication with remote store (GCS). OS 3.x is used (latest main branch - 3.1.0 unreleased)
  2. Note down segment replication metrics by calling node stats API (segments.segment_replication.max_replication_lag metric)
  3. Verify the metric is correct and matching other segrep metrics (like segments.segment_replication.max_bytes_behind)

Expected behavior

Replication lag metric is similar to bytes behind and other segrep metrics

Additional Details

Metadata

Metadata

Assignees

Labels

Indexing:ReplicationIssues and PRs related to core replication framework eg segrepbugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions