Metrics recorded by Ruler are different from the original results (with no delay) #8892

17billion · 2023-03-24T02:20:48Z

Describe the bug

Metrics recorded by Ruler are similar to the original results but not identical, with some unexplained points.

Note that there is almost no delay in the pipeline.

In the figure, the results from Prometheus-infra-(Notify for alerts failed)(Yellow arrow of the below grafana) should be recorded every five minutes, but when using Ruler(Green arrow of the top grafana), the results are irregular as shown in the figure.
It seems that the timing of Ruler calling the query results in slightly different results. How can I make the results appear every minute on the dot, for example, taking the results from the last minute of 13:03:00 ~ 13:03:59 when Ruler executed the query at 13:04:23? Currently, when Ruler executes the query at 13:04:23, it seems to take the results from 13:03:23 ~ 13:04:22 and record them as metrics.

Lastly, is there any further feedback on this issue at #4909, #8765? I have found that in some cases, different delays need to be set.

Environment:

Infrastructure: Kubernetes
Deployment tool: helm
Loki version : 2.6.1
Grafana version : 9.2.5
Prometheus version : 2.40.1

Screenshots, Promtail config, or terminal output
If applicable, add any output to help explain your problem.

...
limits_config:
  ruler_evaluation_delay_duration: 1m
...
ruler:
  enable_api: true
  enable_sharding: true
  storage:
    type: s3
    s3:
      s3: s3:/xxx
      bucketnames: xxx
  rule_path: /tmp/loki/scratch
  wal:
    dir: /var/loki/ruler-wal
  ring:
    kvstore:
      store: memberlist
  remote_write:
    enabled: true
    client:
      url: http://xxx/api/v1/write

recording rule (in grafana)

normal query

Tasks

Give feedback

No tasks being tracked yet.

Options

The text was updated successfully, but these errors were encountered:

lionelmarksgrafana · 2023-03-30T17:25:53Z

What is the interval step for the panel in grafana? There may be a different interval step between the recording rule vs the panel.

slim-bean · 2023-03-30T17:57:07Z

The ruler does not guarantee the execution time of query, rather the execution interval. The rule should run based on the rule group evaluation interval, but it is not keyed to a specific time, rather just the time between executions.

For a distributed/multi-tenant system like Loki to process thousands of rules concurrently we can't execute all of them at exactly the same instant so we introduce jitter to execute them more evenly over time but within their rule group interval.

This can result in a difference between the metric generated (which has a timestamp of when the rule was evaluated) and querying the actual log data (which maintains exact timestamps) and could lead to slightly different results.

You could increase the execution rate of the rule group to try to minimize this discrepancy at the additional cost of more metric datapoints and ruler CPU consumption, but maybe that tradeoff is desired if the timing is critical for you?

dragoangel · 2024-02-05T22:19:04Z

Hi @slim-bean based on Loki v2.9 release notes https://grafana.com/docs/loki/latest/release-notes/v2-9/ and this https://github.com/grafana/loki/pull/8848/files PR should merge it, but on practice I got error when added evaluation_jitter into loki.rulerConfig loki helm chart. Error states: field evaluation_jitter not found in type ruler.Config, meaning that there no such settings in ruler config. It is possible that it was missed to be merged into 2.9 release or there is a bug in implementation?

Update: I found what is was my error - I checked PR and not docs. PR has not finish version of jitter settings, while docs does: https://grafana.com/docs/loki/latest/configure/#ruler:~:text=max%2Djitter%0A%20%20%5B-,max_jitter,-%3A%20%3Cduration%3E and after using proper settings all is working.

JStickler added the type/bug Somehing is not working as expected label Mar 29, 2023

17billion mentioned this issue Mar 30, 2023

Data mismatch for last one minute data for loki alerts and recording rules #8765

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics recorded by Ruler are different from the original results (with no delay) #8892

Metrics recorded by Ruler are different from the original results (with no delay) #8892

17billion commented Mar 24, 2023 •

edited

Loading

Tasks

lionelmarksgrafana commented Mar 30, 2023

slim-bean commented Mar 30, 2023

dragoangel commented Feb 5, 2024 •

edited

Loading

Metrics recorded by Ruler are different from the original results (with no delay) #8892

Metrics recorded by Ruler are different from the original results (with no delay) #8892

Comments

17billion commented Mar 24, 2023 • edited Loading

Tasks

lionelmarksgrafana commented Mar 30, 2023

slim-bean commented Mar 30, 2023

dragoangel commented Feb 5, 2024 • edited Loading

17billion commented Mar 24, 2023 •

edited

Loading

dragoangel commented Feb 5, 2024 •

edited

Loading