Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics recorded by Ruler are different from the original results (with no delay) #8892

Open
17billion opened this issue Mar 24, 2023 · 3 comments
Labels
type/bug Somehing is not working as expected

Comments

@17billion
Copy link

17billion commented Mar 24, 2023

Describe the bug

Metrics recorded by Ruler are similar to the original results but not identical, with some unexplained points.
스크린샷 2023-03-24 오전 10 25 51
Note that there is almost no delay in the pipeline.

  1. In the figure, the results from Prometheus-infra-(Notify for alerts failed)(Yellow arrow of the below grafana) should be recorded every five minutes, but when using Ruler(Green arrow of the top grafana), the results are irregular as shown in the figure.

  2. It seems that the timing of Ruler calling the query results in slightly different results. How can I make the results appear every minute on the dot, for example, taking the results from the last minute of 13:03:00 ~ 13:03:59 when Ruler executed the query at 13:04:23? Currently, when Ruler executes the query at 13:04:23, it seems to take the results from 13:03:23 ~ 13:04:22 and record them as metrics.

Lastly, is there any further feedback on this issue at #4909, #8765? I have found that in some cases, different delays need to be set.

Environment:

  • Infrastructure: Kubernetes
  • Deployment tool: helm
  • Loki version : 2.6.1
  • Grafana version : 9.2.5
  • Prometheus version : 2.40.1

Screenshots, Promtail config, or terminal output
If applicable, add any output to help explain your problem.

...
limits_config:
  ruler_evaluation_delay_duration: 1m
...
ruler:
  enable_api: true
  enable_sharding: true
  storage:
    type: s3
    s3:
      s3: s3:/xxx
      bucketnames: xxx
  rule_path: /tmp/loki/scratch
  wal:
    dir: /var/loki/ruler-wal
  ring:
    kvstore:
      store: memberlist
  remote_write:
    enabled: true
    client:
      url: http://xxx/api/v1/write

recording rule (in grafana)
recording rule
recording rule 2

normal query
normal query

Tasks

Preview Give feedback
No tasks being tracked yet.
@lionelmarksgrafana
Copy link

What is the interval step for the panel in grafana? There may be a different interval step between the recording rule vs the panel.

@slim-bean
Copy link
Collaborator

The ruler does not guarantee the execution time of query, rather the execution interval. The rule should run based on the rule group evaluation interval, but it is not keyed to a specific time, rather just the time between executions.

For a distributed/multi-tenant system like Loki to process thousands of rules concurrently we can't execute all of them at exactly the same instant so we introduce jitter to execute them more evenly over time but within their rule group interval.

This can result in a difference between the metric generated (which has a timestamp of when the rule was evaluated) and querying the actual log data (which maintains exact timestamps) and could lead to slightly different results.

You could increase the execution rate of the rule group to try to minimize this discrepancy at the additional cost of more metric datapoints and ruler CPU consumption, but maybe that tradeoff is desired if the timing is critical for you?

@dragoangel
Copy link

dragoangel commented Feb 5, 2024

Hi @slim-bean based on Loki v2.9 release notes https://grafana.com/docs/loki/latest/release-notes/v2-9/ and this https://github.com/grafana/loki/pull/8848/files PR should merge it, but on practice I got error when added evaluation_jitter into loki.rulerConfig loki helm chart. Error states: field evaluation_jitter not found in type ruler.Config, meaning that there no such settings in ruler config. It is possible that it was missed to be merged into 2.9 release or there is a bug in implementation?

Update: I found what is was my error - I checked PR and not docs. PR has not finish version of jitter settings, while docs does: https://grafana.com/docs/loki/latest/configure/#ruler:~:text=max%2Djitter%0A%20%20%5B-,max_jitter,-%3A%20%3Cduration%3E and after using proper settings all is working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Somehing is not working as expected
Projects
None yet
Development

No branches or pull requests

5 participants