-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics recorded by Ruler are different from the original results (with no delay) #8892
Comments
What is the interval step for the panel in grafana? There may be a different interval step between the recording rule vs the panel. |
The ruler does not guarantee the execution time of query, rather the execution interval. The rule should run based on the rule group evaluation interval, but it is not keyed to a specific time, rather just the time between executions. For a distributed/multi-tenant system like Loki to process thousands of rules concurrently we can't execute all of them at exactly the same instant so we introduce jitter to execute them more evenly over time but within their rule group interval. This can result in a difference between the metric generated (which has a timestamp of when the rule was evaluated) and querying the actual log data (which maintains exact timestamps) and could lead to slightly different results. You could increase the execution rate of the rule group to try to minimize this discrepancy at the additional cost of more metric datapoints and ruler CPU consumption, but maybe that tradeoff is desired if the timing is critical for you? |
Hi @slim-bean based on Loki v2.9 release notes https://grafana.com/docs/loki/latest/release-notes/v2-9/ and this https://github.com/grafana/loki/pull/8848/files PR should merge it, but on practice I got error when added Update: I found what is was my error - I checked PR and not docs. PR has not finish version of jitter settings, while docs does: https://grafana.com/docs/loki/latest/configure/#ruler:~:text=max%2Djitter%0A%20%20%5B-,max_jitter,-%3A%20%3Cduration%3E and after using proper settings all is working. |
Describe the bug
Metrics recorded by Ruler are similar to the original results but not identical, with some unexplained points.
Note that there is almost no delay in the pipeline.
In the figure, the results from Prometheus-infra-(Notify for alerts failed)(Yellow arrow of the below grafana) should be recorded every five minutes, but when using Ruler(Green arrow of the top grafana), the results are irregular as shown in the figure.
It seems that the timing of Ruler calling the query results in slightly different results. How can I make the results appear every minute on the dot, for example, taking the results from the last minute of 13:03:00 ~ 13:03:59 when Ruler executed the query at 13:04:23? Currently, when Ruler executes the query at 13:04:23, it seems to take the results from 13:03:23 ~ 13:04:22 and record them as metrics.
Lastly, is there any further feedback on this issue at #4909, #8765? I have found that in some cases, different delays need to be set.
Environment:
Screenshots, Promtail config, or terminal output
If applicable, add any output to help explain your problem.
recording rule (in grafana)
normal query
Tasks
The text was updated successfully, but these errors were encountered: