-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ruler behaviour when older logs received #4909
Comments
I think what you're looking for is this: # Duration to delay the evaluation of rules to ensure.
# CLI flag: -ruler.evaluation-delay-duration
[ruler_evaluation_delay_duration: <duration> | default = 0s] Can you give that a try please? I'm noticing now the docs for this option are strangely incomplete, wonder what happened here. This option is used when calculating the instant when evaluating rules. |
It appears that option is global. I'd only really like to set it to a comparatively large value (e.g. 3 minutes) for this one rule. This could be more notable if people were taking advantage of the full 2 hour out-of-order accept window. |
Hi! This issue has been automatically marked as stale because it has not had any We use a stalebot among other tools to help manage the state of issues in this project. Stalebots are also emotionless and cruel and can close issues which are still very relevant. If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry. We regularly sort for closed issues which have a We may also:
We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, |
I still think this is a major problem that needs addressing. |
I am definitely seeing this and it compounds when getting logs from multiple cloudflare domains/zones. |
@cyriltovena Any suggestions on a path here we're getting a lot the following after having this running for a period of time
We've tried messing with the retention periods and rejection ages and it doesn't clear this. This is causing it to stop pulling as it puts this into the wait backoff so new logs stop coming in. Seems given the fact cloudflare will not keep stuff more than 168h we should treat this error different than a standard 400 from them. Willing to help but, but would love to hear your thoughts on direction. |
Hi! This issue has been automatically marked as stale because it has not had any We use a stalebot among other tools to help manage the state of issues in this project. Stalebots are also emotionless and cruel and can close issues which are still very relevant. If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry. We regularly sort for closed issues which have a We may also:
We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, |
|
I can understand the concern here but I would suggest the solution here isn't changing Loki, it's removing the delay in your ingestion pipeline. Loki is built to be low latency and expects logs to be received with minimal delay from when they are created. With alerts, typically the purpose is to react quickly so having no delay in your ingestion becomes the primary concern to make sure the alert is able to fire as soon as possible. With recording rules, if timeliness of the metric creation is not important I would suggest using |
@slim-bean The issue was not with the ruler_evaluation or any other offset, the problem is that the plugin just failed once this error was hit instead of throwing out that data and moving on. |
Is your feature request related to a problem? Please describe.
I believe that metrics generated by the ruler do an instant query "now".
Logs might be ingested behind the current point in time; due to e.g. a high latency pipeline that results in metrics taking over a minute to arrive; or out-of-order logs arriving an hour after the fact.
These two things together mean that rules often record incorrect values; or much of the time:
0
.I've found this particularly pronounced with the new cloudflare logpull ingestion, where cloudflare logs seem to take ~3 minutes to all arrive in our systems.
The line in this image should continue to be constant, it's just that logs haven't all arrived yet.
Describe the solution you'd like
Option 1. as old logs arrive, we need to some mechanism to send up updated value for a metric via
remote_write
Option 2. a per rule configuration field specifying a delay to wait before evaluating and sending a metric
Describe alternatives you've considered
Using
offset
can chop off the delayed period if you can find a good offset value (this is not possible in all circumstances), but the nature ofoffset
makes the resultant metrics series not line up in the time dimension when compared to other metrics. This makes it hard to see e.g. the effects of a deployment.The text was updated successfully, but these errors were encountered: