Ruler behaviour when older logs received #4909

james-callahan · 2021-12-10T14:56:43Z

Is your feature request related to a problem? Please describe.

I believe that metrics generated by the ruler do an instant query "now".

Logs might be ingested behind the current point in time; due to e.g. a high latency pipeline that results in metrics taking over a minute to arrive; or out-of-order logs arriving an hour after the fact.

These two things together mean that rules often record incorrect values; or much of the time: 0.
I've found this particularly pronounced with the new cloudflare logpull ingestion, where cloudflare logs seem to take ~3 minutes to all arrive in our systems.

The line in this image should continue to be constant, it's just that logs haven't all arrived yet.

Describe the solution you'd like
Option 1. as old logs arrive, we need to some mechanism to send up updated value for a metric via remote_write
Option 2. a per rule configuration field specifying a delay to wait before evaluating and sending a metric

Describe alternatives you've considered
Using offset can chop off the delayed period if you can find a good offset value (this is not possible in all circumstances), but the nature of offset makes the resultant metrics series not line up in the time dimension when compared to other metrics. This makes it hard to see e.g. the effects of a deployment.

The text was updated successfully, but these errors were encountered:

dannykopping · 2021-12-17T07:28:34Z

I think what you're looking for is this:

# Duration to delay the evaluation of rules to ensure.
# CLI flag: -ruler.evaluation-delay-duration
[ruler_evaluation_delay_duration: <duration> | default = 0s]

Can you give that a try please? I'm noticing now the docs for this option are strangely incomplete, wonder what happened here.

This option is used when calculating the instant when evaluating rules.

james-callahan · 2021-12-17T07:31:29Z

It appears that option is global. I'd only really like to set it to a comparatively large value (e.g. 3 minutes) for this one rule.

This could be more notable if people were taking advantage of the full 2 hour out-of-order accept window.

stale · 2022-03-02T23:49:36Z

Hi! This issue has been automatically marked as stale because it has not had any
activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project.
A stalebot can be very useful in closing issues in a number of cases; the most common
is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

Mark issues as revivable if we think it's a valid issue but isn't something we are likely
to prioritize in the future (the issue will still remain closed).
Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task,
our sincere apologies if you find yourself at the mercy of the stalebot.

james-callahan · 2022-03-03T02:53:40Z

I still think this is a major problem that needs addressing.

paullryan · 2022-03-07T17:53:31Z

I am definitely seeing this and it compounds when getting logs from multiple cloudflare domains/zones.

paullryan · 2022-03-11T17:39:07Z

@cyriltovena Any suggestions on a path here we're getting a lot the following after having this running for a period of time

5 errors: HTTP status 400: bad query: error parsing time: invalid time range: too early: logs older than 168h0m0s are not available ; HTTP status 400: bad query: error parsing time: invalid time range: too early: logs older than 168h0m0s are not available ; could not read response body: http2: response body closed; could not read response body: http2: response body closed; could not read response body: http2: response body closed

We've tried messing with the retention periods and rejection ages and it doesn't clear this. This is causing it to stop pulling as it puts this into the wait backoff so new logs stop coming in. Seems given the fact cloudflare will not keep stuff more than 168h we should treat this error different than a standard 400 from them.

Willing to help but, but would love to hear your thoughts on direction.

stale · 2022-04-16T04:15:00Z

Hi! This issue has been automatically marked as stale because it has not had any
activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project.
A stalebot can be very useful in closing issues in a number of cases; the most common
is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

Mark issues as revivable if we think it's a valid issue but isn't something we are likely
to prioritize in the future (the issue will still remain closed).
Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task,
our sincere apologies if you find yourself at the mercy of the stalebot.

james-callahan · 2022-04-16T04:49:50Z

If this issue is important to you, please add a comment to keep it open.

slim-bean · 2023-03-30T18:02:26Z

I can understand the concern here but I would suggest the solution here isn't changing Loki, it's removing the delay in your ingestion pipeline.

Loki is built to be low latency and expects logs to be received with minimal delay from when they are created.

With alerts, typically the purpose is to react quickly so having no delay in your ingestion becomes the primary concern to make sure the alert is able to fire as soon as possible.

With recording rules, if timeliness of the metric creation is not important I would suggest using ruler_evaluation_delay_duration as mentioned above, or using offset in your ruler query such that it executes at a point you would expect to have received all your data.

paullryan · 2023-04-01T00:27:05Z

@slim-bean The issue was not with the ruler_evaluation or any other offset, the problem is that the plugin just failed once this error was hit instead of throwing out that data and moving on.

stale bot added the stale A stale issue or PR that will automatically be closed. label Mar 2, 2022

stale bot removed the stale A stale issue or PR that will automatically be closed. label Mar 3, 2022

stale bot added the stale A stale issue or PR that will automatically be closed. label Apr 16, 2022

stale bot removed the stale A stale issue or PR that will automatically be closed. label Apr 16, 2022

17billion mentioned this issue Mar 24, 2023

Metrics recorded by Ruler are different from the original results (with no delay) #8892

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ruler behaviour when older logs received #4909

Ruler behaviour when older logs received #4909

james-callahan commented Dec 10, 2021 •

edited

Loading

dannykopping commented Dec 17, 2021

james-callahan commented Dec 17, 2021

stale bot commented Mar 2, 2022

james-callahan commented Mar 3, 2022

paullryan commented Mar 7, 2022

paullryan commented Mar 11, 2022

stale bot commented Apr 16, 2022

james-callahan commented Apr 16, 2022

slim-bean commented Mar 30, 2023

paullryan commented Apr 1, 2023

Ruler behaviour when older logs received #4909

Ruler behaviour when older logs received #4909

Comments

james-callahan commented Dec 10, 2021 • edited Loading

dannykopping commented Dec 17, 2021

james-callahan commented Dec 17, 2021

stale bot commented Mar 2, 2022

james-callahan commented Mar 3, 2022

paullryan commented Mar 7, 2022

paullryan commented Mar 11, 2022

stale bot commented Apr 16, 2022

james-callahan commented Apr 16, 2022

slim-bean commented Mar 30, 2023

paullryan commented Apr 1, 2023

james-callahan commented Dec 10, 2021 •

edited

Loading