Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ruler behaviour when older logs received #4909

Open
james-callahan opened this issue Dec 10, 2021 · 10 comments
Open

Ruler behaviour when older logs received #4909

james-callahan opened this issue Dec 10, 2021 · 10 comments

Comments

@james-callahan
Copy link
Contributor

james-callahan commented Dec 10, 2021

Is your feature request related to a problem? Please describe.

I believe that metrics generated by the ruler do an instant query "now".

Logs might be ingested behind the current point in time; due to e.g. a high latency pipeline that results in metrics taking over a minute to arrive; or out-of-order logs arriving an hour after the fact.

These two things together mean that rules often record incorrect values; or much of the time: 0.
I've found this particularly pronounced with the new cloudflare logpull ingestion, where cloudflare logs seem to take ~3 minutes to all arrive in our systems.
image
The line in this image should continue to be constant, it's just that logs haven't all arrived yet.

Describe the solution you'd like
Option 1. as old logs arrive, we need to some mechanism to send up updated value for a metric via remote_write
Option 2. a per rule configuration field specifying a delay to wait before evaluating and sending a metric

Describe alternatives you've considered
Using offset can chop off the delayed period if you can find a good offset value (this is not possible in all circumstances), but the nature of offset makes the resultant metrics series not line up in the time dimension when compared to other metrics. This makes it hard to see e.g. the effects of a deployment.

@dannykopping
Copy link
Contributor

I think what you're looking for is this:

# Duration to delay the evaluation of rules to ensure.
# CLI flag: -ruler.evaluation-delay-duration
[ruler_evaluation_delay_duration: <duration> | default = 0s]

Can you give that a try please? I'm noticing now the docs for this option are strangely incomplete, wonder what happened here.

This option is used when calculating the instant when evaluating rules.

@james-callahan
Copy link
Contributor Author

It appears that option is global. I'd only really like to set it to a comparatively large value (e.g. 3 minutes) for this one rule.

This could be more notable if people were taking advantage of the full 2 hour out-of-order accept window.

@stale
Copy link

stale bot commented Mar 2, 2022

Hi! This issue has been automatically marked as stale because it has not had any
activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project.
A stalebot can be very useful in closing issues in a number of cases; the most common
is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely
    to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task,
our sincere apologies if you find yourself at the mercy of the stalebot.

@stale stale bot added the stale A stale issue or PR that will automatically be closed. label Mar 2, 2022
@james-callahan
Copy link
Contributor Author

I still think this is a major problem that needs addressing.

@stale stale bot removed the stale A stale issue or PR that will automatically be closed. label Mar 3, 2022
@paullryan
Copy link
Contributor

I am definitely seeing this and it compounds when getting logs from multiple cloudflare domains/zones.

@paullryan
Copy link
Contributor

@cyriltovena Any suggestions on a path here we're getting a lot the following after having this running for a period of time

5 errors: HTTP status 400: bad query: error parsing time: invalid time range: too early: logs older than 168h0m0s are not available ; HTTP status 400: bad query: error parsing time: invalid time range: too early: logs older than 168h0m0s are not available ; could not read response body: http2: response body closed; could not read response body: http2: response body closed; could not read response body: http2: response body closed

We've tried messing with the retention periods and rejection ages and it doesn't clear this. This is causing it to stop pulling as it puts this into the wait backoff so new logs stop coming in. Seems given the fact cloudflare will not keep stuff more than 168h we should treat this error different than a standard 400 from them.

Willing to help but, but would love to hear your thoughts on direction.

@stale
Copy link

stale bot commented Apr 16, 2022

Hi! This issue has been automatically marked as stale because it has not had any
activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project.
A stalebot can be very useful in closing issues in a number of cases; the most common
is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely
    to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task,
our sincere apologies if you find yourself at the mercy of the stalebot.

@stale stale bot added the stale A stale issue or PR that will automatically be closed. label Apr 16, 2022
@james-callahan
Copy link
Contributor Author

If this issue is important to you, please add a comment to keep it open.

@stale stale bot removed the stale A stale issue or PR that will automatically be closed. label Apr 16, 2022
@slim-bean
Copy link
Collaborator

I can understand the concern here but I would suggest the solution here isn't changing Loki, it's removing the delay in your ingestion pipeline.

Loki is built to be low latency and expects logs to be received with minimal delay from when they are created.

With alerts, typically the purpose is to react quickly so having no delay in your ingestion becomes the primary concern to make sure the alert is able to fire as soon as possible.

With recording rules, if timeliness of the metric creation is not important I would suggest using ruler_evaluation_delay_duration as mentioned above, or using offset in your ruler query such that it executes at a point you would expect to have received all your data.

@paullryan
Copy link
Contributor

@slim-bean The issue was not with the ruler_evaluation or any other offset, the problem is that the plugin just failed once this error was hit instead of throwing out that data and moving on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants