Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to align runs with frequency #765

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mem
Copy link
Contributor

@mem mem commented Jul 3, 2024

When the agent restarts, we might end up in a situation where we are running a check too often. For example, if a check is configured to run once every 10 minutes, and the check ran 1 minute ago, if we run immediately, we will have two executions within that 10 minute window.

Since we have to publish samples once every two minutes, we cannot wait for 9 minutes, because we end up with a gap. Instead, do run immediately to avoid that.

But if the check ran 9 minutes ago, we can align with the expectation by waiting for 1 minute instead of a random value. Presumably the check ran 9 minutes ago, and the sample was replicated 7, 5, 3 and 1 minutes ago. If we wait for 1 minute, we would end up running the check when it was expected to run.

In order to actually fix this issue the agents would have to persist data across runs. It might be possible to do this by offloading publishing to another service.

Fixes: #739

When the agent restarts, we might end up in a situation where we are
running a check too often. For example, if a check is configured to run
once every 10 minutes, and the check ran 1 minute ago, if we run
immediately, we will have two executions within that 10 minute window.

Since we have to publish samples once every two minutes, we cannot wait
for 9 minutes, because we end up with a gap. Instead, do run immediately
to avoid that.

But if the check ran 9 minutes ago, we can align with the expectation by
waiting for 1 minute instead of a random value. Presumably the check ran
9 minutes ago, and the sample was replicated 7, 5, 3 and 1 minutes ago.
If we wait for 1 minute, we would end up running the check when it was
expected to run.

In order to actually fix this issue the agents would have to persist
data across runs. It might be possible to do this by offloading
publishing to another service.

Fixes: #739
Signed-off-by: Marcelo E. Magallon <marcelo.magallon@grafana.com>
@mem mem requested a review from a team as a code owner July 3, 2024 21:41
Copy link
Member

@roobre roobre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, mostly a couple naming nits.

My head hurts from doing time arithmetic.

Comment on lines +331 to +335
func timeFromNs(ns float64) time.Time {
sec := int64(math.Floor(ns / 1e9))
nsec := int64(math.Mod(ns, 1e9))
return time.Unix(sec, nsec)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this redundant? I thought you could just time.Unix(0, ns):
https://go.dev/play/p/zIkG3JFBBvL

The docs state you can:

// It is valid to pass nsec outside the range [0, 999999999].

return time.Unix(sec, nsec)
}

func computeOffset(offset, frequency time.Duration, t0, now time.Time) time.Duration {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit, but I'd suggest naming t0 created instead. t0 suggest some beginning of times, but doesn't seem immediately obvious which one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same about offset, I'm not being able to figure out what that is 🤔

Comment on lines +348 to +357
// The check was created more than the frequency ago, so we need to
// compute the time until the next time the check should run.
//
// Compute the number of runs since t0, add one for the next run and
// multiply by the frequency in order to obtain its timestamp. Finally,
// compute the remaining time until that timestamp.

runs := (now.UnixMilli() - t0.UnixMilli()) / frequency.Milliseconds()

timeUntilNextRun := t0.Add(time.Duration(runs+1) * frequency).Sub(now)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I barely grasp what we're doing here, but not enough to try and find gaps in the logic. Looks right but I'm not positive about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Start running checks based on the time of creation, not when the agent receives them
2 participants