-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try to align runs with frequency #765
base: main
Are you sure you want to change the base?
Conversation
When the agent restarts, we might end up in a situation where we are running a check too often. For example, if a check is configured to run once every 10 minutes, and the check ran 1 minute ago, if we run immediately, we will have two executions within that 10 minute window. Since we have to publish samples once every two minutes, we cannot wait for 9 minutes, because we end up with a gap. Instead, do run immediately to avoid that. But if the check ran 9 minutes ago, we can align with the expectation by waiting for 1 minute instead of a random value. Presumably the check ran 9 minutes ago, and the sample was replicated 7, 5, 3 and 1 minutes ago. If we wait for 1 minute, we would end up running the check when it was expected to run. In order to actually fix this issue the agents would have to persist data across runs. It might be possible to do this by offloading publishing to another service. Fixes: #739 Signed-off-by: Marcelo E. Magallon <marcelo.magallon@grafana.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, mostly a couple naming nits.
My head hurts from doing time arithmetic.
func timeFromNs(ns float64) time.Time { | ||
sec := int64(math.Floor(ns / 1e9)) | ||
nsec := int64(math.Mod(ns, 1e9)) | ||
return time.Unix(sec, nsec) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this redundant? I thought you could just time.Unix(0, ns)
:
https://go.dev/play/p/zIkG3JFBBvL
The docs state you can:
// It is valid to pass nsec outside the range [0, 999999999].
return time.Unix(sec, nsec) | ||
} | ||
|
||
func computeOffset(offset, frequency time.Duration, t0, now time.Time) time.Duration { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit, but I'd suggest naming t0
created
instead. t0
suggest some beginning of times, but doesn't seem immediately obvious which one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same about offset
, I'm not being able to figure out what that is 🤔
// The check was created more than the frequency ago, so we need to | ||
// compute the time until the next time the check should run. | ||
// | ||
// Compute the number of runs since t0, add one for the next run and | ||
// multiply by the frequency in order to obtain its timestamp. Finally, | ||
// compute the remaining time until that timestamp. | ||
|
||
runs := (now.UnixMilli() - t0.UnixMilli()) / frequency.Milliseconds() | ||
|
||
timeUntilNextRun := t0.Add(time.Duration(runs+1) * frequency).Sub(now) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I barely grasp what we're doing here, but not enough to try and find gaps in the logic. Looks right but I'm not positive about that.
When the agent restarts, we might end up in a situation where we are running a check too often. For example, if a check is configured to run once every 10 minutes, and the check ran 1 minute ago, if we run immediately, we will have two executions within that 10 minute window.
Since we have to publish samples once every two minutes, we cannot wait for 9 minutes, because we end up with a gap. Instead, do run immediately to avoid that.
But if the check ran 9 minutes ago, we can align with the expectation by waiting for 1 minute instead of a random value. Presumably the check ran 9 minutes ago, and the sample was replicated 7, 5, 3 and 1 minutes ago. If we wait for 1 minute, we would end up running the check when it was expected to run.
In order to actually fix this issue the agents would have to persist data across runs. It might be possible to do this by offloading publishing to another service.
Fixes: #739