-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENT-11765: Changed cf-execd's sleep behavior so it attempts to wake up at the beginning of every minute #5498
Conversation
@cf-bottom jenkins, please |
Backport: #5499 |
Alright, I triggered a build: Jenkins: https://ci.cfengine.com/job/pr-pipeline/10716/ Packages: http://buildcache.cfengine.com/packages/testing-pr/jenkins-pr-pipeline-10716/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems reasonable. I do think that two logs might be helpful
- verbose log about how long environment took to load
- warning log about environment load taking more than one minute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me otherwise.
@@ -434,6 +434,17 @@ void ThisAgentInit(void) | |||
umask(077); | |||
} | |||
|
|||
// Return 2 to 62 seconds depending on how much time is left until the next |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 2 to 62 and not the actual remaining number of seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondered why 2 to 62 as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it seemed safer.
- If there's 0 seconds left, wait a couple of seconds before triggering another agent run.
- If we are a little bit off on our sleeping, targeting second
:02
should mean it's less likely we accidentally wake up at:59
in the previous minute and end up doing 2 runs in the same minute. I guess this could happen if NTP is slightly adjusting the time while we're sleeping, for example? Anyways, it's just based on what I felt was safer, and getting a test build ready for the customer quickly, no real data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added more comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be best to use the monotonic clock on platforms that have it, NTP can easily jump multiple seconds forward or backwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, it would give better guarantees about how long we actually sleep. But still it would be a bit awkward, since the time you wake up would maybe not match what is expected. (Same as now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there are two separate things -- how often this code wakes up (monotonic clock) and when it happens (system time). We can only control the former, but the latter actually determines the defined classes and thus the agent run triggering. So... 🤷♂️
cf-execd/cf-execd.c
Outdated
// accidentally skipping any runs if things are slow. | ||
static inline time_t GetPulseTime() | ||
{ | ||
const time_t current_second = (time(NULL) % 60); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, adjusted to use the constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change actually doesn't make sense to me. If we want to determine which second in a minute the clock currently is, % 60
is the right calculation. If CFPULSETIME
is set to 120 or 180, its use in the calculation won't make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought a bit about that, but it does make some sense. If you define a 120 second pulse, it will tell you how many seconds is left of the current pulse, even though that feels less intuitive to think about than the current setup.
…ginning of every minute This could fix issues if things are slow and agent runs end up getting skipped if cf-execd wakes up too late. Changelog: Title Ticket: ENT-11765 Signed-off-by: Ole Herman Schumacher Elgesem <ole@northern.tech>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely an improvement.
FWIW, word on the street is that this change has a positive impact. |
This could fix issues if things are slow and agent runs end up getting skipped if cf-execd wakes up too late.