-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add grace period to scale-down #78
Add grace period to scale-down #78
Conversation
@maigl Could you see if this solves some of the churn you were seeing previously? This should have the same effect and allow for zero idle runners on pools. |
I'll check this out.. |
I'm skeptical that this will help us, the problem was, that we saw scaling up on |
I think I misunderstood the initial problem. I think you're dealing with situations where non-unique tag sets are used in workflows, and garm selects the wrong pool when it reacts to the The problem with using non-unique labels in workflows, when targeting runners is that the wrong type of runner may pick up the job automatically (garm is not involved in this). For example, if you create a pool with GPUs enabled and one with high amounts of storage for the same repo, runners in both pools will react to I will change this PR. Will leave in the checks for enough idle runners, but will add a check for pools with |
Add a grace period for idle runners of 5 minutes. A new idle runner will not be taken into consideration for scale-down unless it's older than 5 minutes. This should prevent situations where the scaleDown() routine that runs every minute will evaluate candidates for reaping and erroneously count the new one as well. The in_progress hooks that transitiones an idle runner to "active" may arive a long while after the "queued" hook has spun up a runner. Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
26b80e2
to
43d2fd8
Compare
I left in the grace period for scale-down and also added an extra check when deciding to skip adding an idle runner. We check if there is at least one idle runner available in the pool. This should work even for pools that have The idea is that we want the pool to still react to the What do you think? |
Once we re-work the way garm reacts to |
.. this looks fine for me .. .. but I think we have another problem here: |
This is interesting.
The thought process here was that each pool should have unique sets of tags. Mostly because there is no way to influence which runners react to a workflow from GitHub if more than one pool have the same tags. So it becomes impossible to efficiently scale automatically. Would you mind writing up a user story of how you use this? Do you have multiple AZ/regions? How do you envision distributing the runners? Things of this nature. The more detailed the better, as it gives us more data-points to take into account when we'll eventually expand how the pools are managed. |
yes .. we do have multiple AZs and we must run garm in a HA setup to reach some 9s. Currently we have only one garm but it has pools on multiple AZs. I think if queue events would be distributed to a pool randomly or with round robin it would be sufficient. I don't see the need for some elaborate strategy. I a future setup we will probably have multiple garm instances. Currently we think that they would not need to know each other and also don't share any data. |
Thanks for the input! |
There is an issue with skipping the creation of a new runner when we already have idling runners. In some situations, we may have many queued events sent in rapid succession. This can happen if we start a workflow with multiple jobs, maybe in a matrix. There may be a significant delay between that rapid succession of "queued" events and the time a runner picks up one of the jobs and an "in_progress" event is received. So if we get 10 "queued" events and we only have 1 runner, we essentially ignore 9 events. I will have to remove that check for now, as scaling down idling runners is less time consuming than waiting for that 1 idle runner to finish in order to pick up one of the other 9 jobs. I will allocate time next week to implement job tracking so we fix multiple problems. |
Currently, we don't seem to have much trouble with that. But job tracking sounds promising. |
Some good news. An option to skip the default labels in the self hosted github runners has been added via this PR: As soon as a stable release is cut with this option, I will add it to garm. We will finally be able to define just custom labels without the usual |
Add a grace period for idle runners of 5 minutes. A new idle runner will not be taken into consideration for scale-down unless it's older than 5 minutes. This should prevent situations where the scaleDown() routine that runs every minute will evaluate candidates for reaping and erroneously count the new one as well. The in_progress hooks that transitiones an idle runner to "active" may arive a long while after the "queued" hook has spun up a runner.
Signed-off-by: Gabriel Adrian Samfira gsamfira@cloudbasesolutions.com