Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add grace period to scale-down #78

Merged

Conversation

gabriel-samfira
Copy link
Member

Add a grace period for idle runners of 5 minutes. A new idle runner will not be taken into consideration for scale-down unless it's older than 5 minutes. This should prevent situations where the scaleDown() routine that runs every minute will evaluate candidates for reaping and erroneously count the new one as well. The in_progress hooks that transitiones an idle runner to "active" may arive a long while after the "queued" hook has spun up a runner.

Signed-off-by: Gabriel Adrian Samfira gsamfira@cloudbasesolutions.com

@gabriel-samfira
Copy link
Member Author

@maigl Could you see if this solves some of the churn you were seeing previously? This should have the same effect and allow for zero idle runners on pools.

@maigl
Copy link
Contributor

maigl commented Feb 7, 2023

I'll check this out..

@maigl
Copy link
Contributor

maigl commented Feb 7, 2023

I'm skeptical that this will help us, the problem was, that we saw scaling up on queue events and at the same time scaling down due to too many idling runners.
So the fix to not spin any more runners when we already have enough idling seems to work pretty well in our setup. If we now remove this we will again spin way to many runners ..

@gabriel-samfira
Copy link
Member Author

I think I misunderstood the initial problem. I think you're dealing with situations where non-unique tag sets are used in workflows, and garm selects the wrong pool when it reacts to the queued event. A runner that matches the required labels from Pool A picks up the job, but when garm looks up pools by labels it selects Pool B to spin up the runner in.

The problem with using non-unique labels in workflows, when targeting runners is that the wrong type of runner may pick up the job automatically (garm is not involved in this).

For example, if you create a pool with GPUs enabled and one with high amounts of storage for the same repo, runners in both pools will react to self-hosted. If your workflows require GPUs, it may end up on a server with high amounts of storage. The best way to avoid this scenario is to make sure all your workflows avoid using the default tags (self-hosted, x64/arm64, linux/windows/macos).

I will change this PR. Will leave in the checks for enough idle runners, but will add a check for pools with MinIdleRunners set to 0. What do you think?

Add a grace period for idle runners of 5 minutes. A new idle runner will
not be taken into consideration for scale-down unless it's older than 5
minutes. This should prevent situations where the scaleDown() routine
that runs every minute will evaluate candidates for reaping and
erroneously count the new one as well. The in_progress hooks that
transitiones an idle runner to "active" may arive a long while after the
"queued" hook has spun up a runner.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
@gabriel-samfira gabriel-samfira force-pushed the add-scale-down-grace-period branch from 26b80e2 to 43d2fd8 Compare February 7, 2023 11:36
@gabriel-samfira
Copy link
Member Author

gabriel-samfira commented Feb 7, 2023

I left in the grace period for scale-down and also added an extra check when deciding to skip adding an idle runner. We check if there is at least one idle runner available in the pool. This should work even for pools that have MinIdleRunners set to 0.

The idea is that we want the pool to still react to the queued event, even if it has MinIdleRunners set to 0.

What do you think?

@gabriel-samfira
Copy link
Member Author

The problem with using non-unique labels in workflows, when targeting runners is that the wrong type of runner may pick up the job automatically (garm is not involved in this).

Once we re-work the way garm reacts to queued events, I think we can prevent spinning up runners in the wrong pool, and simply create a new runner in the pool that lost one when it picked up a job. But using non-unique label sets won't fix the potential problem of getting the wrong type of runner (garm is not involved in that), because the job ended up on a runner from the wrong pool.

@maigl
Copy link
Contributor

maigl commented Feb 7, 2023

.. this looks fine for me ..

.. but I think we have another problem here:
for redundancy reasons we have multiple pools with the identical label set. Currently garm only looks for the First() matching pool it finds in the db. I.e. every scale event goes to the very same pool. I think we need to fix this.

@gabriel-samfira
Copy link
Member Author

for redundancy reasons we have multiple pools with the identical label set

This is interesting.

Currently garm only looks for the First() matching pool it finds in the db

The thought process here was that each pool should have unique sets of tags. Mostly because there is no way to influence which runners react to a workflow from GitHub if more than one pool have the same tags. So it becomes impossible to efficiently scale automatically.

Would you mind writing up a user story of how you use this? Do you have multiple AZ/regions? How do you envision distributing the runners? Things of this nature. The more detailed the better, as it gives us more data-points to take into account when we'll eventually expand how the pools are managed.

@maigl
Copy link
Contributor

maigl commented Feb 8, 2023

yes .. we do have multiple AZs and we must run garm in a HA setup to reach some 9s. Currently we have only one garm but it has pools on multiple AZs. I think if queue events would be distributed to a pool randomly or with round robin it would be sufficient. I don't see the need for some elaborate strategy.
But currently all queue events go to only one pool.

I a future setup we will probably have multiple garm instances. Currently we think that they would not need to know each other and also don't share any data.

@gabriel-samfira
Copy link
Member Author

Thanks for the input!

@gabriel-samfira gabriel-samfira merged commit 24f61ce into cloudbase:main Feb 8, 2023
@gabriel-samfira
Copy link
Member Author

There is an issue with skipping the creation of a new runner when we already have idling runners.

In some situations, we may have many queued events sent in rapid succession. This can happen if we start a workflow with multiple jobs, maybe in a matrix. There may be a significant delay between that rapid succession of "queued" events and the time a runner picks up one of the jobs and an "in_progress" event is received. So if we get 10 "queued" events and we only have 1 runner, we essentially ignore 9 events.

I will have to remove that check for now, as scaling down idling runners is less time consuming than waiting for that 1 idle runner to finish in order to pick up one of the other 9 jobs.

I will allocate time next week to implement job tracking so we fix multiple problems.

@maigl
Copy link
Contributor

maigl commented Apr 8, 2023

Currently, we don't seem to have much trouble with that. But job tracking sounds promising.

@gabriel-samfira gabriel-samfira mentioned this pull request Apr 17, 2023
@gabriel-samfira
Copy link
Member Author

Some good news. An option to skip the default labels in the self hosted github runners has been added via this PR:

actions/runner#2443

As soon as a stable release is cut with this option, I will add it to garm. We will finally be able to define just custom labels without the usual self-hosted, x64, linux labels, making it easier to properly target runners.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants