-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
issue with small pool sizes #47
Comments
Indeed. That is a bit of a pickle. Garm creates new runners based on 2 things:
And will obey the max runner limit ignoring "queued" events if that limit is reached. This means that one decision factor is removed in that circumstance. And min_idle_runner=0 will eliminate the other. An immediate solution would be to have at least 1 set on min_idle_runners. A good solution would be to queue the new runner in the database instead of ignoring the event, and not spin it up until we can accommodate it (number of runners falls under max). Unfortunately, this one will take some time to implement due to backlog. But it will be fixed. I for the time being would suggest to define pools at the org/enterprise level (which should reduce the number of pools you will need), and have at least 1 runner per pool. |
thanks! definitely not a critical issue as we have a workaround. Queued runners in the database could be a solution to at least guarantee that there will be an sufficient amount of runners. But what would happen if the job gets cancelled in the meanwhile or executed on another runner or ...? Maybe easier to have an algorithm that just reacts on the current load, e.g. "always start some extra runners that get automatically shutdown again if not used within a timeframe"... but only a vague idea, not sure if it will work. |
The main issue I see now, is that we ignore events sent via web hooks if I think a good approach would be to record web hook events as they come in, and use that as a source of truth to make decisions. It also gives us greater transparency into which event triggered the creation of a runner. We currently receive web hooks for 3 events:
What we could do is to create a new entity in our store that tracks web hook events and their state. Then we can use this to scale our runners. As long as we receive web hooks, we can keep this updated. The challenge is to make this robust enough to still work during periods of github outages, prolonged downtime periods for This is a non-trivial change that will require some thought and time, but will ultimately improve the way |
A minor issue we found during testing with small pool sizes (in our case: min=0, max=2):
If max pool size is reached garm will ignore additional jobs queued by GH.
As soon as the currently running jobs are completed, the pool is scaled down to min=0 again and the jobs waiting in GH will never be scheduled until we manually resend the webhook event / manually trigger the creation of a new runner.
Moritz Keppler moritz.keppler@mercedes-benz.com, Mercedes-Benz Tech Innovation GmbH, legal info/Impressum
The text was updated successfully, but these errors were encountered: