-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: webhook does not return error on 'instance not found'. #28
Conversation
Hi @gabriel-samfira, here is another nit. I don't know if you're interested in those little things .. we might add bigger features in future :) |
That's right. We get all webhooks, including those that target other runner types (Github hosted for example). Garm currently returns an error in all cases (queued, in_progress and completed) if the action targets tags we don't care about, or if the webhook targets a worker that we don't manage (like a github hosted one). In all cases, the store will return a
I am a bit torn between knowing it's pointless to log/react to these events and correctly reporting back the fact that the webhook we got, was not meant for us 😄. I think we can just not report an error if we get an |
apiserver/controllers/controllers.go
Outdated
@@ -103,6 +104,11 @@ func (a *APIController) handleWorkflowJobEvent(w http.ResponseWriter, r *http.Re | |||
|
|||
if err := a.r.DispatchWorkflowJob(hookType, signature, body); err != nil { | |||
log.Printf("failed to dispatch work: %s", err) | |||
// in this case we probably got a webhook that was not meant for us | |||
// no need to return an error | |||
if strings.Contains(err.Error(), "fetching instance by name: not found") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We try to return meaningful errors we can test for, as much as we can. This case can be handled with:
if errors.Is(err, gErrors.ErrNotFound) {
log.Printf("got not found error from DispatchWorkflowJob. webhook not meant for us?: %q", err)
return nil
}
It would also be useful to move the log.Printf("failed to dispatch work: %s", err)
statement under the if
statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure .. I'd love to use errors.Is but I was under the impression that there are multiple cases where we return ErrNotFound but only one of this cases means that its not our runner ("fetching instance by name: not found"). Other cases still might point to a real error. Is that wrong ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TL;DR: It should be safe to ignore any ErrNotFound
in this particular code path (webhook handling).
TS; NM: I will elaborate ( bare with me, I sometimes tend to be verbose 😄 ).
The web hook handler is how we consume GitHub events and react to them in 3 particular ways:
- When a new job is queued (a
queued
event is sent) - When a job is started (an
in_progress
event is sent) - When a job is completed (a
completed
event is sent)
In the case of a queued
event, garm
will look at the job info and using the tags that were requested by the workflow, will attempt to find a pool where it will try to spawn a worker. Garm looks for a pool that matches all requested tags. If a pool is not found, then an ErrNotFound
is returned.
This is not necessarily an error. It's just that we don't have a pool to handle that particular workflow, so for us, that webhook is a noop.
In the case of an in_progress
event, garm will attempt to determine the worker name that is handling the job we were just notified about. When a worker name is found, garm will look for that particular worker and attempt to change it's status from idle
to active
. This is important because once that happens, if we have min_idle_runners
set on the pool, garm will attempt to automatically create a new worker, if one was not already created during the queued
event (which can happen if max_runners
is reached).
If garm can't find the runner in the database, it will return ErrNotFound
. But this means that we never managed the runner we got a hook for. So this may be handled by another controller or by github itself. So it is safe to ignore.
In the case of a completed
event, we do pretty much the same thing as in the in_progress
event, but we mark the runner as pending_delete
. At this point the runner was already removed from github, so all that's left for us to do is remove it from our provider. If we can't find the runner, then it's either a runner we don't manage, or it was already deleted. Desired state is current state, so this too will be a noop.
There are a few things to consider here. This change only affects the webhook API endpoint that GitHub sends hook events to. The only consumer of this endpoint is GitHub itself. We don't interact with this at all. Ignoring the ErrNotFound
error here, will only have the side effect of not seeing an error in the GitHub console. But there is no further handling of the error.
We should still log that error condition. We may want to know that GitHub requested something and we generated an ErrNotFound
, even though we're ignoring it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, thanks for taking the time to create this PR! I was on the fence about this issue, but considering it bugs other people too, we might as well change it. 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx for the elaborate explanation :) .. I changed the code as you have suggested.
This case is probably caused by a webhook event that was meant for another runner controller / manager. No need to report this as error, we can simply ignore this and avoid noise in the logs.
Thanks for the PR! |
This case is probably caused by a webhook event that was meant for another runner controller / manager. No need to report this as error, we can simply ignore this and avoid noise in the logs.
Michael Kuhnt michael.kuhnt@mercedes-benz.com Mercedes-Benz Tech Innovation GmbH (ProviderInformation)