Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache trigger secrets for the duration of request #585

Closed

Conversation

lawrencejones
Copy link
Contributor

Changes

This commit adds a request-local cache for interceptors to leverage
during the processing of triggers. It allows interceptors to avoid doing
expensive work more than once for each request, such as fetching a
Kubernetes secret for validating webhooks.

The implementation uses the request context to provide the cache. This
was the least disruptive method of providing a cache for use with
interceptors, and is appropriate if you consider the cache should live
only for the duration of each request.

Alternative implementations might have used the client-go informers to
extend the Kubernetes client to watch for secrets in the cluster. This
would cause the work required to fetch secrets to scale with the number
of secrets in the cluster, as opposed to making a fresh request per
webhook we process. That said, building caching clients seems like more
work than is necessary for fixing this simple problem, which is why I
went with a simple cache object.

The background for this change was finding Github webhooks timing out
once we exceeded ~40 triggers on our EventListener. While the CEL
filtering was super fast, the validation of Github webhook signatures
was being computed for every trigger, even though each trigger used the
same Github secret. Pulling the secret from Kubernetes was taking about
250ms, which meant 40 triggers exceeded the 10s Github timeout.

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

See the contribution guide for more details.

Release Notes

Cache Kubernetes secret refs for each EventListener webhook, using the cached value to process each trigger

This commit adds a request-local cache for interceptors to leverage
during the processing of triggers. It allows interceptors to avoid doing
expensive work more than once for each request, such as fetching a
Kubernetes secret for validating webhooks.

The implementation uses the request context to provide the cache. This
was the least disruptive method of providing a cache for use with
interceptors, and is appropriate if you consider the cache should live
only for the duration of each request.

Alternative implementations might have used the client-go informers to
extend the Kubernetes client to watch for secrets in the cluster. This
would cause the work required to fetch secrets to scale with the number
of secrets in the cluster, as opposed to making a fresh request per
webhook we process. That said, building caching clients seems like more
work than is necessary for fixing this simple problem, which is why I
went with a simple cache object.

The background for this change was finding Github webhooks timing out
once we exceeded ~40 triggers on our EventListener. While the CEL
filtering was super fast, the validation of Github webhook signatures
was being computed for every trigger, even though each trigger used the
same Github secret. Pulling the secret from Kubernetes was taking about
250ms, which meant 40 triggers exceeded the 10s Github timeout.
@tekton-robot tekton-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 26, 2020
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented May 26, 2020

CLA Check
The committers are authorized under a signed CLA.

@tekton-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign iancoffey
You can assign the PR to them by writing /assign @iancoffey in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 26, 2020
@tekton-robot
Copy link

Hi @lawrencejones. Thanks for your PR.

I'm waiting for a tektoncd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@lawrencejones
Copy link
Contributor Author

Hey team! I'm totally new to the tekton codebase, so I've opened this PR as a draft and left tests out for now.

Ideally you'd let me know if this approach looks viable to you, and if so, I'll add the tests and polish it.

This addresses a comment I made the other day here: #406 (comment)

Thanks!

@dibyom
Copy link
Member

dibyom commented May 26, 2020

Hi @lawrencejones

Thank you for the PR! Sorry I missed your comment on the issue (we were on holiday for the last few days). I think I have a slight preference for using a caching client -- we should be able to use the knative Store to make thing easier (like tektoncd/pipeline#2637). That being said, this PR is definitely a step in the right direction and if the other approach ends up being too complex, we can merge this and then iterate!

@lawrencejones
Copy link
Contributor Author

That sounds fine to me. Didn't want to introduce the cached client unless it was used elsewhere. I'll have a look at your example and give that a shot :)

@tragiclifestories
Copy link
Contributor

Hi @dibyom - I might pick this up from Lawrence as my team's made time for investigating perf issues like this. It doesn't look like the knative library provides an equivalent to the configmap package for watching secrets (as per tektoncd/pipeline#2637), so my instinctive approach would be to fall back on the underlying k8s client informer stuff to build a just-good-enough version. Does that make sense from your POV?

@dibyom
Copy link
Member

dibyom commented Jun 1, 2020

Hey @tragiclifestories that sounds good to me! Thanks for working on it!

@tragiclifestories
Copy link
Contributor

Hi there @dibyom, thanks for the reply. Having dug into the k8s client-go docs a little more, I've come to realise that implementing it that way is going to be a much bigger time sink than I had hoped.

I have a branch on my fork that essentially finishes this PR off in its current implementation with tests and so on. Though I plan to take another stab at the informer version this afternoon, it would be great if the request cache version was acceptable as a stop-gap, since there's a good chance it makes a major difference to the performance of event listeners with large numbers of Git(hub|lab) triggers as it currently is. Sorry for blowing hot and cold on this, still getting my head around all the various moving parts in k8s client land!

@dibyom
Copy link
Member

dibyom commented Jun 2, 2020

Though I plan to take another stab at the informer version this afternoon, it would be great if the request cache version was acceptable as a stop-gap, since there's a good chance it makes a major difference to the performance of event listeners with large numbers of Git(hub|lab) triggers as it currently is.

Sure, this is definitely an improvement over what we have at the moment and since its not an external facing API change, we should be able to switch to a another implementation later. Happy to review a PR with the request cache changes!

@tragiclifestories
Copy link
Contributor

OK!

Request cache version: #595

k8s cache prototype: #594

@dibyom
Copy link
Member

dibyom commented Jun 16, 2020

I'm going to close this in favor of the other two PRs mentioned!
/close

@tekton-robot
Copy link

@dibyom: Closed this PR.

In response to this:

I'm going to close this in favor of the other two PRs mentioned!
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants