Cache trigger secrets for the duration of request #585

lawrencejones · 2020-05-26T18:28:05Z

Changes

This commit adds a request-local cache for interceptors to leverage
during the processing of triggers. It allows interceptors to avoid doing
expensive work more than once for each request, such as fetching a
Kubernetes secret for validating webhooks.

The implementation uses the request context to provide the cache. This
was the least disruptive method of providing a cache for use with
interceptors, and is appropriate if you consider the cache should live
only for the duration of each request.

Alternative implementations might have used the client-go informers to
extend the Kubernetes client to watch for secrets in the cluster. This
would cause the work required to fetch secrets to scale with the number
of secrets in the cluster, as opposed to making a fresh request per
webhook we process. That said, building caching clients seems like more
work than is necessary for fixing this simple problem, which is why I
went with a simple cache object.

The background for this change was finding Github webhooks timing out
once we exceeded ~40 triggers on our EventListener. While the CEL
filtering was super fast, the validation of Github webhook signatures
was being computed for every trigger, even though each trigger used the
same Github secret. Pulling the secret from Kubernetes was taking about
250ms, which meant 40 triggers exceeded the 10s Github timeout.

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

Includes tests (if functionality changed/added)
Includes docs (if user facing)
Commit messages follow commit message best practices

See the contribution guide for more details.

Release Notes

Cache Kubernetes secret refs for each EventListener webhook, using the cached value to process each trigger

This commit adds a request-local cache for interceptors to leverage during the processing of triggers. It allows interceptors to avoid doing expensive work more than once for each request, such as fetching a Kubernetes secret for validating webhooks. The implementation uses the request context to provide the cache. This was the least disruptive method of providing a cache for use with interceptors, and is appropriate if you consider the cache should live only for the duration of each request. Alternative implementations might have used the client-go informers to extend the Kubernetes client to watch for secrets in the cluster. This would cause the work required to fetch secrets to scale with the number of secrets in the cluster, as opposed to making a fresh request per webhook we process. That said, building caching clients seems like more work than is necessary for fixing this simple problem, which is why I went with a simple cache object. The background for this change was finding Github webhooks timing out once we exceeded ~40 triggers on our EventListener. While the CEL filtering was super fast, the validation of Github webhook signatures was being computed for every trigger, even though each trigger used the same Github secret. Pulling the secret from Kubernetes was taking about 250ms, which meant 40 triggers exceeded the 10s Github timeout.

linux-foundation-easycla · 2020-05-26T18:28:08Z

The committers are authorized under a signed CLA.

✅ Lawrence Jones (dbce2b6)

tekton-robot · 2020-05-26T18:28:09Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign iancoffey
You can assign the PR to them by writing /assign @iancoffey in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tekton-robot · 2020-05-26T18:28:20Z

Hi @lawrencejones. Thanks for your PR.

I'm waiting for a tektoncd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

lawrencejones · 2020-05-26T18:31:42Z

Hey team! I'm totally new to the tekton codebase, so I've opened this PR as a draft and left tests out for now.

Ideally you'd let me know if this approach looks viable to you, and if so, I'll add the tests and polish it.

This addresses a comment I made the other day here: #406 (comment)

Thanks!

dibyom · 2020-05-26T20:55:24Z

Hi @lawrencejones

Thank you for the PR! Sorry I missed your comment on the issue (we were on holiday for the last few days). I think I have a slight preference for using a caching client -- we should be able to use the knative Store to make thing easier (like tektoncd/pipeline#2637). That being said, this PR is definitely a step in the right direction and if the other approach ends up being too complex, we can merge this and then iterate!

lawrencejones · 2020-05-27T06:53:12Z

That sounds fine to me. Didn't want to introduce the cached client unless it was used elsewhere. I'll have a look at your example and give that a shot :)

tragiclifestories · 2020-05-28T18:08:28Z

Hi @dibyom - I might pick this up from Lawrence as my team's made time for investigating perf issues like this. It doesn't look like the knative library provides an equivalent to the configmap package for watching secrets (as per tektoncd/pipeline#2637), so my instinctive approach would be to fall back on the underlying k8s client informer stuff to build a just-good-enough version. Does that make sense from your POV?

dibyom · 2020-06-01T19:59:41Z

Hey @tragiclifestories that sounds good to me! Thanks for working on it!

tragiclifestories · 2020-06-02T10:55:52Z

Hi there @dibyom, thanks for the reply. Having dug into the k8s client-go docs a little more, I've come to realise that implementing it that way is going to be a much bigger time sink than I had hoped.

I have a branch on my fork that essentially finishes this PR off in its current implementation with tests and so on. Though I plan to take another stab at the informer version this afternoon, it would be great if the request cache version was acceptable as a stop-gap, since there's a good chance it makes a major difference to the performance of event listeners with large numbers of Git(hub|lab) triggers as it currently is. Sorry for blowing hot and cold on this, still getting my head around all the various moving parts in k8s client land!

dibyom · 2020-06-02T17:58:38Z

Though I plan to take another stab at the informer version this afternoon, it would be great if the request cache version was acceptable as a stop-gap, since there's a good chance it makes a major difference to the performance of event listeners with large numbers of Git(hub|lab) triggers as it currently is.

Sure, this is definitely an improvement over what we have at the moment and since its not an external facing API change, we should be able to switch to a another implementation later. Happy to review a PR with the request cache changes!

tragiclifestories · 2020-06-03T10:00:22Z

OK!

Request cache version: #595

k8s cache prototype: #594

dibyom · 2020-06-16T02:05:59Z

I'm going to close this in favor of the other two PRs mentioned!
/close

tekton-robot · 2020-06-16T02:06:01Z

@dibyom: Closed this PR.

In response to this:

I'm going to close this in favor of the other two PRs mentioned!
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 26, 2020

tekton-robot requested review from dlorenc and wlynch May 26, 2020 18:28

tekton-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 26, 2020

This was referenced Jun 3, 2020

Cache secrets in interceptors with Reflector #594

Closed

Cache trigger secrets for the duration of request (with tests) #595

Merged

tekton-robot closed this Jun 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache trigger secrets for the duration of request #585

Cache trigger secrets for the duration of request #585

lawrencejones commented May 26, 2020

linux-foundation-easycla bot commented May 26, 2020 •

edited

Loading

tekton-robot commented May 26, 2020

tekton-robot commented May 26, 2020

lawrencejones commented May 26, 2020

dibyom commented May 26, 2020

lawrencejones commented May 27, 2020

tragiclifestories commented May 28, 2020

dibyom commented Jun 1, 2020

tragiclifestories commented Jun 2, 2020

dibyom commented Jun 2, 2020

tragiclifestories commented Jun 3, 2020

dibyom commented Jun 16, 2020

tekton-robot commented Jun 16, 2020

Cache trigger secrets for the duration of request #585

Cache trigger secrets for the duration of request #585

Conversation

lawrencejones commented May 26, 2020

Changes

Submitter Checklist

Release Notes

linux-foundation-easycla bot commented May 26, 2020 • edited Loading

tekton-robot commented May 26, 2020

tekton-robot commented May 26, 2020

lawrencejones commented May 26, 2020

dibyom commented May 26, 2020

lawrencejones commented May 27, 2020

tragiclifestories commented May 28, 2020

dibyom commented Jun 1, 2020

tragiclifestories commented Jun 2, 2020

dibyom commented Jun 2, 2020

tragiclifestories commented Jun 3, 2020

dibyom commented Jun 16, 2020

tekton-robot commented Jun 16, 2020

linux-foundation-easycla bot commented May 26, 2020 •

edited

Loading