Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-tenant interceptor and scaler #206

Merged
merged 43 commits into from
Sep 3, 2021

Conversation

arschles
Copy link
Collaborator

@arschles arschles commented Jul 1, 2021

This is a large PR that makes the interceptor and external scaler multi-tenant.

See below for testing instructions.

Instead of starting up a single interceptor/scaler automatically per application, a fleet of interceptors/scalers run and can operate on any application in the same namespace. Interceptors can dynamically proxy requests based on the incoming request, scalers can dynamically report metrics on all applications, and the operator can provide routing information to the interceptor fleet. See https://hackmd.io/@arschles/mutitenant-keda-http-addon for more detailed design information.

In this pull request:

  • The operator will not install any Pods into the cluster when any HTTPScaledObject is created
  • Any given interceptor pod can route a request to any installed application (in the same namespace)
  • Interceptors are (still) horizontally scalable
  • The operator maintains a "routing table" - a lookup table from hostname to the Service, Port, and Deployment` of a backing application
  • Interceptor pods periodically request the updated routing table from the operator, and update their internal copy
  • When the routing table is updated -- for example, when a new HTTPScaledObject is created -- the operator pings all interceptor pods to refresh their copy

There are several follow-ups to this PR, including but not limited to:

Checklist

  • Update design.md and walkthrough.md in docs/directory
  • Commits are signed with Developer Certificate of Origin (DCO)
  • Implement the routing table in the operator
  • Implement the routing table in the interceptor
  • Implement interceptor <--> operator RPC for the routing table
  • Remove code from operator that creates new resources
  • Make the routing table fetch path configurable in the operator
  • Expose metrics on the scaler so that KEDA can scale the interceptors as well
  • Clean up status values on the HTTPScaledObject
  • Make the external scaler respond based on the incoming HTTPScaledObject
    • requires that the queue knows about all hosts
  • Fix & add tests:
    • Add tests for the routing table in pkg/routing
    • Add tests for the routing table pinger in the interceptor
    • Add tests for the operator's routing table construction logic
    • round trip tests for the routing table
    • round trip tests for the HTTP request queue
    • Fix broken tests
  • Make applicable updates to the helm chart in kedacore/charts (feat: multi-tenant scaler and interceptor in the HTTP add-on charts#169)
    • Add or remove applicable fields from the CRD and update kedacore/charts. Some status fields will need removing, and a host field will need to be added so that the operator can build the routing table
    • The helm chart needs to install the interceptor and scaler along with the operator
    • add ScaledObject for interceptor. See above for details on work to make interceptor metrics available
  • Any necessary documentation is added, such as:
  • e2e tests
  • any other unit/integration tests necessary
  • ensure that the scaler's IsActive method returns true always for interceptors (so that they don't scale down to 0)
  • Change routing table communication strategy:
    • Operator records routing table to ConfigMap
    • Each interceptor fetches it on startup, records it to memory
    • Interceptors have an open watch stream on the map
    • Interceptors periodically fetch the ConfigMap (to ensure they converge to the correct table, even if they miss events)
  • Ensure that unnecessary configurations are removed from the associated helm chart

Follow-Ups


Fixes #183
Fixes #214
Fixes #101

NOTE: since this is a large pull request, we've split up the work. To do so, we've created branches off of this branch and submitted PRs to this branch. Following are PRs that need to be merged into into this branch before this should be reviewed and merged:

cc/ @yaron2 @tomkerkhove

@yaron2
Copy link

yaron2 commented Jul 8, 2021

@arschles when do you think this will be ready for review?

@arschles
Copy link
Collaborator Author

@yaron2 this week sometime. I have to finish the routing table storage in the operator, then I can test it out, and from there I'll hit the button to make it not a draft PR. I'll @ you at that point as well.

for context, I just moved to a new house 600 miles away over the weekend and am getting back on my feet 😆

@arschles
Copy link
Collaborator Author

arschles commented Jul 14, 2021

@yaron2 an update here - everything is built at this point. I have an M1 Mac and having problems (possibly qemu related) building images for amd64 architectures, so I am figuring that out and then I'll be ready to test in a cluster. Feel free to start reviewing this in the meantime if you like. The code is still rough, but it would be great to have more eyes on it sooner rather than later.

@arschles arschles marked this pull request as ready for review July 14, 2021 00:48
@yaron2
Copy link

yaron2 commented Jul 14, 2021

@yaron2 an update here - everything is built at this point. I have an M1 Mac and having problems (possibly qemu related) building images for amd64 architectures, so I am figuring that out and then I'll be ready to test in a cluster. Feel free to start reviewing this in the meantime if you like. The code is still rough, but it would be great to have more eyes on it sooner rather than later.

Roger Roger.

@khaosdoctor khaosdoctor mentioned this pull request Jul 28, 2021
2 tasks
@arschles arschles mentioned this pull request Jul 29, 2021
@khaosdoctor
Copy link
Contributor

khaosdoctor commented Aug 2, 2021

@arschles can you please add another to-do list item:

@khaosdoctor
Copy link
Contributor

khaosdoctor commented Aug 2, 2021

I Will be responsible for the following:

  • Change routing table communication strategy:
    • Operator records routing table to ConfigMap
    • Each interceptor fetches it on startup, records it to memory
    • Interceptors have an open watch stream on the map
    • Interceptors periodically fetch the ConfigMap (to ensure they converge to the correct table, even if they miss events)

If all goes right, should start working on it next Wed

@arschles
Copy link
Collaborator Author

arschles commented Aug 2, 2021

@arschles can you please add another to-do list item:

@khaosdoctor that TODO list item is in there but already checked off. would you like me to uncheck it?

@arschles
Copy link
Collaborator Author

arschles commented Aug 2, 2021

I Will be responsible for the following:

@khaosdoctor FYI I'm going to be working on adding e2e tests in this branch as well

@khaosdoctor
Copy link
Contributor

@arschles can you please add another to-do list item:

@khaosdoctor that TODO list item is in there but already checked off. would you like me to uncheck it?

Oh haven't seen it! No worries, it's fine then :D

@khaosdoctor
Copy link
Contributor

I Will be responsible for the following:

@khaosdoctor FYI I'm going to be working on adding e2e tests in this branch as well

Yep! I booked some time weekly starting next week to finish it ASAP

@khaosdoctor
Copy link
Contributor

khaosdoctor commented Aug 11, 2021

@arschles I will start the new routing table strategy and merge it into your global-components branch so we only create a single PR here with all the changes (and also so I have all the changes you've made)

@arschles
Copy link
Collaborator Author

How to test this

  1. Ensure that KEDA is already installed
  2. Build the images in this PR using the following command: mage dockerbuild dockerpush (or if you want to use ACR Tasks: mage dockerbuildacr)
  3. Check out the branch in this PR: feat: multi-tenant scaler and interceptor in the HTTP add-on charts#169
  4. Install the chart from inside the kedacore/charts repo: helm install http-add-on ./http-add-on -n $NAMESPACE --set images.tag=${TAG} --set images.operator=${OPERATOR_IMG} --set images.scaler=${SCALER_IMG} --set images.interceptor=${INTERCEPTOR_IMG}
  5. From inside this repository in this branch: helm install xkcd ./examples/xkcd -n $NAMESPACE
  6. Now that the app is installed, you can issue requests to it. Use the keda-add-ons-http-interceptor-proxy Service on port 8080 for that. From inside the cluster in the same $NAMESPACE, do this: curl -H "Host: myhost.com" keda-add-ons-http-interceptor-proxy:8080
    • Note that you need the Host header so that the interceptor routes to the right place. To change that host, add this flag to the end of the helm install xkcd command: --set host=<your host>

Signed-off-by: Aaron Schlesinger <aaron@ecomaz.net>
…mponents

Signed-off-by: Aaron Schlesinger <70865+arschles@users.noreply.github.com>
Copy link

@yaron2 yaron2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@yaron2
Copy link

yaron2 commented Sep 1, 2021

Reviewed offline with @arschles, looks good.

@arschles arschles merged commit c211da9 into kedacore:main Sep 3, 2021
@tomkerkhove
Copy link
Member

Adding @zroubalik in case he wants to review as well

@tomkerkhove
Copy link
Member

Oh OK, nvm - Auto-merge was on.

@benjaminhuo
Copy link

Huge improvement!

@zroubalik
Copy link
Member

Oh OK, nvm - Auto-merge was on.

🤣

Copy link
Member

@zroubalik zroubalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, though the PR is so huge, that I might have missed anything.
But I trust @arschles :) Great work!

@arschles
Copy link
Collaborator Author

arschles commented Sep 3, 2021

thank you @tomkerkhove and @benjaminhuo

@zroubalik sorry this got merged before you got a chance to look at it. if you'd like to review, please feel free to! I can make changes/fixes in follow-up PRs

@zroubalik
Copy link
Member

@arschles no worries, I went through the code after the merge and it was looking good :)

@pkit
Copy link

pkit commented Sep 16, 2021

Eh, unfortunately 2 pods per namespace still totally defeats the purpose of "multi-tenant scale to zero" as tenants usually live in namespaces and usually that's what is needed: scale each namespace to zero pods.

@arschles
Copy link
Collaborator Author

@pkit there needs to be at least one pod running, though, to handle incoming requests to applications that are scaled to zero. or am I missing something?

@benjaminhuo
Copy link

@pkit there needs to be at least one pod running, though, to handle incoming requests to applications that are scaled to zero. or am I missing something?

In knative, the activator pod is just for this purpose which is always on

image

image

@pkit
Copy link

pkit commented Sep 18, 2021

@arschles yup, 1 global deployment of request handlers is ok. But 1 per namespace is not that useful.
@benjaminhuo Knative solution looks good.

@arschles
Copy link
Collaborator Author

@benjaminhuo @pkit we've scoped the interceptor to an individual namespace on purpose, because KEDA is also scoped to a single namespace. We could expand the interceptor to be cluster-global, but doing so to gain better economies of scale would only make sense if the external scaler were made cluster-global as well.

@yaron2 - you raised the issue that prompted this PR in the first place. WDYT?

Also, @tomkerkhove and @zroubalik WDYT as well?

@pkit
Copy link

pkit commented Sep 20, 2021

@arschles my idea was to use something other than full fledged knative installation to just have that "scale-to-zero" feature.
But it seems like there is literally no other solution.
Thanks!

@arschles
Copy link
Collaborator Author

@pkit not sure what you mean?

@pkit
Copy link

pkit commented Sep 20, 2021

@arschles there is no solution (other than knative) that provides "namespace scale to zero" functionality.

@tomkerkhove
Copy link
Member

@benjaminhuo @pkit we've scoped the interceptor to an individual namespace on purpose, because KEDA is also scoped to a single namespace. We could expand the interceptor to be cluster-global, but doing so to gain better economies of scale would only make sense if the external scaler were made cluster-global as well.

@yaron2 - you raised the issue that prompted this PR in the first place. WDYT?

Also, @tomkerkhove and @zroubalik WDYT as well?

Weren't we going to support both of the scenarios? I thought that was the case where you could have it cluster-wide if you want to centralize or have it namespaced if you want to isolate.

We do the same with KEDA where you can deploy it cluster-wide or scoped if you want to.

I think we should align, what are your thoughts @zroubalik @yaron2 ?

@arschles
Copy link
Collaborator Author

@arschles there is no solution (other than knative) that provides "namespace scale to zero" functionality.

got it, thanks for clarifying. stay tuned, we might be adding that functionality soon (see #206 (comment))

Weren't we going to support both of the scenarios? I thought that was the case where you could have it cluster-wide if you want to centralize or have it namespaced if you want to isolate.

@tomkerkhove somebody asked about that, but we didn't go all the way to making it cluster-global. we did, however, decide to allow interceptors/scalers/operators to run in any arbitrary namespace. making them cluster-global would be different work.

I wasn't aware that you could install KEDA at the cluster-wide level. can you point me to any docs on how to do that? I think if KEDA can be global, that makes it easier for the addon to do so as well.

@tomkerkhove
Copy link
Member

It's part of the helm chart/configuration - https://github.com/kedacore/charts/blob/master/keda/README.md#configuration

It's called watchnamespace and is cluster-wide by default, but scopable if you need to

@arschles
Copy link
Collaborator Author

arschles commented Sep 21, 2021

👍 . Not sure how I didn't know this. I don't have the need to use KEDA across multiple namespaces much, I guess 😆

I think, then, that #240 can go on as planned (because it doesn't make much sense for KEDA to be cluster-global, but this project to not). @pkit your wish from #206 (comment) is going to be granted 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Not scaling to 0 Economy of scale Provide support for scaling from 0 -> n and vice versa
7 participants