From c111ca0fc57101e0efb0bb7763fb0f328a854793 Mon Sep 17 00:00:00 2001 From: Joseph Burnett Date: Tue, 15 Jan 2019 21:31:56 +0000 Subject: [PATCH 01/18] Scaling 2019 roadmap stub. --- docs/roadmap/scaling-2019.md | 42 ++++++++++++++++++++++++++++++++++++ 1 file changed, 42 insertions(+) create mode 100644 docs/roadmap/scaling-2019.md diff --git a/docs/roadmap/scaling-2019.md b/docs/roadmap/scaling-2019.md new file mode 100644 index 000000000000..c7af40e108e1 --- /dev/null +++ b/docs/roadmap/scaling-2019.md @@ -0,0 +1,42 @@ +# 2019 Autoscaling Roadmap + +## 2018 Recap + +Before we get into the 2019 roadmap, here is a quick recap of what we did in 2018. + +### Correctness + +1. **Write autoscaler end-to-end tests**: we put a few key autoscaling end-to-end tests in place. [AutoscaleUpDownUp](https://github.com/knative/serving/blob/51b74ba2b78b96fa4b7db3181b4a1c84c2758168/test/e2e/autoscale_test.go#L275) broadly covers scaling from N to 0 and back again. [AutoscaleUpCountPods](https://github.com/knative/serving/blob/51b74ba2b78b96fa4b7db3181b4a1c84c2758168/test/e2e/autoscale_test.go#L327) asserts autoscaler stability and reactivity. +2. **Test error rates at high scale** TODO: STATE OF THE WORLD +3. **Test error rates around idle states**: the AutoscaleUpDownUp end-to-end test has done a good job of flushing out a variety of edge cases and errors during idle and transition states. TODO: EXAMPLES + +### Performance + +1. **Establish canonical load test scenarios**: TODO: STATE OF THE WORLD +2. **Reproducible load tests**: +3. **Vertical pod autoscaling**: + +### Scale to Zero + +1. **Implement scale to zero**: +2. **Reduce Reserve Revision start time**: + +### Development + +1. **Decouple autoscaling from revision controller**: + +### Integration + +1. **Autoscaler multitenancy**: +2. **Consume custom metrics API**: +3. **Autoscale queue-based workloads**: + +## 2019 Goals + +### Sub-Second Cold Start + +### Streaming Autoscaling + +### Overload Handling + +### Vertical Pod Autoscaling Beta From 15a93c2960b7c9c5af563c2a6c80c71af90f2a4b Mon Sep 17 00:00:00 2001 From: Joseph Burnett Date: Tue, 29 Jan 2019 23:38:41 +0000 Subject: [PATCH 02/18] Descriptions for all 2019 goals. --- docs/roadmap/scaling-2018-recap.md | 30 ++++++++++++++++ docs/roadmap/scaling-2019.md | 55 ++++++++++++++++++------------ 2 files changed, 63 insertions(+), 22 deletions(-) create mode 100644 docs/roadmap/scaling-2018-recap.md diff --git a/docs/roadmap/scaling-2018-recap.md b/docs/roadmap/scaling-2018-recap.md new file mode 100644 index 000000000000..acfd0862f95a --- /dev/null +++ b/docs/roadmap/scaling-2018-recap.md @@ -0,0 +1,30 @@ +# 2018 Recap Autoscaling Roadmap + +Before we get into the 2019 roadmap, here is a quick recap of what we did in 2018. + +### Correctness + +1. **Write autoscaler end-to-end tests**: we put a few key autoscaling end-to-end tests in place. [AutoscaleUpDownUp](https://github.com/knative/serving/blob/51b74ba2b78b96fa4b7db3181b4a1c84c2758168/test/e2e/autoscale_test.go#L275) broadly covers scaling from N to 0 and back again. [AutoscaleUpCountPods](https://github.com/knative/serving/blob/51b74ba2b78b96fa4b7db3181b4a1c84c2758168/test/e2e/autoscale_test.go#L327) asserts autoscaler stability and reactivity. +2. **Test error rates at high scale** TODO: STATE OF THE WORLD +3. **Test error rates around idle states**: the AutoscaleUpDownUp end-to-end test has done a good job of flushing out a variety of edge cases and errors during idle and transition states. TODO: EXAMPLES + +### Performance + +1. **Establish canonical load test scenarios**: TODO: STATE OF THE WORLD +2. **Reproducible load tests**: +3. **Vertical pod autoscaling**: + +### Scale to Zero + +1. **Implement scale to zero**: +2. **Reduce Reserve Revision start time**: + +### Development + +1. **Decouple autoscaling from revision controller**: + +### Integration + +1. **Autoscaler multitenancy**: +2. **Consume custom metrics API**: +3. **Autoscale queue-based workloads**: diff --git a/docs/roadmap/scaling-2019.md b/docs/roadmap/scaling-2019.md index c7af40e108e1..1c3a1b8486cb 100644 --- a/docs/roadmap/scaling-2019.md +++ b/docs/roadmap/scaling-2019.md @@ -1,42 +1,53 @@ # 2019 Autoscaling Roadmap -## 2018 Recap +This is what we hope to accomplish in 2019. -Before we get into the 2019 roadmap, here is a quick recap of what we did in 2018. +## 2019 Goals -### Correctness +### Sub-Second Cold Start -1. **Write autoscaler end-to-end tests**: we put a few key autoscaling end-to-end tests in place. [AutoscaleUpDownUp](https://github.com/knative/serving/blob/51b74ba2b78b96fa4b7db3181b4a1c84c2758168/test/e2e/autoscale_test.go#L275) broadly covers scaling from N to 0 and back again. [AutoscaleUpCountPods](https://github.com/knative/serving/blob/51b74ba2b78b96fa4b7db3181b4a1c84c2758168/test/e2e/autoscale_test.go#L327) asserts autoscaler stability and reactivity. -2. **Test error rates at high scale** TODO: STATE OF THE WORLD -3. **Test error rates around idle states**: the AutoscaleUpDownUp end-to-end test has done a good job of flushing out a variety of edge cases and errors during idle and transition states. TODO: EXAMPLES +Serverless is only as good as the illusion we create. Throwing some code into a "serverless" framework, the expectation is that it will just be running when it needs to, with as many resources as necessary. Including zero. But to realize that magical threshold and maintain the illusion of serverless, the code must come back as if it was never gone. The exact latency requirement of "as if it was never gone" will vary from use-case to use-case. But generally less than one second is a good start. -### Performance +Right now cold-starts are between 10 and 15 seconds which is an order of magnitude too slow. The time is spent starting the pod, waiting for Envoy to start and telling all nodes how to reach the pod through the Kubernetes Service. Without the Istio mesh (just routing request to individual pods as they come up) still takes about 4 seconds. -1. **Establish canonical load test scenarios**: TODO: STATE OF THE WORLD -2. **Reproducible load tests**: -3. **Vertical pod autoscaling**: +This area requires some dedicated effort to: -### Scale to Zero +1. identify and programatically capture sources of cold-start latency at all levels of the stack ([#2495](https://github.com/knative/serving/issues/2495)) +2. chase down the low hanging fruit (e.g. [#2659](https://github.com/knative/serving/issues/2659)) +3. architect solutions to larger chunks of cold-start latency -1. **Implement scale to zero**: -2. **Reduce Reserve Revision start time**: +The goal is to achieve sub-second average cold-starts by the end of the year. -### Development +### Overload Handling -1. **Decouple autoscaling from revision controller**: +Knative Serving provides concurrency controls to limit the number of requests a container must handle simultaneously. Additionally, each pod has a queue for holding requests when the container concurrency limit has been reached. When the pod-level queue has been filled to capacity, subsequent request are rejected with 503 "overload". -### Integration +This is desireable to protect the Pod from being overloaded. But in the aggregate the behavior is not ideal for situations when autoscaling needs some time to react to sudden increases in request load (e.g. scale-from-zero). -1. **Autoscaler multitenancy**: -2. **Consume custom metrics API**: -3. **Autoscale queue-based workloads**: +The goal of Overload Handling is to enqueue requests at a revision-level. Scale-from-zero should not overload if autoscaling can react in a reasonable amount of time to provide additional pods. When new pods come online, they should be able to take load for the existing pods. Even when scaled above zero, brief spikes of overload should be handled by enqueuing requests at a revision-level. The depth of the revision-level queue should also be configurable because even the Revision as a whole needs to guard against overload. -## 2019 Goals +This overall problem is closely related to both Networking and Autoscaling, two different working groups. Much of the overload handling will be implemented in the Activator, which is a component most closely related to ingress, part of the Networking WG's charter. So this project is shared jointly between the two working groups. -### Sub-Second Cold Start +### Autoscaler Availability + +Because Knative scales to zero, autoscaling is in the critical-path for serving requests. If the autoscaler isn't available when an idle Revision receives a request, that request will not be served. Other components such as the Activator are in this situation too. But they are more stateless and so can be scaled horizontally relatively easily. For example, any Activator can proxy a request for any Revision. All it has to do is send a messge to the Autoscaler and then wait for a Pod to send the request to. Then it proxies the request and it take back out of the serving path. + +However the Autoscaler process is more stateful. It maintains request statistics over a window of time. And it must process data from the Revision Pods continously to maintain that window of data. It is part of the system all the time, not just when scaled to zero. As the number of Revisions and the number of Pods in each Revision increases, the CPU and memory requirements will exceed that available to a single process. Some sharding is necessary. ### Streaming Autoscaling -### Overload Handling +In addition to being always available, Web applications are expected to be responsive and connected. Continuously connected and streaming protocols like Websockets and HTTP2 are essential to a modern application. + +Knative Serving accepts HTTP2 connections. And will serve requests multiplexed within the connection. But the autoscaling subsystem doesn't quite know what to do with those connections. It sees each connection as continuous load on the system and so will autoscale accordingly. + +But the actual load is in the stream within the connection. So the metrics reported to the Autoscaler should be based on the number of concurrent **streams**. This requires some work in the Queue proxy to crack open the connection and emit stream metrics. + +Additionally, concurrency limits should be applied to streams, not connections. So containers which can handle only one request at at time should still be able to serve HTTP2. The Queue proxy will just allow one stream through at a time. ### Vertical Pod Autoscaling Beta + +Another aspect of the "serverless" illusion is figure out what code needs to run and running it efficiently. Knative has default resources request. And it supports resource requests and limit from the user. But if the user doesn't want to spend their time "tuning" resources, which is a very "serverful" way to spend your time, Vertical Pod Autoscaling (VPA) is needed. + +Knative previously integrated with VPA Alpha. Now it needs to reintegrate with VPA Beta. In addition to creating VPA resources for each Revision, we need to do a little bookkeeping for the unique requirements of serverless workloads. For example, the window for VPA recommendations is 2 weeks. But a serverless function might be invoked once per year (e.g. when the fire alarm gets pulled). The Pods should come back with the correct resource requests and limits. The way VPA is architected, it "injects" the correct recommendations and so it would fail to use the right resources after 2 weeks of inactivity. Knative needs to remember what that recommendation was and make sure new Pods start at the right levels. + +Additionally, one Revision should learn from the previous. But it must not taint the previous Revision's state. For example, when a Service is in runLatest mode, the next Revision should start from the resource requests of the previous. Then VPA will apply learning on top of that to adjust for changes in the application behavior. However if the next Revision goes crazy because of bad recommendations, a quick rollback to the previous should pick up the good ones. Again, this requires a little bit of bookkeeping in Knative. From 03bbda8fb40147daad14c448f372085b02d5d097 Mon Sep 17 00:00:00 2001 From: Joseph Burnett Date: Wed, 30 Jan 2019 17:18:38 +0000 Subject: [PATCH 03/18] Goals, POCs and Github projects for each. --- docs/roadmap/scaling-2019.md | 85 ++++++++++++++++++++++++++++-------- 1 file changed, 67 insertions(+), 18 deletions(-) diff --git a/docs/roadmap/scaling-2019.md b/docs/roadmap/scaling-2019.md index 1c3a1b8486cb..017a8878bb83 100644 --- a/docs/roadmap/scaling-2019.md +++ b/docs/roadmap/scaling-2019.md @@ -6,48 +6,97 @@ This is what we hope to accomplish in 2019. ### Sub-Second Cold Start -Serverless is only as good as the illusion we create. Throwing some code into a "serverless" framework, the expectation is that it will just be running when it needs to, with as many resources as necessary. Including zero. But to realize that magical threshold and maintain the illusion of serverless, the code must come back as if it was never gone. The exact latency requirement of "as if it was never gone" will vary from use-case to use-case. But generally less than one second is a good start. +Serverless is only as good as the illusion it sustains. Throwing some code into a "serverless" framework, the expectation is that it will just be running when it needs to, with as many resources as necessary. Including zero. But to realize that magical threshold and maintain the illusion of serverless, the code must come back as if it was never gone. The exact latency requirement of "as if it was never gone" will vary from use-case to use-case. But generally less than one second is a good start. Right now cold-starts are between 10 and 15 seconds which is an order of magnitude too slow. The time is spent starting the pod, waiting for Envoy to start and telling all nodes how to reach the pod through the Kubernetes Service. Without the Istio mesh (just routing request to individual pods as they come up) still takes about 4 seconds. -This area requires some dedicated effort to: +We've poked at this problem in 2019 ([#1297](https://github.com/knative/serving/issues/1297)) but haven't been able to make significant progress. This area requires some dedicated effort to: 1. identify and programatically capture sources of cold-start latency at all levels of the stack ([#2495](https://github.com/knative/serving/issues/2495)) 2. chase down the low hanging fruit (e.g. [#2659](https://github.com/knative/serving/issues/2659)) 3. architect solutions to larger chunks of cold-start latency -The goal is to achieve sub-second average cold-starts by the end of the year. +**Our goal is to achieve sub-second average cold-starts by the end of the year.** + +* POC: Greg Haynes (IBM) +* Github: [Project 8](https://github.com/knative/serving/projects/8) ### Overload Handling -Knative Serving provides concurrency controls to limit the number of requests a container must handle simultaneously. Additionally, each pod has a queue for holding requests when the container concurrency limit has been reached. When the pod-level queue has been filled to capacity, subsequent request are rejected with 503 "overload". +Knative Serving provides concurrency controls to limit the number of requests a container must handle simultaneously. Additionally, each pod has a queue for holding requests when the container concurrency limit has been reached. When the pod-level queue overflows, subsequent request are rejected with 503 "overload". + +This is desireable to protect the Pod from being overloaded. But in the aggregate the behavior is not ideal for situations when autoscaling needs some time to react to sudden increases in request load (e.g. scale-from-zero). + + The goal of Overload Handling is to enqueue requests at a revision-level. Scale-from-zero should not overload if autoscaling can react in a reasonable amount of time to provide additional pods. When new pods come online, they should be able to take load for the existing pods. Even when scaled above zero, brief spikes of overload should be handled by enqueuing requests at a revision-level. The depth of the revision-level queue should also be configurable because even the Revision as a whole needs to guard against overload. + +The overall problem touches on both Networking and Autoscaling, two different working groups. Much of the overload handling will be implemented in the Activator, which is a part of ingress. So this project is shared jointly between the two working groups. + +**Our goal is for a Revision with 1 Pod and a container concurrency of 1 to handle 1000 requests arriving simultaneously within 30 seconds with no errors, where each request sleeps for 100ms.** + +E.g. +```yaml +apiVersion: serving.knative.dev/v1alpha1 +kind: Service +metadata: + name: overload-test +spec: + runLatest: + configuration: + revisionTemplate: + metadata: + annotations: + autoscaling.knative.dev/minScale: "1" + autoscaling.knative.dev/maxScale: "10" + spec: + containerConcurrency: 1 + container: + image: gcr.io/joe-does-knative/sleep:latest + env: + - name: SLEEP_TIME + value: "100ms" +``` +```bash +for i in `seq 1 1000`; do curl http://$MY_IP & done +``` + +This verifies that 1) we can handle all requests in an overload without error and 2) all the requests don't land on a single pod, which would take 100 sec. + +* POC: Vadim Raskin (IBM) +* Github: [Project 7](https://github.com/knative/serving/projects/7) -This is desireable to protect the Pod from being overloaded. But in the aggregate the behavior is not ideal for situations when autoscaling needs some time to react to sudden increases in request load (e.g. scale-from-zero). +### Autoscaler Availability -The goal of Overload Handling is to enqueue requests at a revision-level. Scale-from-zero should not overload if autoscaling can react in a reasonable amount of time to provide additional pods. When new pods come online, they should be able to take load for the existing pods. Even when scaled above zero, brief spikes of overload should be handled by enqueuing requests at a revision-level. The depth of the revision-level queue should also be configurable because even the Revision as a whole needs to guard against overload. +Because Knative scales to zero, autoscaling is in the critical-path for serving requests. If the autoscaler isn't available when an idle Revision receives a request, that request will not be served. Other components such as the Activator are in this situation too. But they are more stateless and so can be scaled horizontally relatively easily. For example, any Activator can proxy any request for any Revision. All it has to do is send a messge to the Autoscaler and then wait for a Pod to show up. Then it proxies the request and is taken back out of the serving path. -This overall problem is closely related to both Networking and Autoscaling, two different working groups. Much of the overload handling will be implemented in the Activator, which is a component most closely related to ingress, part of the Networking WG's charter. So this project is shared jointly between the two working groups. +However the Autoscaler process is more stateful. It maintains request statistics over a window of time. And it must process data from the Revision Pods continously to maintain that window of data. It is part of the running system all the time, not just when scaled to zero. As the number of Revisions increase and the number of Pods in each Revision increases, the CPU and memory requirements will exceed that available to a single process. So some sharding is necessary. -### Autoscaler Availability +**Our goal is to shard Revisions across Autoscaler replicas (e.g. 2x the Replica count means 1/2 the load on each Autoscaler). And for autoscaling to be unaffected by an individual Autoscaler termination.** -Because Knative scales to zero, autoscaling is in the critical-path for serving requests. If the autoscaler isn't available when an idle Revision receives a request, that request will not be served. Other components such as the Activator are in this situation too. But they are more stateless and so can be scaled horizontally relatively easily. For example, any Activator can proxy a request for any Revision. All it has to do is send a messge to the Autoscaler and then wait for a Pod to send the request to. Then it proxies the request and it take back out of the serving path. - -However the Autoscaler process is more stateful. It maintains request statistics over a window of time. And it must process data from the Revision Pods continously to maintain that window of data. It is part of the system all the time, not just when scaled to zero. As the number of Revisions and the number of Pods in each Revision increases, the CPU and memory requirements will exceed that available to a single process. Some sharding is necessary. +* POC: Kenny Leung (Google) +* Github: [Project 19](https://github.com/knative/serving/projects/19) ### Streaming Autoscaling -In addition to being always available, Web applications are expected to be responsive and connected. Continuously connected and streaming protocols like Websockets and HTTP2 are essential to a modern application. - -Knative Serving accepts HTTP2 connections. And will serve requests multiplexed within the connection. But the autoscaling subsystem doesn't quite know what to do with those connections. It sees each connection as continuous load on the system and so will autoscale accordingly. +In addition to being always available, Web applications are expected to be responsive. Long-lived connections and streaming protocols like Websockets and HTTP2 are essential. Knative Serving accepts HTTP2 connections. And will serve requests multiplexed within the connection. But the autoscaling subsystem doesn't quite know what to do with those connections. It sees each connection as continuous load on the system and so will autoscale accordingly. -But the actual load is in the stream within the connection. So the metrics reported to the Autoscaler should be based on the number of concurrent **streams**. This requires some work in the Queue proxy to crack open the connection and emit stream metrics. +But the actual load is in the stream within the connection. So the metrics reported to the Autoscaler should be based on the number of concurrent *streams*. This requires some work in the Queue proxy to crack open the connection and emit stream metrics. Additionally, concurrency limits should be applied to streams, not connections. So containers which can handle only one request at at time should still be able to serve HTTP2. The Queue proxy will just allow one stream through at a time. +**Our goal is 1) to support HTTP2 end-to-end while scaling on concurrent streams and in this mode 2) enforce concurrency limits on streams (not connections).** + +* POC: Markus Thömmes (Red Hat) +* Github: [Project 16](https://github.com/knative/serving/projects/16) + ### Vertical Pod Autoscaling Beta -Another aspect of the "serverless" illusion is figure out what code needs to run and running it efficiently. Knative has default resources request. And it supports resource requests and limit from the user. But if the user doesn't want to spend their time "tuning" resources, which is a very "serverful" way to spend your time, Vertical Pod Autoscaling (VPA) is needed. +Another dimension of the serverless illusion is running code efficiently. Knative has default resources request. And it supports resource requests and limits from the user. But if the user doesn't want to spend their time "tuning" resources, which is a very "serverful" way to spend one's time, Knative should be able to just "figure it out". That is Vertical Pod Autoscaling (VPA). + +Knative previously integrated with VPA Alpha. Now it needs to reintegrate with VPA Beta. In addition to creating VPA resources for each Revision, we need to do a little bookkeeping for the unique requirements of serverless workloads. For example, the window for VPA recommendations is 2 weeks. But a serverless function might be invoked once per year (e.g. when the fire alarm gets pulled). The Pods should come back with the correct resource requests and limits. The way VPA is architected, it "injects" the correct recommendations via mutating webhook. It will decline to update resources requests after 2 weeks of inactivity and the Revision would fall back to defaults. Knative needs to remember what that recommendation was and make sure new Pods start at the right levels. + +Additionally, the next Revision should learn from the previous. But it must not taint the previous Revision's state. For example, when a Service is in runLatest mode, the next Revision should start from the resource recommendations of the previous. Then VPA will apply learning on top of that to adjust for changes in the application behavior. However if the next Revision goes crazy because of bad recommendations, a quick rollback to the previous should pick up the good ones. Again, this requires a little bit of bookkeeping in Knative. -Knative previously integrated with VPA Alpha. Now it needs to reintegrate with VPA Beta. In addition to creating VPA resources for each Revision, we need to do a little bookkeeping for the unique requirements of serverless workloads. For example, the window for VPA recommendations is 2 weeks. But a serverless function might be invoked once per year (e.g. when the fire alarm gets pulled). The Pods should come back with the correct resource requests and limits. The way VPA is architected, it "injects" the correct recommendations and so it would fail to use the right resources after 2 weeks of inactivity. Knative needs to remember what that recommendation was and make sure new Pods start at the right levels. +**Our goal is support VPA enabled per-revision with 1) revision-to-revision inheritance to recommendations (when appropriate) and 2) safe rollback to previous recommendations when rolling back to previous Revisions.** -Additionally, one Revision should learn from the previous. But it must not taint the previous Revision's state. For example, when a Service is in runLatest mode, the next Revision should start from the resource requests of the previous. Then VPA will apply learning on top of that to adjust for changes in the application behavior. However if the next Revision goes crazy because of bad recommendations, a quick rollback to the previous should pick up the good ones. Again, this requires a little bit of bookkeeping in Knative. +* POC: Joseph Burnett (Google) +* Github: [Project 18](https://github.com/knative/serving/projects/18) From b853f33b0c5b78b5418e6b828390cac954a17421 Mon Sep 17 00:00:00 2001 From: Joseph Burnett Date: Wed, 30 Jan 2019 17:19:40 +0000 Subject: [PATCH 04/18] Remove recap (will do later). --- docs/roadmap/scaling-2018-recap.md | 30 ------------------------------ 1 file changed, 30 deletions(-) delete mode 100644 docs/roadmap/scaling-2018-recap.md diff --git a/docs/roadmap/scaling-2018-recap.md b/docs/roadmap/scaling-2018-recap.md deleted file mode 100644 index acfd0862f95a..000000000000 --- a/docs/roadmap/scaling-2018-recap.md +++ /dev/null @@ -1,30 +0,0 @@ -# 2018 Recap Autoscaling Roadmap - -Before we get into the 2019 roadmap, here is a quick recap of what we did in 2018. - -### Correctness - -1. **Write autoscaler end-to-end tests**: we put a few key autoscaling end-to-end tests in place. [AutoscaleUpDownUp](https://github.com/knative/serving/blob/51b74ba2b78b96fa4b7db3181b4a1c84c2758168/test/e2e/autoscale_test.go#L275) broadly covers scaling from N to 0 and back again. [AutoscaleUpCountPods](https://github.com/knative/serving/blob/51b74ba2b78b96fa4b7db3181b4a1c84c2758168/test/e2e/autoscale_test.go#L327) asserts autoscaler stability and reactivity. -2. **Test error rates at high scale** TODO: STATE OF THE WORLD -3. **Test error rates around idle states**: the AutoscaleUpDownUp end-to-end test has done a good job of flushing out a variety of edge cases and errors during idle and transition states. TODO: EXAMPLES - -### Performance - -1. **Establish canonical load test scenarios**: TODO: STATE OF THE WORLD -2. **Reproducible load tests**: -3. **Vertical pod autoscaling**: - -### Scale to Zero - -1. **Implement scale to zero**: -2. **Reduce Reserve Revision start time**: - -### Development - -1. **Decouple autoscaling from revision controller**: - -### Integration - -1. **Autoscaler multitenancy**: -2. **Consume custom metrics API**: -3. **Autoscale queue-based workloads**: From ca07b8a2d7800fcea9d41f625c393c25a7378b30 Mon Sep 17 00:00:00 2001 From: Joseph Burnett Date: Wed, 30 Jan 2019 17:22:10 +0000 Subject: [PATCH 05/18] Remove indent. --- docs/roadmap/scaling-2019.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/roadmap/scaling-2019.md b/docs/roadmap/scaling-2019.md index 017a8878bb83..6f507d7abcf4 100644 --- a/docs/roadmap/scaling-2019.md +++ b/docs/roadmap/scaling-2019.md @@ -27,7 +27,7 @@ Knative Serving provides concurrency controls to limit the number of requests a This is desireable to protect the Pod from being overloaded. But in the aggregate the behavior is not ideal for situations when autoscaling needs some time to react to sudden increases in request load (e.g. scale-from-zero). - The goal of Overload Handling is to enqueue requests at a revision-level. Scale-from-zero should not overload if autoscaling can react in a reasonable amount of time to provide additional pods. When new pods come online, they should be able to take load for the existing pods. Even when scaled above zero, brief spikes of overload should be handled by enqueuing requests at a revision-level. The depth of the revision-level queue should also be configurable because even the Revision as a whole needs to guard against overload. +The goal of Overload Handling is to enqueue requests at a revision-level. Scale-from-zero should not overload if autoscaling can react in a reasonable amount of time to provide additional pods. When new pods come online, they should be able to take load for the existing pods. Even when scaled above zero, brief spikes of overload should be handled by enqueuing requests at a revision-level. The depth of the revision-level queue should also be configurable because even the Revision as a whole needs to guard against overload. The overall problem touches on both Networking and Autoscaling, two different working groups. Much of the overload handling will be implemented in the Activator, which is a part of ingress. So this project is shared jointly between the two working groups. From e9c4c1ad2decfe0da729bf5a9c9c8307a04bf293 Mon Sep 17 00:00:00 2001 From: Joseph Burnett Date: Fri, 1 Feb 2019 20:24:13 +0000 Subject: [PATCH 06/18] Add Pluggability and HPA line item. --- docs/roadmap/scaling-2019.md | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/docs/roadmap/scaling-2019.md b/docs/roadmap/scaling-2019.md index 6f507d7abcf4..69e4dc8c74f9 100644 --- a/docs/roadmap/scaling-2019.md +++ b/docs/roadmap/scaling-2019.md @@ -10,7 +10,7 @@ Serverless is only as good as the illusion it sustains. Throwing some code into Right now cold-starts are between 10 and 15 seconds which is an order of magnitude too slow. The time is spent starting the pod, waiting for Envoy to start and telling all nodes how to reach the pod through the Kubernetes Service. Without the Istio mesh (just routing request to individual pods as they come up) still takes about 4 seconds. -We've poked at this problem in 2019 ([#1297](https://github.com/knative/serving/issues/1297)) but haven't been able to make significant progress. This area requires some dedicated effort to: +We've poked at this problem in 2018 ([#1297](https://github.com/knative/serving/issues/1297)) but haven't been able to make significant progress. This area requires some dedicated effort to: 1. identify and programatically capture sources of cold-start latency at all levels of the stack ([#2495](https://github.com/knative/serving/issues/2495)) 2. chase down the low hanging fruit (e.g. [#2659](https://github.com/knative/serving/issues/2659)) @@ -88,6 +88,19 @@ Additionally, concurrency limits should be applied to streams, not connections. * POC: Markus Thömmes (Red Hat) * Github: [Project 16](https://github.com/knative/serving/projects/16) +### Pluggability and HPA + +This is work remaining from 2018 to add CPU-based autoscaling to Knative and provide an extension point for further customizing the autoscaling sub-system. Remaining work includes: + +1. metrics pipeline relayering to scrape metrics from Pods ([#1927](https://github.com/knative/serving/issues/1927)) +2. adding a `window` annotation to allow for further customization of the KPA autoscaler ([#2909](https://github.com/knative/serving/issues/2909)) +3. implementing scale-to-zero for CPU-scaled workloads ([#3064](https://github.com/knative/serving/issues/3064)) + +**Our goal is to have a cleanly-layered, extensible autoscaling sub-system which fully supports concurrency and CPU metrics (including scale-to-zero).** + +* POC: Joseph Burnett (Google) +* Github: [Project 11](https://github.com/knative/serving/projects/11) + ### Vertical Pod Autoscaling Beta Another dimension of the serverless illusion is running code efficiently. Knative has default resources request. And it supports resource requests and limits from the user. But if the user doesn't want to spend their time "tuning" resources, which is a very "serverful" way to spend one's time, Knative should be able to just "figure it out". That is Vertical Pod Autoscaling (VPA). From 4a3ca3fa09a567caa4f3ab630857c271681dd547 Mon Sep 17 00:00:00 2001 From: Joseph Burnett Date: Fri, 1 Feb 2019 20:25:43 +0000 Subject: [PATCH 07/18] Yanwei as POC for layering. --- docs/roadmap/scaling-2019.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/roadmap/scaling-2019.md b/docs/roadmap/scaling-2019.md index 69e4dc8c74f9..12d7aec5ac58 100644 --- a/docs/roadmap/scaling-2019.md +++ b/docs/roadmap/scaling-2019.md @@ -98,7 +98,7 @@ This is work remaining from 2018 to add CPU-based autoscaling to Knative and pro **Our goal is to have a cleanly-layered, extensible autoscaling sub-system which fully supports concurrency and CPU metrics (including scale-to-zero).** -* POC: Joseph Burnett (Google) +* POC: Yanwei Guo (Google) * Github: [Project 11](https://github.com/knative/serving/projects/11) ### Vertical Pod Autoscaling Beta From a1281255c2607a5147f91ab757176f7cee1a252a Mon Sep 17 00:00:00 2001 From: mattmoor-sockpuppet Date: Fri, 8 Feb 2019 08:38:28 -0800 Subject: [PATCH 08/18] Update docs/roadmap/scaling-2019.md Co-Authored-By: josephburnett --- docs/roadmap/scaling-2019.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/roadmap/scaling-2019.md b/docs/roadmap/scaling-2019.md index 12d7aec5ac58..f80137f34ea8 100644 --- a/docs/roadmap/scaling-2019.md +++ b/docs/roadmap/scaling-2019.md @@ -25,7 +25,7 @@ We've poked at this problem in 2018 ([#1297](https://github.com/knative/serving/ Knative Serving provides concurrency controls to limit the number of requests a container must handle simultaneously. Additionally, each pod has a queue for holding requests when the container concurrency limit has been reached. When the pod-level queue overflows, subsequent request are rejected with 503 "overload". -This is desireable to protect the Pod from being overloaded. But in the aggregate the behavior is not ideal for situations when autoscaling needs some time to react to sudden increases in request load (e.g. scale-from-zero). +This is desirable to protect the Pod from being overloaded. But in the aggregate the behavior is not ideal for situations when autoscaling needs some time to react to sudden increases in request load (e.g. scale-from-zero). The goal of Overload Handling is to enqueue requests at a revision-level. Scale-from-zero should not overload if autoscaling can react in a reasonable amount of time to provide additional pods. When new pods come online, they should be able to take load for the existing pods. Even when scaled above zero, brief spikes of overload should be handled by enqueuing requests at a revision-level. The depth of the revision-level queue should also be configurable because even the Revision as a whole needs to guard against overload. From 69370815b7e0ae7cd1878e968ff30e5faf718080 Mon Sep 17 00:00:00 2001 From: mattmoor-sockpuppet Date: Fri, 8 Feb 2019 08:38:43 -0800 Subject: [PATCH 09/18] Update docs/roadmap/scaling-2019.md Co-Authored-By: josephburnett --- docs/roadmap/scaling-2019.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/roadmap/scaling-2019.md b/docs/roadmap/scaling-2019.md index f80137f34ea8..278f36cb5a4e 100644 --- a/docs/roadmap/scaling-2019.md +++ b/docs/roadmap/scaling-2019.md @@ -68,7 +68,7 @@ This verifies that 1) we can handle all requests in an overload without error an Because Knative scales to zero, autoscaling is in the critical-path for serving requests. If the autoscaler isn't available when an idle Revision receives a request, that request will not be served. Other components such as the Activator are in this situation too. But they are more stateless and so can be scaled horizontally relatively easily. For example, any Activator can proxy any request for any Revision. All it has to do is send a messge to the Autoscaler and then wait for a Pod to show up. Then it proxies the request and is taken back out of the serving path. -However the Autoscaler process is more stateful. It maintains request statistics over a window of time. And it must process data from the Revision Pods continously to maintain that window of data. It is part of the running system all the time, not just when scaled to zero. As the number of Revisions increase and the number of Pods in each Revision increases, the CPU and memory requirements will exceed that available to a single process. So some sharding is necessary. +However the Autoscaler process is more stateful. It maintains request statistics over a window of time. And it must process data from the Revision Pods continuously to maintain that window of data. It is part of the running system all the time, not just when scaled to zero. As the number of Revisions increase and the number of Pods in each Revision increases, the CPU and memory requirements will exceed that available to a single process. So some sharding is necessary. **Our goal is to shard Revisions across Autoscaler replicas (e.g. 2x the Replica count means 1/2 the load on each Autoscaler). And for autoscaling to be unaffected by an individual Autoscaler termination.** From f9ef30a890ed7245ca8f585c4f06660e870c1766 Mon Sep 17 00:00:00 2001 From: Joseph Burnett Date: Fri, 8 Feb 2019 08:43:01 -0800 Subject: [PATCH 10/18] Clarify overload handling for 0 and non-0 cases. --- docs/roadmap/scaling-2019.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/roadmap/scaling-2019.md b/docs/roadmap/scaling-2019.md index 278f36cb5a4e..1cfcb1e710e0 100644 --- a/docs/roadmap/scaling-2019.md +++ b/docs/roadmap/scaling-2019.md @@ -25,7 +25,7 @@ We've poked at this problem in 2018 ([#1297](https://github.com/knative/serving/ Knative Serving provides concurrency controls to limit the number of requests a container must handle simultaneously. Additionally, each pod has a queue for holding requests when the container concurrency limit has been reached. When the pod-level queue overflows, subsequent request are rejected with 503 "overload". -This is desirable to protect the Pod from being overloaded. But in the aggregate the behavior is not ideal for situations when autoscaling needs some time to react to sudden increases in request load (e.g. scale-from-zero). +This is desirable to protect the Pod from being overloaded. But in the aggregate the behavior is not ideal for situations when autoscaling needs some time to react to sudden increases in request load. This could be when the revision is scaled to zero. Or when the revision is already running some pods, but not nearly enough. The goal of Overload Handling is to enqueue requests at a revision-level. Scale-from-zero should not overload if autoscaling can react in a reasonable amount of time to provide additional pods. When new pods come online, they should be able to take load for the existing pods. Even when scaled above zero, brief spikes of overload should be handled by enqueuing requests at a revision-level. The depth of the revision-level queue should also be configurable because even the Revision as a whole needs to guard against overload. From 118b702e87e3c6c362e8ef36837939fcba1b04e8 Mon Sep 17 00:00:00 2001 From: Joseph Burnett Date: Mon, 25 Mar 2019 13:35:15 +0100 Subject: [PATCH 11/18] Refactor cold-start goal. --- docs/roadmap/scaling-2019.md | 28 ++++++++++++++++++---------- 1 file changed, 18 insertions(+), 10 deletions(-) diff --git a/docs/roadmap/scaling-2019.md b/docs/roadmap/scaling-2019.md index 1cfcb1e710e0..1215edb6b22d 100644 --- a/docs/roadmap/scaling-2019.md +++ b/docs/roadmap/scaling-2019.md @@ -2,24 +2,26 @@ This is what we hope to accomplish in 2019. -## 2019 Goals +## Performance ### Sub-Second Cold Start -Serverless is only as good as the illusion it sustains. Throwing some code into a "serverless" framework, the expectation is that it will just be running when it needs to, with as many resources as necessary. Including zero. But to realize that magical threshold and maintain the illusion of serverless, the code must come back as if it was never gone. The exact latency requirement of "as if it was never gone" will vary from use-case to use-case. But generally less than one second is a good start. +As a serverless framework, Knative should only run code when it needs to. Including scaling to zero when the Revision is not being used. However the Revison must also come back quickly, otherwise the illusion of "serverless" is broken--it must seem as if it was always there. Generally less than one second is a good start. -Right now cold-starts are between 10 and 15 seconds which is an order of magnitude too slow. The time is spent starting the pod, waiting for Envoy to start and telling all nodes how to reach the pod through the Kubernetes Service. Without the Istio mesh (just routing request to individual pods as they come up) still takes about 4 seconds. +Today cold-starts are between 10 and 15 seconds which is an order of magnitude too slow. The time is spent starting the pod, waiting for Envoy to initialize, and setting up routing. Without the Istio mesh (just routing request to individual pods as they come up) still takes about 4 seconds. We've poked at this problem in 2018 ([#1297](https://github.com/knative/serving/issues/1297)) but haven't made significant progress. This area requires some dedicated effort. -We've poked at this problem in 2018 ([#1297](https://github.com/knative/serving/issues/1297)) but haven't been able to make significant progress. This area requires some dedicated effort to: +One area of investment is to vet a local scheduling approach in which the Activator is given authority to schedule a Pod locally on the Node. This takes several layers out of the critical path for cold starts. -1. identify and programatically capture sources of cold-start latency at all levels of the stack ([#2495](https://github.com/knative/serving/issues/2495)) -2. chase down the low hanging fruit (e.g. [#2659](https://github.com/knative/serving/issues/2659)) -3. architect solutions to larger chunks of cold-start latency +**Goal**: achieve sub-second average cold-starts of disk-warm Revisions running in mesh-mode by the end of the year. -**Our goal is to achieve sub-second average cold-starts by the end of the year.** +**Key Steps**: +1. Capture cold start traces. +2. Track cold start latency over time by span. +3. Local scheduling design doc. -* POC: Greg Haynes (IBM) -* Github: [Project 8](https://github.com/knative/serving/projects/8) +POC: Greg Haynes (IBM) + +Github: [Project 8](https://github.com/knative/serving/projects/8) ### Overload Handling @@ -64,6 +66,8 @@ This verifies that 1) we can handle all requests in an overload without error an * POC: Vadim Raskin (IBM) * Github: [Project 7](https://github.com/knative/serving/projects/7) +## Reliability + ### Autoscaler Availability Because Knative scales to zero, autoscaling is in the critical-path for serving requests. If the autoscaler isn't available when an idle Revision receives a request, that request will not be served. Other components such as the Activator are in this situation too. But they are more stateless and so can be scaled horizontally relatively easily. For example, any Activator can proxy any request for any Revision. All it has to do is send a messge to the Autoscaler and then wait for a Pod to show up. Then it proxies the request and is taken back out of the serving path. @@ -88,6 +92,8 @@ Additionally, concurrency limits should be applied to streams, not connections. * POC: Markus Thömmes (Red Hat) * Github: [Project 16](https://github.com/knative/serving/projects/16) +## Extendability + ### Pluggability and HPA This is work remaining from 2018 to add CPU-based autoscaling to Knative and provide an extension point for further customizing the autoscaling sub-system. Remaining work includes: @@ -101,6 +107,8 @@ This is work remaining from 2018 to add CPU-based autoscaling to Knative and pro * POC: Yanwei Guo (Google) * Github: [Project 11](https://github.com/knative/serving/projects/11) +## What we're not doing yet + ### Vertical Pod Autoscaling Beta Another dimension of the serverless illusion is running code efficiently. Knative has default resources request. And it supports resource requests and limits from the user. But if the user doesn't want to spend their time "tuning" resources, which is a very "serverful" way to spend one's time, Knative should be able to just "figure it out". That is Vertical Pod Autoscaling (VPA). From e7b51f7f841cea04ea78b2a44637fdcaa0baab7f Mon Sep 17 00:00:00 2001 From: Joseph Burnett Date: Mon, 25 Mar 2019 13:57:02 +0100 Subject: [PATCH 12/18] Remove POC. --- docs/roadmap/scaling-2019.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/docs/roadmap/scaling-2019.md b/docs/roadmap/scaling-2019.md index 1215edb6b22d..dd3fab4a5a4b 100644 --- a/docs/roadmap/scaling-2019.md +++ b/docs/roadmap/scaling-2019.md @@ -19,9 +19,7 @@ One area of investment is to vet a local scheduling approach in which the Activa 2. Track cold start latency over time by span. 3. Local scheduling design doc. -POC: Greg Haynes (IBM) - -Github: [Project 8](https://github.com/knative/serving/projects/8) +**Project**: [Project 8](https://github.com/knative/serving/projects/8) ### Overload Handling From 10ec4156037d3bf707bee3967ef32655087296b4 Mon Sep 17 00:00:00 2001 From: Joseph Burnett Date: Mon, 25 Mar 2019 16:50:14 +0100 Subject: [PATCH 13/18] Autoscaler scalability. --- docs/roadmap/scaling-2019.md | 65 ++++++++++++++++-------------------- 1 file changed, 29 insertions(+), 36 deletions(-) diff --git a/docs/roadmap/scaling-2019.md b/docs/roadmap/scaling-2019.md index dd3fab4a5a4b..d2e284e00a03 100644 --- a/docs/roadmap/scaling-2019.md +++ b/docs/roadmap/scaling-2019.md @@ -25,52 +25,45 @@ One area of investment is to vet a local scheduling approach in which the Activa Knative Serving provides concurrency controls to limit the number of requests a container must handle simultaneously. Additionally, each pod has a queue for holding requests when the container concurrency limit has been reached. When the pod-level queue overflows, subsequent request are rejected with 503 "overload". -This is desirable to protect the Pod from being overloaded. But in the aggregate the behavior is not ideal for situations when autoscaling needs some time to react to sudden increases in request load. This could be when the revision is scaled to zero. Or when the revision is already running some pods, but not nearly enough. +This is desirable to protect the Pod from being overloaded. But the aggregate behavior is not ideal for situations when autoscaling needs some time to react to sudden increases in request load. This could be when the revision is scaled to zero. Or when the revision is already running some pods, but not nearly enough. -The goal of Overload Handling is to enqueue requests at a revision-level. Scale-from-zero should not overload if autoscaling can react in a reasonable amount of time to provide additional pods. When new pods come online, they should be able to take load for the existing pods. Even when scaled above zero, brief spikes of overload should be handled by enqueuing requests at a revision-level. The depth of the revision-level queue should also be configurable because even the Revision as a whole needs to guard against overload. +The goal of Overload Handling is to enqueue requests at a revision-level. Scale-from-zero should not overload if autoscaling can react in a reasonable amount of time to provide additional pods. When new pods come online, they should be able to take load from the existing pods. Even when scaled above zero, brief spikes of overload should be handled by enqueuing requests at a revision-level. The depth of the revision-level queue should also be configurable because even the Revision as a whole needs to guard against overload. The overall problem touches on both Networking and Autoscaling, two different working groups. Much of the overload handling will be implemented in the Activator, which is a part of ingress. So this project is shared jointly between the two working groups. -**Our goal is for a Revision with 1 Pod and a container concurrency of 1 to handle 1000 requests arriving simultaneously within 30 seconds with no errors, where each request sleeps for 100ms.** - -E.g. -```yaml -apiVersion: serving.knative.dev/v1alpha1 -kind: Service -metadata: - name: overload-test -spec: - runLatest: - configuration: - revisionTemplate: - metadata: - annotations: - autoscaling.knative.dev/minScale: "1" - autoscaling.knative.dev/maxScale: "10" - spec: - containerConcurrency: 1 - container: - image: gcr.io/joe-does-knative/sleep:latest - env: - - name: SLEEP_TIME - value: "100ms" -``` -```bash -for i in `seq 1 1000`; do curl http://$MY_IP & done -``` - -This verifies that 1) we can handle all requests in an overload without error and 2) all the requests don't land on a single pod, which would take 100 sec. - -* POC: Vadim Raskin (IBM) -* Github: [Project 7](https://github.com/knative/serving/projects/7) +**Goal**: requests can be enqueued at the revision-level in response to high load. + +**Key Steps**: +1. Handle overload gracefully in the Activator. Proxy requests at a rate the underlying Deployment can handle. Reject requests beyond a revision-level limit. +2. Wire the Activator into the serving path on overload. +3. E2E tests for overload scenarios. + +**Project**: [Project 7](https://github.com/knative/serving/projects/7) ## Reliability +### Autoscaling Availabilty + +Because Knative scales to zero, the autoscaling system is in the critical-path for serving requests. If the Autoscaler or Activator isn't available when an idle Revision receives a request, that request will not be served. The Activator is stateless and can be easily scaled horizontally. Any Activator can proxy any request for any Revision. But the Autoscaler process is stateful. It maintains request statistics over a window of time. + +We need a way for autoscaling to have higher availability than that of a single Pod. When an Autoscaler Pod fails, another one should take over, quickly. And the new Pod should make autoscaling decisions equivalent to what the failed Pod would have given the short history of data stored in memory. + +**Goal**: the autoscaling system should be highly available. + +**Key Steps**: +1. Autoscaler replication should be configurable. This will likely require leader election. +2. Activator replication should be configurable. +3. Any Activator Pod can provide metrics to all Autoscaler Pods. +4. E2E tests which validate autoscaling availability in the midst of a Autoscaler Pod failure. + +### Autoscaling Scalability + +And it must process data from the Revision Pods continuously to maintain that window of data. It is part of the running system all the time, not just when scaled to zero. As the number of Revisions increase and the number of Pods in each Revision increases, the CPU and memory requirements will exceed that available to a single process. So some sharding is necessary. + + ### Autoscaler Availability -Because Knative scales to zero, autoscaling is in the critical-path for serving requests. If the autoscaler isn't available when an idle Revision receives a request, that request will not be served. Other components such as the Activator are in this situation too. But they are more stateless and so can be scaled horizontally relatively easily. For example, any Activator can proxy any request for any Revision. All it has to do is send a messge to the Autoscaler and then wait for a Pod to show up. Then it proxies the request and is taken back out of the serving path. -However the Autoscaler process is more stateful. It maintains request statistics over a window of time. And it must process data from the Revision Pods continuously to maintain that window of data. It is part of the running system all the time, not just when scaled to zero. As the number of Revisions increase and the number of Pods in each Revision increases, the CPU and memory requirements will exceed that available to a single process. So some sharding is necessary. **Our goal is to shard Revisions across Autoscaler replicas (e.g. 2x the Replica count means 1/2 the load on each Autoscaler). And for autoscaling to be unaffected by an individual Autoscaler termination.** From ac143e37bc6b8b9b16ab5acb82fd6d73d78e9642 Mon Sep 17 00:00:00 2001 From: Joseph Burnett Date: Tue, 26 Mar 2019 11:35:56 +0100 Subject: [PATCH 14/18] More edits. --- docs/roadmap/scaling-2019.md | 33 ++++++++++++++------------------- 1 file changed, 14 insertions(+), 19 deletions(-) diff --git a/docs/roadmap/scaling-2019.md b/docs/roadmap/scaling-2019.md index d2e284e00a03..2114e0bb3c4b 100644 --- a/docs/roadmap/scaling-2019.md +++ b/docs/roadmap/scaling-2019.md @@ -56,36 +56,31 @@ We need a way for autoscaling to have higher availability than that of a single 3. Any Activator Pod can provide metrics to all Autoscaler Pods. 4. E2E tests which validate autoscaling availability in the midst of a Autoscaler Pod failure. -### Autoscaling Scalability - -And it must process data from the Revision Pods continuously to maintain that window of data. It is part of the running system all the time, not just when scaled to zero. As the number of Revisions increase and the number of Pods in each Revision increases, the CPU and memory requirements will exceed that available to a single process. So some sharding is necessary. - - -### Autoscaler Availability - +**Project**: TBD +### Autoscaling Scalability -**Our goal is to shard Revisions across Autoscaler replicas (e.g. 2x the Replica count means 1/2 the load on each Autoscaler). And for autoscaling to be unaffected by an individual Autoscaler termination.** +The Autoscaler process maintains Pod metric data points over a window of time and calculates average concurrency every 2 seconds. As the number and size of Revisions deployed to a cluster increases, so does the load on the Autoscaler. -* POC: Kenny Leung (Google) -* Github: [Project 19](https://github.com/knative/serving/projects/19) +We need some way to have sub-linear load on the Autoscaler as the Revision count increases. This could be a sharding scheme or simply deploying separate Autoscalers per namespace. -### Streaming Autoscaling +**Goal**: the Autoscaling system can scale sub-linearly with the number of Revisions. -In addition to being always available, Web applications are expected to be responsive. Long-lived connections and streaming protocols like Websockets and HTTP2 are essential. Knative Serving accepts HTTP2 connections. And will serve requests multiplexed within the connection. But the autoscaling subsystem doesn't quite know what to do with those connections. It sees each connection as continuous load on the system and so will autoscale accordingly. +**Key Steps**: +1. Automated load test to determine the current scalability limit. And to guard against regression. +2. Deploying an Autoscaler per namespace. -But the actual load is in the stream within the connection. So the metrics reported to the Autoscaler should be based on the number of concurrent *streams*. This requires some work in the Queue proxy to crack open the connection and emit stream metrics. +**Project**: TBD -Additionally, concurrency limits should be applied to streams, not connections. So containers which can handle only one request at at time should still be able to serve HTTP2. The Queue proxy will just allow one stream through at a time. +## Extendability -**Our goal is 1) to support HTTP2 end-to-end while scaling on concurrent streams and in this mode 2) enforce concurrency limits on streams (not connections).** +### Pluggability -* POC: Markus Thömmes (Red Hat) -* Github: [Project 16](https://github.com/knative/serving/projects/16) +It is possible to replace the autoscaling system by implementing an alternative PodAutoscaler reconciler (see the [Yolo controller](https://github.com/josephburnett/kubecon18)). However that requires the implementer to collect their own metrics, implement an autoscaling process, and actuate the recommendations. -## Extendability +It should be able to swap out smaller pieces of the autoscaling system. -### Pluggability and HPA +### HPA Integration This is work remaining from 2018 to add CPU-based autoscaling to Knative and provide an extension point for further customizing the autoscaling sub-system. Remaining work includes: From c0a717d012a3aaba91ca7e3c4bc3c58c19411903 Mon Sep 17 00:00:00 2001 From: Joseph Burnett Date: Wed, 27 Mar 2019 10:12:07 +0100 Subject: [PATCH 15/18] HPA Interation. --- docs/roadmap/scaling-2019.md | 48 ++++++++++++++++++++---------------- 1 file changed, 27 insertions(+), 21 deletions(-) diff --git a/docs/roadmap/scaling-2019.md b/docs/roadmap/scaling-2019.md index 2114e0bb3c4b..d5dbc4d3aae7 100644 --- a/docs/roadmap/scaling-2019.md +++ b/docs/roadmap/scaling-2019.md @@ -10,14 +10,15 @@ As a serverless framework, Knative should only run code when it needs to. Includ Today cold-starts are between 10 and 15 seconds which is an order of magnitude too slow. The time is spent starting the pod, waiting for Envoy to initialize, and setting up routing. Without the Istio mesh (just routing request to individual pods as they come up) still takes about 4 seconds. We've poked at this problem in 2018 ([#1297](https://github.com/knative/serving/issues/1297)) but haven't made significant progress. This area requires some dedicated effort. -One area of investment is to vet a local scheduling approach in which the Activator is given authority to schedule a Pod locally on the Node. This takes several layers out of the critical path for cold starts. +One area of investment is a local scheduling approach in which the Activator is given authority to schedule a Pod locally on the Node. This takes several layers out of the critical path for cold starts. **Goal**: achieve sub-second average cold-starts of disk-warm Revisions running in mesh-mode by the end of the year. **Key Steps**: 1. Capture cold start traces. 2. Track cold start latency over time by span. -3. Local scheduling design doc. +3. Performance test for cold starts. +4. Local scheduling. **Project**: [Project 8](https://github.com/knative/serving/projects/8) @@ -36,7 +37,7 @@ The overall problem touches on both Networking and Autoscaling, two different wo **Key Steps**: 1. Handle overload gracefully in the Activator. Proxy requests at a rate the underlying Deployment can handle. Reject requests beyond a revision-level limit. 2. Wire the Activator into the serving path on overload. -3. E2E tests for overload scenarios. +3. Performance tests for overload scenarios. **Project**: [Project 7](https://github.com/knative/serving/projects/7) @@ -46,9 +47,9 @@ The overall problem touches on both Networking and Autoscaling, two different wo Because Knative scales to zero, the autoscaling system is in the critical-path for serving requests. If the Autoscaler or Activator isn't available when an idle Revision receives a request, that request will not be served. The Activator is stateless and can be easily scaled horizontally. Any Activator can proxy any request for any Revision. But the Autoscaler process is stateful. It maintains request statistics over a window of time. -We need a way for autoscaling to have higher availability than that of a single Pod. When an Autoscaler Pod fails, another one should take over, quickly. And the new Pod should make autoscaling decisions equivalent to what the failed Pod would have given the short history of data stored in memory. +We need a way for autoscaling to have higher availability than that of a single Pod. When an Autoscaler Pod fails, another one should take over, quickly. And the new Autoscaler Pod should make equivalent scaling decisions. -**Goal**: the autoscaling system should be highly available. +**Goal**: the autoscaling system should be more highly available than a single Pod. **Key Steps**: 1. Autoscaler replication should be configurable. This will likely require leader election. @@ -62,13 +63,13 @@ We need a way for autoscaling to have higher availability than that of a single The Autoscaler process maintains Pod metric data points over a window of time and calculates average concurrency every 2 seconds. As the number and size of Revisions deployed to a cluster increases, so does the load on the Autoscaler. -We need some way to have sub-linear load on the Autoscaler as the Revision count increases. This could be a sharding scheme or simply deploying separate Autoscalers per namespace. +We need some way to have sub-linear load in a given Autoscaler Pod as the Revision count increases. This could be a sharding scheme or simply deploying separate Autoscalers per namespace. -**Goal**: the Autoscaling system can scale sub-linearly with the number of Revisions. +**Goal**: the Autoscaling system can scale sub-linearly with the number of Revisions and number of Revision Pods. **Key Steps**: 1. Automated load test to determine the current scalability limit. And to guard against regression. -2. Deploying an Autoscaler per namespace. +2. Configuration for sharding by namespace or other scheme. **Project**: TBD @@ -76,22 +77,30 @@ We need some way to have sub-linear load on the Autoscaler as the Revision count ### Pluggability -It is possible to replace the autoscaling system by implementing an alternative PodAutoscaler reconciler (see the [Yolo controller](https://github.com/josephburnett/kubecon18)). However that requires the implementer to collect their own metrics, implement an autoscaling process, and actuate the recommendations. +It is possible to replace the entire autoscaling system by implementing an alternative PodAutoscaler reconciler (see the [Yolo controller](https://github.com/josephburnett/kubecon18)). However that requires collecting metrics, running an autoscaling process, and actuating the recommendations. -It should be able to swap out smaller pieces of the autoscaling system. +We should be able to swap out smaller pieces of the autoscaling system. For example, the HPA should be able to make use of the metrics Knative collects. + +**Goal**: the autoscaling decider and metrics collection components can be replaced independently. + +**Key Steps**: +1. Build a reference implementation to test swapping the decider. And the metrics collection. (See [knative/build](https://github.com/knative/serving/blob/fa1aff18a9b549e79e41cf0b34f66b79c3da06b6/test/controller/main.go#L69)). +2. Provide Knative metrics via the Custom Metrics interface. + +**Project**: TBD ### HPA Integration -This is work remaining from 2018 to add CPU-based autoscaling to Knative and provide an extension point for further customizing the autoscaling sub-system. Remaining work includes: +The current Knative integration with k8s HPA only supports CPU autoscaling. However it should be able to scale on concurrency as well. Ultimately, the HPA may be able to replace the KPA entirely (see ["make everything better"](https://github.com/knative/serving/blob/master/docs/roadmap/scaling-2018.md#references)). Additionally, HPA should be able to scale on user-provided custom metrics as well. -1. metrics pipeline relayering to scrape metrics from Pods ([#1927](https://github.com/knative/serving/issues/1927)) -2. adding a `window` annotation to allow for further customization of the KPA autoscaler ([#2909](https://github.com/knative/serving/issues/2909)) -3. implementing scale-to-zero for CPU-scaled workloads ([#3064](https://github.com/knative/serving/issues/3064)) +**Goal**: Knative hpa-class PodAutoscalers support concurrency-based autoscaling -**Our goal is to have a cleanly-layered, extensible autoscaling sub-system which fully supports concurrency and CPU metrics (including scale-to-zero).** +**Key Steps**: +1. Provide Knative metrics via the Custom Metrics interface (see also Pluggability above). +2. Configure the HPA to scale on the Knative concurrency metric. +3. Configure the HPA to scale on the user provided metric (requires a user configured Custom Metrics adapter). -* POC: Yanwei Guo (Google) -* Github: [Project 11](https://github.com/knative/serving/projects/11) +**Project**: TBD ## What we're not doing yet @@ -103,7 +112,4 @@ Knative previously integrated with VPA Alpha. Now it needs to reintegrate with V Additionally, the next Revision should learn from the previous. But it must not taint the previous Revision's state. For example, when a Service is in runLatest mode, the next Revision should start from the resource recommendations of the previous. Then VPA will apply learning on top of that to adjust for changes in the application behavior. However if the next Revision goes crazy because of bad recommendations, a quick rollback to the previous should pick up the good ones. Again, this requires a little bit of bookkeeping in Knative. -**Our goal is support VPA enabled per-revision with 1) revision-to-revision inheritance to recommendations (when appropriate) and 2) safe rollback to previous recommendations when rolling back to previous Revisions.** - -* POC: Joseph Burnett (Google) -* Github: [Project 18](https://github.com/knative/serving/projects/18) +**Project**: [Project 18](https://github.com/knative/serving/projects/18) From 50ca86fa8d51d92ddd16104375aedb0f97fcc746 Mon Sep 17 00:00:00 2001 From: Joseph Burnett Date: Wed, 27 Mar 2019 10:29:48 +0100 Subject: [PATCH 16/18] Minor edits. --- docs/roadmap/scaling-2019.md | 28 ++++++++++++++++++---------- 1 file changed, 18 insertions(+), 10 deletions(-) diff --git a/docs/roadmap/scaling-2019.md b/docs/roadmap/scaling-2019.md index d5dbc4d3aae7..488987e5d13f 100644 --- a/docs/roadmap/scaling-2019.md +++ b/docs/roadmap/scaling-2019.md @@ -45,7 +45,7 @@ The overall problem touches on both Networking and Autoscaling, two different wo ### Autoscaling Availabilty -Because Knative scales to zero, the autoscaling system is in the critical-path for serving requests. If the Autoscaler or Activator isn't available when an idle Revision receives a request, that request will not be served. The Activator is stateless and can be easily scaled horizontally. Any Activator can proxy any request for any Revision. But the Autoscaler process is stateful. It maintains request statistics over a window of time. +Because Knative scales to zero, the autoscaling system is in the critical-path for serving requests. If the Autoscaler or Activator isn't available when an idle Revision receives a request, that request will not be served. The Activator is stateless and can be easily scaled horizontally. Any Activator Pod can proxy any request for any Revision. But the Autoscaler Pod is stateful. It maintains request statistics over a window of time. We need a way for autoscaling to have higher availability than that of a single Pod. When an Autoscaler Pod fails, another one should take over, quickly. And the new Autoscaler Pod should make equivalent scaling decisions. @@ -63,7 +63,7 @@ We need a way for autoscaling to have higher availability than that of a single The Autoscaler process maintains Pod metric data points over a window of time and calculates average concurrency every 2 seconds. As the number and size of Revisions deployed to a cluster increases, so does the load on the Autoscaler. -We need some way to have sub-linear load in a given Autoscaler Pod as the Revision count increases. This could be a sharding scheme or simply deploying separate Autoscalers per namespace. +We need some way to have sub-linear load on a given Autoscaler Pod as the Revision count increases. This could be a sharding scheme or simply deploying separate Autoscalers per namespace. **Goal**: the Autoscaling system can scale sub-linearly with the number of Revisions and number of Revision Pods. @@ -79,7 +79,7 @@ We need some way to have sub-linear load in a given Autoscaler Pod as the Revisi It is possible to replace the entire autoscaling system by implementing an alternative PodAutoscaler reconciler (see the [Yolo controller](https://github.com/josephburnett/kubecon18)). However that requires collecting metrics, running an autoscaling process, and actuating the recommendations. -We should be able to swap out smaller pieces of the autoscaling system. For example, the HPA should be able to make use of the metrics Knative collects. +We should be able to swap out smaller pieces of the autoscaling system. For example, the HPA should be able to make use of the metrics that Knative collects. **Goal**: the autoscaling decider and metrics collection components can be replaced independently. @@ -91,24 +91,32 @@ We should be able to swap out smaller pieces of the autoscaling system. For exam ### HPA Integration -The current Knative integration with k8s HPA only supports CPU autoscaling. However it should be able to scale on concurrency as well. Ultimately, the HPA may be able to replace the KPA entirely (see ["make everything better"](https://github.com/knative/serving/blob/master/docs/roadmap/scaling-2018.md#references)). Additionally, HPA should be able to scale on user-provided custom metrics as well. +The current Knative integration with K8s HPA only supports CPU autoscaling. However it should be able to scale on concurrency as well. Ultimately, the HPA may be able to replace the Knative Autoscaler (KPA) entirely (see ["make everything better"](https://github.com/knative/serving/blob/master/docs/roadmap/scaling-2018.md#references)). Additionally, HPA should be able to scale on user-provided custom metrics as well. -**Goal**: Knative hpa-class PodAutoscalers support concurrency-based autoscaling +**Goal**: Knative HPA-class PodAutoscalers support concurrency autoscaling **Key Steps**: -1. Provide Knative metrics via the Custom Metrics interface (see also Pluggability above). +1. Provide Knative metrics via the Custom Metrics interface (see also [Pluggability](#pluggability) above). 2. Configure the HPA to scale on the Knative concurrency metric. -3. Configure the HPA to scale on the user provided metric (requires a user configured Custom Metrics adapter). +3. Configure the HPA to scale on the user provided metric (requires a user configured Custom Metrics adapter to collect their metric). **Project**: TBD -## What we're not doing yet +## What We Are Not Doing Yet + +### Removing the Queue Proxy Sidecar + +There are two sidecars injected into Knative Pods, Envoy and the Queue Proxy. The queue-proxy sidecar is where we put everything we wish Envoy/Istio could do, but doesn't yet. For example, enforcing single-threaded request. Or reporting concurrency metrics in the way we want. Ultimately we should push these features upstream and get rid of the queue-proxy sidecar. + +However we're not doing that yet because the requirement haven't stablized enough yet. And it's still useful to have a component to innovate within. + +See [2018 What We Are Not Doing Yet](https://github.com/knative/serving/blob/master/docs/roadmap/scaling-2018.md#what-we-are-not-doing-yet) ### Vertical Pod Autoscaling Beta -Another dimension of the serverless illusion is running code efficiently. Knative has default resources request. And it supports resource requests and limits from the user. But if the user doesn't want to spend their time "tuning" resources, which is a very "serverful" way to spend one's time, Knative should be able to just "figure it out". That is Vertical Pod Autoscaling (VPA). +A serverless system should be able to run code efficiently. Knative has default resources request and it supports resource requests and limits from the user. But if the user doesn't want to spend their time "tuning" resources (which is very "serverful") then Knative should be able to just "figure it out". That is Vertical Pod Autoscaling (VPA). -Knative previously integrated with VPA Alpha. Now it needs to reintegrate with VPA Beta. In addition to creating VPA resources for each Revision, we need to do a little bookkeeping for the unique requirements of serverless workloads. For example, the window for VPA recommendations is 2 weeks. But a serverless function might be invoked once per year (e.g. when the fire alarm gets pulled). The Pods should come back with the correct resource requests and limits. The way VPA is architected, it "injects" the correct recommendations via mutating webhook. It will decline to update resources requests after 2 weeks of inactivity and the Revision would fall back to defaults. Knative needs to remember what that recommendation was and make sure new Pods start at the right levels. +Knative [previously integrated with VPA Alpha](https://github.com/knative/serving/issues/839#issuecomment-389387311). Now it needs to reintegrate with VPA Beta. In addition to creating VPA resources for each Revision, we need to do a little bookkeeping for the unique requirements of serverless workloads. For example, the window for VPA recommendations is 2 weeks. But a serverless function might be invoked once per year (e.g. when the fire alarm gets pulled). The Pods should come back with the correct resource requests and limits. The way VPA is architected, it "injects" the correct recommendations via mutating webhook. It will decline to update resources requests after 2 weeks of inactivity and the Revision would fall back to defaults. Knative needs to remember what that recommendation was and make sure new Pods start at the right levels. Additionally, the next Revision should learn from the previous. But it must not taint the previous Revision's state. For example, when a Service is in runLatest mode, the next Revision should start from the resource recommendations of the previous. Then VPA will apply learning on top of that to adjust for changes in the application behavior. However if the next Revision goes crazy because of bad recommendations, a quick rollback to the previous should pick up the good ones. Again, this requires a little bit of bookkeeping in Knative. From e34c0566553cf70c6347cc45a7a7e08dbfe4d1c0 Mon Sep 17 00:00:00 2001 From: Ben Browning Date: Thu, 28 Mar 2019 11:58:06 -0400 Subject: [PATCH 17/18] Propose section on migration K8s Deployments --- docs/roadmap/scaling-2019.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/docs/roadmap/scaling-2019.md b/docs/roadmap/scaling-2019.md index 488987e5d13f..9e2547bf90fc 100644 --- a/docs/roadmap/scaling-2019.md +++ b/docs/roadmap/scaling-2019.md @@ -102,6 +102,19 @@ The current Knative integration with K8s HPA only supports CPU autoscaling. Howe **Project**: TBD +## User Experience + +### Migrating Kubernetes Deployments to Knative + +We need documentation and examples to help Kubernetes users with existing Kubernetes Deployments migrate some of those to Knative to take advantage of request-based autoscaling and scale-to-zero. + +**Goal**: increase Knative adoption by making migration from Kubernetes Deployments simple + +**Key Steps**: +1. Document why a user would want Knative's autoscaling instead of using the Kubernetes Horizontal Pod Autoscaler (HPA) without Knative. Especially if K8s HPA and Knative Autoscaler converge in implementation, describe the benefit to the user of moving to Knative autoscaling. +2. Document what would make a Deployment ineligible to move to Knative without changes to the application - multiple containers in a pod, writable volumes, etc. +3. Maintain an example of a Kubernetes Deployment that was converted to a Knative resource to take advantage of Knative autoscaling. + ## What We Are Not Doing Yet ### Removing the Queue Proxy Sidecar From df43dc375331f17f73f36aaad42b200cfbd6bc5c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Markus=20Th=C3=B6mmes?= Date: Wed, 29 May 2019 14:09:24 +0200 Subject: [PATCH 18/18] Reworked parts of the Scaling roadmap. - Unified some wording (capitalization mostly). - Removed prescriptive key steps. These should be captured by the respective projects, which will be more dynamically changeable than this document. --- docs/roadmap/scaling-2019.md | 81 ++++++++++++------------------------ 1 file changed, 26 insertions(+), 55 deletions(-) diff --git a/docs/roadmap/scaling-2019.md b/docs/roadmap/scaling-2019.md index 9e2547bf90fc..9cbd482919ec 100644 --- a/docs/roadmap/scaling-2019.md +++ b/docs/roadmap/scaling-2019.md @@ -4,40 +4,35 @@ This is what we hope to accomplish in 2019. ## Performance +### Tests and reliable reporting + +As an overarching goal, we want all aspects of our performance continuously measured and reliably reported. All of the following aspects will include in-depth testing and reporting to make sure that advancements are reproducible on the CI systems and to avoid unwanted regressions. + +**Goal**: All relevant performance numbers are tracked and reported. + +**Project**: No seperate project for now. + ### Sub-Second Cold Start As a serverless framework, Knative should only run code when it needs to. Including scaling to zero when the Revision is not being used. However the Revison must also come back quickly, otherwise the illusion of "serverless" is broken--it must seem as if it was always there. Generally less than one second is a good start. Today cold-starts are between 10 and 15 seconds which is an order of magnitude too slow. The time is spent starting the pod, waiting for Envoy to initialize, and setting up routing. Without the Istio mesh (just routing request to individual pods as they come up) still takes about 4 seconds. We've poked at this problem in 2018 ([#1297](https://github.com/knative/serving/issues/1297)) but haven't made significant progress. This area requires some dedicated effort. -One area of investment is a local scheduling approach in which the Activator is given authority to schedule a Pod locally on the Node. This takes several layers out of the critical path for cold starts. - -**Goal**: achieve sub-second average cold-starts of disk-warm Revisions running in mesh-mode by the end of the year. - -**Key Steps**: -1. Capture cold start traces. -2. Track cold start latency over time by span. -3. Performance test for cold starts. -4. Local scheduling. +**Goal**: Achieve sub-second average cold-starts of disk-warm Revisions. **Project**: [Project 8](https://github.com/knative/serving/projects/8) ### Overload Handling -Knative Serving provides concurrency controls to limit the number of requests a container must handle simultaneously. Additionally, each pod has a queue for holding requests when the container concurrency limit has been reached. When the pod-level queue overflows, subsequent request are rejected with 503 "overload". +Knative Serving provides concurrency controls to limit the number of requests a container can handle simultaneously. Additionally, each pod has a queue for holding requests when the container concurrency limit has been reached. When the pod-level queue overflows, subsequent request are rejected with 503 "overload". -This is desirable to protect the Pod from being overloaded. But the aggregate behavior is not ideal for situations when autoscaling needs some time to react to sudden increases in request load. This could be when the revision is scaled to zero. Or when the revision is already running some pods, but not nearly enough. +This is desirable to protect the pod from being overloaded. But the aggregate behavior is not ideal for situations when autoscaling needs some time to react to sudden increases in request load. This could happen when the Revision is scaled to zero or when the Revision is already running some pods, but not nearly enough. -The goal of Overload Handling is to enqueue requests at a revision-level. Scale-from-zero should not overload if autoscaling can react in a reasonable amount of time to provide additional pods. When new pods come online, they should be able to take load from the existing pods. Even when scaled above zero, brief spikes of overload should be handled by enqueuing requests at a revision-level. The depth of the revision-level queue should also be configurable because even the Revision as a whole needs to guard against overload. +The goal of Overload Handling is to enqueue requests at a Revision-level. Scale-from-zero should not overload if autoscaling can react in a reasonable amount of time to provide additional pods. When new pods come online, they should be able to take load from the existing pods. Even when scaled above zero, brief spikes of overload should be handled by enqueuing requests at a Revision-level. The depth of the Revision-level queue should also be configurable because even the Revision as a whole needs to guard against overload. -The overall problem touches on both Networking and Autoscaling, two different working groups. Much of the overload handling will be implemented in the Activator, which is a part of ingress. So this project is shared jointly between the two working groups. +The overall problem touches on both networking and autoscaling, two different working groups. Much of the overload handling will be implemented in the Activator, which is a part of ingress. So this project is shared jointly between the two working groups. -**Goal**: requests can be enqueued at the revision-level in response to high load. - -**Key Steps**: -1. Handle overload gracefully in the Activator. Proxy requests at a rate the underlying Deployment can handle. Reject requests beyond a revision-level limit. -2. Wire the Activator into the serving path on overload. -3. Performance tests for overload scenarios. +**Goal**: Requests can be enqueued at the Revision-level in response to high load. **Project**: [Project 7](https://github.com/knative/serving/projects/7) @@ -45,31 +40,21 @@ The overall problem touches on both Networking and Autoscaling, two different wo ### Autoscaling Availabilty -Because Knative scales to zero, the autoscaling system is in the critical-path for serving requests. If the Autoscaler or Activator isn't available when an idle Revision receives a request, that request will not be served. The Activator is stateless and can be easily scaled horizontally. Any Activator Pod can proxy any request for any Revision. But the Autoscaler Pod is stateful. It maintains request statistics over a window of time. - -We need a way for autoscaling to have higher availability than that of a single Pod. When an Autoscaler Pod fails, another one should take over, quickly. And the new Autoscaler Pod should make equivalent scaling decisions. +Because Knative scales to zero, the autoscaling system is in the critical-path for serving requests. If the Autoscaler or Activator isn't available when an idle Revision receives a request, that request will not be served. The Activator is stateless and can be easily scaled horizontally. Any Activator pod can proxy any request for any Revision. But the Autoscaler pod is stateful. It maintains request statistics over a window of time. Moreover, the relationship between Activator and Autoscaler is N:1 currently because of how the Activator pushes metrics into the Autoscaler via a Websocket connection. -**Goal**: the autoscaling system should be more highly available than a single Pod. +We need a way for autoscaling to have higher availability than that of a single pod. When an Autoscaler pod fails, another one should take over, quickly. And the new Autoscaler pod should make equivalent scaling decisions. -**Key Steps**: -1. Autoscaler replication should be configurable. This will likely require leader election. -2. Activator replication should be configurable. -3. Any Activator Pod can provide metrics to all Autoscaler Pods. -4. E2E tests which validate autoscaling availability in the midst of a Autoscaler Pod failure. +**Goal**: The autoscaling is more highly available than a single pod. **Project**: TBD ### Autoscaling Scalability -The Autoscaler process maintains Pod metric data points over a window of time and calculates average concurrency every 2 seconds. As the number and size of Revisions deployed to a cluster increases, so does the load on the Autoscaler. - -We need some way to have sub-linear load on a given Autoscaler Pod as the Revision count increases. This could be a sharding scheme or simply deploying separate Autoscalers per namespace. +The Autoscaler process maintains pod metric data points over a window of time and calculates average concurrency every 2 seconds. As the number and size of Revisions deployed to a cluster increases, so does the load on the Autoscaler. -**Goal**: the Autoscaling system can scale sub-linearly with the number of Revisions and number of Revision Pods. +We need some way to have sub-linear load on a given Autoscaler pod as the Revision count increases. This could be a sharding scheme or simply deploying separate Autoscalers per namespace. -**Key Steps**: -1. Automated load test to determine the current scalability limit. And to guard against regression. -2. Configuration for sharding by namespace or other scheme. +**Goal**: The autoscaling system can scale sub-linearly with the number of Revisions and number of Revision pods. **Project**: TBD @@ -81,11 +66,7 @@ It is possible to replace the entire autoscaling system by implementing an alter We should be able to swap out smaller pieces of the autoscaling system. For example, the HPA should be able to make use of the metrics that Knative collects. -**Goal**: the autoscaling decider and metrics collection components can be replaced independently. - -**Key Steps**: -1. Build a reference implementation to test swapping the decider. And the metrics collection. (See [knative/build](https://github.com/knative/serving/blob/fa1aff18a9b549e79e41cf0b34f66b79c3da06b6/test/controller/main.go#L69)). -2. Provide Knative metrics via the Custom Metrics interface. +**Goal**: The autoscaling decider and metrics collection components can be replaced independently. **Project**: TBD @@ -93,12 +74,7 @@ We should be able to swap out smaller pieces of the autoscaling system. For exam The current Knative integration with K8s HPA only supports CPU autoscaling. However it should be able to scale on concurrency as well. Ultimately, the HPA may be able to replace the Knative Autoscaler (KPA) entirely (see ["make everything better"](https://github.com/knative/serving/blob/master/docs/roadmap/scaling-2018.md#references)). Additionally, HPA should be able to scale on user-provided custom metrics as well. -**Goal**: Knative HPA-class PodAutoscalers support concurrency autoscaling - -**Key Steps**: -1. Provide Knative metrics via the Custom Metrics interface (see also [Pluggability](#pluggability) above). -2. Configure the HPA to scale on the Knative concurrency metric. -3. Configure the HPA to scale on the user provided metric (requires a user configured Custom Metrics adapter to collect their metric). +**Goal**: Knative HPA-class PodAutoscalers support concurrency autoscaling. **Project**: TBD @@ -108,20 +84,15 @@ The current Knative integration with K8s HPA only supports CPU autoscaling. Howe We need documentation and examples to help Kubernetes users with existing Kubernetes Deployments migrate some of those to Knative to take advantage of request-based autoscaling and scale-to-zero. -**Goal**: increase Knative adoption by making migration from Kubernetes Deployments simple - -**Key Steps**: -1. Document why a user would want Knative's autoscaling instead of using the Kubernetes Horizontal Pod Autoscaler (HPA) without Knative. Especially if K8s HPA and Knative Autoscaler converge in implementation, describe the benefit to the user of moving to Knative autoscaling. -2. Document what would make a Deployment ineligible to move to Knative without changes to the application - multiple containers in a pod, writable volumes, etc. -3. Maintain an example of a Kubernetes Deployment that was converted to a Knative resource to take advantage of Knative autoscaling. +**Goal**: Increase Knative adoption by making migration from Kubernetes Deployments simple. ## What We Are Not Doing Yet ### Removing the Queue Proxy Sidecar -There are two sidecars injected into Knative Pods, Envoy and the Queue Proxy. The queue-proxy sidecar is where we put everything we wish Envoy/Istio could do, but doesn't yet. For example, enforcing single-threaded request. Or reporting concurrency metrics in the way we want. Ultimately we should push these features upstream and get rid of the queue-proxy sidecar. +There are two sidecars injected into Knative pods, Envoy and the Queue Proxy. The queue-proxy sidecar is where we put everything we wish Envoy/Istio could do, but doesn't yet. For example, enforcing single-threaded request or reporting concurrency metrics in the way we want. Ultimately we should push these features upstream and get rid of the queue-proxy sidecar. -However we're not doing that yet because the requirement haven't stablized enough yet. And it's still useful to have a component to innovate within. +However we're not doing that yet because the requirements haven't stablized enough yet. And it's still useful to have a component to innovate within. See [2018 What We Are Not Doing Yet](https://github.com/knative/serving/blob/master/docs/roadmap/scaling-2018.md#what-we-are-not-doing-yet) @@ -129,7 +100,7 @@ See [2018 What We Are Not Doing Yet](https://github.com/knative/serving/blob/mas A serverless system should be able to run code efficiently. Knative has default resources request and it supports resource requests and limits from the user. But if the user doesn't want to spend their time "tuning" resources (which is very "serverful") then Knative should be able to just "figure it out". That is Vertical Pod Autoscaling (VPA). -Knative [previously integrated with VPA Alpha](https://github.com/knative/serving/issues/839#issuecomment-389387311). Now it needs to reintegrate with VPA Beta. In addition to creating VPA resources for each Revision, we need to do a little bookkeeping for the unique requirements of serverless workloads. For example, the window for VPA recommendations is 2 weeks. But a serverless function might be invoked once per year (e.g. when the fire alarm gets pulled). The Pods should come back with the correct resource requests and limits. The way VPA is architected, it "injects" the correct recommendations via mutating webhook. It will decline to update resources requests after 2 weeks of inactivity and the Revision would fall back to defaults. Knative needs to remember what that recommendation was and make sure new Pods start at the right levels. +Knative [previously integrated with VPA Alpha](https://github.com/knative/serving/issues/839#issuecomment-389387311). Now it needs to reintegrate with VPA Beta. In addition to creating VPA resources for each Revision, we need to do a little bookkeeping for the unique requirements of serverless workloads. For example, the window for VPA recommendations is 2 weeks. But a serverless function might be invoked once per year (e.g. when the fire alarm gets pulled). The pods should come back with the correct resource requests and limits. The way VPA is architected, it "injects" the correct recommendations via mutating webhook. It will decline to update resources requests after 2 weeks of inactivity and the Revision would fall back to defaults. Knative needs to remember what that recommendation was and make sure new pods start at the right levels. Additionally, the next Revision should learn from the previous. But it must not taint the previous Revision's state. For example, when a Service is in runLatest mode, the next Revision should start from the resource recommendations of the previous. Then VPA will apply learning on top of that to adjust for changes in the application behavior. However if the next Revision goes crazy because of bad recommendations, a quick rollback to the previous should pick up the good ones. Again, this requires a little bit of bookkeeping in Knative.