title | authors | reviewers | approvers | creation-date | ||||||
---|---|---|---|---|---|---|---|---|---|---|
FederatedHPA |
|
|
|
2023-04-11 |
Horizontal Pod Autoscaling (HPA) is a popular method for stabilizing applications by adjusting the number of pods based on different metrics. However, in single cluster, the scaling fail recovery couldn't be handled. Also, in the era of multi-clusters, single cluster HPA may not meet the growing requirements, including scaling across clusters, unified HPA configuration management, high availability of scaling function, and limiting resource/costs thresholds in multiple clusters.
To address these issues, this proposal introduced FederatedHPA to Karmada. This solution could scale applications across multiple clusters and improve the overall scaling threshold, stability and usage of their Kubernetes clusters.
In the current era of single clusters, Horizontal Pod Autoscaling (HPA) is commonly used to dynamically scale workload replicas to handle incoming requests and enhance resource utilization. Therefore, in the era of multi-clusters, HPA should still work and be capable of scaling the workloads in different member clusters, to resolve resource limitations for a single cluster and improve scaling fail recovery capabilities.
By introducing FederatedHPA to Karmada, users can enable various scenarios with rich policies in PropagationPolicy. For instance, the scaled up workloads may be scheduled to the clusters with more resources. FederatedHPA provides greater flexibility over workload scaling across multiple clusters.
- Manage the HPA resource in a unified manner across clusters.
- Scaling workloads across multiple clusters with FederatedHPA and rich policies in PropagationPolicy.
- Support both Kubernetes HPA(CPU/Memory) and customized HPA(customized metrics).
- Propose the implementation ideas for involved components, including
karmada-controller-manager
andkarmada-metrics-adapter
.
- The FederatedHPA is solely focused on scaling replicas, while the scheduling of these replicas to a specific cluster is determined by the schedule component. In other words, PropagationPolicy will be used to schedule the workloads to the clusters, such as schedule the workloads to high priority cluster first.
As a user, I have HPA in each cluster for one application. Since it's quite complex and redundant to manage them separately, I expect to be able to config and manage the HPA resource in a unified manner.
As a user, I deploy my application in multiple clusters. Since the application's load may change with the incoming requests, I expect to be able to scale up or down the workloads across clusters with rich policies to achieve different goals, such as:
- If I have two clusters, A and B, with A having more resources and B having fewer, then I hope to be able to configure more instances to be scaled up in cluster A to improve overall resource utilization across multiple clusters.
As a user, I deploy my service with HPA in a single cluster. Since this cluster may scale fail because of single cluster limitation, such as resource limitations, which will cause the service unavailable if requests too many, so I expect to improve the scaling stability of the service with scaling across clusters.
As a user, I deploy my application with HPA in multiple clusters. Since the unlimited scaling up of the application may lead to unexpected cloud costs and resource congestion that may cause service disruptions for other applications, so I expect to set the maximum number of instances for multi-cluster applications.
- The workloads/pods in different member clusters selected by the same HPA CR/resource share the load of the application equally. For example, 10 pods of the application are spread into two member clusters with distribution
cluster1: 3 pods, cluster2: 7 pods
, so the 3 pods in cluster1 take 3/10 of total requests and 7 pods in cluster2 take 7/10 of total requests. Scenarios that don't meet the restriction are not considered in this proposal. - If the system does not meet the first requirement, some clusters' services may become overloaded while others remain underloaded.
- The HPA used in a single cluster can be seamlessly migrated to multiple clusters' FederatedHPA.
For different scenarios, the architecture may be different:
The steps are described as follows:
- HPA controller query metrics via metrics API
metrics.k8s.io
with label selector periodically. karmada-apiserver
get the metrics API query, and it will route tokarmada-metrics-adapter
via API service registration.karmada-metrics-adapter
always list&watch the member clusters' pods, when the requests come, it could query the metrics from the target clusters(where the pod exists). After the metrics are collected, it will aggregate them and return it.- HPA controller will calculate the desired replicas based on metrics and scale the workload directly.
HPA with custom metrics:
The steps are described as follows:
- HPA controller query metrics via metrics API
custom.metrics.k8s.io
with label selector periodically. karmada-apiserver
get the metrics API query, and it will route tokarmada-metrics-adapter
via API service registration.karmada-metrics-adapter
always list&watch the member clusters' pods, when the requests come, it could query the metrics from the target clusters(where the pod exists). After the metrics are collected, it will aggregate them and return it.- HPA controller will calculate the desired replicas based on metrics and scale the workload directly.
HPA with multi-cluster metrics store:
- HPA controller query metrics via metrics API
metrics.k8s.io
orcustom.metrics.k8s.io
with label selector periodically. karmada-apiserver
get the metrics API query, and it will route tometrics-adapter
via API service registration.metrics-adapter
query the metrics from the managed metrics store and return it.- HPA controller will calculate the desired replicas based on metrics and scale the workload directly.
Note:
- How to find the target clusters?
karmada-metrics-adapter
uses different informer to build the lister cache for each member cluster. When the request comes, it could find out the target clusters by querying the resource from all the clusters' lister cache, so if there are the matched resource, the related clusters is the one of the target clusters.
From the above architecture, we can get following information:
- FederatedHPA controller is located in
karmada-controller-manager
component, which will query metrics fromkarmada-apiserver
with metrics API and scale the deployments. karmada-metrics-adapter
is responsible for query metrics from member clusters and do aggregation.- FederatedHPA only responsible for scaling instances but doesn't care where the instances will be scheduled.
For better user experience and easy to extend the API in the future, we defined the API as follows:
// FederatedHPA is a centralized HPA aggregator that can aggregate the metrics in multiple clusters.
// When the system load increases, it will query the metrics from multiple clusters and scales up the replicas.
// When the system load decreases, it will query the metrics from multiple clusters and scales down the replicas.
// After the replicas are scaled up/down, karmada-scheduler will schedule the replicas based on the policy.
type FederatedHPA struct {
metav1.TypeMeta
metav1.ObjectMeta
// Spec is the specification of the FederatedHPA.
// +required
Spec FederatedHPASpec `json:"spec"`
// Status is the current status of the FederatedHPA.
// +optional
Status autoscalingv2.HorizontalPodAutoscalerStatus `json:"status"`
}
// FederatedHPASpec describes the desired functionality of the FederatedHPA.
type FederatedHPASpec struct {
// ScaleTargetRef points to the target resource to scale, and is used to
// the pods for which metrics should be collected, as well as to actually
// change the replica count.
// +required
ScaleTargetRef autoscalingv2.CrossVersionObjectReference `json:"scaleTargetRef"`
// minReplicas is the lower limit for the number of replicas to which the autoscaler
// can scale down. It defaults to 1 pod. minReplicas is allowed to be 0 if the
// alpha feature gate HPAScaleToZero is enabled and at least one Object or External
// metric is configured. Scaling is active as long as at least one metric value is
// available.
// +optional
MinReplicas *int32 `json:"minReplicas,omitempty" protobuf:"varint,2,opt,name=minReplicas"`
// maxReplicas is the upper limit for the number of replicas to which the autoscaler can scale up.
// It cannot be less that minReplicas.
MaxReplicas int32 `json:"maxReplicas" protobuf:"varint,3,opt,name=maxReplicas"`
// Metrics contains the specifications for which to use to calculate the
// desired replica count (the maximum replica count across all metrics will
// be used). The desired replica count is calculated multiplying the
// ratio between the target value and the current value by the current
// number of pods. Ergo, metrics used must decrease as the pod count is
// increased, and vice-versa. See the individual metric source types for
// more information about how each type of metric must respond.
// If not set, the default metric will be set to 80% average CPU utilization.
// +optional
Metrics []autoscalingv2.MetricSpec `json:"metrics,omitempty"`
// Behavior configures the scaling behavior of the target
// in both Up and Down directions (scaleUp and scaleDown fields respectively).
// If not set, the default HPAScalingRules for scale up and scale down are used.
// +optional
Behavior *autoscalingv2.HorizontalPodAutoscalerBehavior `json:"behavior,omitempty"`
}
In order to make HPA scaling work, HPA controller should be implemented in karmada-controller-manager
, and the logic will follow the original implementation in kube-controller-manager
. But the formula to calculate the desired replicas is different.
-
In original implementation, the desired replicas is:
desiredReplicas = ceil[currentReadyReplicas * ( currentMetricValue / desiredMetricValue )]
,currentReadyReplicas
is all ready pods in a single clusters. -
In Karmada, the desired replicas of the template in Karmada control plane is:
desiredReplicas = ceil[currentReadyReplicas * ( currentMetricValue / desiredMetricValue )]
, and thecurrentReadyReplicas
iscurrentReadyReplicas = readyPodsNumberInAllMemberClusters * (templateReplicas / sumPodsNumber)
.
If we use the following configuration(Duplicated):#Deployment yaml apiVersion: apps/v1 kind: Deployment metadata: name: nginx labels: app: nginx spec: replicas: 2 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - image: nginx name: nginx --- #PropagationPolicy yaml apiVersion: policy.karmada.io/v1alpha1 kind: PropagationPolicy metadata: name: nginx-propagation spec: resourceSelectors: - apiVersion: apps/v1 kind: Deployment name: nginx placement: clusterAffinity: clusterNames: - member1 - member2 replicaScheduling: replicaSchedulingType: Duplicated --- #FederatedHPA yaml apiVersion: autoscaling.karmada.io/v1alpha1 kind: FederatedHPA metadata: name: nginx spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: nginx minReplicas: 1 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 10
With this configuration, cluster member1/member2 will both have 2 replicas. If the current utilization is 20%, so the desired replicas will be: (4 *(2/4)) * (20%/10%) = 4. So the nginx-deployment replicas in Karmada control plane will be scaled to 4. And there will be 4 replicas in both cluster member1/member2.
If we use the following configuration(Divided):
#Deployment yaml apiVersion: apps/v1 kind: Deployment metadata: name: nginx labels: app: nginx spec: replicas: 2 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - image: nginx name: nginx --- #PropagationPolicy yaml apiVersion: policy.karmada.io/v1alpha1 kind: PropagationPolicy metadata: name: nginx-propagation spec: resourceSelectors: - apiVersion: apps/v1 kind: Deployment name: nginx placement: clusterAffinity: clusterNames: - member1 - member2 replicaScheduling: replicaSchedulingType: Divided weightPreference: staticWeightList: - targetCluster: clusterNames: - member1 weight: 1 - targetCluster: clusterNames: - member2 weight: 1 --- #FederatedHPA yaml apiVersion: autoscaling.karmada.io/v1alpha1 kind: FederatedHPA metadata: name: nginx spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: nginx minReplicas: 1 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 10
With this configuration, cluster member1/member2 will both have 1 replicas. If the current utilization is 20%, so the desired replicas will be: (2 *(2/2)) * (20%/10%) = 4. So the nginx-deployment replicas in Karmada control plane will be scaled to 4. And 2 replicas will be scheduled to member1, and another 2 replicas will be scheduled to member2.
Also, the following edge case should be considered:
- If one cluster loses connection with Karmada control plane, the HPA controller will ignore the pod in this cluster.
For example, there are 2 member clusters, A(lose connection) and B. There are 10 pods in cluster A, 30 pods in cluster B. So the target replicas will be:
(30 pods average utilization / target utilization) * 30
.
To query metrics from member clusters, Karmada introduces a new component called karmada-metrics-adapter
which is responsible for the following tasks:
- Registering the metrics API service including metrics.k8s.io, custom.metrics.k8s.io, and external.metrics.k8s.io.
- When a query is received from apiserver, the
karmada-metrics-adapter
will query the member clusters and perform aggregation.
Note: For pull mode clusters, users need to provide a method to connect the network between the Karmada control plane and the member cluster, so that karmada-metrics-adapter
can access this member cluster, such as via ANP.
I have two clusters, A and B, A has 100 available pods and B has 50 available pods. What I expect is:
- Set the min/max replicas for the sum of the pods in cluster A and B. - (meets story 1/4)
- When cluster B goes down, the scaled replicas in B will be migrated to A and the new scaled replicas will be scheduled to A. - (meets story 3)
- When the replicas are scaled up, 2/3 of the replicas will be scheduled to A, 1/3 of the replicas will be scheduled to B. - (meets story 2)
So the following example configuration could meet the requirements:
apiVersion: autoscaling.karmada.io/v1alpha1
kind: FederatedHPA
metadata:
name: nginx-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: nginx
minReplicas: 1
maxReplicas: 60
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
---
apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
name: nginx-propagation
spec:
resourceSelectors:
- apiVersion: apps/v1
kind: Deployment
name: nginx
placement:
clusterAffinity:
clusterNames:
- A
- B
replicaScheduling:
replicaDivisionPreference: Weighted
replicaSchedulingType: Divided
weightPreference:
dynamicWeight: AvailableReplicas
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
...
- With this configuration, the pods' number of nginx deployment in clusters will be limited between 1 and 60.
- When the Karmada HPA controller scales up the nginx deployment, let's say to 15 pods,
karmada-scheduler
will schedule 10 of those pods to cluster A and 5 of them to cluster B. - When the Karmada HPA controller scales down the nginx deployment, for instance by reducing 15 pods,
karmada-scheduler
will scale down 10 pods in cluster A and 5 pods in cluster B. - When cluster A goes down, the replicas will be migrated to cluster B by
karmada-scheduler
.
To maintain high availability for FederatedHPA, the Karmada control plane should be deployed in a highly available manner, such as across multiple zones. If the Karmada control plane fully goes down, workloads can no longer be scaled.
This feature is should be implement in three stages:
- Implement HPA controller in
karmada-controller-manager
to scale the workloads based on CPU/Memory metrics. - Implement
karmada-metrics-adapter
to query metrics from member clusters and perform aggregation. - Implement scaling the workload based on custom metrics and external metrics.
- Complete documentation for the 3 deployment modes.
- All current testing should be passed, no break change would be involved by this feature.
- Add new E2E test cases to cover the new feature.
- Scaling across clusters with different PropagationPolicy, including scaling up/down.
- Scaling across clusters with different metrics.
- Scaling across clusters with different architecture.
As per the above content, it is evident that this implementation follows a centralized HPA approach. The alternative will be a decentralized HPA, and it will be described in another proposal. The function difference is:
Decentralized HPA | Centralized HPA | |
---|---|---|
Karmada control plane load | Low[1] | High[2] |
Bandwidth requirement | Low[3] | High[4] |
Scaling Accuracy | Low[5] | High[6] |
Scaling strategy | Prioritized[7] | Follow Policy[8] |
- [1] The process of replicas calculation is down by member clusters, so the load of Karmada control plane is relative low.
- [2] The process of replicas calculation is down by Karmada control plane, so the load of Karmada control plane is relative high.
- [3] Karmada control plane doesn't need to query metrics from member clusters, so the bandwidth requirement is relative low.
- [4] Karmada control plane needs to query metrics from member clusters, so the bandwidth requirement is high.
- [5] The scaling operation is down by member clusters, and the member cluster cannot see the resources of other clusters, and scaling will be relative rough.
- [6] The scaled replicas scheduling based on the real-time state of member clusters, so the scaling will be very precise and accurate.
- [7] The API definition of Decentralized HPA only provides priority scaling policy.
- [8] The scaling strategy will follows the PropagationPolicy, duplicated, staticWeighted, dynamicWeighted and aggregated.
The dependency difference is:
Decentralized HPA | Centralized HPA | |
---|---|---|
HPA controller is enabled in member clusters | Yes[1] | No[2] |
karmada-metrics-adapter is installed in Karmada control plane |
No[3] | Yes[4] |
- [1] Decentralized HPA will propagate HPA resource to member clusters and the scaling operation is executed by member clusters, so HPA controller should be enabled in member clusters.
- [2] Centralized HPA will scale the deployment and the replicas will be scheduled to member clusters, so HPA controller doesn't need to be enabled in member clusters.
- [3] The HPA controller of member clusters will query metrics in member clusters, and the metrics aggregation is not needed, so
karmada-metrics-adapter
doesn't need to be installed. - [4] Karmada control plane will query metrics from all member clusters, the aggregation is needed, so
karmada-metrics-adapter
needs to be installed.
The recommendation:
The first point is resources' limitation:
- If the bandwidth is limited and the application has a large scale, you have to choose decentralized HPA.
- If the bandwidth is enough big, you can use centralized HPA or decentralized HPA.
The second point is scaling strategy's limitation:
- If you want to scale the workload in a prioritized order, you have to choose decentralized HPA.
- If you want to scale the workload with the policy defined in PropagationPolicy, you have to choose centralized HPA.