From 359d690b3a3f6611da081961eb88a1528317ff82 Mon Sep 17 00:00:00 2001 From: qmhu Date: Wed, 30 Nov 2022 23:14:52 +0800 Subject: [PATCH] english docs --- .../Recommendation/idlenode-recommendation.md | 8 +- .../Recommendation/replicas-recommendation.md | 130 +++++++++++-- .../Recommendation/resource-recommendation.md | 178 +++++++++++++++++- .../Recommendation/replicas-recommendation.md | 4 +- .../Recommendation/resource-recommendation.md | 8 +- 5 files changed, 290 insertions(+), 38 deletions(-) diff --git a/site/content/en/docs/Tutorials/Recommendation/idlenode-recommendation.md b/site/content/en/docs/Tutorials/Recommendation/idlenode-recommendation.md index 33e930fa7..5b8d6d684 100644 --- a/site/content/en/docs/Tutorials/Recommendation/idlenode-recommendation.md +++ b/site/content/en/docs/Tutorials/Recommendation/idlenode-recommendation.md @@ -10,7 +10,7 @@ By scanning the status and utilization of nodes, the idle node recommendation he In Kubernetes cluster, some nodes often idle due to such factors as node taint, label selector, low packing rate and low utilization rate, which wastes a lot of costs. IdleNode recommendation tries to help users find these nodes to reduce cost. -## Example +## Sample ```yaml kind: Recommendation @@ -48,15 +48,17 @@ status: lastUpdateTime: '2022-11-30T07:46:57Z' ``` -In this example: +In this sample: - Recommendation's TargetRef Point to Node:worker-node-1 - Recommendation type is IdleNode - action is Delete,but offline a node is a complicated operation, we only give recommended advise. +How to create a IdleNode recommendation please refer to:[**Recommendation Framework**](/zh-cn/docs/tutorials/recommendation/recommendation-framework) + ## Implement Perform the following steps to complete a recommendation process for idle nodes: 1. Scan all nodes and pods in the cluster -2. If all Pods on a node are DaemonSet, the node is considered to be idle +2. If all Pods on a node are DaemonSet pods, the node is considered to be idle diff --git a/site/content/en/docs/Tutorials/Recommendation/replicas-recommendation.md b/site/content/en/docs/Tutorials/Recommendation/replicas-recommendation.md index d071859f6..c4361dff0 100644 --- a/site/content/en/docs/Tutorials/Recommendation/replicas-recommendation.md +++ b/site/content/en/docs/Tutorials/Recommendation/replicas-recommendation.md @@ -6,44 +6,136 @@ weight: 13 Kubernetes' users often set the replicas based on empirical values when creating application resources. Based on the replicas recommendation, you can analyze the actual application usage and recommend a more suitable replicas configuration. You can use it to improve the resource utilization of the cluster. -## Implement +## Motivation + +Kubernetes workload replicas allows you to control the Pods for quick scaling. However, how to set a reasonable replicas has always been a problem for application administrators. Too large may lead to a lot of waste of resources, while too low may cause stability problems. + +The HPA in community provides a dynamic autoscaling mechanism based on realtime metrics, meanwhile Crane's EffectiveHPA supports prediction-driven autoscaling based on HPA. However, in the real world, only some workloads can scale horizontally all the time, many workloads require a fixed number of pods. + +The figure below shows a workload with low utilization, it has 30% of the resource wasted between the Pod's peak historical usage and its Request. + +![Resource Waste](/images/resource-waste.jpg) + +Replica recommendation attempts to reduce the complexity of how to know the replicas of workloads by analyzing the historical usage. + +## Sample + +A Replicas recommendation sample yaml looks like below: + +```yaml +kind: Recommendation +apiVersion: analysis.crane.io/v1alpha1 +metadata: + name: workloads-rule-replicas-p84jv + namespace: kube-system + labels: + addonmanager.kubernetes.io/mode: Reconcile + analysis.crane.io/recommendation-rule-name: workloads-rule + analysis.crane.io/recommendation-rule-recommender: Replicas + analysis.crane.io/recommendation-rule-uid: 18588495-f325-4873-b45a-7acfe9f1ba94 + k8s-app: kube-dns + kubernetes.io/cluster-service: 'true' + kubernetes.io/name: CoreDNS + ownerReferences: + - apiVersion: analysis.crane.io/v1alpha1 + kind: RecommendationRule + name: workloads-rule + uid: 18588495-f325-4873-b45a-7acfe9f1ba94 + controller: false + blockOwnerDeletion: false +spec: + targetRef: + kind: Deployment + namespace: kube-system + name: coredns + apiVersion: apps/v1 + type: Replicas + completionStrategy: + completionStrategyType: Once + adoptionType: StatusAndAnnotation +status: + recommendedValue: + replicasRecommendation: + replicas: 1 + targetRef: { } + recommendedInfo: '{"spec":{"replicas":1}}' + currentInfo: '{"spec":{"replicas":2}}' + action: Patch + conditions: + - type: Ready + status: 'True' + lastTransitionTime: '2022-11-28T08:07:36Z' + reason: RecommendationReady + message: Recommendation is ready + lastUpdateTime: '2022-11-29T11:07:45Z' +``` -Based on the historical Workload CPU loads, find the workload's lowest CPU usage per hour in the past seven days, and calculate the replicas with 50% (configurable) cpu usage that should be configured +In this sample: -### Filter Phase +- Recommendation TargetRef point to a Deployment in kube-system namespace:coredns +- Recommendation type is Replicas +- adoptionType is StatusAndAnnotation,indicated that put recommendation result in recommendation.status and Deployment 的 Annotation +- recommendedInfo shows the recommended replicas(recommendedValue is deprecated),currentInfo shows the current replicas.The format is Json that can be updated for TargetRef by `Kubectl Patch` + TargetRef -1. workload with low replicas: If the replicas is too low, it may not have high recommendation demand. Associated configuration: 'workload-min-replicas' -2. There is a certain percentage of the not running pods for workload: if the Pod of workload mostly can't run normally, may not be suitable for recommendation, associated configuration: `pod-min-ready-seconds` | `pod-available-ratio` +How to create a Replicas recommendation please refer to:[**Recommendation Framework**](/docs/tutorials/recommendation/recommendation-framework) -### Prepare Phase +## Implement -Query the workload cpu usage in the past week. +The process for one Replicas recommendation: -### Recommend Phase +1. Query the historical CPU and Memory usage of the Workload for the past week by monitoring system. +2. Use DSP algorithm to predict the CPU usage in the future. +3. Calculate the replicas for both CPU and memory, then choose a larger one. -1. Calculate the lowest value of the median workload usage per hour in the past seven days (to prevent the impact of the minimum value): workload_cpu_usage_medium_min -2. The number of replicas corresponding to the target utilization: +### Algorithm + +Use cpu usage as an example. Assume that the P99 of the historical CPU usage of the workload is 10 cores, the Pod CPU Request is 5 cores, and the target peak utilization is 50%. Therefore, we know that 4(10 / 50% / 5) pods can meet the target peak utilization. ```go - replicas := int32(math.Ceil(workload_cpu_usage_medium_min / (rr.TargetUtilization * float64(requestTotal) / 1000.))) + replicas := int32(math.Ceil(workloadUsage / (TargetUtilization * float64(requestTotal)))) ``` -3. In order to prevent too low replicas,replicas should be larger than or equal to default-min-replicas +### Abnormal workloads + +The following types of abnormal workloads are not recommended: + +1. workload with low replicas: If the replicas is too low, it may not have high recommendation demand. Associated configuration: 'workload-min-replicas' +2. There is a certain percentage of the not running pods for workload: if the Pod of workload mostly can't run normally, may not be suitable for recommendation, associated configuration: `pod-min-ready-seconds` | `pod-available-ratio` -### Observe Phase +### Prometheus Metrics Record recommended replicas to Metric: crane_analytics_replicas_recommendation +## How to verify the accuracy of recommendation results + +Users can get the Workload resource usage through the following Prom-query, when you get the workload usage, put it into the algorithm above. + +Taking Deployment Craned in crane-system as an example, you can use your container, namespace to replace it in following Prom-query. + +```shell +sum(irate(container_cpu_usage_seconds_total{namespace="crane-system",pod=~"^craned-.*$",container!=""}[3m])) # cpu usage +``` + +```shell +sum(container_memory_working_set_bytes{namespace="crane-system",pod=~"^craned-.*$",container!=""}) # memory usage +``` + ## Accepted resources Support StatefulSet and Deployment by default,but all workloads that support `Scale SubResource` are supported. ## Configuration -| Configuration items | Default | Description | -|------------------------|---------|---------------------------------------------------------------------| -| workload-min-replicas | 1 | Workload replicas than less than this value are not recommended | -| pod-min-ready-seconds | 30 | Defines the min seconds to identify Pod is ready | +| Configuration items | Default | Description | +|------------------------|---------|------------------------------------------------------------------------| +| workload-min-replicas | 1 | Workload replicas than less than this value are not recommended | +| pod-min-ready-seconds | 30 | Defines the min seconds to identify Pod is ready | | pod-available-ratio | 0.5 | Workload ready Pod ratio that less than this value are not recommended | -| default-min-replicas | 1 | default minReplicas | -| cpu-target-utilization | 0.5 | Calculate the minimum replicas based on this cpu utilization | +| default-min-replicas | 1 | default minReplicas | +| cpu-percentile | 0.95 | Percentile for historical cpu usage | +| mem-percentile | 0.95 | Percentile for historical memory usage | +| cpu-target-utilization | 0.5 | Target of CPU peak historical usage | +| mem-target-utilization | 0.5 | Target of Memory peak historical usage | + +How to update recommendation configuration please refer to:[**Recommendation Framework**](/docs/tutorials/recommendation/recommendation-framework) diff --git a/site/content/en/docs/Tutorials/Recommendation/resource-recommendation.md b/site/content/en/docs/Tutorials/Recommendation/resource-recommendation.md index 5775ea732..ade9256be 100644 --- a/site/content/en/docs/Tutorials/Recommendation/resource-recommendation.md +++ b/site/content/en/docs/Tutorials/Recommendation/resource-recommendation.md @@ -6,25 +6,142 @@ weight: 14 Kubernetes' users often config request and limit based on empirical values when creating application resources. Based on the resource recommendation algorithm, you can analyze the actual application usage and recommend more appropriate resource configurations. You can use the resource recommendation algorithm to improve the resource utilization of the cluster. +## Motivation + +In Kubernetes, Request defines the minimum amount of resources required by Pod, Limit defines the maximum amount of resources capability for Pod , and workload's Utilization = Resource Usage / Request. There are two types of unreasonable resource utilization: + +- Utilization too low: The Request is often set to a large value because user don't know how many resource specifications can meet application requirements and choose to make it higher. It leads to a lot of resource waste. +- Utilization too high: Due to the service pressure caused by peak traffic or incorrect resource configuration. If the CPU usage is too high, it might cause more delay. If the memory usage exceeds the Limit, the Container will be killed, which affects service stability. + +The figure below shows a workload with low utilization, it has 30% of the resource wasted between the Pod's peak historical usage and its Request. + +![Resource Waste](/images/resource-waste.jpg) + +Resource recommendation attempts to reduce the complexity of how to know the fit request of workloads by analyzing the historical usage. + +## Sample + +A Resource recommendation sample yaml looks like below: + +```yaml +kind: Recommendation +apiVersion: analysis.crane.io/v1alpha1 +metadata: + name: workloads-rule-resource-flzbv + namespace: crane-system + labels: + analysis.crane.io/recommendation-rule-name: workloads-rule + analysis.crane.io/recommendation-rule-recommender: Resource + analysis.crane.io/recommendation-rule-uid: 18588495-f325-4873-b45a-7acfe9f1ba94 + analysis.crane.io/recommendation-target-kind: Deployment + analysis.crane.io/recommendation-target-name: load-test + analysis.crane.io/recommendation-target-version: v1 + app: craned + app.kubernetes.io/instance: crane + app.kubernetes.io/managed-by: Helm + app.kubernetes.io/name: crane + app.kubernetes.io/version: v0.7.0 + helm.sh/chart: crane-0.7.0 + ownerReferences: + - apiVersion: analysis.crane.io/v1alpha1 + kind: RecommendationRule + name: workloads-rule + uid: 18588495-f325-4873-b45a-7acfe9f1ba94 + controller: false + blockOwnerDeletion: false +spec: + targetRef: + kind: Deployment + namespace: crane-system + name: craned + apiVersion: apps/v1 + type: Resource + completionStrategy: + completionStrategyType: Once + adoptionType: StatusAndAnnotation +status: + recommendedValue: + resourceRequest: + containers: + - containerName: craned + target: + cpu: 150m + memory: 256Mi + - containerName: dashboard + target: + cpu: 150m + memory: 256Mi + targetRef: {} + recommendedInfo: >- + {"spec":{"template":{"spec":{"containers":[{"name":"craned","resources":{"requests":{"cpu":"150m","memory":"256Mi"}}},{"name":"dashboard","resources":{"requests":{"cpu":"150m","memory":"256Mi"}}}]}}}} + currentInfo: >- + {"spec":{"template":{"spec":{"containers":[{"name":"craned","resources":{"requests":{"cpu":"500m","memory":"512Mi"}}},{"name":"dashboard","resources":{"requests":{"cpu":"200m","memory":"256Mi"}}}]}}}} + action: Patch + conditions: + - type: Ready + status: 'True' + lastTransitionTime: '2022-11-29T04:07:44Z' + reason: RecommendationReady + message: Recommendation is ready + lastUpdateTime: '2022-11-30T03:07:49Z' +``` + +In this sample: + +- Recommendation TargetRef point to a Deployment in namespace crane-system:craned +- Recommendation type is Replicas +- adoptionType is StatusAndAnnotation,indicated that put recommendation result in recommendation.status and Deployment 的 Annotation +- recommendedInfo shows the recommended requests for containers(recommendedValue is deprecated),currentInfo shows the current request for containers. The format is Json that can be updated for TargetRef by `Kubectl Patch` + +How to create a Resource recommendation please refer to:[**Recommendation Framework**](/docs/tutorials/recommendation/recommendation-framework) + ## Implement -The algorithm model adopts VPA's Moving Window algorithm for recommendation +The process for one Resource recommendation: + +1. Query the historical CPU and Memory usage of the Workload for the past week by monitoring system. +2. Take the P99 percentile usage through VPA Histogram, and then multiply the amplification factor +3. OOM protection: If the container has a history of OOM events, it is recommended to increase the memory size based on the memory used when OOM happened. +4. Resource specifications: The recommended result is rounded up based on the specified pod specifications -1. By monitoring system, you can obtain the Workload (configurable) CPU and Memory usage history in the past week. -2. The algorithm considers the timeliness of data. The newer data sampling points will have higher weight. -3. The recommended CPU value is calculated based on the target percentile value that set by the user, and the recommended Memory value is calculated based on the maximum historical value +To sum up, based on the historical resource usage, set the Request value to slightly higher than the historical maximum and consider the OOM and Pod specifications. -### Filter Phase +### VPA Algorithm -Workloads that have no Pods: If the workload does not have Pods, analysis cannot be performed +The core algorithm of resource recommendation is to recommend reasonable resource request based on historical resource consumption. we adopt the community VPA Histogram algorithm to implement it. VPA algorithm puts the historical resource consumption into the histogram, finds the P99 percentage of resource consumption, and multiplies the percentage by the amplification factor as the recommended value. -### Recommend Phase +The output of VPA algorithm is the P99 consumption of cpu and memory. In order to reserve buffer for the application, it will multiply the magnification factor. You can configure the amplification factor in either of the following ways: -Adopt VPA Moving Window algorithm to calculate CPU and Memory for every container and give recommendation config. +1. Margin fraction: Recommended result = P99 usage * (1 + margin fraction), corresponding configuration: cpu-request-margin-fraction and me-request-margin-fraction +2. Target utilization: The recommended result is P99 amount/target utilization, and the corresponding configurations are cpu-target-utilization and mem-target-utilization -### Observe Phase +When you have a target peak utilization target for the application, it is recommended to use the **target utilization** way to amplify the recommendation results. -Record recommended resource to Metric:crane_analytics_replicas_recommendation +### OOM Protection + +Craned runs a component, OOMRecorder, which records the events of the container OOM in the cluster. The resource recommendation reads the events of the OOM Recorder to obtain the memory usage at OOM time. We make sure the recommended memory is larger than the value when OOM happened. + +### Resource Specification + +In Serverless Kubernetes, the cpu and memory specifications of the Pod are predefined. The resource recommendation can be rounded up according to the predefined resource specifications. For example, the recommended cpu value based on the historical usage is 0.125 core, and are rounded up to 0.25 core. You can also modify the specifications to meet the specifications requirements of your environment. + +### Prometheus Metric + +Record recommended resource to Metric:crane_analysis_resource_recommendation + +## How to verify the accuracy of recommendation results + +You can use the following Prom-query to obtain the Workload Container resource usage. The recommended value is slightly higher than the historical maximum, considering the OOM and Pod specifications. + +Taking Deployment Craned in crane-system as an example, you can use your container, namespace to replace it in following Prom-query. + +```shell +irate(container_cpu_usage_seconds_total{container!="POD",namespace="crane-system",pod=~"^craned.*$",container="craned"}[3m]) # cpu usage +``` + +```shell +container_memory_working_set_bytes{container!="POD",namespace="crane-system",pod=~"^craned.*$",container="craned"} # memory usage +``` ## Accepted Resources @@ -44,3 +161,44 @@ Support StatefulSet and Deployment by default,but all workloads that support ` | mem-request-margin-fraction | 0.15 | Memory recommend value margin factor,0.15 means recommended value = recommended value * 1.15 | | mem-target-utilization | 1 | Memory target utilization,0.8 means recommended value = recommended value / 0.8 | | mem-model-history-length | 168h | Historical length for Memory monitoring data | +| specification | false | Enable for resource rpecification | +| specification-config | "" | resource specifications configuration | +| oom-protection | true | Enable for OOM Prodection | +| oom-history-length | 168h | OOM event history length, ignore too old events | +| oom-bump-ratio | 1.2 | OOM memory bump up ratio | + +How to update recommendation configuration please refer to:[**Recommendation Framework**](/docs/tutorials/recommendation/recommendation-framework) + +## Default resource specifications configuration + +| CPU(Cores) | Memory(GBi) | +|------------|-------------| +| 0.25 | 0.25 | +| 0.25 | 0.5 | +| 0.25 | 1 | +| 0.5 | 0.5 | +| 0.5 | 1 | +| 1 | 1 | +| 1 | 2 | +| 1 | 4 | +| 1 | 8 | +| 2 | 2 | +| 2 | 4 | +| 2 | 8 | +| 2 | 16 | +| 4 | 4 | +| 4 | 8 | +| 4 | 16 | +| 4 | 32 | +| 8 | 8 | +| 8 | 16 | +| 8 | 32 | +| 8 | 64 | +| 16 | 32 | +| 16 | 64 | +| 16 | 128 | +| 32 | 64 | +| 32 | 128 | +| 32 | 256 | +| 64 | 128 | +| 64 | 256 | diff --git a/site/content/zh/docs/Tutorials/Recommendation/replicas-recommendation.md b/site/content/zh/docs/Tutorials/Recommendation/replicas-recommendation.md index 4ca4ac79d..b3de5af8a 100644 --- a/site/content/zh/docs/Tutorials/Recommendation/replicas-recommendation.md +++ b/site/content/zh/docs/Tutorials/Recommendation/replicas-recommendation.md @@ -90,10 +90,10 @@ status: ### 计算副本算法 -以 CPU 举例,假设工作负载 CPU 历史用量的 P99 是10核,Pod CPU Request 是5核,目标峰值利用率是50%,可知副本数是4个可以满足峰值利用率是50%。 +以 CPU 举例,假设工作负载 CPU 历史用量的 P99 是10核,Pod CPU Request 是5核,目标峰值利用率是50%,可知副本数是4个可以满足峰值利用率不小于50%。 ```go - replicas := int32(math.Ceil(workloadUsage / (TargetUtilization * float64(requestTotal) / 1000.))) + replicas := int32(math.Ceil(workloadUsage / (TargetUtilization * float64(requestTotal) ))) ``` ### 排除异常的工作负载 diff --git a/site/content/zh/docs/Tutorials/Recommendation/resource-recommendation.md b/site/content/zh/docs/Tutorials/Recommendation/resource-recommendation.md index 44bc09e4f..8bc2f349c 100644 --- a/site/content/zh/docs/Tutorials/Recommendation/resource-recommendation.md +++ b/site/content/zh/docs/Tutorials/Recommendation/resource-recommendation.md @@ -8,7 +8,7 @@ Kubernetes 用户在创建应用资源时常常是基于经验值来设置 reque ## 动机 -Kubernetes 中 Request 定义了 Pod 运行需要的最小资源量,Limit 定义了 Pod 运行可使用的最大资源量,应用的资源利用率 Utilization = Request / 资源用量 Usage。不合理的资源利用率有以下两种情况: +Kubernetes 中 Request 定义了 Pod 运行需要的最小资源量,Limit 定义了 Pod 运行可使用的最大资源量,应用的资源利用率 Utilization = 资源用量 Usage / Request 。不合理的资源利用率有以下两种情况: - 利用率过低:因为不清楚配置多少资源规格可以满足应用需求,或者是为了应对高峰流量时的资源消耗诉求,常常将 Request 设置得较大,这样就导致了过低的利用率,造成了浪费。 - 利用率过高:由于高峰流量的业务压力,或者错误的资源配置,导致利用率过高,CPU 利用率过高时会引发更高的业务延时,内存利用率过高超过 Limit 会导致 Container 被 OOM Kill,影响业务的稳定。 @@ -112,8 +112,8 @@ status: VPA 算法的 output 是 cpu、内存指标的 P99 用量。为了给应用预留 buffer,推荐结果还会乘以放大系数。资源推荐支持两种方式配置放大系数: -- 扩大比例:推荐结果=P99用量 * (1 + 放大系数),对应配置:cpu-request-margin-fraction 和 mem-request-margin-fraction -- 目标峰值利用率:以 P99用量为目标峰值用量计算,推荐结果=P99用量/目标峰值利用率,对应配置:cpu-target-utilization 和 mem-target-utilization +1. 扩大比例:推荐结果=P99用量 * (1 + 放大系数),对应配置:cpu-request-margin-fraction 和 mem-request-margin-fraction +2. 目标峰值利用率:推荐结果=P99用量/目标峰值利用率,对应配置:cpu-target-utilization 和 mem-target-utilization 在您有应用的目标峰值利用率目标时,推荐使用**目标峰值利用率**方式放大推荐结果。 @@ -162,7 +162,7 @@ container_memory_working_set_bytes{container!="POD",namespace="crane-system",pod | mem-target-utilization | 1 | Memory 目标利用率,0.8 指推荐值除以 0.8 | | specification | false | 是否开启资源规格规整 | | specification-config | "" | 资源规格,注意格式,详细的默认配置请见下方表格 | -| oom-protection | true | 是否开启资源规格规整 | +| oom-protection | true | 是否开启 OOM 保护 | | oom-history-length | 168h | OOM 历史事件的事件,过期事件会被忽略 | | oom-bump-ratio | 1.2 | OOM 内存放大系数 |