Skip to content

Commit

Permalink
Add qos doc to site,readme and introduction
Browse files Browse the repository at this point in the history
  • Loading branch information
kaiyuechen committed Oct 14, 2022
1 parent dc22027 commit 710c296
Show file tree
Hide file tree
Showing 24 changed files with 1,012 additions and 791 deletions.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,9 @@ EffectiveHorizontalPodAutoscaler supports prediction-driven autoscaling. With th

Provide a simple but efficient scheduler that schedule pods based on actual node utilization data,and filters out those nodes with high load to balance the cluster. [learn more](docs/tutorials/scheduling-pods-based-on-actual-node-load.md).

**Colocation with Enhanced QoS**
**Colocation with Enhanced QOS**

QOS-related capabilities ensure the running stability of Pods on Kubernetes. It has the ability of interference detection and active avoidance under the condition of multi-dimensional metrics, and supports reasonable operation and custom metrics access; it has the ability to oversell elastic resources enhanced by the prediction algorithm, reuse and limit the idle resources in the cluster; it has the enhanced bypass cpuset Management capabilities, improve resource utilization efficiency while binding cores. [learn more](docs/tutorials/using-qos-ensurance.md).

## Architecture

Expand Down
3 changes: 2 additions & 1 deletion README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,9 @@ EffectiveHorizontalPodAutoscaler 支持了预测驱动的弹性。它基于社

动态调度器根据实际的节点利用率构建了一个简单但高效的模型,并过滤掉那些负载高的节点来平衡集群。[了解更多](docs/tutorials/scheduling-pods-based-on-actual-node-load.zh.md)

**基于 QoS 的混部**
**基于 QOS 的混部**

QOS相关能力保证了运行在 Kubernetes 上的 Pod 的稳定性。具有多维指标条件下的干扰检测和主动回避能力,支持精确操作和自定义指标接入;具有预测算法增强的弹性资源超卖能力,复用和限制集群内的空闲资源;具备增强的旁路cpuset管理能力,在绑核的同时提升资源利用效率。[了解更多](docs/tutorials/using-qos-ensurance.zh.md)

## 架构

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -83,9 +83,6 @@ metadata:
spec:
allowedActions:
- disablescheduling
resourceQOS:
cpuQOS:
cpuPriority: 7
labelSelector:
matchLabels:
preemptible_job: "true"
Expand Down
3 changes: 2 additions & 1 deletion site/content/en/docs/Getting started/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,9 @@ EffectiveHorizontalPodAutoscaler supports prediction-driven autoscaling. With th

Provide a simple but efficient scheduler that schedule pods based on actual node utilization data,and filters out those nodes with high load to balance the cluster. [learn more](/docs/tutorials/scheduling-pods-based-on-actual-node-load).

**Colocation with Enhanced QoS**
**Colocation with Enhanced QOS**

QOS-related capabilities ensure the running stability of Pods on Kubernetes. It has the ability of interference detection and active avoidance under the condition of multi-dimensional metrics, and supports reasonable operation and custom metrics access; it has the ability to oversell elastic resources enhanced by the prediction algorithm, reuse and limit the idle resources in the cluster; it has the enhanced bypass cpuset Management capabilities, improve resource utilization efficiency while binding cores. [learn more](/docs/tutorials/using-qos-ensurance.md).

## Architecture

Expand Down
7 changes: 7 additions & 0 deletions site/content/en/docs/Tutorials/QOS/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@

---
title: "QOS"
weight: 9
description: >
Introduction to QOS related capabilities.
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
---
title: "QoS: Accurately Perform Avoidance Actions"
description: "Accurately Perform Avoidance Actions"
weight: 21
---

## Accurately Perform Avoidance Actions
Through the following two points, the excessive operation of low-quality pod can be avoided, and the gap between the metrics and the specified watermark can be reduced faster, so as to ensure that the high-priority service is not affected
1. Sort pod

Crane implements some general sorting methods (which will be improved later):

ClassAndPriority: compare the QOSClass and class value of two pods, compare QOSClass first, and then class value; Those with high priority are ranked later and have higher priority

runningTime: compare the running time of two pods. The one with long running time is ranked later and has higher priority

If you only need to use these two sorting strategies, you can use the default sorting method: you will first compare the priority of the pod, then compare the consumption of the corresponding indicators of the pod, and then compare the running time of the pod. There is a dimension that can compare the results, that is, the sorting results of the pod

Taking the ranking of CPU usage metric as an example, it also extends some ranking strategies related to its own metric, such as the ranking of CPU usage, which will compare the priority of two pods in turn. If the priority is the same, then compare the CPU consumption. If the CPU consumption is also the same, continue to compare the extended CPU resource consumption, and finally compare the running time of pod, when there is a difference in an indicator, the comparison result can be returned: `orderedby (classandpriority, CpuUsage, extcpuusage, runningtime) Sort(pods)`

2. Refer to the watermark and pod usage to perform avoidance action
```go
//Divide all the metrics that trigger the watermark threshold into two parts according to their quantified attribute
metricsQuantified, MetricsNotQuantified := ThrottleDownWaterLine.DivideMetricsByQuantified()
// If there is a metric that cannot be quantified, obtain the metric of a throttleable with the highest actionpriority to operate on all selected pods
if len(MetricsNotThrottleQuantified) != 0 {
highestPrioriyMetric := GetHighestPriorityThrottleAbleMetric()
t.throttlePods(ctx, &totalReleased, highestPrioriyMetric)
} else {
//Get the latest usage, get the gap to watermark
ThrottoleDownGapToWaterLines = buildGapToWaterLine(ctx.getStateFunc())
//If the real-time consumption of metric in the trigger watermark threshold cannot be obtained, chose the metric which is throttleable with the highest actionpriority to suppress all selected pods
if ThrottoleDownGapToWaterLines.HasUsageMissedMetric() {
highestPrioriyMetric := ThrottleDownWaterLine.GetHighestPriorityThrottleAbleMetric()
errPodKeys = throttlePods(ctx, &totalReleased, highestPrioriyMetric)
} else {
var released ReleaseResource
//Traverse the quantifiable metrics in the metrics that trigger the watermark: if the metric has a sorting method, use its sortfunc to sort the pod directly,
//otherwise use generalsorter to sort; Then use its corresponding operation method to operate the pod, and calculate the amount of resources released from the corresponding metric until the gap between the corresponding metric and the watermark no longer exists
for _, m := range metricsQuantified {
if m.SortAble {
m.SortFunc(ThrottleDownPods)
} else {
GeneralSorter(ThrottleDownPods)
}

for !ThrottoleDownGapToWaterLines.TargetGapsRemoved(m) {
for index, _ := range ThrottleDownPods {
released = m.ThrottleFunc(ctx, index, ThrottleDownPods, &totalReleased)
ThrottoleDownGapToWaterLines[m] -= released[m]
}
}
}
}
}
```
About extending user-defined metrics and sorting, it is introduced in "User-defined metrics interference detection avoidance and user-defined sorting".
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
---
title: "QoS: Define your watermark"
description: "How to customized your watermark"
weight: 22
---

## User-defined metrics interference detection avoidance and user-defined sorting
The use of user-defined metrics interference detection avoidance and user-defined sorting is the same as the process described in the "Accurately Perform Avoidance Actions". Here is how to customize your own metrics to participate in the interference detection avoidance process

In order to better sort and accurately control metrics configured based on NodeQoSEnsurancePolicy, the concept of attributes is introduced into metrics.

The attributes of metric include the following, and these fields can be realized by customized indicators:

1. Name Indicates the name of metric, which should be consistent with the metric name collected in the collector module
2. ActionPriority Indicates the priority of the metric. 0 is the lowest and 10 is the highest
3. SortAble Indicates whether the metric can be sorted. If it is true, the corresponding SortFunc needs to be implemented
4. SortFunc The corresponding sorting method. The sorting method can be arranged and combined with some general methods, and then combined with the sorting of the metric itself, which will be introduced in detail below
5. ThrottleAble Indicates whether pod can be suppressed for this metric. For example, for the metric of CPU usage, there are corresponding suppression methods, but for the metric of memory usage, pod can only be evicted, and effective suppression cannot be carried out
6. ThrottleQuantified Indicates whether the amount of resources corresponding to metric released after suppressing (restoring) a pod can be accurately calculated. We call the metric that can be accurately quantified as quantifiable, otherwise it is not quantifiable;
For example, the CPU usage can be suppressed by limiting the CGroup usage, and the CPU usage released after suppression can be calculated by the current running value and the value after suppression; Memory usage does not belong to suppression quantifiable metric, because memory has no corresponding throttle implementation, so it is impossible to accurately measure the specific amount of memory resources released after suppressing a pod;
7. ThrottleFunc The specific method of executing throttle action. If throttle is not available, the returned released is null
8. RestoreFunc After being throttled, the specific method of performing the recovery action. If restore is not allowed, the returned released is null
9. Evictable, EvictQuantified and EvictFunc The relevant definitions of evict action are similar to those of throttle action

```go
type metric struct {
Name WaterLineMetric

ActionPriority int

SortAble bool
SortFunc func(pods []podinfo.PodContext)

ThrottleAble bool
ThrottleQuantified bool
ThrottleFunc func(ctx *ExecuteContext, index int, ThrottleDownPods ThrottlePods, totalReleasedResource *ReleaseResource) (errPodKeys []string, released ReleaseResource)
RestoreFunc func(ctx *ExecuteContext, index int, ThrottleUpPods ThrottlePods, totalReleasedResource *ReleaseResource) (errPodKeys []string, released ReleaseResource)

EvictAble bool
EvictQuantified bool
EvictFunc func(wg *sync.WaitGroup, ctx *ExecuteContext, index int, totalReleasedResource *ReleaseResource, EvictPods EvictPods) (errPodKeys []string, released ReleaseResource)
}
```

After the construction is completed, register the metric through registerMetricMap()

For the metrics that need to be customized, you can easily realize the flexible customized sorting of pod by combining the following methods with general sorting methods to represent the customized metric indicators, <metric-sort-func> represents the customized sorting strategy

```yaml
func <metric>Sorter(pods []podinfo.PodContext) {
orderedBy(classAndPriority, <metric-sort-func>, runningTime).Sort(pods)
}
```
Among them, the following sorting method `<metric-sort-func>` needs to be implemented
`func (p1, p2 podinfo.PodContext) int32`
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
---
title: "QoS: Dynamic resource oversold and limit"
description: "How offline jobs use Crane"
weight: 20
---


## Dynamic resource oversold enhanced by prediction algorithm
In order to improve the stability, users usually set the request value higher than the actual usage when deploying applications, resulting in a waste of resources. In order to improve the resource utilization of nodes, users will deploy some besteffort applications in combination, using idle resources to realize oversold;
However, due to the lack of resource limit and request constraints and related information in these applications, scheduler may still schedule these pods to nodes with high load, which is inconsistent with our original intention, so it is best to schedule based on the free resources of nodes.

Crane collects the idle resources of nodes in the following two ways, and takes them as the idle resources of nodes after synthesis, which enhances the accuracy of resource evaluation:

Take cpu as an example, crane also supports the recovery of memory idle resources.

1. CPU usage information collected locally

`nodeCpuCannotBeReclaimed := nodeCpuUsageTotal + exclusiveCPUIdle - extResContainerCpuUsageTotal`

ExclusiveCPUIdle refers to the idle amount of CPU occupied by the pod whose CPU manager policy is exclusive. Although this part of resources is idle, it cannot be reused because of monopoly, so it is counted as used

ExtResContainerCpuUsageTotal refers to the CPU consumption used as dynamic resources, which needs to be subtracted to avoid secondary calculation

2. Create a TSP of node CPU usage, which is automatically created by default, and will predict node CPU usage based on history
```yaml
apiVersion: v1
data:
spec: |
predictionMetrics:
- algorithm:
algorithmType: dsp
dsp:
estimators:
fft:
- highFrequencyThreshold: "0.05"
lowAmplitudeThreshold: "1.0"
marginFraction: "0.2"
maxNumOfSpectrumItems: 20
minNumOfSpectrumItems: 10
historyLength: 3d
sampleInterval: 60s
resourceIdentifier: cpu
type: ExpressionQuery
expressionQuery:
expression: 'sum(count(node_cpu_seconds_total{mode="idle",instance=~"({{.metadata.name}})(:\\d+)?"}) by (mode, cpu)) - sum(irate(node_cpu_seconds_total{mode="idle",instance=~"({{.metadata.name}})(:\\d+)?"}[5m]))'
predictionWindowSeconds: 3600
kind: ConfigMap
metadata:
name: noderesource-tsp-template
namespace: default
```
Combine the prediction algorithm with the current actual consumption to calculate the remaining available resources of the node, and give it to the node as an extended resource. Pod can indicate that the extended resource is used as an offline job to use the idle resources, so as to improve the resource utilization rate of the node;
How to use:
When deploying pod, limit and request use `gocrane.io/<$resourcename>:<$value>`, as follows
```yaml
spec:
containers:
- image: nginx
imagePullPolicy: Always
name: extended-resource-demo-ctr
resources:
limits:
gocrane.io/cpu: "2"
gocrane.io/memory: "2000Mi"
requests:
gocrane.io/cpu: "2"
gocrane.io/memory: "2000Mi"
```

## Elastic resource restriction function
The native besteffort application lacks a fair guarantee of resource usage. Crane guarantees that the CPU usage of the besteffort pod using dynamic resources is limited within the reasonable range of its allowable use. The agent guarantees that the actual consumption of the pod using extended resources will not exceed its stated limit. At the same time, when the CPU competes, it can also compete fairly according to its stated amount; At the same time, pod using elastic resources will also be managed by the watermark function.

How to use:
When deploying pod, limit and request use `gocrane.io/<$resourcename>:<$value>`

## suitable scene
In order to increase the load of nodes, some offline jobs or less important jobs can be scheduled and deployed to the cluster by using dynamic resources. Such jobs will use idle elastic resources.
With the watermark guarantee of QOS, when the node has a high load, it will be evicted and throttled first, and the utilization of the node will be improved on the premise of ensuring the stability of high-priority services.
See the section "Used with dynamic resources" in qos-interference-detection-and-active-avoidance.md.
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
---
title: "QoS: Enhanced bypass cpuset management capability"
description: "Enhanced bypass cpuset management capability"
weight: 23
---

## Enhanced bypass cpuset management capability
Kubelet supports the static CPU manager strategy. When the guaranteed pod runs on the node, kebelet will allocate the specified dedicated CPU for the pod, which cannot be occupied by other processes. This ensures the CPU monopoly of the guaranteed pod, but also causes the low utilization of CPU and nodes, resulting in a certain waste.
Crane agent provides a new strategy for cpuset management, allowing pod and other pod to share CPU. When it specifies CPU binding core, it can make use of the advantages of less context switching and higher cache affinity of binding core, and also allow other workload to deploy and share, so as to improve resource utilization.

1. Three types of pod cpuset are provided:

- Exclusive: after binding the core, other containers can no longer use the CPU and monopolize the CPU
- Share: other containers can use the CPU after binding the core
- None: select the CPU that is not occupied by the container of exclusive pod, can use the binding core of share type

Share type binding strategy can make use of the advantages of less context switching and higher cache affinity, and can also be shared by other workload deployments to improve resource utilization

2. Relax the restrictions on binding cores in kubelet

Originally, it was required that the CPU limit of all containers be equal to the CPU request. Here, it is only required that the CPU limit of any container be greater than or equal to 1 and equal to the CPU request to set the binding core for the container


3. Support modifying the cpuset policy of pod during the running of pod, which will take effect immediately

The CPU manager policy of pod is converted from none to share and from exclusive to share without restart

How to use:
1. Set the cpuset manager of kubelet to "None"
2. Set CPU manager policy through pod annotation
`qos.gocrane.io/cpu-manager: none/exclusive/share`
```yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
qos.gocrane.io/cpu-manager: none/exclusive/share
```
Loading

0 comments on commit 710c296

Please sign in to comment.