Node scope and cluster scope score added, fix docu, filter out comple…

…ted pods, add explanation of the scores Signed-off-by: z1ens <xxtale02591@gmail.com>
open-cluster-management-io · Aug 15, 2024 · 90ddb29 · 90ddb29
1 parent d3d16a7
commit 90ddb29
Show file tree

Hide file tree

Showing 4 changed files with 245 additions and 131 deletions.
diff --git a/resource-usage-collect-addon/README.md b/resource-usage-collect-addon/README.md
@@ -1,25 +1,39 @@
 
 # Resource usage collect addon
+
 ## Background
-Open-Cluster-Management has already supported [extensible placement scheduling](https://github.com/open-cluster-management-io/enhancements/blob/main/enhancements/sig-architecture/32-extensiblescheduling/32-extensiblescheduling.md), which allow users to use [addonplacementscore](https://github.com/open-cluster-management-io/enhancements/blob/main/enhancements/sig-architecture/32-extensiblescheduling/32-extensiblescheduling.md#addonplacementscore-api) to select clusters under certain conditions.
 
-The basic idea of `addonPlacementScore` is that, the addon agent, which is installed on the managed cluster, collect information about the managed cluster, and calculate a score. These scores can be used when selecting or comparing multiple clusters.
-With the rapid advancement of artificial intelligence, an increasing number of developers need to schedule and plan workloads based on available resources to achieve better performance and save resources.
+With the rapid advancement of artificial intelligence, an increasing number of developers are required to schedule and plan AI/ML workloads based on available resources to achieve optimal performance and resource efficiency.
+
+
+Open-Cluster-Management (OCM) has already implemented `Placement` and supports [extensible placement scheduling](https://github.com/open-cluster-management-io/enhancements/blob/main/enhancements/sig-architecture/32-extensiblescheduling/32-extensiblescheduling.md), which allows for advanced, customizable workload scheduling across clusters. The key components are:
+
+- `Placement`: This feature enables the dynamic selection of a set of `ManagedClusters` within one or more `ManagedClusterSets` to facilitate Multi-Cluster scheduling.
+- `AddOnPlacementScore`: An API introduced by `Placement` to support scheduling based on customized scores.
+
+The `resource-usage-addon` is developed with `AddonTemplate`, and operates within this framework.
+- Once installed on the hub cluster, the addon deploys an agent on each managed cluster.
+- Agent pods on the managed clusters collect resource usage data and calculate a corresponding score.
+- These scores are then used by `Placement` to inform cluster selection, ensuring workloads are deployed on clusters with the most appropriate available resources.
+
+This repository, developed as part of [Google Summer of Code 2024](https://github.com/open-cluster-management-io/ocm/issues/369), introduces enhancements to the `resource-usage-addon`, including new support for scheduling based on GPU and TPU resource availability.
+This update is particularly valuable for developers seeking to optimize AI/ML workloads across multiple clusters.
 
-This repository mainly introduce an addon which collects the resource usage information in the managed clusters and calculate `addonPlacementScore`, users could select clusters based on the score using a `placement`.
-A possible use case could be: As a developer, I want to deploy my work on the cluster who has the most GPU resources available. This addon is developed using `addonTemplate`.
 
-More details about:
-- Extensible scheduling, please refer to [Extend the multicluster scheduling capabilities with placement](https://open-cluster-management.io/scenarios/extend-multicluster-scheduling-capabilities/)
-- Add-on, please refer to [What-is-an-addon](https://open-cluster-management.io/concepts/addon/#what-is-an-add-on)
-- Placement, please refer to [What-is-a-placement](https://open-cluster-management.io/concepts/placement/#select-clusters-in-managedclusterset)
-- Addon template, please refer to [Enhancement:addontemplate](https://github.com/open-cluster-management-io/enhancements/tree/main/enhancements/sig-architecture/82-addon-template)
+REF:
+- [GSoC 2024: Scheduling AI workload among multiple clusters #369](https://github.com/open-cluster-management-io/ocm/issues/369)
+- [Extend the multicluster scheduling capabilities with placement](https://open-cluster-management.io/scenarios/extend-multicluster-scheduling-capabilities/)
+- [What-is-an-addon](https://open-cluster-management.io/concepts/addon/#what-is-an-add-on)
+- [What-is-a-placement](https://open-cluster-management.io/concepts/placement/#select-clusters-in-managedclusterset)
+- [Enhancement:addontemplate](https://github.com/open-cluster-management-io/enhancements/tree/main/enhancements/sig-architecture/82-addon-template)
 
 # Quickstart
 ## Prerequisite
-1. Follow the instructions on [OCM official website](https://open-cluster-management.io/getting-started/quick-start/) install`clusteradm` command-line tool and set up a hub (manager) cluster and two managed clusters. 
-If using a different kubernetes distribution, follow the instructions in [Set-hub-and-managed-cluster](https://open-cluster-management.io/getting-started/quick-start/#setup-hub-and-managed-cluster).
+1. Follow the instructions on [OCM official website](https://open-cluster-management.io/getting-started/quick-start/), install `clusteradm` command-line tool and set up a hub (manager) cluster with two managed clusters.
+   If prefer using a different kubernetes distribution, follow the instructions in [Set-hub-and-managed-cluster](https://open-cluster-management.io/getting-started/quick-start/#setup-hub-and-managed-cluster).
+
 2. Command line tool `kubectl`  installed.
+
 3. [Docker](https://www.docker.com/) installed.
 
 ## Deploy
@@ -54,7 +68,7 @@ make deploy
 
 If deployed successfully:
 
-On the hub cluster, you can see the `addonTemplate`, and check the `managedClusterAddon` status.
+On the hub cluster, you can see the `AddonTemplate`, and check the `ManagedClusterAddon` status.
 ```bash
 $ kubectl get addontemplate
 NAME                     ADDON NAME
@@ -66,14 +80,31 @@ cluster1    resource-usage-collect   True                   False
 cluster2    resource-usage-collect   True                   False
 ```
 
-After a short while,on the hub cluster, `addonPlacementScore` for each managed cluster will be generated.
+After a short while, on the hub cluster, `AddonPlacementScore` for each managed cluster will be generated.
 ```bash
 $ kubectl config use kind-hub
 $ kubectl get addonplacementscore -A
 NAMESPACE   NAME                   AGE
 cluster1    resource-usage-score   3m23s
 cluster2    resource-usage-score   3m24s
 ```
+### Resource Scoring Strategies
+
+#### Node Scope Score
+- Node Scope Score: Indicates the available resources on the node with the most capacity in the cluster, aiding in selecting the best node for resource-intensive workloads.
+- Code Representation: Represented as `cpuNodeAvailable`, `gpuNodeAvailable`, etc., indicating available CPU and GPU resources on specific nodes.
+
+#### Example Use Scenarios:
+- Scenario: Suppose you have a cluster with three nodes: Node A with 2 available GPUs, Node B with 4 available GPUs, and Node C with 6 available GPUs. You need to deploy a job that requires 1 GPU.
+- Scheduling Strategies: Using the Node Scope Score, specifically `gpuNodeAvailable`, the scheduler could identify Node A as the optimal node by choosing a lower `gpuNodeAvailable` for this job under a bin-packing strategy. The scheduler would prefer to place the job on Node A to keep Nodes B and C more available for future jobs that may require more resources. This approach minimizes fragmentation and ensures that larger jobs can be accommodated later.
+
+#### Cluster Scope Score
+- Cluster Scope Score reflects the total available resources across the entire cluster, helping to determine if the cluster can support additional workloads.
+- Code Representation: Represented as `cpuClusterAvailable`, `gpuClusterAvailable`, etc., aggregating available resources across all nodes in the cluster.
+
+#### Example Use Scenarios:
+- Scenario: Consider a multi-cluster environment where Cluster X has 10 available GPUs across all nodes, Cluster Y has 6 available GPUs, and Cluster Z has 8 available GPUs. You need to deploy two jobs that first one requires 3 GPUs, and the other requires 4 GPUs.
+- Scheduling Strategies: Using the Cluster Scope Score, specifically `gpuClusterAvailable`, the scheduler would prefer the first job Cluster X for the job because it has the most available GPU resource. Then the Cluster X's score becoming lower, the scheduler will then deploy the second job on Cluster Z. This ensures that workloads are spread out, maximizing resource utilization across clusters and avoiding overloading a single cluster.
 
 ### Use Placement to select clusters
 Consider this example use case: As a developer, I want to select a cluster with the most available GPU resources and deploy a job on it.
@@ -118,4 +149,4 @@ make undeploy
 
 ### Troubleshoot
 1. If `make deploy` could not work, it might be that there has an auto-generated  `kustomization_tmp.yaml.tmp` file, delete it and rerun the command.
-Also make sure you are under hub cluster context, check the `kustomization.yaml` file, delete the part under `configMapGenerator`(if there is one exists).
+   Also make sure you are under hub cluster context, check the `kustomization.yaml` file, delete the part under `configMapGenerator`(if there is one exists).
diff --git a/resource-usage-collect-addon/pkg/addon/agent/agent.go b/resource-usage-collect-addon/pkg/addon/agent/agent.go
@@ -158,26 +158,43 @@ func newAgentController(
 
 func (c *agentController) sync(ctx context.Context, syncCtx factory.SyncContext) error {
 	score := NewScore(c.nodeInformer, c.podInformer)
-	cpuScore, memScore, gpuScore, tpuScore, err := score.calculateScore()
+	cpuNodeScore, memNodeScore, gpuNodeScore, tpuNodeScore, err := score.calculateNodeScore()
+	cpuClusterScore, memClusterScore, gpuClusterScore, tpuClusterScore, err := score.calculateClusterScopeScore()
 	if err != nil {
 		return err
 	}
 	items := []apiv1alpha2.AddOnPlacementScoreItem{
 		{
-			Name:  "cpuAvailable",
-			Value: cpuScore,
+			Name:  "cpuNodeAvailable",
+			Value: cpuNodeScore,
 		},
 		{
-			Name:  "memAvailable",
-			Value: memScore,
+			Name:  "cpuClusterAvailable",
+			Value: cpuClusterScore,
 		},
 		{
-			Name:  "gpuAvailable",
-			Value: gpuScore,
+			Name:  "memNodeAvailable",
+			Value: memNodeScore,
 		},
 		{
-			Name:  "tpuAvailable",
-			Value: tpuScore,
+			Name:  "memClusterAvailable",
+			Value: memClusterScore,
+		},
+		{
+			Name:  "gpuNodeAvailable",
+			Value: gpuNodeScore,
+		},
+		{
+			Name:  "gpuClusterAvailable",
+			Value: gpuClusterScore,
+		},
+		{
+			Name:  "tpuNodeAvailable",
+			Value: tpuNodeScore,
+		},
+		{
+			Name:  "tpuClusterAvailable",
+			Value: tpuClusterScore,
 		},
 	}