Skip to content

Commit

Permalink
Node scope and cluster scope score added, fix docu, filter out comple…
Browse files Browse the repository at this point in the history
…ted pods, add explanation of the scores

Signed-off-by: z1ens <xxtale02591@gmail.com>
  • Loading branch information
z1ens committed Aug 15, 2024
1 parent d3d16a7 commit 90ddb29
Show file tree
Hide file tree
Showing 4 changed files with 245 additions and 131 deletions.
61 changes: 46 additions & 15 deletions resource-usage-collect-addon/README.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,39 @@

# Resource usage collect addon

## Background
Open-Cluster-Management has already supported [extensible placement scheduling](https://github.com/open-cluster-management-io/enhancements/blob/main/enhancements/sig-architecture/32-extensiblescheduling/32-extensiblescheduling.md), which allow users to use [addonplacementscore](https://github.com/open-cluster-management-io/enhancements/blob/main/enhancements/sig-architecture/32-extensiblescheduling/32-extensiblescheduling.md#addonplacementscore-api) to select clusters under certain conditions.

The basic idea of `addonPlacementScore` is that, the addon agent, which is installed on the managed cluster, collect information about the managed cluster, and calculate a score. These scores can be used when selecting or comparing multiple clusters.
With the rapid advancement of artificial intelligence, an increasing number of developers need to schedule and plan workloads based on available resources to achieve better performance and save resources.
With the rapid advancement of artificial intelligence, an increasing number of developers are required to schedule and plan AI/ML workloads based on available resources to achieve optimal performance and resource efficiency.


Open-Cluster-Management (OCM) has already implemented `Placement` and supports [extensible placement scheduling](https://github.com/open-cluster-management-io/enhancements/blob/main/enhancements/sig-architecture/32-extensiblescheduling/32-extensiblescheduling.md), which allows for advanced, customizable workload scheduling across clusters. The key components are:

- `Placement`: This feature enables the dynamic selection of a set of `ManagedClusters` within one or more `ManagedClusterSets` to facilitate Multi-Cluster scheduling.
- `AddOnPlacementScore`: An API introduced by `Placement` to support scheduling based on customized scores.

The `resource-usage-addon` is developed with `AddonTemplate`, and operates within this framework.
- Once installed on the hub cluster, the addon deploys an agent on each managed cluster.
- Agent pods on the managed clusters collect resource usage data and calculate a corresponding score.
- These scores are then used by `Placement` to inform cluster selection, ensuring workloads are deployed on clusters with the most appropriate available resources.

This repository, developed as part of [Google Summer of Code 2024](https://github.com/open-cluster-management-io/ocm/issues/369), introduces enhancements to the `resource-usage-addon`, including new support for scheduling based on GPU and TPU resource availability.
This update is particularly valuable for developers seeking to optimize AI/ML workloads across multiple clusters.

This repository mainly introduce an addon which collects the resource usage information in the managed clusters and calculate `addonPlacementScore`, users could select clusters based on the score using a `placement`.
A possible use case could be: As a developer, I want to deploy my work on the cluster who has the most GPU resources available. This addon is developed using `addonTemplate`.

More details about:
- Extensible scheduling, please refer to [Extend the multicluster scheduling capabilities with placement](https://open-cluster-management.io/scenarios/extend-multicluster-scheduling-capabilities/)
- Add-on, please refer to [What-is-an-addon](https://open-cluster-management.io/concepts/addon/#what-is-an-add-on)
- Placement, please refer to [What-is-a-placement](https://open-cluster-management.io/concepts/placement/#select-clusters-in-managedclusterset)
- Addon template, please refer to [Enhancement:addontemplate](https://github.com/open-cluster-management-io/enhancements/tree/main/enhancements/sig-architecture/82-addon-template)
REF:
- [GSoC 2024: Scheduling AI workload among multiple clusters #369](https://github.com/open-cluster-management-io/ocm/issues/369)
- [Extend the multicluster scheduling capabilities with placement](https://open-cluster-management.io/scenarios/extend-multicluster-scheduling-capabilities/)
- [What-is-an-addon](https://open-cluster-management.io/concepts/addon/#what-is-an-add-on)
- [What-is-a-placement](https://open-cluster-management.io/concepts/placement/#select-clusters-in-managedclusterset)
- [Enhancement:addontemplate](https://github.com/open-cluster-management-io/enhancements/tree/main/enhancements/sig-architecture/82-addon-template)

# Quickstart
## Prerequisite
1. Follow the instructions on [OCM official website](https://open-cluster-management.io/getting-started/quick-start/) install`clusteradm` command-line tool and set up a hub (manager) cluster and two managed clusters.
If using a different kubernetes distribution, follow the instructions in [Set-hub-and-managed-cluster](https://open-cluster-management.io/getting-started/quick-start/#setup-hub-and-managed-cluster).
1. Follow the instructions on [OCM official website](https://open-cluster-management.io/getting-started/quick-start/), install `clusteradm` command-line tool and set up a hub (manager) cluster with two managed clusters.
If prefer using a different kubernetes distribution, follow the instructions in [Set-hub-and-managed-cluster](https://open-cluster-management.io/getting-started/quick-start/#setup-hub-and-managed-cluster).

2. Command line tool `kubectl` installed.

3. [Docker](https://www.docker.com/) installed.

## Deploy
Expand Down Expand Up @@ -54,7 +68,7 @@ make deploy

If deployed successfully:

On the hub cluster, you can see the `addonTemplate`, and check the `managedClusterAddon` status.
On the hub cluster, you can see the `AddonTemplate`, and check the `ManagedClusterAddon` status.
```bash
$ kubectl get addontemplate
NAME ADDON NAME
Expand All @@ -66,14 +80,31 @@ cluster1 resource-usage-collect True False
cluster2 resource-usage-collect True False
```

After a short while,on the hub cluster, `addonPlacementScore` for each managed cluster will be generated.
After a short while, on the hub cluster, `AddonPlacementScore` for each managed cluster will be generated.
```bash
$ kubectl config use kind-hub
$ kubectl get addonplacementscore -A
NAMESPACE NAME AGE
cluster1 resource-usage-score 3m23s
cluster2 resource-usage-score 3m24s
```
### Resource Scoring Strategies

#### Node Scope Score
- Node Scope Score: Indicates the available resources on the node with the most capacity in the cluster, aiding in selecting the best node for resource-intensive workloads.
- Code Representation: Represented as `cpuNodeAvailable`, `gpuNodeAvailable`, etc., indicating available CPU and GPU resources on specific nodes.

#### Example Use Scenarios:
- Scenario: Suppose you have a cluster with three nodes: Node A with 2 available GPUs, Node B with 4 available GPUs, and Node C with 6 available GPUs. You need to deploy a job that requires 1 GPU.
- Scheduling Strategies: Using the Node Scope Score, specifically `gpuNodeAvailable`, the scheduler could identify Node A as the optimal node by choosing a lower `gpuNodeAvailable` for this job under a bin-packing strategy. The scheduler would prefer to place the job on Node A to keep Nodes B and C more available for future jobs that may require more resources. This approach minimizes fragmentation and ensures that larger jobs can be accommodated later.

#### Cluster Scope Score
- Cluster Scope Score reflects the total available resources across the entire cluster, helping to determine if the cluster can support additional workloads.
- Code Representation: Represented as `cpuClusterAvailable`, `gpuClusterAvailable`, etc., aggregating available resources across all nodes in the cluster.

#### Example Use Scenarios:
- Scenario: Consider a multi-cluster environment where Cluster X has 10 available GPUs across all nodes, Cluster Y has 6 available GPUs, and Cluster Z has 8 available GPUs. You need to deploy two jobs that first one requires 3 GPUs, and the other requires 4 GPUs.
- Scheduling Strategies: Using the Cluster Scope Score, specifically `gpuClusterAvailable`, the scheduler would prefer the first job Cluster X for the job because it has the most available GPU resource. Then the Cluster X's score becoming lower, the scheduler will then deploy the second job on Cluster Z. This ensures that workloads are spread out, maximizing resource utilization across clusters and avoiding overloading a single cluster.

### Use Placement to select clusters
Consider this example use case: As a developer, I want to select a cluster with the most available GPU resources and deploy a job on it.
Expand Down Expand Up @@ -118,4 +149,4 @@ make undeploy

### Troubleshoot
1. If `make deploy` could not work, it might be that there has an auto-generated `kustomization_tmp.yaml.tmp` file, delete it and rerun the command.
Also make sure you are under hub cluster context, check the `kustomization.yaml` file, delete the part under `configMapGenerator`(if there is one exists).
Also make sure you are under hub cluster context, check the `kustomization.yaml` file, delete the part under `configMapGenerator`(if there is one exists).
35 changes: 26 additions & 9 deletions resource-usage-collect-addon/pkg/addon/agent/agent.go
Original file line number Diff line number Diff line change
Expand Up @@ -158,26 +158,43 @@ func newAgentController(

func (c *agentController) sync(ctx context.Context, syncCtx factory.SyncContext) error {
score := NewScore(c.nodeInformer, c.podInformer)
cpuScore, memScore, gpuScore, tpuScore, err := score.calculateScore()
cpuNodeScore, memNodeScore, gpuNodeScore, tpuNodeScore, err := score.calculateNodeScore()
cpuClusterScore, memClusterScore, gpuClusterScore, tpuClusterScore, err := score.calculateClusterScopeScore()
if err != nil {
return err
}
items := []apiv1alpha2.AddOnPlacementScoreItem{
{
Name: "cpuAvailable",
Value: cpuScore,
Name: "cpuNodeAvailable",
Value: cpuNodeScore,
},
{
Name: "memAvailable",
Value: memScore,
Name: "cpuClusterAvailable",
Value: cpuClusterScore,
},
{
Name: "gpuAvailable",
Value: gpuScore,
Name: "memNodeAvailable",
Value: memNodeScore,
},
{
Name: "tpuAvailable",
Value: tpuScore,
Name: "memClusterAvailable",
Value: memClusterScore,
},
{
Name: "gpuNodeAvailable",
Value: gpuNodeScore,
},
{
Name: "gpuClusterAvailable",
Value: gpuClusterScore,
},
{
Name: "tpuNodeAvailable",
Value: tpuNodeScore,
},
{
Name: "tpuClusterAvailable",
Value: tpuClusterScore,
},
}

Expand Down
Loading

0 comments on commit 90ddb29

Please sign in to comment.