Add metric for admitted active workloads #291

alculquicondor · 2022-07-07T18:24:59Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Add a metric that keeps track of admitted workloads that are still active (running).

This requires keeping track of admitted workloads per queue in the cache, which can also be used to report queue status.

Which issue(s) this PR fixes:

Part of #199

Special notes for your reviewer:

The metric will under-report active workloads if the queue is deleted, or the queue switches cluster queue, which should be disallowed with finalizers and webhooks #171

k8s-ci-robot · 2022-07-07T18:25:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

alculquicondor · 2022-07-07T20:41:57Z

/hold
There is some very weird flakiness where a cluster queue name disappears. I'm investigating.

alculquicondor · 2022-07-07T21:04:45Z

/hold cancel
/test pull-kueue-test-integration-main

ahg-g · 2022-07-07T21:31:20Z

pkg/metrics/metrics.go

+			Subsystem: subsystemName,
+			Name:      "admitted_active_workloads",
+			Help:      "Number of admitted workloads that are active (unsuspended and not finished)",
+		}, []string{"cluster_queue", "queue"},


Since users can create as many queues as they wish, aren't we concerned about this dimension getting too big? do you remember what the limits where?

There is no limit. The more queues there are, the more memory we use. Let me try to get more information on how bad this could be for a gauge.

Note that when a queue or clusterQueue is removed, I'm deleting the label values .

We chatted offline and concluded that it's best to only keep track of active workloads per clusterqueue, similarly to #293

ahg-g · 2022-07-12T16:54:14Z

pkg/workload/workload.go

@@ -34,7 +34,8 @@ type Info struct {
 	Obj *kueue.Workload
 	// list of total resources requested by the podsets.
 	TotalRequests []PodSetResources
-	// Populated from queue.
+	// Populated from the queue during admission of from the admission field if


ahg-g · 2022-07-12T16:54:35Z

pkg/workload/workload.go

 		Obj:           w,
 		TotalRequests: totalRequests(&w.Spec),
 	}
+	if w.Spec.Admission != nil {
+		info.ClusterQueue = string(w.Spec.Admission.ClusterQueue)


was this a bug?

No, we were only using this field during scheduling, where the information comes from the queues, rather than the Workload.

Now I'm populating the field when adding the workload to the cache, for consistency and convenience (and to avoid potential bugs in the future).

Change-Id: I6f1607f6c3560d83e85e19976a986e638ea5a8b0

ahg-g · 2022-07-12T17:44:22Z

/lgtm

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 7, 2022

k8s-ci-robot requested review from ArangoGutierrez and denkensk July 7, 2022 18:25

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 7, 2022

alculquicondor force-pushed the admitted_metrics branch 3 times, most recently from 50227bb to eb7103d Compare July 7, 2022 19:22

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 7, 2022

alculquicondor force-pushed the admitted_metrics branch from eb7103d to d75b5d7 Compare July 7, 2022 20:56

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 7, 2022

ahg-g reviewed Jul 7, 2022

View reviewed changes

This was referenced Jul 8, 2022

Reduce cardinality of pending_workloads metric #293

Merged

Track number of active running workloads per queue #295

Merged

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 11, 2022

alculquicondor force-pushed the admitted_metrics branch from d75b5d7 to a258ba7 Compare July 11, 2022 18:31

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 11, 2022

alculquicondor force-pushed the admitted_metrics branch 2 times, most recently from 006151a to 9ef94e9 Compare July 11, 2022 20:05

ahg-g reviewed Jul 12, 2022

View reviewed changes

Metric for number of admitted active workloads per cluster_queue

0abbc39

Change-Id: I6f1607f6c3560d83e85e19976a986e638ea5a8b0

alculquicondor force-pushed the admitted_metrics branch from 9ef94e9 to 0abbc39 Compare July 12, 2022 17:41

k8s-ci-robot assigned ahg-g Jul 12, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 12, 2022

k8s-ci-robot merged commit ce886ae into kubernetes-sigs:main Jul 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add metric for admitted active workloads #291

Add metric for admitted active workloads #291

alculquicondor commented Jul 7, 2022

k8s-ci-robot commented Jul 7, 2022

alculquicondor commented Jul 7, 2022

alculquicondor commented Jul 7, 2022

ahg-g Jul 7, 2022

alculquicondor Jul 8, 2022

alculquicondor Jul 11, 2022

ahg-g Jul 12, 2022

alculquicondor Jul 12, 2022

ahg-g Jul 12, 2022

alculquicondor Jul 12, 2022

ahg-g commented Jul 12, 2022

Add metric for admitted active workloads #291

Add metric for admitted active workloads #291

Conversation

alculquicondor commented Jul 7, 2022

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

k8s-ci-robot commented Jul 7, 2022

alculquicondor commented Jul 7, 2022

alculquicondor commented Jul 7, 2022

ahg-g Jul 7, 2022

Choose a reason for hiding this comment

alculquicondor Jul 8, 2022

Choose a reason for hiding this comment

alculquicondor Jul 11, 2022

Choose a reason for hiding this comment

ahg-g Jul 12, 2022

Choose a reason for hiding this comment

alculquicondor Jul 12, 2022

Choose a reason for hiding this comment

ahg-g Jul 12, 2022

Choose a reason for hiding this comment

alculquicondor Jul 12, 2022

Choose a reason for hiding this comment

ahg-g commented Jul 12, 2022