Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metric for admitted active workloads #291

Merged
merged 1 commit into from
Jul 12, 2022

Conversation

alculquicondor
Copy link
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

Add a metric that keeps track of admitted workloads that are still active (running).

This requires keeping track of admitted workloads per queue in the cache, which can also be used to report queue status.

Which issue(s) this PR fixes:

Part of #199

Special notes for your reviewer:

The metric will under-report active workloads if the queue is deleted, or the queue switches cluster queue, which should be disallowed with finalizers and webhooks #171

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 7, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 7, 2022
@alculquicondor alculquicondor force-pushed the admitted_metrics branch 3 times, most recently from 50227bb to eb7103d Compare July 7, 2022 19:22
@alculquicondor
Copy link
Contributor Author

/hold
There is some very weird flakiness where a cluster queue name disappears. I'm investigating.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 7, 2022
@alculquicondor
Copy link
Contributor Author

/hold cancel
/test pull-kueue-test-integration-main

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 7, 2022
Subsystem: subsystemName,
Name: "admitted_active_workloads",
Help: "Number of admitted workloads that are active (unsuspended and not finished)",
}, []string{"cluster_queue", "queue"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since users can create as many queues as they wish, aren't we concerned about this dimension getting too big? do you remember what the limits where?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no limit. The more queues there are, the more memory we use. Let me try to get more information on how bad this could be for a gauge.

Note that when a queue or clusterQueue is removed, I'm deleting the label values .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We chatted offline and concluded that it's best to only keep track of active workloads per clusterqueue, similarly to #293

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 11, 2022
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 11, 2022
@alculquicondor alculquicondor force-pushed the admitted_metrics branch 2 times, most recently from 006151a to 9ef94e9 Compare July 11, 2022 20:05
@@ -34,7 +34,8 @@ type Info struct {
Obj *kueue.Workload
// list of total resources requested by the podsets.
TotalRequests []PodSetResources
// Populated from queue.
// Populated from the queue during admission of from the admission field if
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Obj: w,
TotalRequests: totalRequests(&w.Spec),
}
if w.Spec.Admission != nil {
info.ClusterQueue = string(w.Spec.Admission.ClusterQueue)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this a bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we were only using this field during scheduling, where the information comes from the queues, rather than the Workload.

Now I'm populating the field when adding the workload to the cache, for consistency and convenience (and to avoid potential bugs in the future).

Change-Id: I6f1607f6c3560d83e85e19976a986e638ea5a8b0
@ahg-g
Copy link
Contributor

ahg-g commented Jul 12, 2022

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 12, 2022
@k8s-ci-robot k8s-ci-robot merged commit ce886ae into kubernetes-sigs:main Jul 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants