-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metric for admitted active workloads #291
Add metric for admitted active workloads #291
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
50227bb
to
eb7103d
Compare
/hold |
eb7103d
to
d75b5d7
Compare
/hold cancel |
pkg/metrics/metrics.go
Outdated
Subsystem: subsystemName, | ||
Name: "admitted_active_workloads", | ||
Help: "Number of admitted workloads that are active (unsuspended and not finished)", | ||
}, []string{"cluster_queue", "queue"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since users can create as many queues as they wish, aren't we concerned about this dimension getting too big? do you remember what the limits where?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no limit. The more queues there are, the more memory we use. Let me try to get more information on how bad this could be for a gauge.
Note that when a queue or clusterQueue is removed, I'm deleting the label values .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We chatted offline and concluded that it's best to only keep track of active workloads per clusterqueue, similarly to #293
d75b5d7
to
a258ba7
Compare
006151a
to
9ef94e9
Compare
pkg/workload/workload.go
Outdated
@@ -34,7 +34,8 @@ type Info struct { | |||
Obj *kueue.Workload | |||
// list of total resources requested by the podsets. | |||
TotalRequests []PodSetResources | |||
// Populated from queue. | |||
// Populated from the queue during admission of from the admission field if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Obj: w, | ||
TotalRequests: totalRequests(&w.Spec), | ||
} | ||
if w.Spec.Admission != nil { | ||
info.ClusterQueue = string(w.Spec.Admission.ClusterQueue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this a bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we were only using this field during scheduling, where the information comes from the queues, rather than the Workload.
Now I'm populating the field when adding the workload to the cache, for consistency and convenience (and to avoid potential bugs in the future).
Change-Id: I6f1607f6c3560d83e85e19976a986e638ea5a8b0
9ef94e9
to
0abbc39
Compare
/lgtm |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Add a metric that keeps track of admitted workloads that are still active (running).
This requires keeping track of admitted workloads per queue in the cache, which can also be used to report queue status.
Which issue(s) this PR fixes:
Part of #199
Special notes for your reviewer:
The metric will under-report active workloads if the queue is deleted, or the queue switches cluster queue, which should be disallowed with finalizers and webhooks #171