-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid queueing workloads that don't match CQ namespaceSelector #322
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
071a456
to
057724b
Compare
/assign @alculquicondor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking if it's worth simply not adding a Workload to the queue system at all if the namespace doesn't match.
Then, when there is a namespace update, we could use the goclient to list all the workloads in the informer's cache.
Then these workloads wouldn't show up in the "pending" metric. Although that might be undesired?
pkg/queue/cluster_queue_impl.go
Outdated
return c.pushIfNotPresent(wInfo) | ||
// QueueInadmissibleWorkloads moves all workloads from inadmissibleWorkloads to heap. | ||
// If at least one workload is moved, returns true. Otherwise returns false. | ||
func (c *ClusterQueueImpl) QueueInadmissibleWorkloads(client client.Client) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pass a context.
Although... (could be in a follow up), we already know which namespace was updated in cqNamespaceHandler
. So we could pass the name to this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not the only place where this is called though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can pass metav1.NamespaceAll
when the namespace is not important (it's the empty string)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I will make the change in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Started to do it, but halfway through I felt the change not really worth it, at least not now.
Unless we use an options pattern with a default of metav1.NamespaceAll
, I am afraid we may make a mistake somewhere calling this function incorrectly while restricting it to a namespace, also the only place that it currently makes sense to use it is when a namespace changes its labels, which is rather infrequent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just don't like that we call into the client so much. Although it's cached. But sure, it can be a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but this optimization wouldn't save us much because in the vast majority of the cases we are calling the function with NamespaceAll.
/hold |
I thought about this approach, but my conclusion was that it might be better to unify how we deal with inadmissible workloads. While the PR is large, most of it is a refactor that is agnostic to this specific issue, and I think can help us long term when deciding to be selective on re-queueing. We can still do the optimization related to not adding the workload on add, but I think we still need to track those workloads and report them via a metric somewhere. May be we can change inadmissibleWorkloads into a map to to breakdown by requeue reason, and use that in the pending metric as well. |
pkg/queue/cluster_queue_impl.go
Outdated
return c.pushIfNotPresent(wInfo) | ||
// QueueInadmissibleWorkloads moves all workloads from inadmissibleWorkloads to heap. | ||
// If at least one workload is moved, returns true. Otherwise returns false. | ||
func (c *ClusterQueueImpl) QueueInadmissibleWorkloads(client client.Client) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can pass metav1.NamespaceAll
when the namespace is not important (it's the empty string)
All comments should now be addressed now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
with a nit
thanks, fixed and squashed. |
/lgtm |
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR updates the scheduler to avoid re-queueing workloads that don't match the cq NamespaceSelector.
The initial idea was to not add those workloads when first observed, but this is not enough to address this issue since a namespace label or CQ namespaceSelector could change by the time the workload that was initially accepted into the queue. Those changes could make the workload inadmissible due to not matching namespaceSelector, and so we still need a code path that handles the case during re-queueing.
The consequence is that such workloads will get evaluated, but at most once. To optimize away this wasted cycle, we will need to avoid adding the workload from the beginning, but this can be done as a followup because this PR is too long (I already made the changes to the workload controller to update the workload status for this case on another branch).
Which issue(s) this PR fixes:
Fixes #301
Special notes for your reviewer: