-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Count inflight workload as pending #1936
Count inflight workload as pending #1936
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
/hold |
4fe6430
to
a829ccf
Compare
e579455
to
dc25ca6
Compare
dc25ca6
to
aff3522
Compare
2508913
to
0aec355
Compare
bad5983
to
6ea1e82
Compare
/hold cancel |
6ea1e82
to
119b2e1
Compare
a610da2
to
78163a4
Compare
/assign @alculquicondor |
78163a4
to
f95af59
Compare
@@ -153,6 +157,7 @@ func (c *clusterQueueBase) Delete(w *kueue.Workload) { | |||
key := workload.Key(w) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 did we forget to lock the mutex? Or is it locked by all callers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is locked by the only called currently, that is DeleteFromLocalQueue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we rename it to delete
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it's also called from the queue Manager, via the ClusterQueue interface.
But we already access the manager through a mutex. Now I'm wondering why we added a dedicated mutex to the ClusterQueue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh it was for the status snapshoter #1069
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, sorry for confusion, somehow I missed it in vs-code when using the "all references" option.
IIRC we added the mutex to ClusterQueue so that the Snapshot
operation (needed for visibility) only requires cq-scoped mutex, and this operation could be expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose there might still be problematic scenarios, for example, the event handlers triggering a Workload deletion while the snapshoter is running.
So we should always be locking the mutex. However, maybe we can get rid of the snapshotter in the near future in favor of the visibility API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose there might still be problematic scenarios, for example, the event handlers triggering a Workload deletion while the snapshoter is running.
The main purpose of the cs-scoped mutex is to make sure the heap structure can be iterated and copied so that we don't miss any element due to concurrent modifications.
So we should always be locking the mutex.
yes
However, maybe we can get rid of the snapshotter in the near future in favor of the visibility API.
I think we can get rid of the snapshotter that persists data in etcd, but the Snapshot()
function itself is used by the visibility API to get consistent view of the workloads, so I don't see a way of dropping the cq-scoped mutex, unless:
- we use the manager-scoped mutex, but this would impact performance
- don't use mutex for
Snapshot()
at all, but this would impact consistency of results during updates
@@ -235,7 +247,11 @@ func (c *clusterQueueBase) Pending() int { | |||
} | |||
|
|||
func (c *clusterQueueBase) PendingActive() int { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this ever called without locking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it is called from two places:
cq.Pending()
- this is using read lock scoped to the cqmanager.reportPendingWorkloads()
, but all invocations of this function take manager lock, which is broader range than cq
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
LGTM label has been added. Git tree hash: 787aa61cb564ef774a812027fb97db41299f7826
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, mimowo The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@alculquicondor @mimowo Isn't this a bug? |
yes, it is, let me update the labels: /kind-remove cleanup |
Also, should we do cherry-pick to v0.6. |
I would vote in favor of cherry-picking, the risk seems small. On the other hand it wasn't reported by users, but me when I was working on the feature: #1873, so may not be critical. I checked that there is a small conflict with #1815, if we include #1815, then it cherry-picks without conflict. |
This doesn't seem urgent enough for a cherry-pick IMO |
Considering conflicts, I'm ok without cherry-pick. |
* Test StrictFiFO misreporting metrics * Cleanup and expand the integration test
* Test StrictFiFO misreporting metrics * Cleanup and expand the integration test
What type of PR is this?
/kind bug
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #1937
Special notes for your reviewer:
Does this PR introduce a user-facing change?