-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System sched ignore ineligible updates #6996
Conversation
26b0862
to
3627179
Compare
@@ -276,11 +276,6 @@ func (s *SystemScheduler) computePlacements(place []allocTuple) error { | |||
node, ok := nodeByID[missing.Alloc.NodeID] | |||
if !ok { | |||
s.logger.Debug("could not find node %q", missing.Alloc.NodeID) | |||
if s.failedTGAllocs == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to call out, this was introduced by #6968 and is not the desired functionality or any functionality that exists in a released version of nomad
f78a18b
to
9d9da18
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work digging in and figuring this out. I think this would be a great opportunity for an e2e test, but that can be done in a followup PR post-merge.
I think the following assertions in an e2e test would not only exercise this code, but would also be useful for future system job related work:
- submit system job
- assert it's running on node A
- mark node A ineligible
- submit system job v2
- assert v1 is still running on node A
- mark node A as eligible
- assert v2 does not get placed due to System jobs are not rescheduled when resources become available #4072
Step 5 is probably easiest if Step 4 is a destructive update (change a meta
var or something). Then it should be trivial to assert that node A is still running the original alloc (ID is unchanged and meta
var is the original value).
Step 7 is interesting because we want Step 6 to actually cause the update to occur. However, I don't expect that to happen for system jobs due to #4072
The good news being that we'll be able to tweak and reuse this e2e test when #4072 is fixed!
test flakey, add temp sleeps for debugging fix computed class
diffSystemAllocs -> diffSystemAllocsForNode, this function is only used for diffing system allocations, but lacked awareness of eligible nodes and the node ID that the allocation was going to be placed. This change now ignores a change if its existing allocation is on an ineligible node. For a new allocation, it also checks tainted and ineligible nodes in the same function instead of nil-ing out the diff after computation in diffSystemAllocs
884e6e1
to
e86988c
Compare
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
If
diffAllocs
returns update or place diffs for a system job on an ineligible node, ignore it instead of computing placements