Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System sched ignore ineligible updates #6996

Merged
merged 5 commits into from
Feb 3, 2020

Conversation

drewbailey
Copy link
Contributor

If diffAllocs returns update or place diffs for a system job on an ineligible node, ignore it instead of computing placements

@drewbailey drewbailey force-pushed the system-sched-ineligible-updates branch 2 times, most recently from 26b0862 to 3627179 Compare January 28, 2020 19:37
@drewbailey drewbailey marked this pull request as ready for review January 28, 2020 20:07
@@ -276,11 +276,6 @@ func (s *SystemScheduler) computePlacements(place []allocTuple) error {
node, ok := nodeByID[missing.Alloc.NodeID]
if !ok {
s.logger.Debug("could not find node %q", missing.Alloc.NodeID)
if s.failedTGAllocs == nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to call out, this was introduced by #6968 and is not the desired functionality or any functionality that exists in a released version of nomad

scheduler/system_sched_test.go Show resolved Hide resolved
scheduler/system_sched_test.go Outdated Show resolved Hide resolved
scheduler/util.go Outdated Show resolved Hide resolved
@drewbailey drewbailey force-pushed the system-sched-ineligible-updates branch 2 times, most recently from f78a18b to 9d9da18 Compare January 30, 2020 16:43
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work digging in and figuring this out. I think this would be a great opportunity for an e2e test, but that can be done in a followup PR post-merge.

I think the following assertions in an e2e test would not only exercise this code, but would also be useful for future system job related work:

  1. submit system job
  2. assert it's running on node A
  3. mark node A ineligible
  4. submit system job v2
  5. assert v1 is still running on node A
  6. mark node A as eligible
  7. assert v2 does not get placed due to System jobs are not rescheduled when resources become available #4072

Step 5 is probably easiest if Step 4 is a destructive update (change a meta var or something). Then it should be trivial to assert that node A is still running the original alloc (ID is unchanged and meta var is the original value).

Step 7 is interesting because we want Step 6 to actually cause the update to occur. However, I don't expect that to happen for system jobs due to #4072

The good news being that we'll be able to tweak and reuse this e2e test when #4072 is fixed!

test flakey, add temp sleeps for debugging

fix computed class
diffSystemAllocs -> diffSystemAllocsForNode, this function is only used
for diffing system allocations, but lacked awareness of eligible
nodes and the node ID that the allocation was going to be placed.

This change now ignores a change if its existing allocation is on an
ineligible node. For a new allocation, it also checks tainted and
ineligible nodes in the same function instead of nil-ing out the diff
after computation in diffSystemAllocs
@drewbailey drewbailey force-pushed the system-sched-ineligible-updates branch from 884e6e1 to e86988c Compare February 3, 2020 14:04
@drewbailey drewbailey added this to the 0.10.4 milestone Feb 3, 2020
@drewbailey drewbailey merged commit 0eb3586 into master Feb 3, 2020
@drewbailey drewbailey deleted the system-sched-ineligible-updates branch February 3, 2020 18:22
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 18, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants