Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YUNIKORN-2978] Fix handling of reserved allocations where node differs #996

Closed
wants to merge 4 commits into from

Conversation

craigcondit
Copy link
Contributor

@craigcondit craigcondit commented Nov 15, 2024

What is this PR for?

YUNIKORN-2700 introduced a bug where allocations of previously-reserved tasks were not handled correctly in the case where we schedule on a different node than the reservation. Ensure that we unreserve and allocate using the proper node in both cases.

Also introduce additional logging of allocations on nodes to make finding issues like this easier in the future.

What type of PR is it?

  • - Bug Fix
  • - Improvement
  • - Feature
  • - Documentation
  • - Hot Fix
  • - Refactoring

Todos

  • - Task

What is the Jira issue?

https://issues.apache.org/jira/browse/YUNIKORN-2978

How should this be tested?

Verified successful processing of 1000-pod job on autoscaled cluster where previously this would fail.

Screenshots (if appropriate)

Questions:

  • - The licenses files need update.
  • - There is breaking changes for older versions.
  • - It needs documentation.

YUNIKORN-2700 introduced a bug where allocations of previously-reserved
tasks were not handled correctly in the case where we schedule on a
different node than the reservation. Ensure that we unreserve and
allocate using the proper node in both cases.

Also introduce additional logging of allocations on nodes to make
finding issues like this easier in the future.
@craigcondit craigcondit self-assigned this Nov 15, 2024
Copy link

codecov bot commented Nov 15, 2024

Codecov Report

Attention: Patch coverage is 82.00000% with 9 lines in your changes missing coverage. Please review.

Project coverage is 81.34%. Comparing base (ac32595) to head (36bd2da).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
pkg/scheduler/partition.go 67.85% 7 Missing and 2 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master     #996   +/-   ##
=======================================
  Coverage   81.34%   81.34%           
=======================================
  Files          97       97           
  Lines       15590    15620   +30     
=======================================
+ Hits        12681    12706   +25     
- Misses       2630     2634    +4     
- Partials      279      280    +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

Copy link
Contributor

@pbacsko pbacsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have at least a single unit test that fails with the old code and passes with this PR?

pkg/scheduler/partition.go Outdated Show resolved Hide resolved
@wilfred-s
Copy link
Contributor

Is it possible to have at least a single unit test that fails with the old code and passes with this PR?

I think the missing bit is just a single line:

928       alloc.SetNodeID(targetNodeID)

We need a unit tests, and it should be doable to create one:

  • fill up a node with allocations
  • create a request that does not fit on the used node
  • manually create a reservation for that request on that filled up node
  • run the normal allocation and get it to allocate on the "other" node.
  • the new allocation should show the correct node.

Before the fix the allocation will show the reserved node ID or none at all.

@craigcondit
Copy link
Contributor Author

Addressed review comments. Reservation test updated to verify node assignment -- verified this test fails prior to this PR but passes now.

Copy link
Contributor

@pbacsko pbacsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@craigcondit craigcondit deleted the YUNIKORN-2978 branch November 19, 2024 22:20
craigcondit added a commit that referenced this pull request Nov 19, 2024
…rs (#996)

YUNIKORN-2700 introduced a bug where allocations of previously-reserved
tasks were not handled correctly in the case where we schedule on a
different node than the reservation. Ensure that we unreserve and
allocate using the proper node in both cases.

Also introduce additional logging of allocations on nodes to make
finding issues like this easier in the future.

Closes: #996
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants