Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-Level Scheduling Proposal #1128

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Multi-Level Scheduling Proposal #1128

wants to merge 2 commits into from

Conversation

lukasfrank
Copy link
Member

@lukasfrank lukasfrank commented Sep 30, 2024

Proposed Changes

  • This proposal discusses possible solutions to improve the IronCore scheduling

@github-actions github-actions bot added documentation Improvements or additions to documentation size/L labels Sep 30, 2024
@lukasfrank lukasfrank changed the title Proposal to improve scheduling Multi-Level Scheduling Proposal Sep 30, 2024
@balpert89
Copy link
Contributor

+1 for the proposal. This addresses a lot of current pain points in the stack:

  • better utilization for dedicated machine pools
  • a transparent decision making process how a Machine resource ended up on a given MachinePool
  • adding custom resources for scheduling decisions such as EPC Memory

Going into more detail. The current proposed approach follows a decentralized solution whereas the centralized one (in the alternatives section) follows the network stack solution. There are some disadvantages with a clustered solution, such as lack of guaranteed availability. For networking this is okay because if a critical infrastructure component is down, networking is affected anyway. The scheduling part should not have such a big impact for computing - using the decentralized approach would isolate impact on a pool level.
Another aspect with a centralized solution is you have to deal with a lot of "boilerplate" challenges such as "eventual consistency", giving room for possible race conditions. The Reservations solution solves this pretty elegantly because you can introduce a time peroid until a scheduler waits for its decision and only takes the pools into consideration it finds in the status slice. Therefore, my vote goes with a decentralized approach here.

On the topic of the "scheduling decision". The reservation system is meant to have a decision who can provide the requested resources. Another controller can then use this to actually decide which one of those pools to actually use for the the Machine resource. This enables a similar behavior compared with Node <-> Pod scheduling in vanilla Kubernetes. Another point to consider is that you can enable "system" reservations to accomodate for resources that are exclusively reserved for system applications.

Some aspects that are unclear for me:

  • who decides the rating on a given status entry, is this similar to a priority? How does this influence the scheduling decision?
  • how will arbitrary resources be announced, such as the already mentioned EPC Memory? Or e.g. dedicated graphics cards?

@lukasfrank
Copy link
Member Author

lukasfrank commented Sep 30, 2024

  • who decides the rating on a given status entry, is this similar to a priority? How does this influence the scheduling decision?

Only the pool provider can calculates the rating (since it's the component to check if the reservation can be fulfilled) and it is a metric on "how good the Reservation fits onto the related pool". It should be understood as a hint for the scheduler to take the decision.

  • how will arbitrary resources be announced, such as the already mentioned EPC Memory? Or e.g. dedicated graphics cards?

In the distributed approach: There is no need anymore for announcing resources. The resource "owner" (the pool provider) is in charge of taking or rejecting the reservation and needs to keep track of all the resources. If arbitrary resources aren't available on a specific host, the reservation will be declined.

@balpert89 Does that make sense to you?

@balpert89
Copy link
Contributor

The rating part is clear for me now, thanks for addressing.

In the distributed approach: There is no need anymore for announcing resources. The resource "owner" (the pool provider) is in charge of taking or rejecting the reservation and needs to keep track of all the resources. If arbitrary resources aren't available on a specific host, the reservation will be declined.

Does that mean we will deprecate the allocatable / available (https://github.com/ironcore-dev/ironcore/blob/main/api/compute/v1alpha1/machinepool_types.go#L30-L33) fields as they are not required anymore?

@lukasfrank
Copy link
Member Author

The rating part is clear for me now, thanks for addressing.

In the distributed approach: There is no need anymore for announcing resources. The resource "owner" (the pool provider) is in charge of taking or rejecting the reservation and needs to keep track of all the resources. If arbitrary resources aren't available on a specific host, the reservation will be declined.

Does that mean we will deprecate the allocatable / available (https://github.com/ironcore-dev/ironcore/blob/main/api/compute/v1alpha1/machinepool_types.go#L30-L33) fields as they are not required anymore?

Correct, there would be no need for this fields anymore. In case if it's used to aggregate the resources of the entire infrastructure, we can offer metrics and aggregate it the kubernetes way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants