Skip to content
This repository has been archived by the owner on Sep 2, 2024. It is now read-only.

Next generation MT Scheduler and Descheduler with pluggable policies #768

Merged
merged 11 commits into from
Sep 8, 2021

Conversation

aavarghese
Copy link
Contributor

@aavarghese aavarghese commented Jul 15, 2021

Fixes #593
🎁 Implementing next generation MT scheduler and descheduler for Knative Eventing sources with plugins for filtering pods
and scoring them to pick the best placement pod.

Proposed Changes

  • Controller reads and validates the policy profile (containing predicates and priorities) for both scheduling and descheduling from ConfigMaps.
  • Scheduler imports all the available plugins which invokes their init() and registers all the core and source specific (kafka) plugins.
  • Scheduler runs all the predicates and priorities specified in the policy (configmap) file to eliminate all the pods that do not meet the predicate, and then prioritizes the feasible pods to find the best placement given the current state of the cluster.
  • Same for descheduler (has separate predicates and priorities).
  • Controls are also shared with autoscaler, and state accessor.
  • Autoscaler runs autoscaling synchronously using a lock so that autoscaling does not run together with the scheduler, and also scales up number of pods using a scale factor based on number of resources in the HA domain
  • State accessor computes the spread of existing and reserved vreplicas across pods, nodes and zones and stores a map of vpod key to a map of podname/nodename/zonename to vreplica count. It also stores the number of pods, nodes, zones and a list of "schedulable" pods (unschedulable pods are pods that are marked for eviction for compacting, and pods that are on unschedulable nodes (cordoned or unreachable nodes))
  • Pending vreplicas that cannot be scheduled triggers the autoscaler to add new pods. This leads to rebalancing of all the already placed vreplicas together with the pending vreplicas.
  • When its time to autoscale down (after refresh period), compacting the pods is attempted if space is available on lower pods. Higher pods are evicted and removed from the "schedulable" pods list. This leads to another rebalancing of all the vreplicas.
  • Recovery situations are addressed by making sure pods on failed nodes/zones are restarted on new nodes (statefulset adapter settings) and state collector is aware of nodes/zones being down.

Release Note

Next generation Multi-Tenant Scheduler and Descheduler: uses a plugin interface to specify a Scheduler profile with predicates and priorities that run filtering and scoring of pods, respectively to compute the best vreplica placements. When the autoscaler adds new pods, scheduler performs a rebalancing of the already placed vreplicas along with the new vreplicas. A Descheduler profile must be installed when vreplicas need to be scaled down and placements removed. 

@google-cla google-cla bot added the cla: yes Indicates the PR's author has signed the CLA. label Jul 15, 2021
@knative-prow-robot knative-prow-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jul 15, 2021
@aavarghese aavarghese changed the title Next generation MT Scheduler with pluggable policies [WIP] Next generation MT Scheduler with pluggable policies Jul 15, 2021
@knative-prow-robot knative-prow-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 15, 2021
@aavarghese aavarghese force-pushed the issue#593 branch 2 times, most recently from 7bcb222 to 143d240 Compare July 15, 2021 16:24
@lionelvillard
Copy link
Contributor

/test pull-knative-sandbox-eventing-kafka-unit-tests

@aavarghese aavarghese force-pushed the issue#593 branch 4 times, most recently from 1a409ed to ad47180 Compare July 20, 2021 12:13
@codecov
Copy link

codecov bot commented Jul 20, 2021

Codecov Report

Merging #768 (4f83ec9) into main (15bb0b1) will increase coverage by 0.11%.
The diff coverage is 79.05%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #768      +/-   ##
==========================================
+ Coverage   74.64%   74.75%   +0.11%     
==========================================
  Files         143      154      +11     
  Lines        6633     7139     +506     
==========================================
+ Hits         4951     5337     +386     
- Misses       1435     1518      +83     
- Partials      247      284      +37     
Impacted Files Coverage Δ
...e/lowestordinalpriority/lowest_ordinal_priority.go 66.66% <66.66%> (ø)
...alpriority/remove_with_highest_ordinal_priority.go 66.66% <66.66%> (ø)
...uler/plugins/core/evenpodspread/even_pod_spread.go 70.90% <70.90%> (ø)
...labilitynodepriority/availability_node_priority.go 72.72% <72.72%> (ø)
...labilityzonepriority/availability_zone_priority.go 74.28% <74.28%> (ø)
pkg/common/scheduler/state/interface.go 75.75% <75.75%> (ø)
pkg/common/scheduler/statefulset/scheduler.go 76.80% <75.93%> (-6.11%) ⬇️
pkg/common/scheduler/state/state.go 81.94% <81.94%> (ø)
.../kafka/nomaxresourcecount/no_max_resource_count.go 82.35% <82.35%> (ø)
...adpriority/remove_with_even_pod_spread_priority.go 83.33% <83.33%> (ø)
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 15bb0b1...4f83ec9. Read the comment docs.

@aavarghese aavarghese force-pushed the issue#593 branch 2 times, most recently from 3d0817b to b2bce5c Compare July 28, 2021 22:27
@aavarghese aavarghese force-pushed the issue#593 branch 11 times, most recently from eb05e8e to 2798557 Compare August 10, 2021 17:11
zoneMap := make(map[string]struct{})
for i := 0; i < len(nodes); i++ {
node := nodes[i]
if node.Spec.Unschedulable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line requires some explanation. In theory the vpod scheduler shouldn't have to deal with nodes directly. A pod on an unschedulable node might still be viable, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thin line I feel. A node is marked as Unschedulable in its Spec after it has been optionally drained and cordoned. K8 scheduler won't place new pods onto that Node, so idea was that we should not place "new vreplicas" onto this pod either. Because the next step for this Node might be rebooting/shutting down/upgrading etc. So this is a preventative step to add more rebalancing work. That was my thinking..

}

nodeName := pod.Spec.NodeName
node, err := s.nodeLister.Get(nodeName) //Node could be marked as Unschedulable - CANNOT SCHEDULE VREPS on a pod running on this node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. Maybe we just don't care about pod eviction. This has to be dealt somewhere else anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each plugin must be given some info to prevent it from selecting a pod that is in the process of eviction. Not sure yet of how to do that without the state collector telling it so...will think about this a bit more.

Signed-off-by: aavarghese <avarghese@us.ibm.com>
Signed-off-by: aavarghese <avarghese@us.ibm.com>
Signed-off-by: aavarghese <avarghese@us.ibm.com>
Signed-off-by: aavarghese <avarghese@us.ibm.com>
Signed-off-by: aavarghese <avarghese@us.ibm.com>
Signed-off-by: aavarghese <avarghese@us.ibm.com>
Signed-off-by: aavarghese <avarghese@us.ibm.com>
Signed-off-by: aavarghese <avarghese@us.ibm.com>
Signed-off-by: aavarghese <avarghese@us.ibm.com>
@aavarghese aavarghese force-pushed the issue#593 branch 4 times, most recently from e7fffb7 to 29c0181 Compare September 8, 2021 19:55
Signed-off-by: aavarghese <avarghese@us.ibm.com>
@knative-metrics-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-knative-sandbox-eventing-kafka-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/sources/v1beta1/kafka_scheduling.go 80.0% 81.8% 1.8
pkg/common/scheduler/plugins/core/availabilitynodepriority/availability_node_priority.go Do not exist 81.8%
pkg/common/scheduler/plugins/core/availabilityzonepriority/availability_zone_priority.go Do not exist 83.3%
pkg/common/scheduler/plugins/core/evenpodspread/even_pod_spread.go Do not exist 81.8%
pkg/common/scheduler/plugins/core/lowestordinalpriority/lowest_ordinal_priority.go Do not exist 66.7%
pkg/common/scheduler/plugins/core/podfitsresources/pod_fits_resources.go Do not exist 100.0%
pkg/common/scheduler/plugins/core/removewithavailabilitynodepriority/remove_with_availability_node_priority.go Do not exist 87.9%
pkg/common/scheduler/plugins/core/removewithavailabilityzonepriority/remove_with_availability_zone_priority.go Do not exist 88.9%
pkg/common/scheduler/plugins/core/removewithevenpodspreadpriority/remove_with_even_pod_spread_priority.go Do not exist 86.7%
pkg/common/scheduler/plugins/core/removewithhighestordinalpriority/remove_with_highest_ordinal_priority.go Do not exist 66.7%
pkg/common/scheduler/plugins/kafka/nomaxresourcecount/no_max_resource_count.go Do not exist 88.2%
pkg/common/scheduler/state/helpers.go Do not exist 92.6%
pkg/common/scheduler/state/interface.go Do not exist 86.2%
pkg/common/scheduler/state/state.go Do not exist 88.1%
pkg/common/scheduler/statefulset/autoscaler.go 80.3% 85.3% 5.0
pkg/common/scheduler/statefulset/scheduler.go 88.1% 83.4% -4.7

@lionelvillard
Copy link
Contributor

/lgtm
/approve

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 8, 2021
@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aavarghese, lionelvillard

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 8, 2021
@knative-prow-robot knative-prow-robot merged commit 91e1bec into knative-extensions:main Sep 8, 2021
@aavarghese
Copy link
Contributor Author

Tracking optimizations remaining to do:

  1. Plugin args should be in the same package ex: Next generation MT Scheduler and Descheduler with pluggable policies #768 (comment)
  2. Node and eviction details collected in State - should be elsewhere?
  3. generate example checksum Next generation MT Scheduler and Descheduler with pluggable policies #768 (comment)
  4. Convert json to yaml Next generation MT Scheduler and Descheduler with pluggable policies #768 (comment)
  5. Modify the scheduler adapter model from StatefulSet to Deployment?
  6. Bulk scheduling of vreplicas (instead of one at a time) to improve performance
  7. Rebalancing with minimal placement changes when #pods/nodes/zone changes
  8. Initial number of replicas (default:1) from statefulset not a multiple of scale factor for HA Fix for [scheduler] Crash when creating state #882

/cc @lionelvillard

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Indicates the PR's author has signed the CLA. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[scheduler,mtsource] Implementing EvenSpread HA support across nodes within a zone
4 participants