Next generation MT Scheduler and Descheduler with pluggable policies #768

aavarghese · 2021-07-15T06:34:19Z

Fixes #593
🎁 Implementing next generation MT scheduler and descheduler for Knative Eventing sources with plugins for filtering pods
and scoring them to pick the best placement pod.

Proposed Changes

Controller reads and validates the policy profile (containing predicates and priorities) for both scheduling and descheduling from ConfigMaps.
Scheduler imports all the available plugins which invokes their init() and registers all the core and source specific (kafka) plugins.
Scheduler runs all the predicates and priorities specified in the policy (configmap) file to eliminate all the pods that do not meet the predicate, and then prioritizes the feasible pods to find the best placement given the current state of the cluster.
Same for descheduler (has separate predicates and priorities).
Controls are also shared with autoscaler, and state accessor.
Autoscaler runs autoscaling synchronously using a lock so that autoscaling does not run together with the scheduler, and also scales up number of pods using a scale factor based on number of resources in the HA domain
State accessor computes the spread of existing and reserved vreplicas across pods, nodes and zones and stores a map of vpod key to a map of podname/nodename/zonename to vreplica count. It also stores the number of pods, nodes, zones and a list of "schedulable" pods (unschedulable pods are pods that are marked for eviction for compacting, and pods that are on unschedulable nodes (cordoned or unreachable nodes))
Pending vreplicas that cannot be scheduled triggers the autoscaler to add new pods. This leads to rebalancing of all the already placed vreplicas together with the pending vreplicas.
When its time to autoscale down (after refresh period), compacting the pods is attempted if space is available on lower pods. Higher pods are evicted and removed from the "schedulable" pods list. This leads to another rebalancing of all the vreplicas.
Recovery situations are addressed by making sure pods on failed nodes/zones are restarted on new nodes (statefulset adapter settings) and state collector is aware of nodes/zones being down.

Release Note

Next generation Multi-Tenant Scheduler and Descheduler: uses a plugin interface to specify a Scheduler profile with predicates and priorities that run filtering and scoring of pods, respectively to compute the best vreplica placements. When the autoscaler adds new pods, scheduler performs a rebalancing of the already placed vreplicas along with the new vreplicas. A Descheduler profile must be installed when vreplicas need to be scaled down and placements removed.

lionelvillard · 2021-07-15T20:08:13Z

/test pull-knative-sandbox-eventing-kafka-unit-tests

codecov · 2021-07-20T12:27:14Z

Codecov Report

Merging #768 (4f83ec9) into main (15bb0b1) will increase coverage by 0.11%.
The diff coverage is 79.05%.

@@            Coverage Diff             @@
##             main     #768      +/-   ##
==========================================
+ Coverage   74.64%   74.75%   +0.11%     
==========================================
  Files         143      154      +11     
  Lines        6633     7139     +506     
==========================================
+ Hits         4951     5337     +386     
- Misses       1435     1518      +83     
- Partials      247      284      +37

Impacted Files	Coverage Δ
...e/lowestordinalpriority/lowest_ordinal_priority.go	`66.66% <66.66%> (ø)`
...alpriority/remove_with_highest_ordinal_priority.go	`66.66% <66.66%> (ø)`
...uler/plugins/core/evenpodspread/even_pod_spread.go	`70.90% <70.90%> (ø)`
...labilitynodepriority/availability_node_priority.go	`72.72% <72.72%> (ø)`
...labilityzonepriority/availability_zone_priority.go	`74.28% <74.28%> (ø)`
pkg/common/scheduler/state/interface.go	`75.75% <75.75%> (ø)`
pkg/common/scheduler/statefulset/scheduler.go	`76.80% <75.93%> (-6.11%)`	⬇️
pkg/common/scheduler/state/state.go	`81.94% <81.94%> (ø)`
.../kafka/nomaxresourcecount/no_max_resource_count.go	`82.35% <82.35%> (ø)`
...adpriority/remove_with_even_pod_spread_priority.go	`83.33% <83.33%> (ø)`
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 15bb0b1...4f83ec9. Read the comment docs.

pkg/common/scheduler/statefulset/scheduler.go

pkg/source/reconciler/mtsource/kafkasource.go

pkg/source/reconciler/mtsource/controller.go

config/source/multi/600-config-scheduler.yaml

lionelvillard · 2021-09-07T18:09:55Z

pkg/common/scheduler/state/state.go

+ zoneMap := make(map[string]struct{})
+ for i := 0; i < len(nodes); i++ {
+ node := nodes[i]
+ if node.Spec.Unschedulable {


I think this line requires some explanation. In theory the vpod scheduler shouldn't have to deal with nodes directly. A pod on an unschedulable node might still be viable, right?

Thin line I feel. A node is marked as Unschedulable in its Spec after it has been optionally drained and cordoned. K8 scheduler won't place new pods onto that Node, so idea was that we should not place "new vreplicas" onto this pod either. Because the next step for this Node might be rebooting/shutting down/upgrading etc. So this is a preventative step to add more rebalancing work. That was my thinking..

lionelvillard · 2021-09-07T18:13:10Z

pkg/common/scheduler/state/state.go

+ }
+
+ nodeName := pod.Spec.NodeName
+ node, err := s.nodeLister.Get(nodeName) //Node could be marked as Unschedulable - CANNOT SCHEDULE VREPS on a pod running on this node


same here. Maybe we just don't care about pod eviction. This has to be dealt somewhere else anyway

Each plugin must be given some info to prevent it from selecting a pod that is in the process of eviction. Not sure yet of how to do that without the state collector telling it so...will think about this a bit more.

Signed-off-by: aavarghese <avarghese@us.ibm.com>

knative-metrics-robot · 2021-09-08T20:29:48Z

The following is the coverage report on the affected files.
Say /test pull-knative-sandbox-eventing-kafka-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/apis/sources/v1beta1/kafka_scheduling.go	80.0%	81.8%	1.8
pkg/common/scheduler/plugins/core/availabilitynodepriority/availability_node_priority.go	Do not exist	81.8%
pkg/common/scheduler/plugins/core/availabilityzonepriority/availability_zone_priority.go	Do not exist	83.3%
pkg/common/scheduler/plugins/core/evenpodspread/even_pod_spread.go	Do not exist	81.8%
pkg/common/scheduler/plugins/core/lowestordinalpriority/lowest_ordinal_priority.go	Do not exist	66.7%
pkg/common/scheduler/plugins/core/podfitsresources/pod_fits_resources.go	Do not exist	100.0%
pkg/common/scheduler/plugins/core/removewithavailabilitynodepriority/remove_with_availability_node_priority.go	Do not exist	87.9%
pkg/common/scheduler/plugins/core/removewithavailabilityzonepriority/remove_with_availability_zone_priority.go	Do not exist	88.9%
pkg/common/scheduler/plugins/core/removewithevenpodspreadpriority/remove_with_even_pod_spread_priority.go	Do not exist	86.7%
pkg/common/scheduler/plugins/core/removewithhighestordinalpriority/remove_with_highest_ordinal_priority.go	Do not exist	66.7%
pkg/common/scheduler/plugins/kafka/nomaxresourcecount/no_max_resource_count.go	Do not exist	88.2%
pkg/common/scheduler/state/helpers.go	Do not exist	92.6%
pkg/common/scheduler/state/interface.go	Do not exist	86.2%
pkg/common/scheduler/state/state.go	Do not exist	88.1%
pkg/common/scheduler/statefulset/autoscaler.go	80.3%	85.3%	5.0
pkg/common/scheduler/statefulset/scheduler.go	88.1%	83.4%	-4.7

lionelvillard · 2021-09-08T20:55:04Z

/lgtm
/approve

knative-prow-robot · 2021-09-08T20:55:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aavarghese, lionelvillard

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [lionelvillard]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

aavarghese · 2021-09-21T20:33:59Z

Tracking optimizations remaining to do:

Plugin args should be in the same package ex: Next generation MT Scheduler and Descheduler with pluggable policies #768 (comment)
Node and eviction details collected in State - should be elsewhere?
generate example checksum Next generation MT Scheduler and Descheduler with pluggable policies #768 (comment)
Convert json to yaml Next generation MT Scheduler and Descheduler with pluggable policies #768 (comment)
Modify the scheduler adapter model from StatefulSet to Deployment?
Bulk scheduling of vreplicas (instead of one at a time) to improve performance
Rebalancing with minimal placement changes when #pods/nodes/zone changes
Initial number of replicas (default:1) from statefulset not a multiple of scale factor for HA Fix for [scheduler] Crash when creating state #882

/cc @lionelvillard

google-cla bot added the cla: yes Indicates the PR's author has signed the CLA. label Jul 15, 2021

knative-prow-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jul 15, 2021

aavarghese changed the title ~~Next generation MT Scheduler with pluggable policies~~ [WIP] Next generation MT Scheduler with pluggable policies Jul 15, 2021

knative-prow-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 15, 2021

aavarghese force-pushed the issue#593 branch 2 times, most recently from 7bcb222 to 143d240 Compare July 15, 2021 16:24

aavarghese force-pushed the issue#593 branch 4 times, most recently from 1a409ed to ad47180 Compare July 20, 2021 12:13

lionelvillard reviewed Jul 20, 2021

View reviewed changes

pkg/common/scheduler/statefulset/scheduler.go Outdated Show resolved Hide resolved

aavarghese force-pushed the issue#593 branch 2 times, most recently from 3d0817b to b2bce5c Compare July 28, 2021 22:27

lionelvillard reviewed Aug 4, 2021

View reviewed changes

pkg/source/reconciler/mtsource/kafkasource.go Outdated Show resolved Hide resolved

lionelvillard reviewed Aug 4, 2021

View reviewed changes

pkg/source/reconciler/mtsource/kafkasource.go Outdated Show resolved Hide resolved

lionelvillard reviewed Aug 4, 2021

View reviewed changes

pkg/source/reconciler/mtsource/controller.go Outdated Show resolved Hide resolved

lionelvillard reviewed Aug 4, 2021

View reviewed changes

config/source/multi/600-config-scheduler.yaml Outdated Show resolved Hide resolved

aavarghese force-pushed the issue#593 branch 11 times, most recently from eb05e8e to 2798557 Compare August 10, 2021 17:11

lionelvillard reviewed Sep 7, 2021

View reviewed changes

aavarghese added 10 commits September 7, 2021 18:31

Even Spread across Nodes in each zone for HA scheduler

a988d14

Initial next gen scheduler

1e67190

Signed-off-by: aavarghese <avarghese@us.ibm.com>

Descheduler to remove vreplicas from placements

4409c4b

Signed-off-by: aavarghese <avarghese@us.ibm.com>

Rebalancer when statefulset is autoscaled up

8330062

Signed-off-by: aavarghese <avarghese@us.ibm.com>

Autoscaler updates for HA scaling factor (removed hard-coded number 3)

1522000

Signed-off-by: aavarghese <avarghese@us.ibm.com>

Compactor for scheduling policies (HA too) [WIP]

ed09db8

Signed-off-by: aavarghese <avarghese@us.ibm.com>

Removing old scheduler types for EVENSPREAD*

0283372

Signed-off-by: aavarghese <avarghese@us.ibm.com>

Blocking on autoscaler and checking vpod resource version [WIP]

f786d64

Signed-off-by: aavarghese <avarghese@us.ibm.com>

Adding node taints check to state collector for recovery

6ef944f

Signed-off-by: aavarghese <avarghese@us.ibm.com>

Autoscaler scaling with lock + recovery updates

79669f5

Signed-off-by: aavarghese <avarghese@us.ibm.com>

aavarghese force-pushed the issue#593 branch 4 times, most recently from e7fffb7 to 29c0181 Compare September 8, 2021 19:55

Review comment updates

4f83ec9

Signed-off-by: aavarghese <avarghese@us.ibm.com>

aavarghese force-pushed the issue#593 branch from 29c0181 to 4f83ec9 Compare September 8, 2021 20:24

knative-prow-robot assigned lionelvillard Sep 8, 2021

knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 8, 2021

knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 8, 2021

knative-prow-robot merged commit 91e1bec into knative-extensions:main Sep 8, 2021

knative-prow-robot requested a review from lionelvillard September 21, 2021 20:34

This was referenced Sep 21, 2021

[scheduler,mtsource] Implementing EvenSpread HA support across nodes within a zone #593

Open

README for new MT Scheduler with pluggable policies #888

Merged

aavarghese mentioned this pull request Oct 15, 2021

Moving Placeable type and new Scheduler to eventing core #942

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Next generation MT Scheduler and Descheduler with pluggable policies #768

Next generation MT Scheduler and Descheduler with pluggable policies #768

aavarghese commented Jul 15, 2021 •

edited

Loading

lionelvillard commented Jul 15, 2021

codecov bot commented Jul 20, 2021 •

edited

Loading

lionelvillard Sep 7, 2021

aavarghese Sep 8, 2021

lionelvillard Sep 7, 2021

aavarghese Sep 8, 2021

knative-metrics-robot commented Sep 8, 2021

lionelvillard commented Sep 8, 2021

knative-prow-robot commented Sep 8, 2021

aavarghese commented Sep 21, 2021

Next generation MT Scheduler and Descheduler with pluggable policies #768

Next generation MT Scheduler and Descheduler with pluggable policies #768

Conversation

aavarghese commented Jul 15, 2021 • edited Loading

Proposed Changes

lionelvillard commented Jul 15, 2021

codecov bot commented Jul 20, 2021 • edited Loading

Codecov Report

lionelvillard Sep 7, 2021

Choose a reason for hiding this comment

aavarghese Sep 8, 2021

Choose a reason for hiding this comment

lionelvillard Sep 7, 2021

Choose a reason for hiding this comment

aavarghese Sep 8, 2021

Choose a reason for hiding this comment

knative-metrics-robot commented Sep 8, 2021

lionelvillard commented Sep 8, 2021

knative-prow-robot commented Sep 8, 2021

aavarghese commented Sep 21, 2021

aavarghese commented Jul 15, 2021 •

edited

Loading

codecov bot commented Jul 20, 2021 •

edited

Loading