You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Besides core workload spec for requested resources, any other info being annotated or set to help scheduler make right decisions? pending.
Rate and tenant deployment and burst
workload types: Long running service and short-job ratio or other info, or just for VMs?: no hints for this info at scheduler time. APM can have the info.
application with affinity ( app to app; app to data ) ratio -- current thinking is not to support region level affinity and favor to achieve perf goal of the global scheduler first.
Policies: consolidated vs. distributed workload; priority vs fairness, etc. What is the desired default?
number of tenants and avg, max of services/apps per tenants, i.e. complexity the app topo of a given tenant's service?. same as bullet list 5, not supported for now.
Resources
Up to 1M hosts. global scheduler needs to has the view of the global state of each region resource; how fine grained the resources will need to be defined. for HOST resource config changes ( such as which flavors of VMs the host can have ) will change when needed. so the view will need to be refreshed frequently. Iaas layer resource topo is actually a graph ( tree structure however they are some links among leaf nodes or branch nodes due to updated flavors to support on the node.
CPU, Memory, network bandwidth, GPU/FPGA, any others to be considered as HOST resources?. resources slots for multiple vm flavors to be supported on the node.
HOST being Physical machine and/or VMs ? -- pool
HOST Heterogeneity complexity?
System
System throughput 100k/sec
Global scheduler to coordinate with the region/cluster level scheduler -- need further design. each instance need the whole view of region level resources for all regions. how to sync the refreshed views, etc are the design challenges. currently this still based on omega or parsync model to sync resource state views
Distributed (multi-scheduler) partitioned or stateless
Single instance perf and throughput. sub millisecond scheduler time at the global scheduler layer.
Failed workload scheduling handle requirements:
- Latency of workload scheduling tolerance for failed scheduling
2nd scheduling flow for SLO, global scheduler assisted or region-region self-managed directly, current monitoring system latency etc.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Workload
Resources
Up to 1M hosts. global scheduler needs to has the view of the global state of each region resource; how fine grained the resources will need to be defined. for HOST resource config changes ( such as which flavors of VMs the host can have ) will change when needed. so the view will need to be refreshed frequently. Iaas layer resource topo is actually a graph ( tree structure however they are some links among leaf nodes or branch nodes due to updated flavors to support on the node.
CPU, Memory, network bandwidth, GPU/FPGA, any others to be considered as HOST resources?. resources slots for multiple vm flavors to be supported on the node.
HOST being Physical machine and/or VMs ? -- pool
HOST Heterogeneity complexity?
System
System throughput 100k/sec
Global scheduler to coordinate with the region/cluster level scheduler -- need further design. each instance need the whole view of region level resources for all regions. how to sync the refreshed views, etc are the design challenges. currently this still based on omega or parsync model to sync resource state views
Distributed (multi-scheduler) partitioned or stateless
Single instance perf and throughput. sub millisecond scheduler time at the global scheduler layer.
Failed workload scheduling handle requirements:
- Latency of workload scheduling tolerance for failed scheduling
2nd scheduling flow for SLO, global scheduler assisted or region-region self-managed directly, current monitoring system latency etc.
Beta Was this translation helpful? Give feedback.
All reactions