This charter adheres to the conventions described in the Kubernetes Charter README and uses the Roles and Organization Management outlined in wg-governance.
Discuss and enhance the support for Batch (eg. HPC, AI/ML, data analytics, CI) workloads in core Kubernetes. We want to unify the way users deploy batch workloads to improve portability and to simplify supportability for Kubernetes providers.
- To reduce fragmentation in the k8s batch ecosystem: congregate leads and users from different external and internal projects and user groups (CNCF TAGs, k8s sub-projects focused on batch-related features such as topology-aware scheduling) in the batch ecosystem to gather requirements, validate designs and encourage reutilization of core kubernetes APIs.
- The following recommendations for enhancements:
- Additions to the batch API group, currently including Job and CronJob resources that benefit batch use cases such as HPC, AI/ML, data analytics and CI.
- Primitives for job-level queueing, not limited to the k8s Job resource. Long-term, this could include multi-cluster support.
- Primitives to control and maximize utilization of resources in fixed-size clusters (on-prem) and elastic clusters (cloud).
- Runtime and scheduling support for specialized hardware (GPUs, NUMA, RDMA, etc.)
- Addition of new API kinds that serve a specialized type of workload. The focus should be on general APIs that specialized controllers can build on top of.
- Uses of the batch APIs as support for serving workloads (eg. backups, upgrades, migrations). These can be served by existing SIGs.
- Proposals that duplicate the functionality of core kubernetes components (job-controller, kube-scheduler, cluster-autoscaler).
- Job workflows or pipelines. Mature third party frameworks serve these use cases with the current kubernetes primitives. But additional primitives to support these frameworks could be in scope.
Stakeholders in this working group span multiple SIGs that own parts of the code in core kubernetes components and addons.
- Apps
- Autoscaling
- Node
- Scheduling
The list of deliverables include the following high level features:
- To SIG Apps:
- Updated Job API that fulfills the needs of a wider range of batch applications.
- A performant job controller that can scale to thousands of pods per minute.
- To SIG Scheduling and Autoscaling
- A set of APIs to support job queueing, a framework to support different queueing policies and a ready-to-use implementation as a subproject.
- Scheduling plugin(s) to support different batch needs.
- To SIG Autoscaling:
- Capabilities for job-level provisioning.
- To SIG Node:
- Runtime support for specialized hardware.
This wg adheres to the Roles and Organization Management outlined in wg-governance and opts-in to updates and modifications to wg-governance.
Additionally, the wg commits to:
- maintain a solid communication line between the Kubernetes groups and the wider CNCF community;
- submit a proposal to the KubeCon/CloudNativeCon maintainers track; if not selected, a video update will be recorded and listed below.
As a first mandate, the wg will define a roadmap in the first quarter of operation. We envision three timelines for the exit criteria, the focus will be on early exit, but a determination on whether or not to go beyond that is left until we reach that milestone.
- Early exit: define "recommendations" for the deliverables mentioned above, those recommendations would be left to the respective sigs to implement. The WG could start implementing those recommendations in the context of the owning sig to generate some momentum.
- Milestone 2, Late exit: The WG continues the implementation of the recommendations until they reach GA, and then disband.
- Convert to SIG: The WG observes a constant influx of requirements for the artifacts and there is the risk that the SIGs don't have enough capacity to maintain them. Then, the WG will propose the graduation into a SIG, taking ownership of the APIs, controllers and scheduling plugins.