- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- (R) Graduation criteria is in place
- (R) Production readiness review completed
- Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
The Jobs API allows users to create Pods using the Job controller with specific requirements such as Pod-level parallelism, completing a certain number of executions, and restart policies.
This KEP proposes an enhancement that will allow suspending and resuming Jobs. Suspending a Job will delete all active Pods owned by that Job and also instruct the controller manager to not create new Pods until the Job is resumed. Users will also be able to create Jobs in the suspended state, thereby delaying Pod creation indefinitely.
The Job controller tracks and manages all Jobs within Kubernetes. When a new Job is created by the user, the Job controller will immediately begin creating pods to satisfy the requirements of the Job until the Job is complete.
However, there are use-cases that require a Job to be suspended in the middle of its execution. Currently, this is impossible; as a workaround, users can delete and re-create the Job, but this will also delete Job metadata such as the list of successful/failed Pods and logs. Therefore, it is desirable to suspend Jobs and resume them later when convenient.
- Allow suspending and resuming Jobs.
- Allow indefinitely delaying the creation of Pods owned by a Job.
- Being able to suspending and resume jobs makes Job preemption, all-or-nothing scheduling, and Job queueing possible. However, we don't propose to create such higher-level controllers as a part of this KEP.
- It might be useful to restrict who can set/modify the
suspend
flag. However, we don't propose to create the validating and/or mutating webhooks necessary to achieve that as a part of this KEP.
Consider a cloud provider where servers are cheaper at night. My Job takes several days to complete, so I'd like to suspend my Job early in the morning every day and resume it after dark to save money. However, I don't want to delete and re-create my Job every day as I don't want to lose track of completed Pods, logs, and other metadata.
Let's say I'm a system administrator and there are many users submitting Jobs
to my cluster. All user-submitted Jobs are created with suspend: true
.
There's only a finite amount of resources, so I must resume these suspended
Jobs in the right order at the right time.
I can write a higher-level Job queueing controller to do this based on external factors. For example, the controller could choose to simply unpause Jobs in the FIFO order. Alternatively, Jobs could be assigned priority and, just like kube-scheduler, the controller can make a decision based on the suspended Job queue (it can even do Job preemption). Each Job could request a different amount of resources, so the higher-level controller may also want to resize the cluster to just fit the Job it's going to run. Regardless of what logic the controller uses to queue jobs, being able to suspend Jobs indefinitely and then resume them later is important.
System administrators may want to restrict who can set/modify the suspend
field when creating or updating Jobs. This can be achieved with validating
and/or mutating webhooks.
When a Job is suspended, the time is recorded as part of the Job status. Users can infer how long a Job has been suspended for through this field. This can be useful when making decisions around which Job should be resumed.
A Job that is complete cannot be suspended.
Suspending an active Job deletes all active pods. Users must design their application to gracefully handle this.
We propose adding a suspend
field to the JobSpec
API:
type JobSpec struct {
// Suspend specifies whether the Job controller should create Pods or not. If
// a Job is created with suspend set to true, no Pods are created by the Job
// controller. If a Job is suspended after creation (i.e. the flag goes from
// false to true), the Job controller will delete all active Pods associated
// with this Job. Users must design their workload to gracefully handle this.
// This is an alpha field and requires enabling the SuspendJob feature gate.
// Defaults to false.
// +optional
Suspend *bool `json:"suspend,omitempty"`
...
}
As described in the comment, when the boolean is set to true, the
controller-manager abstains from creating Pods even if there's work left to be
done. If the Job is already active and is updated with suspend: true
, the Job
controller calls Delete on all its active Pods. This causes the kubelet to send
a SIGTERM signal and completely remove the Pod after its graceful termination
period is honoured. Pods terminated this way are considered a failure and the
controller does not count terminated Pods towards completions. This behaviour
is similar to decreasing the Job's parallelism to zero.
Completed pods before suspension will count towards completion after the job is unsuspended.
For example, for jobs with completionMode: Indexed
; successfully completed indexes will not run again.
Similar to existing JobConditionTypes "Complete" and "Failed", we propose adding a new condition type called "Suspended" as a part of the Job's status as follows:
// These are valid conditions of a job.
const (
// JobSuspended means the job has been suspended.
JobSuspended JobConditionType = "Suspended"
// JobComplete means the job has completed its execution.
JobComplete JobConditionType = "Complete"
// JobFailed means the job has failed its execution.
JobFailed JobConditionType = "Failed"
)
To determine if a Job has been suspended, users must look for the
JobCondition
with Type
as "Suspended". If such a JobCondition
does not
exist or if the Status
field is false, the Job is not suspended. Otherwise,
if the Status
field is true, the Job is suspended. Note that when the
suspend
flag in the Job spec is flipped from true to false, the Job
controller simply updates the existing suspend JobCondition
status to false;
it does not remove the condition or add a new one.
Inferring suspension status from the Job spec's suspend
field is not
recommended as the Job controller may not have seen the update yet. When a Job
is suspended, the Job controller sets the JobCondition
only after all active
Pods owned by the Job are terminating or have been deleted.
The suspend JobCondition
also has a LastTransitionTime
field. This can be
used to infer how long a Job has been suspended for (if Status
is true).
The StartTime
field of the Job status is reset to the current time every time
the Job is resumed from suspension. If a Job is created with suspend: true
,
the StartTime
field of the Job status is set only when it is resumed for the
first time.
If a Job is suspended (at creation or through an update), the ActiveDeadlineSeconds
timer will effectively be stopped and reset when the Job is resumed again. That
is, Jobs will never be terminated for exceeding ActiveDeadlineSeconds
when a
Job is suspended. Users must interpret ActiveDeadlineSeconds
as the duration
for which a Job can be continuously active before which it is terminated.
When a Job is suspended or created in the suspended state, a "Suspended" event is recorded. Similarly, when a Job is resumed from its suspended state, a "Resumed" event is recorded.
Unit, integration, and end-to-end tests will be added to test that:
- Creating a Job with
suspend: true
should not create pods - Suspending a Job should delete active pods
- Resuming a Job should re-create pods
- Jobs should remember completions count after a suspend-resume cycle
- Metrics with observability in to the Job controller available
- Implemented feedback from alpha testers
- We're confident that no further semantical changes will be needed to achieve the goals of the KEP
- All known functional bugs have been fixed
Upgrading from 1.20 and below will not change the behaviour of how Jobs work.
To make use of this feature, the SuspendJob
feature gate must be explicitly
enabled on the api-server and kube-controller-manager and the suspend
field
must be explicitly set in the Job spec.
The change is entirely limited to the control plane. Version skew across control plane / kubelet does not change anything.
In HA clusters, version skew across different replicas in the control plane should also work seamlessly because only one controller manager will be active at any given time.
-
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name:
SuspendJob
- Components depending on the feature gate:
- kube-apiserver
- kube-controller-manager
- Feature gate name:
- Feature gate (also fill in values in
-
Does enabling the feature change any default behavior? No, using the feature requires explicitly opting in by setting the
suspend
field. -
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? Yes. Turning off the feature gate will disable the feature. Once the feature gate is turned off, the default Job controller will ignore the
suspend
field in all jobs, so this will cause existing suspended jobs to be resumed indirectly when the controller manager is restarted. -
What happens if we reenable the feature if it was previously rolled back? Jobs that have the flag set will be suspended, and new jobs or updates to existing ones to the field will be persisted.
-
Are there any tests for feature enablement/disablement? Yes. Integration tests have exhaustive testing switching between different feature enablement states whilst using the feature at the same time. Unit tests and end-to-end tests test feature enablement too.
This section must be completed when targeting beta graduation to a release.
-
How can a rollout fail? Can it impact already running workloads? Impact to existing Jobs that previously didn't use this feature in alpha is impossible. In workloads using the feature in an older version, suspended Jobs may inadvertently be resumed (or Jobs may be inadvertently suspended) if there are storage-related issues arising from components crashing mid-rollout.
-
What specific metrics should inform a rollback?
job_sync_duration_seconds
andjob_sync_total
should be observed. Unexpected spikes in the metric with labelsresult=error
andaction=pods_deleted
is potentially an indicator that:- Job suspension is producing errors in the Job controller,
- Jobs are getting suspended when they shouldn't be, or
- Job sync latency is high when Job are suspended. While the above list isn't exhaustive, they're signals in favour of rollbacks.
-
**Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? yes manually tested successfully.
-
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? No.
This section must be completed when targeting beta graduation to a release.
-
How can an operator determine if the feature is in use by workloads? The
.spec.suspend
field is set to true by Jobs. The status conditions of a Job can also be used to determine whether a Job is using the feature (look for a condition of type "Suspended"). -
How can someone using this feature know that it is working for their instance?
- Events
- Event Reason: Suspended
- The message includes the job name.
- API .status
- Condition name: Suspended
- Other field:
-
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name: The metrics
job_sync_duration_seconds
andjob_sync_total
get a new label namedaction
to allow operators to filter Job sync latency and error rate, respectively, by the action performed. There are four mutually-exclusive values possible for this label:reconciling
when the Job's pod creation/deletion expectations are unsatisfied and the controller is waiting for issued Pod creation/deletions to complete.tracking
when the Job's pod creation/deletion expectations are satisfied and the number of active Pods matches expectations (i.e. no pod creation/deletions issued in this sync). This is expected to be the action in most of the syncs.pods_created
when the controller creates Pods. This can happen when the number of active Pods is less than the wanted Job parallelism.pods_deleted
when the controller deletes Pods. This can happen if a Job is suspended or if the number of active Pods is more than parallelism. Each sample of the two metrics will have exactly one of the above values for theaction
label.
- Components exposing the metric:
- kube-controller-manager
- Metric name: The metrics
- Metrics
-
What are the reasonable SLOs (Service Level Objectives) for the above SLIs?
- per-day percentage of
job_sync_total
with labelsresult=error
andaction=pods_deleted
<= 1% - 99% percentile over day for
job_sync_duration_seconds
with labelaction=pods_deleted
is <= 15s, assuming a client-side QPS limit of 50 calls per second
- per-day percentage of
-
Are there any missing metrics that would be useful to have to improve observability of this feature? No.
This section must be completed when targeting beta graduation to a release.
- Does this feature depend on any specific services running in the cluster? Feature is restricted to kube-apiserver and kube-controller-manager.
-
Will enabling / using this feature result in any new API calls? Yes. when a job is suspended, all pods will be issued a DELETE request; similarly when a job is resumed, pods will be created again.
-
Will enabling / using this feature result in introducing new API types? No.
-
Will enabling / using this feature result in any new calls to the cloud provider? No.
-
Will enabling / using this feature result in increasing size or count of the existing API objects? Each JobSpec object will increase by the size of a boolean. Each JobStatus may have an additional
JobCondition
entry. -
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? No.
-
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? No.
The Troubleshooting section currently serves the Playbook
role. We may consider
splitting it into a dedicated Playbook
document (potentially with some monitoring
details). For now, we leave it here.
This section must be completed when targeting beta graduation to a release.
-
How does this feature react if the API server and/or etcd is unavailable? Updates to suspend or resume a Job will not work. The controller will not be able to create or delete Pods. Events, logs, and status conditions for Jobs will not be updated to reflect their suspended status.
-
What are other known failure modes? None. The API server, etcd, and the controller manager are the only possible points of failure.
-
What steps should be taken if SLOs are not being met to determine the problem?
- Verify that kube-apiserver and etcd are healthy. If not, the Job controller cannot operate, so you must fix those problems first.
- Verify that
job_sync_total
is unexpectedly high forresult=error
andaction=pods_deleted
in comparison to other actions. - Verify that
job_sync_duration_seconds
is noticeably larger foraction=pods_deleted
in comparison to the other actions. - If control plane components are starved for CPU, which could be a potential reason behind Job sync latency spikes, consider increasing the control plane's resources.
2021-02-01: Initial KEP merged, alpha targeted for 1.21 2021-03-08: Implementation merged in 1.21 with feature gate disabled by default 2021-04-22: KEP updated for beta graduation in 1.22 2022-01-18: KEP updated for GA graduation in 1.24
Alternative strategies to achieve something similar were explored (see KEP issue for design details), so if one of the other less-preferred options were chosen instead, this KEP should not be implemented.
Instead of making this a native Kubernetes feature, one could use an external
controller to handle Jobs that need delayed Pod creation. This can be achieved
with an orchestratorName
field that can tell the default Job controller to
ignore a Job entirely. While this approach is similar to the schedulerName
field used in kube-scheduler, it adds unnecessary complexity and the need for
additional control plane components to handle Jobs. In addition, this approach
makes ownership of the Job hard to track.
None.