kubernetes · k8s-github-robot · Aug 28, 2017 · Apr 26, 2017 · Apr 26, 2017 · Apr 27, 2017
diff --git a/contributors/design-proposals/job.md b/contributors/design-proposals/job.md
@@ -18,6 +18,7 @@ Several existing issues and PRs were already created regarding that particular s
 1. Be able to get the job status.
 1. Be able to specify the number of instances performing a job at any one time.
 1. Be able to specify the number of successfully finished instances required to finish a job.
+1. Be able to specify a backoff policy, when job is continuously failing.
 
 
 ## Motivation
@@ -26,6 +27,35 @@ Jobs are needed for executing multi-pod computation to completion; a good exampl
 here would be the ability to implement any type of batch oriented tasks.
 
 
+## Backoff policy and failed pod limit
+
+By design, Jobs do not have any notion of failure, other than a pod's `restartPolicy`
+which is mistakenly taken as Job's restart policy ([#30243](https://github.com/kubernetes/kubernetes/issues/30243),
+[#[43964](https://github.com/kubernetes/kubernetes/issues/43964)]).  There are
+situation where one wants to fail a Job after some amount of retries over a certain
+period of time, due to a logical error in configuration etc.  To do so we are going
+to introduce following fields, which will control the exponential backoff when
+retrying a Job: number of retries and time to retry.  The two fields will allow
+fine-grained control over the backoff policy, limiting the number of retries over
+a specified period of time.  If only one of the two fields is supplied, an exponential
+backoff with an intervening duration of ten seconds and a factor of two will be
+applied, such that either:
+* the number of retries will not exceed a specified count, if present, or
+* the maximum time elapsed will not exceed the specified duration, if present.
+
+Additionally, to help debug the issue with a Job, and limit the impact of having
+too many failed pods left around (as mentioned in [#30243](https://github.com/kubernetes/kubernetes/issues/30243)),
+we are going to introduce a field which will allow specifying the maximum number
+of failed pods to keep around.  This number will also take effect if none of the
+limits described above are set.
+
+All of the above fields will be optional and will apply no matter which `restartPolicy`
+is set on a `PodTemplate`.  The only difference applies to how failures are counted.
+For restart policy `Never` we count actual pod failures (reflected in `.status.failed`
+field). With restart policy `OnFailure` we look at pod restarts (calculated from
+`.status.containerStatuses[*].restartCount`).
+
+
 ## Implementation
 
 Job controller is similar to replication controller in that they manage pods.
@@ -79,8 +109,21 @@ type JobSpec struct {
     // job should be run with. Defaults to 1.
     Completions *int
 
+    // Optional duration in seconds relative to the startTime that the job may be active
+    // before the system tries to terminate it; value must be a positive integer.
+    ActiveDeadlineSeconds *int
+
+    // Optional number of retries before marking this job failed.
+    BackoffLimit *int
+
+    // Optional time (in seconds) specifying how long a job should be retried before marking it failed.
+    BackoffDeadlineSeconds *int
+
+    // Optional number of failed pods to retain.
+    FailedPodsLimit *int
+
     // Selector is a label query over pods running a job.
-    Selector map[string]string
+    Selector LabelSelector
 
     // Template is the object that describes the pod that will be created when
     // executing a job.
@@ -109,12 +152,12 @@ type JobStatus struct {
     // Active is the number of actively running pods.
     Active int
 
-    // Successful is the number of pods successfully completed their job.
-    Successful int
+    // Succeeded is the number of pods successfully completed their job.
+    Succeeded int
 
-    // Unsuccessful is the number of pods failures, this applies only to jobs
+    // Failed is the number of pods failures, this applies only to jobs
     // created with RestartPolicyNever, otherwise this value will always be 0.
-    Unsuccessful int
+    Failed int
 }
 
 type JobConditionType string