Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backoff policy and failed pod limit #583

Merged
merged 7 commits into from
Aug 28, 2017
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 48 additions & 5 deletions contributors/design-proposals/job.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ Several existing issues and PRs were already created regarding that particular s
1. Be able to get the job status.
1. Be able to specify the number of instances performing a job at any one time.
1. Be able to specify the number of successfully finished instances required to finish a job.
1. Be able to specify a backoff policy, when job is continuously failing.


## Motivation
Expand All @@ -26,6 +27,35 @@ Jobs are needed for executing multi-pod computation to completion; a good exampl
here would be the ability to implement any type of batch oriented tasks.


## Backoff policy and failed pod limit

By design, Jobs do not have any notion of failure, other than a pod's `restartPolicy`
which is mistakenly taken as Job's restart policy ([#30243](https://github.com/kubernetes/kubernetes/issues/30243),
[#[43964](https://github.com/kubernetes/kubernetes/issues/43964)]). There are
situation where one wants to fail a Job after some amount of retries over a certain
period of time, due to a logical error in configuration etc. To do so we are going

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I think you are repeating "introduce{,d}". It should be something like: "To do so we are going to introduce the following fields"

to introduce following fields, which will control the exponential backoff when
retrying a Job: number of retries and time to retry. The two fields will allow
fine-grained control over the backoff policy, limiting the number of retries over
a specified period of time. If only one of the two fields is supplied, an exponential
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear what happens if both fields are specified. Is only the number of retries limited or both of the fields limit each other?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it's clearer after rereading the bullet points below.

backoff with an intervening duration of ten seconds and a factor of two will be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should promise our users exponential backoff -- we should give ourselves the freedom to do other things in the future. Just say that we will retry at a time that the system choses, and that in any case the number of retries won't exceed the number specified by the user.

applied, such that either:
* the number of retries will not exceed a specified count, if present, or
* the maximum time elapsed will not exceed the specified duration, if present.

Additionally, to help debug the issue with a Job, and limit the impact of having

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not also limit the number of successful pods which may takes a lot of resource either ?

too many failed pods left around (as mentioned in [#30243](https://github.com/kubernetes/kubernetes/issues/30243)),
we are going to introduce a field which will allow specifying the maximum number
of failed pods to keep around. This number will also take effect if none of the
limits described above are set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intent that users will typically set this to 0 unless they expect to need to debug? Or do we typically want to keep around at least 1 failed pod so the user can get the logs?

not save all pods without

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote to have this defaulted to 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 seems reasonable default.


All of the above fields will be optional and will apply no matter which `restartPolicy`
is set on a `PodTemplate`. The only difference applies to how failures are counted.
For restart policy `Never` we count actual pod failures (reflected in `.status.failed`
field). With restart policy `OnFailure` we look at pod restarts (calculated from
`.status.containerStatuses[*].restartCount`).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot be synchronous, though, so the limit has to be approximate. Document that it is precise with Never, and approximate with OnFailure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, makes sense.



## Implementation

Job controller is similar to replication controller in that they manage pods.
Expand Down Expand Up @@ -79,8 +109,21 @@ type JobSpec struct {
// job should be run with. Defaults to 1.
Completions *int

// Optional duration in seconds relative to the startTime that the job may be active

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I think the comments for exported fields should start with the name of the field: "ActiveDeadlineSeconds is an optional..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not anymore, the def will have those in place.

// before the system tries to terminate it; value must be a positive integer.
ActiveDeadlineSeconds *int

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Can uint be used, also in the other fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used ints, for consistency and then do appropriate validation so this is non-negative and bigger than 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*int32

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*int32

Actually *int64

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also update the rest of the fields to their canonical types?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the deadline for the entire job to complete, regardless of whether there are restarts, right? Document that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It clearly states it's relative to job's startTime.


// Optional number of retries before marking this job failed.
BackoffLimit *int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why BackoffLimit instead of RetryLimit or TriesLimit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with Backoff as a prefix for every field related to that, since we're talking about Backoff policy.


// Optional time (in seconds) specifying how long a job should be retried before marking it failed.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Obviously this duration shouldn't be negative (in which case I assume we'd ignore it, or reject it), but what about zero?
    Since it's a pointer, the pointer being nil means that it's not set, but what if it's present but zero? Is that the same as it not being set, or is that a valid duration meaning that we won't wait any amount of time for a job complete (that it must complete immediately)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually require those being >= 0, it'll be checked in validation.

BackoffDeadlineSeconds *int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ProgressDeadlineSeconds to match the field in Deployments? Seems like it's the same concept although in Deployment we merely add a condition in the DeploymentStatus.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would a user typically use ActiveDeadlineSeconds with BackoffDeadlineSeconds?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erictune @soltysh if I understand correctly the only check system may run is: ActiveDeadlineSeconds must be strictly bigger than BackoffDeadlineSeconds, missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kargakis this is different than ProgressDeadlineSecods, in deployments world PDS is how long you are waiting for the job to make progress and than fail. With Job BackoffDeadlineSeconds applies at any point in time. This means that if a Job expects to reach 5 completions, 4 of them can succeed and only the 5th Pod might be having troubles and BDS should still apply. That's why I'd rather not to use the same name, it's intentional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erictune @sdminonne ADS is the overall time for a Job (max life-time). Whereas BDS is counted from the moment the first failure happens. But yes, usually BDS will be smaller than ADS, otherwise you'll never hit BDS in case of a failure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, PDS is not that different. We start counting from the last time we saw progress and once the deadline is exceeded we merely switch the Progressing condition to False. "Failing the deployment" should be something that a higher-level orchestrator decides based on the condition.


// Optional number of failed pods to retain.
FailedPodsLimit *int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When are pods left around? I guess when pods have restartPolicy=Never? It feels weird to add a feature that is not related directly to the failurePolicy and also is useful for a specific case in Jobs. Can we leave it out of this proposal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for debugging purposes. Leaving pods should allow you to figure out what went wrong. Without it and with restart policy Never you might end up with many failed pods, if you set the limits high. That was one of the original requests wrt to the backoff policy, so I think it's crucial to address it here.


// Selector is a label query over pods running a job.
Selector map[string]string
Selector LabelSelector

// Template is the object that describes the pod that will be created when
// executing a job.
Expand Down Expand Up @@ -109,12 +152,12 @@ type JobStatus struct {
// Active is the number of actively running pods.
Active int

// Successful is the number of pods successfully completed their job.
Successful int
// Succeeded is the number of pods successfully completed their job.
Succeeded int

// Unsuccessful is the number of pods failures, this applies only to jobs
// Failed is the number of pods failures, this applies only to jobs
// created with RestartPolicyNever, otherwise this value will always be 0.
Unsuccessful int
Failed int
}

type JobConditionType string
Expand Down