Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Configurable Failure Policy API #262

Closed
Tracked by #239 ...
danielvegamyhre opened this issue Aug 22, 2023 · 22 comments · Fixed by #381 or #537
Closed
Tracked by #239 ...

Feature: Configurable Failure Policy API #262

danielvegamyhre opened this issue Aug 22, 2023 · 22 comments · Fixed by #381 or #537
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@danielvegamyhre
Copy link
Contributor

danielvegamyhre commented Aug 22, 2023

What would you like to be added:
Right now, the JobSet FailurePolicy only allows specifying the max number of JobSet restarts to attempt before marking it as failed.

Users have requested that we should also allow them to configure the failure conditions under which the JobSet will be restarted, or when the JobSet should be failed immediately without going through the allowed number of restarts (see "Example user stories" section below for why this is needed).

As an initial thought, we could define a field in the JobSet FailurePolicy which allows the user to specify the JobSet should respect the Job's podFailurePolicy (i.e., if a Job failed due to a podFailurePolicy configuring the job to fail immediately without restart under certain conditions, then JobSet should also respect this and not restart the JobSet).

For example, in this list of common exit codes used by containers, we can see container exit code 143 is used for containers killed by graceful termination (SIGTERM), which is the signal used by maintenance events via graceful node shutdown, as well as by workload pre-emption.

A user could configure a podFailurePolicy on their job to fail the job immediately if there is any exit code except 143. If the JobSet respected this, it would restart if the child job was killed by a SIGTERM (maintenance event), but would not restart if there an application code error.

Why is this needed:

Example use cases / user story:

As a user, when I enqueue and run many JobSet workloads using Kueue, I'd like to be able to specify the JobSet failure policy to restart if the containers exited with a SIGTERM exit code (e.g. maintenance event or pre-emption), but NOT restart if the container exited with an application code error, in order to not consume valuable resources repeatedly attempting to run pods that are doomed to fail, while other workloads sit idle in Kueue awaiting free resources to run.

Example API:

type FailurePolicy struct {
	// TargetPodFailurePolicies, if set, will cause the JobSet to fail immediately 
        // without restart if any of its failed child jobs failed due to matching a podFailurePolicy.
	FollowPodFailurePolicy *bool`json:"followPodFailurePolicy,omitempty"`

	// MaxRestarts defines the limit on the number of JobSet restarts.
	// A restart is achieved by recreating all active child jobs.
	MaxRestarts int32 `json:"maxRestarts,omitempty"`
}

Example pod failure policy which the user could use in combination with .spec.failurePolicy.followPodFailurePolicy = true to allow jobsets to be restarted if the jobs were killed due to maintenance events, but not if the jobs failed due to application code errors or other bugs:

  podFailurePolicy:
    rules:
    - action: FailJob
      onExitCodes:
        containerName: user-container
        operator: NotIn
        values: [143] # SIGTERM
@kannon92
Copy link
Contributor

With replicas we have some complications as you would need a API that can understand that some replicas may fail while others don't.

I think generally, a SOC should be that if the Job declares a job as failed with PodFailurePolicy we should respect that. I don't quite understand why we would want exit code information at the JobSet level.

@danielvegamyhre
Copy link
Contributor Author

With replicas we have some complications as you would need a API that can understand that some replicas may fail while others don't.

I think generally, a SOC should be that if the Job declares a job as failed with PodFailurePolicy we should respect that. I don't quite understand why we would want exit code information at the JobSet level.

Say we have many large JobSet workloads enqueued in Kueue. JobSet 1 is scheduled and executed first, and has a failure policy of maxRestarts: 5. While it's running it's consuming all of the resources in the cluster - no other JobSets can run. Unfortunately, there is a bug in the application code causing it to crash. The podFailurePolicy may be configured to immediately fail the job if any container exits with an exit code indicating an application code, to avoid restarting the pods up to backOffLimit times, that's great. However, the JobSet controller will now see a failed job and restart the entire massive jobset running on thousands of nodes, reschedule all of them with our computationally expensive exclusive affinities rules (scheduler throughput was 58 pods/sec in my tests at scale, so for thousands of pods this could take O(minutes). It will go through this process 10 times until it hits the failurePolicy.maxRestarts limit, at which point the JobSet will be marked as failed, only after say 30minutes, during which time all the other JobSets in the Kueue are blocked from running since its using the whole cluster.

This could be avoided if the JobSet controller could be aware that the Job failed due to the non-retriable exit code specified in podFailurePolicy, and simply fail the JobSet immediately rather than restarting N times.

@kannon92
Copy link
Contributor

With replicas we have some complications as you would need a API that can understand that some replicas may fail while others don't.
I think generally, a SOC should be that if the Job declares a job as failed with PodFailurePolicy we should respect that. I don't quite understand why we would want exit code information at the JobSet level.

Say we have many large JobSet workloads enqueued in Kueue. JobSet 1 is scheduled and executed first, and has a failure policy of maxRestarts: 5. While it's running it's consuming all of the resources in the cluster - no other JobSets can run. Unfortunately, there is a bug in the application code causing it to crash. The podFailurePolicy may be configured to immediately fail the job if any container exits with an exit code indicating an application code, to avoid restarting the pods up to backOffLimit times, that's great. However, the JobSet controller will now see a failed job and restart the entire massive jobset running on thousands of nodes, reschedule all of them with our computationally expensive exclusive affinities rules (scheduler throughput was 58 pods/sec in my tests at scale, so for thousands of pods this could take O(minutes). It will go through this process 10 times until it hits the failurePolicy.maxRestarts limit, at which point the JobSet will be marked as failed, only after say 30minutes, during which time all the other JobSets in the Kueue are blocked from running since its using the whole cluster.

This could be avoided if the JobSet controller could be aware that the Job failed due to the non-retriable exit code specified in podFailurePolicy, and simply fail the JobSet immediately rather than restarting N times.

I see. I still don't think the API would be exit codes/conditions of the container but you want some kind of JobFailurePolicy at the JobSet level. In this case, it could be conditions or statuses from the jobs that we only restart. I believe that you know that the jobs failed due to some kind of match of PodFailurePolicy. IIRC we publish a condition on match for PodFailurePolicy so that info would be in the job.

What kind of API are you thinking? I find it easier to work backwards from the API.

@danielvegamyhre
Copy link
Contributor Author

danielvegamyhre commented Aug 23, 2023

Here are a couple ideas for a possible API:

Option 1: Generic, but more configuration required

type FailurePolicy struct {
	// TargetPodFailurePolicies, if set, will cause the JobSet to fail immediately 
        // without restart if any of its failed child jobs failed due to matching a podFailurePolicy.
	FollowPodFailurePolicy *bool`json:"followPodFailurePolicy,omitempty"`

	// MaxRestarts defines the limit on the number of JobSet restarts.
	// A restart is achieved by recreating all active child jobs.
	MaxRestarts int32 `json:"maxRestarts,omitempty"`
}

This would require the user to define podFailurePolicies themselves in their Job templates. The implementation would then involve checking if a job failure was due to a match on a podFailurePolicy and then failing the JobSet rather than restarting MaxRestarts times.

One downside is failing the JobSet if it matches ANY podFailurePolicy is rather sweeping and inexact, however, I didn't see any obvious way to specify specific pod failure policies (e.g. a name or id).

Option 2: Less generic, but less configuration required

type FailurePolicy struct {
	// FailNonRetriable, if set, will not restart the JobSet if one or more
        // of its child jobs failed due to a non-retriable container exit code.
        // Exit code 143 (SIGTERM) is treated as the only retriable exit code, as it  
       // is used for maintenance events and workload pre-emption, which are scenarios
       // where JobSet recreation is desired.
	FailNonRetriable *bool `json:"failNonRetriable,omitempty"`

	// MaxRestarts defines the limit on the number of JobSet restarts.
	// A restart is achieved by recreating all active child jobs.
	MaxRestarts int32 `json:"maxRestarts,omitempty"`
}

In this option, the user does not have to define a podFailurePolicy in their Job template. If FailNonRetriable is set, the JobSet webhook would inject the appropriate podFailurePolicy to ensure container exit codes not resulting from a SIGTERM trigger the podFailurePolicy to fail the job immediately. The JobSet controller can check this and fail the jobset immediately if the job failure occurred due to matching this podFailurePolicy.

podFailurePolicy we would inject into the Job template:

  podFailurePolicy:
    rules:
    - action: FailJob
      onExitCodes:
        containerName: user-container
        operator: NotIn
        values: [143] # SIGTERM

Personally I am leaning toward Option 1, since it is more generic/flexible and we don't lock ourselves into any specific podFailurePolicy. The downside is it requires our users to have more k8s specific knowledge and configure more "knobs," which I would like to minimize.

@kannon92
Copy link
Contributor

Maybe you can draft a KEP with some examples of Job with PodFailurePolicies and JobSet.

I'd also like to ask @mimowo for his thoughts on this once he is back from vacation.

Generally PodFailurePolicy has the concept of ExitCodes and Conditions. PodDisruptionCondition is there to help with preemption as I'm not sure exit codes are reliable for all kinds of disruptions.

One thing I don't get is why is this needed? If one wants to fail a JobSet with FailurePolicy they could not specify anything and it would assume that if a job failed than the JobSet failed.

@danielvegamyhre
Copy link
Contributor Author

One thing I don't get is why is this needed? If one wants to fail a JobSet with FailurePolicy they could not specify anything and it would assume that if a job failed than the JobSet failed.

If a Job fails due to a maintenance event on the underlying VMs, or workload preemption, we DO want it to be restarted as soon as possible.

If a Job fails due to some software bug, we DON'T want it to restart at all, for the reasons described previously.

@kannon92
Copy link
Contributor

kannon92 commented Aug 23, 2023

One thing I don't get is why is this needed? If one wants to fail a JobSet with FailurePolicy they could not specify anything and it would assume that if a job failed than the JobSet failed.

If a Job fails due to a maintenance event on the underlying VMs, or workload preemption, we DO want it to be restarted as soon as possible.

I agree but that is the responsibility of PodFailurePolicy in a Job. PodFailurePolicy controls when a Job is considered failed. If a Job is considered fail with PodFailurePolicy then you should consider the overall JobSet failed.

This would be the same behavior as not specifying FailurePolicy I think?

If a Job fails due to some software bug, we DON'T want it to restart at all, for the reasons described previously.

@danielvegamyhre
Copy link
Contributor Author

danielvegamyhre commented Aug 23, 2023

How would one configure a podFailurePolicy + JobSet failurePolicy to achieve this behavior?

If a maintenance event kills some pods on a node during graceful node shutdown, we want the job to fail AND for the entire JobSet to restart. Right now, the only way for the JobSet failurePolicy to execute is if 1+ child jobs have failed, so unless we change that, we want the maintenance event to fail the job.

If a software bug causes the pods to die, we also want the job to fail, but we want the JobSet failure policy to know NOT to restart the entire JobSet.

In both cases we want the Job to fail, but in only 1 case do we want the JobSet to restart. At the jobset level, we have no way of distinguishing between job failures which require restarts vs job failures which do not.

Basically, we need the JobSet failurePolicy to be selective about when it should use the restarts allowed by maxRestarts or when to just fail the jobset immediately.

@kannon92
Copy link
Contributor

I see. I think option 1 would be sufficient but this seems to be more subtle than a github issue can surmise.

@danielvegamyhre
Copy link
Contributor Author

@alculquicondor I'd be curious to get your thoughts on this as well. When a Job fails, is there a way to check if it failed due to triggering a podFailurePolicy with action FailJob?

To summarize the issue: when a JobSet's child job fails, we need the JobSet failurePolicy to be selective about when it should use the restarts allowed by jobSet.spec.failurePolicy.maxRestarts or when to just fail the JobSet immediately.

The use case requested by users: when pods are killed with SIGTERM due to maintenance events or workload preemption, we want the job to fail, triggering the JobSet failurePolicy and restarting the whole JobSet in accordance with it's .spec.failurePolicy.maxRestarts. However, if pods crash due to an error in the application code, we want the JobSet to fail immediately and not restart the JobSet at all.

See this comment for proposed API options so far.

Basically, an error in the application code means the workload is doomed to crash no matter what, and each restart requires rescheduling thousands of pods using the computationally expensive/slow exclusive affinities (only ~58 pods/sec scheduling throughput), and so restarting N times will block other enqueued workloads from utilizing these resources the entire time, which could be 30min+ for large workloads.

@alculquicondor
Copy link

I'd be curious to get your thoughts on this as well. When a Job fails, is there a way to check if it failed due to triggering a podFailurePolicy with action FailJob?

The Failed condition contains this reason https://github.com/kubernetes/kubernetes/blob/714e77595c8b19b693925bda2a96ab80c307d38f/pkg/controller/job/job_controller.go#L62C21-L62C21

Give it a try and let us know if there is anything missing, so we can include it in the next iteration for the change (hopefully we don't need extra API fields).

The use case you mention is exactly why we developed PodFailurePolicy in Job.

@ahg-g
Copy link
Contributor

ahg-g commented Oct 18, 2023

I think the api can be a bit more generic and can mimic the podFailurePolicy one, think of the following cases that can be qualified based on the child job's failure reason:

type JobFailurePolicyRule struct {
  Action JobFailurePolicyAction
  OnJobFailureReasons []string
}

type FailurePolicy struct {
  Rules []JobFailurePolicyRule
  MaxRestarts int
}

JobFailurePolicyAction could be (still needs careful thinking):

  1. FailJobSet: fail jobset regardless of maxRestart.
  2. Ignore: restart jobset, but don't count it against maxRestart
  3. FailJob: indicates that the failed child job is permanently failed, active child jobs will continue to run to completion
  4. RecreateAllActive: recreates all active jobs
  5. RecreateJob: recreates the failed job only

@danielvegamyhre
Copy link
Contributor Author

danielvegamyhre commented Oct 19, 2023

I think the api can be a bit more generic and can mimic the podFailurePolicy one, think of the following cases that can be qualified based on the child job's failure reason:

type JobFailurePolicyRule struct {
  Action JobFailurePolicyAction
  OnJobFailureReasons []string
}

type FailurePolicy struct {
  Rules []JobFailurePolicyRule
  MaxRestarts int
}

JobFailurePolicyAction could be (still needs careful thinking):

  1. FailJobSet: fail jobset regardless of maxRestart.
  2. Ignore: restart jobset, but don't count it against maxRestart
  3. FailJob: indicates that the failed child job is permanently failed, active child jobs will continue to run to completion
  4. RecreateAllActive: recreates all active jobs
  5. RecreateJob: recreates the failed job only

I like this, definitely more generic/flexible. Ignore and FailJob are definitely both key actions required for the original use case outlined in the issue description. Are there any specific frameworks or use cases for the other actions (RecreateAllActive,RecreateJob, FailJobSet), or are they suggestions for making the API as flexible as possible?

@ahg-g
Copy link
Contributor

ahg-g commented Oct 19, 2023

FailJobSet and RecreateAllActive are actually the ones we need. The first is what we are trying to solve with this enhancement, the latter is the current behavior. We can start with those only.

Think of Job = pod in the following comparison between JobSet and Job: Ignore and FailJob are similar to Ignore and FailIndex semantics in the job API; while RecreateJob is similar to Job's default behavior.

@danielvegamyhre
Copy link
Contributor Author

FailJobSet and RecreateAllActive are actually the ones we need. The first is what we are trying to solve with this enhancement, the latter is the current behavior. We can start with those only.

Think of Job = pod in the following comparison between JobSet and Job: Ignore and FailJob are similar to Ignore and FailIndex semantics in the job API; while RecreateJob is similar to Job's default behavior.

Oh, now that I read the descriptions of the actions, Ignore and FailJob would not do what I was assuming they would do. Yes, per the action descriptions, FailJobSet would be needed for errors the user wants to classify as non-retriable. For RecreateAllActive would that restart everything including the failed job (so the failed job counts as active, correct?)

@ahg-g
Copy link
Contributor

ahg-g commented Oct 19, 2023

For RecreateAllActive would that restart everything including the failed job (so the failed job counts as active, correct?)

The current behavior, so yes.

@danielvegamyhre
Copy link
Contributor Author

I like the revised API proposed by @ahg-g. More users have been requesting this feature lately, so unless there's any major concerns I'm going to proceed with the implementation soon.

@danielvegamyhre
Copy link
Contributor Author

/assign

@danielvegamyhre danielvegamyhre changed the title Feature request: restartPolicy should be configurable based on the type of failure experienced by the job's pods Feature: Configurable Failure Policy API Jan 27, 2024
@kannon92
Copy link
Contributor

/reopen

@k8s-ci-robot
Copy link
Contributor

@kannon92: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Feb 10, 2024
@danielvegamyhre danielvegamyhre added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 16, 2024
@jedwins1998
Copy link
Contributor

/assign

@danielvegamyhre
Copy link
Contributor Author

@jedwins1998 let's make the ETA for this feature around the end of the month, we'll include it in the next release. This is an important feature which will require careful review and testing, and I don't want to rush it to squeeze it into v0.5.0.

In addition to the implementation we will need:

  • E2E test validating the most important use case of restarting the JobSet if Job fails due to SIGTERM (host maintenance), but failing the JobSet immediately when any other container exit code occurs.
    • Alternatively, this could potentially be simulated in an integration test by failing the Job and setting the condition reason as PodFailurePolicy but I would prefer E2E testing to make sure the information propagates as expected along the full path of "container exit code -> pod -> Job controller (triggering podFailurePolicy) -> JobSet controller executing your new failure policy code"
  • Manual testing of the same use case, by simulating host maintenance on some nodes in the cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
6 participants