Skip to content

Commit 136e7e8

Browse files
committed
JobPodReplacementPolicy Promoted To GA Blog Post
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
1 parent 35b1ebe commit 136e7e8

File tree

1 file changed

+109
-0
lines changed

1 file changed

+109
-0
lines changed
Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
---
2+
layout: blog
3+
title: "Kubernetes v1.34: Pod Replacement Policy for Jobs Goes GA"
4+
date: 2025-0X-XX
5+
draft: true
6+
slug: kubernetes-v1-34-pod-replacement-policy-for-jobs-goes-ga
7+
author: >
8+
[Dejan Zele Pejchev](https://github.com/dejanzele) (G-Research)
9+
---
10+
11+
In Kubernetes v1.34, the _Pod Replacement Policy_ feature reaches general availability (GA).
12+
This blog post describes the Pod Replacement Policy feature and how to use it in your Jobs.
13+
14+
## About Pod Replacement Policy
15+
16+
By default, the Job controller immediately recreates Pods as soon as they fail or begin terminating (when they have a deletion timestamp).
17+
18+
As a result, while some Pods are terminating, the total number of running Pods for a Job can temporarily exceed the specified parallelism.
19+
For Indexed Jobs, this can even mean multiple Pods running for the same index at the same time.
20+
21+
This behavior works fine for many workloads, but it can cause problems in certain cases.
22+
23+
For example, popular machine learning frameworks like TensorFlow and
24+
[JAX](https://jax.readthedocs.io/en/latest/) expect exactly one Pod per worker index.
25+
If two Pods run at the same time, you might encounter errors such as:
26+
```
27+
/job:worker/task:4: Duplicate task registration with task_name=/job:worker/replica:0/task:4
28+
```
29+
30+
Another example is in clusters with limited or expensive resources.
31+
Starting replacement Pods before the old ones fully terminate can lead to scheduling delays or even unnecessary cluster scale-ups.
32+
33+
The _Pod Replacement Policy_ feature gives you control over when Kubernetes replaces terminating Pods, helping you avoid these issues.
34+
35+
## How Pod Replacement Policy works
36+
37+
The feature introduces a new Job-level field, `podReplacementPolicy`, which controls when Kubernetes replaces terminating Pods.
38+
You can choose one of two policies:
39+
40+
- TerminatingOrFailed (default): Replaces Pods as soon as they start terminating.
41+
- Failed: Replaces Pods only after they fully terminate and transition to the `Failed` phase.
42+
43+
Setting the policy to `Failed` ensures that a new Pod is only created after the previous one has completely terminated.
44+
45+
For Jobs with a Pod Failure Policy, the default `podReplacementPolicy` is `Failed`, and no other value is allowed.
46+
See [Pod Failure Policy](/content/en/docs/concepts/workloads/controllers/job/#pod-failure-policy) to learn more about Pod Failure Policies for Jobs.
47+
48+
You can check how many Pods are currently terminating by inspecting the Job’s `.status.terminating` field:
49+
50+
```sh
51+
kubectl get job myjob -o=jsonpath='{.status.terminating}'
52+
```
53+
54+
For external queueing controllers such as [Kueue](https://github.com/kubernetes-sigs/kueue),
55+
this distinction matters because resources aren’t considered “freed” until terminating Pods are fully cleaned up.
56+
57+
## Example
58+
59+
Here’s a simple Job spec that ensures Pods are replaced only after they terminate completely:
60+
61+
```yaml
62+
apiVersion: batch/v1
63+
kind: Job
64+
metadata:
65+
name: example-job
66+
spec:
67+
podReplacementPolicy: Failed
68+
template:
69+
spec:
70+
restartPolicy: Never
71+
containers:
72+
- name: worker
73+
image: your-image
74+
```
75+
76+
With this setting, Kubernetes won’t launch a replacement Pod while the previous Pod is still terminating.
77+
78+
## How can you learn more?
79+
80+
- Read the user-facing documentation for [Pod Replacement Policy](/content/en/docs/concepts/workloads/controllers/job/#pod-replacement-policy),
81+
[Backoff Limit per Index](/content/en/docs/concepts/workloads/controllers/job/#backoff-limit-per-index)and
82+
[Pod Failure Policy](/content/en/docs/concepts/workloads/controllers/job/#pod-failure-policy).
83+
- Read the KEPs for [Pod Replacement Policy](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3939-allow-replacement-when-fully-terminated),
84+
[Backoff Limit per Index](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs), and
85+
[Pod Failure Policy](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures).
86+
87+
88+
## Acknowledgments
89+
90+
As with any Kubernetes feature, multiple people contributed to getting this
91+
done, from testing and filing bugs to reviewing code.
92+
93+
As this feature moves to stable after 2 years, we would like to thank the following people:
94+
* [Kevin Hannon](https://github.com/kannon92) - for writing the KEP and the initial implementation.
95+
* [Michał Woźniak](https://github.com/mimowo) - for guidance, mentorship, and reviews.
96+
* [Aldo Culquicondor](https://github.com/alculquicondor) - for guidance, mentorship, and reviews.
97+
* [Maciej Szulik](https://github.com/soltysh) - for guidance, mentorship, and reviews.
98+
* [Dejan Zele Pejchev](https://github.com/dejanzele) - for taking over the feature and promoting it from Alpha through Beta to GA.
99+
100+
## Get involved
101+
102+
This work was sponsored by the Kubernetes
103+
[batch working group](https://github.com/kubernetes/community/tree/master/wg-batch)
104+
in close collaboration with the
105+
[SIG Apps](https://github.com/kubernetes/community/tree/master/sig-apps) community.
106+
107+
If you are interested in working on new features in the space we recommend
108+
subscribing to our [Slack](https://kubernetes.slack.com/messages/wg-batch)
109+
channel and attending the regular community meetings.

0 commit comments

Comments
 (0)