Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate follower pod owned by same Job as leader pod #433

Merged
merged 4 commits into from
Feb 28, 2024

Conversation

danielvegamyhre
Copy link
Contributor

@danielvegamyhre danielvegamyhre commented Feb 22, 2024

This change validates the leader pod has same owner UID as the follower, to ensure they are part of the same Job.
This is necessary to handle a potential race condition between index updates and pod rescheduling during JobSet restarts.

  1. A job failure occurs and the JobSet is restarted (deleting and recreating all jobs)
  2. Leader pods may land on different node pools than they were originally scheduled on.
  3. When the follower pods are recreated, and we look up the leader pod using the index which maps
    [pod name without random suffix] -> corev1.Pod object, if this occurs before the index updates for the leader pod have been pushed to the controller, we may get a stale index entry and inject the the wrong nodeSelector, using the topology the leader pod was originally scheduled on before the
    restart.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 22, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danielvegamyhre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from ahg-g February 22, 2024 21:20
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 22, 2024
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 22, 2024
Copy link

netlify bot commented Feb 22, 2024

Deploy Preview for kubernetes-sigs-jobset canceled.

Name Link
🔨 Latest commit 4f5afd3
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-jobset/deploys/65de8a8c67c58a0008ba4169

@kannon92
Copy link
Contributor

I am going to leave LGTM for @ahg-g on this one. I don't really have much context into this problem.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 28, 2024
@ahg-g
Copy link
Contributor

ahg-g commented Feb 28, 2024

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 28, 2024
@k8s-ci-robot k8s-ci-robot merged commit 6b2e629 into kubernetes-sigs:main Feb 28, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants