Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A KEP for StartupPolicy #244

Merged
merged 2 commits into from
Dec 21, 2023

Conversation

kannon92
Copy link
Contributor

@kannon92 kannon92 commented Aug 3, 2023

I wanted to draft a potential API for StartupPolicy as we have discussed it.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 3, 2023
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 3, 2023
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 3, 2023
@kannon92
Copy link
Contributor Author

kannon92 commented Aug 8, 2023

/retest

@kannon92 kannon92 changed the title WIP: A KEP for StartupPolicy A KEP for StartupPolicy Aug 8, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 8, 2023
@kannon92 kannon92 force-pushed the startup-policy-kpe branch from d5b2a6e to 55ecf7d Compare August 8, 2023 19:18
@kannon92
Copy link
Contributor Author

kannon92 commented Aug 8, 2023

/cc @vsoch @Gekko0114

Would this API fit your needs?

@k8s-ci-robot
Copy link
Contributor

@kannon92: GitHub didn't allow me to request PR reviews from the following users: vsoch, Gekko0114.

Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @vsoch @Gekko0114

Would this API fit your needs?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vsoch
Copy link
Contributor

vsoch commented Aug 8, 2023

This would definitely be useful for us - what I've had to do is typically start the workers and handle race conditions with sleeps / waiting, that sort of thing.

@kannon92
Copy link
Contributor Author

kannon92 commented Aug 9, 2023

This would definitely be useful for us - what I've had to do is typically start the workers and handle race conditions with sleeps / waiting, that sort of thing.

And an API like:

 startupPolicy:
  - name: driver
    status: Ready

Would be sufficent?

@Gekko0114
Copy link
Member

Thanks @kannon92 so much!

IMO, I prefer defining startUpPolicy under each replicatedJob field.
I think this implementation is easier to read.

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: driver-ready-worker-start
spec:
  replicatedJobs:
  - name: driver
    replicas: 1
    template:
      spec:
        # Set backoff limit to 0 so job will immediately fail if any pod fails.
        backoffLimit: 0 
        completions: 1
        parallelism: 1
        template:
          spec:
            containers:
            - name: driver
              image: bash:latest
              command:
              - bash
              - -xc
              - |
                sleep 10000
  - name: workers
    replicas: 1
    startupPolicy:
    - name: driver
       status: Ready
    template:
      spec:
        backoffLimit: 0 
        completions: 2
        parallelism: 2
        template:
          spec:
            containers:
            - name: worker
              image: bash:latest
              command:
              - bash
              - -xc
              - |
                sleep 10

FYI
I prefer Tekton's design, especially runAfter.
https://tekton.dev/docs/pipelines/pipelines/

apiVersion: tekton.dev/v1 # or tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: pipeline
spec:
  workspaces:
    - name: source
  tasks:
    - name: gen-code
      taskRef:
        name: gen-code # gen-code expects a Workspace named "source"
      workspaces:
        - name: source # <- mapping workspace name
    - name: commit
      taskRef:
        name: commit # commit expects a Workspace named "source"
      workspaces:
        - name: source # <- mapping workspace name
      runAfter:
        - gen-code

@vsoch
Copy link
Contributor

vsoch commented Aug 9, 2023

IMO, I prefer defining startUpPolicy under each replicatedJob field. I think this implementation is easier to read.

I think he's mirroring SuccessPolicy-> TargetReplicatedJobs - the idea being that if there is more definition / attributes for such a policy, we have a place to put them and the ReplicatedJob itself doesn't become a single long list of loosely associated features.

TargetReplicatedJobs []string `json:"targetReplicatedJobs,omitempty"`

I think my preference is for that modular (and consistent, since we already have SuccessPolicy and FailurePolicy) approach, ensuring that each of those sections can be produced independently and then added to a replicatedJob.

And if we wanted to be more consistent it might look like:

type StartupPolicy struct {
	TargetReplicatedJobs []string `json:"targetReplicatedJobs,omitempty"`
	Status JobReadyStatus `json:"status"`
}

This makes the assumption that a JobSet will have one condition (set of jobs and a status) to indicate ready. I'm not sure we want to allow more complexity than that (the workflow complexity thing is a non goal).

@kannon92
Copy link
Contributor Author

Thanks @kannon92 so much!

IMO, I prefer defining startUpPolicy under each replicatedJob field. I think this implementation is easier to read.

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: driver-ready-worker-start
spec:
  replicatedJobs:
  - name: driver
    replicas: 1
    template:
      spec:
        # Set backoff limit to 0 so job will immediately fail if any pod fails.
        backoffLimit: 0 
        completions: 1
        parallelism: 1
        template:
          spec:
            containers:
            - name: driver
              image: bash:latest
              command:
              - bash
              - -xc
              - |
                sleep 10000
  - name: workers
    replicas: 1
    startupPolicy:
    - name: driver
       status: Ready
    template:
      spec:
        backoffLimit: 0 
        completions: 2
        parallelism: 2
        template:
          spec:
            containers:
            - name: worker
              image: bash:latest
              command:
              - bash
              - -xc
              - |
                sleep 10

@vsoch is correct. I was aiming to replicate a similar behavior that we have for SuccessPolicy and RestartPolicy. Those APIs are defined outside of the replicated job spec. I think its cleaner to do that as its all in one place. Hiding them in the ReplicatedJob spec seems a bit confusing.

FYI I prefer Tekton's design, especially runAfter. https://tekton.dev/docs/pipelines/pipelines/

apiVersion: tekton.dev/v1 # or tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: pipeline
spec:
  workspaces:
    - name: source
  tasks:
    - name: gen-code
      taskRef:
        name: gen-code # gen-code expects a Workspace named "source"
      workspaces:
        - name: source # <- mapping workspace name
    - name: commit
      taskRef:
        name: commit # commit expects a Workspace named "source"
      workspaces:
        - name: source # <- mapping workspace name
      runAfter:
        - gen-code

Yea, this is called out as a non-goal as I don't think this API will ever be complete in the context of workflows.

@kannon92
Copy link
Contributor Author

IMO, I prefer defining startUpPolicy under each replicatedJob field. I think this implementation is easier to read.

I think he's mirroring SuccessPolicy-> TargetReplicatedJobs - the idea being that if there is more definition / attributes for such a policy, we have a place to put them and the ReplicatedJob itself doesn't become a single long list of loosely associated features.

Correct.

TargetReplicatedJobs []string `json:"targetReplicatedJobs,omitempty"`

I think my preference is for that modular (and consistent, since we already have SuccessPolicy and FailurePolicy) approach, ensuring that each of those sections can be produced independently and then added to a replicatedJob.

And if we wanted to be more consistent it might look like:

type StartupPolicy struct {
	TargetReplicatedJobs []string `json:"targetReplicatedJobs,omitempty"`
	Status JobReadyStatus `json:"status"`
}

This makes the assumption that a JobSet will have one condition (set of jobs and a status) to indicate ready. I'm not sure we want to allow more complexity than that (the workflow complexity thing is a non goal).

Yea I believe there was an interest in both ready and succeeded. Init containers run after one another and the thought was that we could have a similar idea for this.

I ended up going with the API I have because of the flexibility of assuming ready or succeeded for the replicated jobs.

I do acknowledge the irony of calling out that we don't want a workflow engine but then we at least have a poor person workflow engine with this.

@Gekko0114
Copy link
Member

I commented without understanding the context of jobset, thank you for your explanations

When a JobSet is started with `StartupPolicy` specified, we will create jobs in a suspended state (ie `Job.Spec.Suspend = true`) This avoids starting the underlying jobs.

If StartupPolicy is set, then we will suspend the other jobs until the StartupPolicy is considered succesful.
We will unsuspend only the jobs that match `TargetReplicatedJobStartup`. Once they are considered "started" by matching on targeted status, then we resume all the rest of the jobs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we unsuspend jobs matching TargetReplicatedJobStartup, then in the example where we want sequential execution of Job A -> Job B -> Job -> C, wouldn't they all run in parallel since there is a separate startup policy targeting each of them?

Or is this saying that we handle only 1 startup policy at a time, in the order in which they are specified in the spec (e.g. we first see a startup policy for Job A so we unsuspend it, wait for Job A to complete, then we check the next startup policy and it's for Job B, so we unsuspend that, etc.)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is this saying that we handle only 1 startup policy at a time, in the order in which they are specified in the spec (e.g. we first see a startup policy for Job A so we unsuspend it, wait for Job A to complete, then we check the next startup policy and it's for Job B, so we unsuspend that, etc.)?

This is where I find it difficult balancing design versus implementation details.

In the implementation, I apply StartupPolicy in sequential order.

I would verify if JobA is suspended and then unsuspend. I would return and trigger a reconcile.

I loop over the startupPolicy rules and check if JobA is successful via the rules (Ready/Succeeded) and if it is, I move on to JobB and resume if its suspended.

And repeat.

If all the startup polices items are successful then we unsuspend everything not in the startup list (because the startup ones are running or succeeded).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated with this info. PTAL

@kannon92 kannon92 force-pushed the startup-policy-kpe branch 3 times, most recently from cb80a39 to 9d855b9 Compare December 18, 2023 21:13
keps/104-StartupPolicy/README.md Outdated Show resolved Hide resolved
keps/104-StartupPolicy/README.md Outdated Show resolved Hide resolved
keps/104-StartupPolicy/README.md Outdated Show resolved Hide resolved
keps/104-StartupPolicy/README.md Outdated Show resolved Hide resolved
keps/104-StartupPolicy/README.md Outdated Show resolved Hide resolved
keps/104-StartupPolicy/README.md Outdated Show resolved Hide resolved
@kannon92
Copy link
Contributor Author

/retest

keps/104-StartupPolicy/README.md Outdated Show resolved Hide resolved
keps/104-StartupPolicy/README.md Outdated Show resolved Hide resolved
keps/104-StartupPolicy/README.md Outdated Show resolved Hide resolved
keps/104-StartupPolicy/README.md Outdated Show resolved Hide resolved
keps/104-StartupPolicy/README.md Outdated Show resolved Hide resolved
@ahg-g
Copy link
Contributor

ahg-g commented Dec 21, 2023

@kannon92 this is great, can you please resolve the remaining comments and get this merged?

@kannon92 kannon92 requested a review from ahg-g December 21, 2023 16:45
@ahg-g
Copy link
Contributor

ahg-g commented Dec 21, 2023

A couple final nits, otherwise looks good to me.

@ahg-g
Copy link
Contributor

ahg-g commented Dec 21, 2023

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 21, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, kannon92

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 21, 2023
@kannon92
Copy link
Contributor Author

/retest

@k8s-ci-robot k8s-ci-robot merged commit dbabc65 into kubernetes-sigs:main Dec 21, 2023
10 checks passed
@danielvegamyhre
Copy link
Contributor

Thanks for driving this @kannon92! 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants