Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Do you plan to create a pipeline of Jobs? #240

Closed
Gekko0114 opened this issue Jul 29, 2023 · 7 comments
Closed

Question: Do you plan to create a pipeline of Jobs? #240

Gekko0114 opened this issue Jul 29, 2023 · 7 comments

Comments

@Gekko0114
Copy link
Member

Thank you for the development of this interesting OSS.
I have a question.

I think it would be very reasonable to develop JobSet to meet the needs of AI/ML workloads. On the other hand, in ML pipelines, it is often required to run multiple Jobs sequentially, such as running Job A, then Job B, and then Job C after that.
Can JobSet cover this functionality of executing Jobs in sequence, or should features like these be handled by another OSS like Argo?

@kannon92
Copy link
Contributor

We have this issue/feature. #104

@Gekko0114
Copy link
Member Author

I thought this feature was only about supporting which one to start first, the worker or the driver.
Does it also cover running multiple Jobs sequentially, like Job A -> Job B -> Job C -> ....?

@danielvegamyhre
Copy link
Contributor

danielvegamyhre commented Jul 30, 2023

I thought this feature was only about supporting which one to start first, the worker or the driver. Does it also cover running multiple Jobs sequentially, like Job A -> Job B -> Job C -> ....?

That's correct, the idea described in #104 would only address things like the leader/work pattern, where we wait to start the worker until the leader (Job A) is running before starting the worker (Job B) - that is, not waiting until Job A is finished but waiting until Job A is ready before launching Job B.

However, I think the proposed design could easily be extended to include an extra knob to configure if we want to wait until the previous Job is ready or until it is finished before starting the next job.

Building off of @kannon92's comment, maybe something like this:

type StartupPolicy struct {
	// Operator determines either All or Any of the selected jobs should be ready to 
        // consider the replicatedjob ready.
	// +kubebuilder:validation:Enum=All;Any
	Operator Operator `json:"operator"`

	// TargetReplicatedJobs are the names of the replicated jobs the operator will apply to.
	// A null or empty list will apply to all replicatedJobs.
	TargetReplicatedJobs []string `json:"targetReplicatedJobs,omitempty"`
	
	// Condition defines what condition the previous job should be in before starting
        // the next job (ready or finished).
        // +kubebuilder:validation:Enum=Ready;Finished
        Condition JobCondition `json:"condition"`
}

@Gekko0114
Copy link
Member Author

However, I think the proposed design could easily be extended to include an extra knob to configure if we want to wait until the previous Job is ready or until it is finished before starting the next job.

I see. Sounds useful. Thank you for your explanation!

@kannon92
Copy link
Contributor

I thought this feature was only about supporting which one to start first, the worker or the driver. Does it also cover running multiple Jobs sequentially, like Job A -> Job B -> Job C -> ....?

That's correct, the idea described in #104 would only address things like the leader/work pattern, where we wait to start the worker until the leader (Job A) is running before starting the worker (Job B) - that is, not waiting until Job A is finished but waiting until Job A is ready before launching Job B.

However, I think the proposed design could easily be extended to include an extra knob to configure if we want to wait until the previous Job is ready or until it is finished before starting the next job.

Building off of @kannon92's comment, maybe something like this:

type StartupPolicy struct {
	// Operator determines either All or Any of the selected jobs should be ready to 
        // consider the replicatedjob ready.
	// +kubebuilder:validation:Enum=All;Any
	Operator Operator `json:"operator"`

	// TargetReplicatedJobs are the names of the replicated jobs the operator will apply to.
	// A null or empty list will apply to all replicatedJobs.
	TargetReplicatedJobs []string `json:"targetReplicatedJobs,omitempty"`
	
	// Condition defines what condition the previous job should be in before starting
        // the next job (ready or finished).
        // +kubebuilder:validation:Enum=Ready;Finished
        Condition JobCondition `json:"condition"`
}

Only issue with ready is that we don't yet have a condition in Jobs to say when a job is ready (kubernetes/kubernetes#117758).

@kannon92
Copy link
Contributor

Based on a discussion with @ahg-g and @danielvegamyhre in the startup policy KEP, I don't know if we will cover a pipeline of jobs (ie sequential completed execution).

@danielvegamyhre
Copy link
Contributor

As @kannon92 mentioned above, after some discussion we concluded we want to keep the startup policy API simple and specific to sequential startup, to address specific requirements of distributed training frameworks. For now, we do not want to expand to include other use cases, like sequential completed execution, where the API can quickly explode into a full blown workload DAG / workflow execution engine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants