Question: Do you plan to create a pipeline of Jobs? #240

Gekko0114 · 2023-07-29T15:19:44Z

Thank you for the development of this interesting OSS.
I have a question.

I think it would be very reasonable to develop JobSet to meet the needs of AI/ML workloads. On the other hand, in ML pipelines, it is often required to run multiple Jobs sequentially, such as running Job A, then Job B, and then Job C after that.
Can JobSet cover this functionality of executing Jobs in sequence, or should features like these be handled by another OSS like Argo?

The text was updated successfully, but these errors were encountered:

kannon92 · 2023-07-29T15:25:43Z

We have this issue/feature. #104

Gekko0114 · 2023-07-29T15:28:22Z

I thought this feature was only about supporting which one to start first, the worker or the driver.
Does it also cover running multiple Jobs sequentially, like Job A -> Job B -> Job C -> ....?

danielvegamyhre · 2023-07-30T00:05:04Z

I thought this feature was only about supporting which one to start first, the worker or the driver. Does it also cover running multiple Jobs sequentially, like Job A -> Job B -> Job C -> ....?

That's correct, the idea described in #104 would only address things like the leader/work pattern, where we wait to start the worker until the leader (Job A) is running before starting the worker (Job B) - that is, not waiting until Job A is finished but waiting until Job A is ready before launching Job B.

However, I think the proposed design could easily be extended to include an extra knob to configure if we want to wait until the previous Job is ready or until it is finished before starting the next job.

Building off of @kannon92's comment, maybe something like this:

type StartupPolicy struct {
	// Operator determines either All or Any of the selected jobs should be ready to 
        // consider the replicatedjob ready.
	// +kubebuilder:validation:Enum=All;Any
	Operator Operator `json:"operator"`

	// TargetReplicatedJobs are the names of the replicated jobs the operator will apply to.
	// A null or empty list will apply to all replicatedJobs.
	TargetReplicatedJobs []string `json:"targetReplicatedJobs,omitempty"`
	
	// Condition defines what condition the previous job should be in before starting
        // the next job (ready or finished).
        // +kubebuilder:validation:Enum=Ready;Finished
        Condition JobCondition `json:"condition"`
}

Gekko0114 · 2023-07-30T01:25:55Z

However, I think the proposed design could easily be extended to include an extra knob to configure if we want to wait until the previous Job is ready or until it is finished before starting the next job.

I see. Sounds useful. Thank you for your explanation!

kannon92 · 2023-07-30T11:41:09Z

I thought this feature was only about supporting which one to start first, the worker or the driver. Does it also cover running multiple Jobs sequentially, like Job A -> Job B -> Job C -> ....?

That's correct, the idea described in #104 would only address things like the leader/work pattern, where we wait to start the worker until the leader (Job A) is running before starting the worker (Job B) - that is, not waiting until Job A is finished but waiting until Job A is ready before launching Job B.

However, I think the proposed design could easily be extended to include an extra knob to configure if we want to wait until the previous Job is ready or until it is finished before starting the next job.

Building off of @kannon92's comment, maybe something like this:
type StartupPolicy struct {
	// Operator determines either All or Any of the selected jobs should be ready to 
        // consider the replicatedjob ready.
	// +kubebuilder:validation:Enum=All;Any
	Operator Operator `json:"operator"`

	// TargetReplicatedJobs are the names of the replicated jobs the operator will apply to.
	// A null or empty list will apply to all replicatedJobs.
	TargetReplicatedJobs []string `json:"targetReplicatedJobs,omitempty"`
	
	// Condition defines what condition the previous job should be in before starting
        // the next job (ready or finished).
        // +kubebuilder:validation:Enum=Ready;Finished
        Condition JobCondition `json:"condition"`
}

Only issue with ready is that we don't yet have a condition in Jobs to say when a job is ready (kubernetes/kubernetes#117758).

kannon92 · 2023-10-19T21:27:44Z

Based on a discussion with @ahg-g and @danielvegamyhre in the startup policy KEP, I don't know if we will cover a pipeline of jobs (ie sequential completed execution).

danielvegamyhre · 2024-01-22T23:27:29Z

As @kannon92 mentioned above, after some discussion we concluded we want to keep the startup policy API simple and specific to sequential startup, to address specific requirements of distributed training frameworks. For now, we do not want to expand to include other use cases, like sequential completed execution, where the API can quickly explode into a full blown workload DAG / workflow execution engine.

kannon92 mentioned this issue Aug 2, 2023

add kep template #242

Merged

Gekko0114 mentioned this issue Oct 2, 2023

REQUEST: New membership for Gekko0114 kubernetes/org#4496

Closed

9 tasks

danielvegamyhre closed this as completed Jan 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Do you plan to create a pipeline of Jobs? #240

Question: Do you plan to create a pipeline of Jobs? #240

Gekko0114 commented Jul 29, 2023

kannon92 commented Jul 29, 2023

Gekko0114 commented Jul 29, 2023

danielvegamyhre commented Jul 30, 2023 •

edited

Loading

Gekko0114 commented Jul 30, 2023

kannon92 commented Jul 30, 2023

kannon92 commented Oct 19, 2023

danielvegamyhre commented Jan 22, 2024

Question: Do you plan to create a pipeline of Jobs? #240

Question: Do you plan to create a pipeline of Jobs? #240

Comments

Gekko0114 commented Jul 29, 2023

kannon92 commented Jul 29, 2023

Gekko0114 commented Jul 29, 2023

danielvegamyhre commented Jul 30, 2023 • edited Loading

Gekko0114 commented Jul 30, 2023

kannon92 commented Jul 30, 2023

kannon92 commented Oct 19, 2023

danielvegamyhre commented Jan 22, 2024

danielvegamyhre commented Jul 30, 2023 •

edited

Loading