Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Kubeflow Jobs in MultiKueue #2552

Closed
1 of 3 tasks
Tracked by #4249
alculquicondor opened this issue Jul 8, 2024 · 18 comments · Fixed by #2880 or #4254
Closed
1 of 3 tasks
Tracked by #4249

Support Kubeflow Jobs in MultiKueue #2552

alculquicondor opened this issue Jul 8, 2024 · 18 comments · Fixed by #2880 or #4254
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@alculquicondor
Copy link
Contributor

What would you like to be added:

Support for Kubeflow Jobs in MultiKueue, in particular, for TFJob and PyTorchJob.
Ideally, the implementation should be mostly common among all job types.

Kubeflow Job doesn't have support for managedBy, so, for now, we can only support the scenario where the manager cluster doesn't have the controller installed.

Why is this needed:

To continue the incremental improvement of MK and satisfy the needs of early adopters.

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

@alculquicondor alculquicondor added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 8, 2024
@alculquicondor
Copy link
Contributor Author

/assign @mszadkow

@kannon92
Copy link
Contributor

The hope is to implement a new version of Trainingoperator called TrainJob that would use JobSet as the base.

I think if rhat is done then one could use the ManagedField in JobSet for this.

@alculquicondor
Copy link
Contributor Author

Right, but we have users requesting this feature today.

@kannon92
Copy link
Contributor

Should there be some work done to add managedField to the KubeFlow API? Or are we trying to avoid that to satisfy a solution that bypasses KubeFlow releases?

@alculquicondor
Copy link
Contributor Author

We don't need it in the current version of MultiKueue. We can just recommend users not to install the operator in the dispatcher cluster.

@mimowo
Copy link
Contributor

mimowo commented Sep 12, 2024

/reopen
Let's close it when the ongoing effort of supporting managedBy is complete for the training-operator and MPIJob.

@k8s-ci-robot k8s-ci-robot reopened this Sep 12, 2024
@k8s-ci-robot
Copy link
Contributor

@mimowo: Reopened this issue.

In response to this:

/reopen
Let's close it when the ongoing effort of supporting managedBy is complete for the training-operator and MPIJob.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@tenzen-y
Copy link
Member

FYI: kubeflow/trainer#2203 was merged right now.
We will include the feature in the next training-operator minor release.

RC.0 with the managedBy feature will be released on January 20th, 2025.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 17, 2024
@mimowo
Copy link
Contributor

mimowo commented Dec 18, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 18, 2024
@mimowo
Copy link
Contributor

mimowo commented Dec 18, 2024

RC.0 with the managedBy feature will be released on January 20th, 2025.

@tenzen-y does the plan for Jan 20th still holds?

cc @mszadkow

@tenzen-y
Copy link
Member

/reopen Let's close it when the ongoing effort of supporting managedBy is complete for the training-operator and MPIJob.

We plan to release the final Kubeflow v1 API release within this year.

@mimowo
Copy link
Contributor

mimowo commented Jan 23, 2025

FYI we already have the training-operator rc1 with the managedBy field: https://github.com/kubeflow/training-operator/releases/tag/v1.9.0-rc.0.

So, we can prepare the PR for the integration already. Depending on the timeline of the full release we can either merge the support using rc1 or wait for the full release, but starting the work early and discovering potential roadblocks would be great.

cc @mszadkow

@mszadkow
Copy link
Contributor

FYI we already have the training-operator rc1 with the managedBy field: https://github.com/kubeflow/training-operator/releases/tag/v1.9.0-rc.0.

So, we can prepare the PR for the integration already. Depending on the timeline of the full release we can either merge the support using rc1 or wait for the full release, but starting the work early and discovering potential roadblocks would be great.

cc @mszadkow

ACK

@tenzen-y
Copy link
Member

I guess that this has already been completed. @mszadkow @mimowo Do you have any remaining tasks?

@mszadkow
Copy link
Contributor

I think only documentation left, I will push PR today.

@mimowo
Copy link
Contributor

mimowo commented Feb 13, 2025

Please also x-ref the documentation in Kueue with the pending PR for the documentation in kubeflow: kubeflow/website#3956

@tenzen-y
Copy link
Member

I think only documentation left, I will push PR today.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
7 participants