-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Kubeflow Jobs in MultiKueue #2552
Comments
/assign @mszadkow |
The hope is to implement a new version of Trainingoperator called TrainJob that would use JobSet as the base. I think if rhat is done then one could use the ManagedField in JobSet for this. |
Right, but we have users requesting this feature today. |
Should there be some work done to add managedField to the KubeFlow API? Or are we trying to avoid that to satisfy a solution that bypasses KubeFlow releases? |
We don't need it in the current version of MultiKueue. We can just recommend users not to install the operator in the dispatcher cluster. |
/reopen |
@mimowo: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
FYI: kubeflow/trainer#2203 was merged right now. RC.0 with the managedBy feature will be released on January 20th, 2025. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
We plan to release the final Kubeflow v1 API release within this year. |
FYI we already have the training-operator rc1 with the managedBy field: https://github.com/kubeflow/training-operator/releases/tag/v1.9.0-rc.0. So, we can prepare the PR for the integration already. Depending on the timeline of the full release we can either merge the support using rc1 or wait for the full release, but starting the work early and discovering potential roadblocks would be great. cc @mszadkow |
ACK |
I think only documentation left, I will push PR today. |
Please also x-ref the documentation in Kueue with the pending PR for the documentation in kubeflow: kubeflow/website#3956 |
Thank you! |
What would you like to be added:
Support for Kubeflow Jobs in MultiKueue, in particular, for TFJob and PyTorchJob.
Ideally, the implementation should be mostly common among all job types.
Kubeflow Job doesn't have support for managedBy, so, for now, we can only support the scenario where the manager cluster doesn't have the controller installed.
Why is this needed:
To continue the incremental improvement of MK and satisfy the needs of early adopters.
Completion requirements:
This enhancement requires the following artifacts:
The artifacts should be linked in subsequent comments.
The text was updated successfully, but these errors were encountered: