-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Kubeflow Jobs in MultiKueue #2552
Comments
/assign @mszadkow |
The hope is to implement a new version of Trainingoperator called TrainJob that would use JobSet as the base. I think if rhat is done then one could use the ManagedField in JobSet for this. |
Right, but we have users requesting this feature today. |
Should there be some work done to add managedField to the KubeFlow API? Or are we trying to avoid that to satisfy a solution that bypasses KubeFlow releases? |
We don't need it in the current version of MultiKueue. We can just recommend users not to install the operator in the dispatcher cluster. |
/reopen |
@mimowo: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
FYI: kubeflow/training-operator#2203 was merged right now. RC.0 with the managedBy feature will be released on January 20th, 2025. |
What would you like to be added:
Support for Kubeflow Jobs in MultiKueue, in particular, for TFJob and PyTorchJob.
Ideally, the implementation should be mostly common among all job types.
Kubeflow Job doesn't have support for managedBy, so, for now, we can only support the scenario where the manager cluster doesn't have the controller installed.
Why is this needed:
To continue the incremental improvement of MK and satisfy the needs of early adopters.
Completion requirements:
This enhancement requires the following artifacts:
The artifacts should be linked in subsequent comments.
The text was updated successfully, but these errors were encountered: