-
Notifications
You must be signed in to change notification settings - Fork 698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for the managedBy
field
#2193
Comments
/cc @andreyvelich @tenzen-y |
For xreference of with support for MPIJob: kubeflow/mpi-operator#646 |
Thank you for creating this @mimowo! |
I believe the implementation could follow the JobSet example. These are the key bits: |
In the JobSet managedBy implementations, the basic status transitions are guarded by the batch/v1 Job status validations. So, I would propose selecting either of the following 2 approaches:
TBH, I would propose selecting opt 2 since I can not find any use cases so that the arbitrary external controller needs to reconcile/control the kubeflow Jobs. |
I'm also in favor of (2.) to simplify the code. |
I agree on the 2nd approach given that we will migrate to @tenzen-y do we need to allow user to set |
I assumed that the training operator defaulter mutate it to the So, there is no impact on the existing users. |
I see, do we have the same logic for |
In JobSet and Jobwe don't default the value to the jobset-controller (we leave If we want consistency with Job, then, we reserve EDIT: I don't have a strong preference, but given the code is simple anyway, I would be leaning to follow the pattern of Job and reserve one value, it could be |
Uhm, that makes sense. If so, can we allow the training-operator to reconcile against the Jobs with an empty or |
/assign |
I saw that the Please can you add a new user guide on how to use |
Since we restrict for now the field only to multikueue or built-in controller, the documentation in Kubeflow should probably be short and focused on MultiKueue (maybe just briefly mention what the field does, and reference the MutliKueue page). For the docs in Kueue the only challenge is that it would need to reference dev (from master) installation of the training-operator, until the official release, but I think it would be a nice starting point I would suggest. |
I have opened the issue to capture the docs extension at the Kueue side: kubernetes-sigs/kueue#3121 |
I agree with you, but even the reference would be useful.
Let's create pending PR in the |
Sounds like a plan:
I synced with @mszadkow who will follow up on the steps. |
@mszadkow investigated this a bit more and it turns out that we cannot use master version of the training-operator with Kueue (point 1 of the plan), because Kueue requires to see the field to operate properly, but we don't want to compile Kueue against unreleased training-operator. So, for now, the docs would be just confusing to end users. Since the
I could open an issue in kubeflow to add the documentation page once the above conditions are met. WDYT @andreyvelich @tenzen-y? |
Sure, @mimowo please open a new issue in Training Operator repo: https://github.com/kubeflow/training-operator/issues. |
Sure, opened: #2279 |
I discussed this with @mimowo offline, and we agreed with this decision based on Kueue policies. We may aim to contain the training-operator managedBy feature once the traininig-operator RC version is in the future. But that is not guaranteed. |
What you would like to be added?
The support for the
managedBy
field which can delegate reconciliation from built-in controller, to a custom one.The semantics of the field are:
Why is this needed?
For context, we have in Kueue the effort (see Support Kubeflow Jobs in MultiKueue) to support kubeflow-training (it will include MPIJob), but it will not be complete without the support for managedBy.
The complete support for the users of MultiKueue (multi-cluster Kueue) means:
For context, the efforts to support the field in:
Love this feature?
Give it a 👍 We prioritize the features with most 👍
The text was updated successfully, but these errors were encountered: