-
Notifications
You must be signed in to change notification settings - Fork 708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support monitor podgroup #1688
support monitor podgroup #1688
Conversation
Signed-off-by: qiankunli <bert.li.qiankun@gmail.com>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: qiankunli The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Pull Request Test Coverage Report for Build 3391834184
💛 - Coveralls |
@johnugeorge PTAL |
@@ -64,6 +64,7 @@ func main() { | |||
"Enabling this will ensure there is only one active controller manager.") | |||
flag.Var(&enabledSchemes, "enable-scheme", "Enable scheme(s) as --enable-scheme=tfjob --enable-scheme=pytorchjob, case insensitive."+ | |||
" Now supporting TFJob, PyTorchJob, MXNetJob, XGBoostJob. By default, all supported schemes will be enabled.") | |||
flag.Var(&config.Config.WatchedResources, "enable-watch-resources", "The list of resources that need be watched to trigger reconcile, in the form: Kind.version.group (e.g. TFJob.v1.kubeflow.org)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the example, you can provide pod group resource for easy reference
@@ -215,6 +218,51 @@ func (r *PyTorchJobReconciler) SetupWithManager(mgr ctrl.Manager) error { | |||
return err | |||
} | |||
|
|||
// inject watching for job related service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this duplicate?
} | ||
|
||
// inject watching for job related objects,such as PodGroup | ||
if config.Config.WatchedResources != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rebase your code and add it for all frameworks? Also, the volcano static watches in the current code should be removed for this dynamic watcher. #1666
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I wrote this feature, I did not realize that it was already supported and merged. @_@
@@ -0,0 +1,30 @@ | |||
package util |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this file used?
@ggaaooppeenngg @D0m021ng @Crazybean-lwb This adds a dynamic watcher so that new resources can watched without hardcoding. |
I suggest adding PodGroup as the custom resource by default when enable-gang-scheduling is enabled. |
If it is a dynamic watcher instead of static watch(in the current code), other schedulers can be also deployed instead of volcano. Ref: #1677 (comment) We can still use volcano as the default one to be watched. Eg: If CRD is not installed before the controller deployment, it skips the watch in that case. @zw0610 Does this solution answer your concern as well regarding the third party crd installation? |
@qiankunli Can you update the PR? |
I think we can close this PR since the above issue was resolved by #1666. |
Yes. Closing this PR as #1666 is merged |
if we create a pytorchjob, operator will create a podgroup later (we use volcano as scheduler),if there is no enough resources in cluster at the moment, the status of podgroup will be pending, and the reconcile function will break.
after a while, if there are enough resources available, the status of podgroup will be inqueue. then the reconcile function should continue to run. but the operator have not listening the podgroup, so the status pytorchjob will always be created, and no pod will be created.
so we may need to watch the change of podgroup's status, so that the reconcile function can continue to create pods and services.
I add an argument
enable-watch-resources
, you can add any resources you want to monitor, controller will watch its change to trigger the reconcile.related pr #1677