Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support monitor podgroup #1688

Closed

Conversation

qiankunli
Copy link
Contributor

@qiankunli qiankunli commented Nov 4, 2022

if we create a pytorchjob, operator will create a podgroup later (we use volcano as scheduler),if there is no enough resources in cluster at the moment, the status of podgroup will be pending, and the reconcile function will break.

after a while, if there are enough resources available, the status of podgroup will be inqueue. then the reconcile function should continue to run. but the operator have not listening the podgroup, so the status pytorchjob will always be created, and no pod will be created.

so we may need to watch the change of podgroup's status, so that the reconcile function can continue to create pods and services.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    control-plane: kubeflow-training-operator
  name: training-operator
  namespace: kubeflow
spec:
  replicas: 1
  selector:
    matchLabels:
      control-plane: kubeflow-training-operator
  template:
    spec:
      containers:
      - command:
        - /manager
        args:
        - --enable-gang-scheduling=true
        - --enable-watch-resources=PodGroup.v1beta1.scheduling.volcano.sh

I add an argument enable-watch-resources, you can add any resources you want to monitor, controller will watch its change to trigger the reconcile.

related pr #1677

Signed-off-by: qiankunli <bert.li.qiankun@gmail.com>
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: qiankunli
Once this PR has been reviewed and has the lgtm label, please assign jeffwan for approval by writing /assign @jeffwan in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coveralls
Copy link

coveralls commented Nov 4, 2022

Pull Request Test Coverage Report for Build 3391834184

  • 12 of 50 (24.0%) changed or added relevant lines in 2 files are covered.
  • 5 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.4%) to 39.286%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/common/util/flag_util.go 0 13 0.0%
pkg/controller.v1/pytorch/pytorchjob_controller.go 12 37 32.43%
Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/mpi/mpijob_controller.go 2 77.2%
pkg/controller.v1/pytorch/pytorchjob_controller.go 3 57.25%
Totals Coverage Status
Change from base Build 3390627896: -0.4%
Covered Lines: 2345
Relevant Lines: 5969

💛 - Coveralls

@zw0610
Copy link
Member

zw0610 commented Nov 4, 2022

@johnugeorge PTAL

@@ -64,6 +64,7 @@ func main() {
"Enabling this will ensure there is only one active controller manager.")
flag.Var(&enabledSchemes, "enable-scheme", "Enable scheme(s) as --enable-scheme=tfjob --enable-scheme=pytorchjob, case insensitive."+
" Now supporting TFJob, PyTorchJob, MXNetJob, XGBoostJob. By default, all supported schemes will be enabled.")
flag.Var(&config.Config.WatchedResources, "enable-watch-resources", "The list of resources that need be watched to trigger reconcile, in the form: Kind.version.group (e.g. TFJob.v1.kubeflow.org)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the example, you can provide pod group resource for easy reference

@@ -215,6 +218,51 @@ func (r *PyTorchJobReconciler) SetupWithManager(mgr ctrl.Manager) error {
return err
}

// inject watching for job related service
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this duplicate?

}

// inject watching for job related objects,such as PodGroup
if config.Config.WatchedResources != nil {
Copy link
Member

@johnugeorge johnugeorge Nov 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rebase your code and add it for all frameworks? Also, the volcano static watches in the current code should be removed for this dynamic watcher. #1666

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I wrote this feature, I did not realize that it was already supported and merged. @_@

@@ -0,0 +1,30 @@
package util
Copy link
Member

@johnugeorge johnugeorge Nov 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file used?

@johnugeorge
Copy link
Member

johnugeorge commented Nov 4, 2022

@ggaaooppeenngg @D0m021ng @Crazybean-lwb

This adds a dynamic watcher so that new resources can watched without hardcoding.

@ggaaooppeenngg
Copy link
Contributor

I suggest adding PodGroup as the custom resource by default when enable-gang-scheduling is enabled.

@johnugeorge
Copy link
Member

johnugeorge commented Nov 4, 2022

@ggaaooppeenngg

If it is a dynamic watcher instead of static watch(in the current code), other schedulers can be also deployed instead of volcano. Ref: #1677 (comment)

We can still use volcano as the default one to be watched. Eg:
https://github.com/kubeflow/katib/blob/54b020b44e852bd2980c1bd9223f8c7a7ed67f2d/manifests/v1beta1/components/controller/controller.yaml#L31

If CRD is not installed before the controller deployment, it skips the watch in that case.

@zw0610 Does this solution answer your concern as well regarding the third party crd installation?

@johnugeorge
Copy link
Member

@qiankunli Can you update the PR?

@tenzen-y
Copy link
Member

I think we can close this PR since the above issue was resolved by #1666.
@johnugeorge WDYT?

@johnugeorge
Copy link
Member

Yes. Closing this PR as #1666 is merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants