support monitor podgroup #1688

qiankunli · 2022-11-04T06:56:10Z

if we create a pytorchjob, operator will create a podgroup later (we use volcano as scheduler),if there is no enough resources in cluster at the moment, the status of podgroup will be pending, and the reconcile function will break.

after a while, if there are enough resources available, the status of podgroup will be inqueue. then the reconcile function should continue to run. but the operator have not listening the podgroup, so the status pytorchjob will always be created, and no pod will be created.

so we may need to watch the change of podgroup's status, so that the reconcile function can continue to create pods and services.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    control-plane: kubeflow-training-operator
  name: training-operator
  namespace: kubeflow
spec:
  replicas: 1
  selector:
    matchLabels:
      control-plane: kubeflow-training-operator
  template:
    spec:
      containers:
      - command:
        - /manager
        args:
        - --enable-gang-scheduling=true
        - --enable-watch-resources=PodGroup.v1beta1.scheduling.volcano.sh

I add an argument enable-watch-resources, you can add any resources you want to monitor, controller will watch its change to trigger the reconcile.

related pr #1677

Signed-off-by: qiankunli <bert.li.qiankun@gmail.com>

google-oss-prow · 2022-11-04T06:56:24Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: qiankunli
Once this PR has been reviewed and has the lgtm label, please assign jeffwan for approval by writing /assign @jeffwan in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coveralls · 2022-11-04T06:59:11Z

Pull Request Test Coverage Report for Build 3391834184

12 of 50 (24.0%) changed or added relevant lines in 2 files are covered.
5 unchanged lines in 2 files lost coverage.
Overall coverage decreased (-0.4%) to 39.286%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/common/util/flag_util.go	0	13	0.0%
pkg/controller.v1/pytorch/pytorchjob_controller.go	12	37	32.43%

Files with Coverage Reduction	New Missed Lines	%
pkg/controller.v1/mpi/mpijob_controller.go	2	77.2%
pkg/controller.v1/pytorch/pytorchjob_controller.go	3	57.25%

Totals
Change from base Build 3390627896:	-0.4%
Covered Lines:	2345
Relevant Lines:	5969

💛 - Coveralls

zw0610 · 2022-11-04T07:36:12Z

@johnugeorge PTAL

johnugeorge · 2022-11-04T08:22:48Z

cmd/training-operator.v1/main.go

@@ -64,6 +64,7 @@ func main() {
 			"Enabling this will ensure there is only one active controller manager.")
 	flag.Var(&enabledSchemes, "enable-scheme", "Enable scheme(s) as --enable-scheme=tfjob --enable-scheme=pytorchjob, case insensitive."+
 		" Now supporting TFJob, PyTorchJob, MXNetJob, XGBoostJob. By default, all supported schemes will be enabled.")
+	flag.Var(&config.Config.WatchedResources, "enable-watch-resources", "The list of resources that need be watched to trigger reconcile, in the form: Kind.version.group (e.g. TFJob.v1.kubeflow.org)")


In the example, you can provide pod group resource for easy reference

johnugeorge · 2022-11-04T08:23:13Z

pkg/controller.v1/pytorch/pytorchjob_controller.go

@@ -215,6 +218,51 @@ func (r *PyTorchJobReconciler) SetupWithManager(mgr ctrl.Manager) error {
 		return err
 	}

+	// inject watching for job related service


Is this duplicate?

johnugeorge · 2022-11-04T08:26:58Z

pkg/controller.v1/pytorch/pytorchjob_controller.go

+	}
+
+	// inject watching for job related objects,such as PodGroup
+	if config.Config.WatchedResources != nil {


Can you rebase your code and add it for all frameworks? Also, the volcano static watches in the current code should be removed for this dynamic watcher. #1666

When I wrote this feature, I did not realize that it was already supported and merged. @_@

johnugeorge · 2022-11-04T08:27:21Z

pkg/common/util/flag_util.go

@@ -0,0 +1,30 @@
+package util


Is this file used?

johnugeorge · 2022-11-04T08:29:26Z

@ggaaooppeenngg @D0m021ng @Crazybean-lwb

This adds a dynamic watcher so that new resources can watched without hardcoding.

ggaaooppeenngg · 2022-11-04T17:11:21Z

I suggest adding PodGroup as the custom resource by default when enable-gang-scheduling is enabled.

johnugeorge · 2022-11-04T18:21:11Z

@ggaaooppeenngg

If it is a dynamic watcher instead of static watch(in the current code), other schedulers can be also deployed instead of volcano. Ref: #1677 (comment)

We can still use volcano as the default one to be watched. Eg:
https://github.com/kubeflow/katib/blob/54b020b44e852bd2980c1bd9223f8c7a7ed67f2d/manifests/v1beta1/components/controller/controller.yaml#L31

If CRD is not installed before the controller deployment, it skips the watch in that case.

@zw0610 Does this solution answer your concern as well regarding the third party crd installation?

johnugeorge · 2022-11-20T11:07:58Z

@qiankunli Can you update the PR?

tenzen-y · 2023-01-21T18:31:31Z

I think we can close this PR since the above issue was resolved by #1666.
@johnugeorge WDYT?

johnugeorge · 2023-01-21T19:42:14Z

Yes. Closing this PR as #1666 is merged

support monitor podgroup

4bf0fba

Signed-off-by: qiankunli <bert.li.qiankun@gmail.com>

google-oss-prow bot added the size/M label Nov 4, 2022

google-oss-prow bot requested review from jinchihe, terrytangyuan and zw0610 November 4, 2022 06:56

johnugeorge reviewed Nov 4, 2022

View reviewed changes

pkg/common/util/flag_util.go

@@ -0,0 +1,30 @@

package util

Copy link

Member

johnugeorge Nov 4, 2022 •

edited

Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file used?

kubeflow deleted a comment from google-oss-prow bot Nov 4, 2022

johnugeorge mentioned this pull request Nov 4, 2022

Make a generic logger instead of the nil logger on dependent update #1680

Merged

1 task

johnugeorge mentioned this pull request Jan 12, 2023

Support coscheduling plugin #1722

Closed

johnugeorge closed this Jan 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support monitor podgroup #1688

support monitor podgroup #1688

qiankunli commented Nov 4, 2022 •

edited

Loading

google-oss-prow bot commented Nov 4, 2022

coveralls commented Nov 4, 2022 •

edited

Loading

zw0610 commented Nov 4, 2022

johnugeorge Nov 4, 2022

johnugeorge Nov 4, 2022

johnugeorge Nov 4, 2022 •

edited

Loading

qiankunli Nov 7, 2022

johnugeorge Nov 4, 2022 •

edited

Loading

johnugeorge commented Nov 4, 2022 •

edited

Loading

ggaaooppeenngg commented Nov 4, 2022

johnugeorge commented Nov 4, 2022 •

edited

Loading

johnugeorge commented Nov 20, 2022

tenzen-y commented Jan 21, 2023

johnugeorge commented Jan 21, 2023

support monitor podgroup #1688

support monitor podgroup #1688

Conversation

qiankunli commented Nov 4, 2022 • edited Loading

google-oss-prow bot commented Nov 4, 2022

coveralls commented Nov 4, 2022 • edited Loading

Pull Request Test Coverage Report for Build 3391834184

💛 - Coveralls

zw0610 commented Nov 4, 2022

johnugeorge Nov 4, 2022

Choose a reason for hiding this comment

johnugeorge Nov 4, 2022

Choose a reason for hiding this comment

johnugeorge Nov 4, 2022 • edited Loading

Choose a reason for hiding this comment

qiankunli Nov 7, 2022

Choose a reason for hiding this comment

johnugeorge Nov 4, 2022 • edited Loading

Choose a reason for hiding this comment

johnugeorge commented Nov 4, 2022 • edited Loading

ggaaooppeenngg commented Nov 4, 2022

johnugeorge commented Nov 4, 2022 • edited Loading

johnugeorge commented Nov 20, 2022

tenzen-y commented Jan 21, 2023

johnugeorge commented Jan 21, 2023

qiankunli commented Nov 4, 2022 •

edited

Loading

coveralls commented Nov 4, 2022 •

edited

Loading

johnugeorge Nov 4, 2022 •

edited

Loading

johnugeorge Nov 4, 2022 •

edited

Loading

johnugeorge commented Nov 4, 2022 •

edited

Loading

johnugeorge commented Nov 4, 2022 •

edited

Loading