Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1888073: prevent no-op hotlooping on Operators #1816

Merged

Conversation

sjenning
Copy link
Contributor

xref https://bugzilla.redhat.com/show_bug.cgi?id=1888073

The Operator Reconciler does not detect when a reconciliation is triggered by its own Update and thus hotloops until it gets lucky enough to create the exact same rendering of the Operator two times in a row. For larger Operator resources, this is a unlikely and will hotloop forever.

This hotloop creates a lot of strain on the kube-apiserver and etcd as well, since some Operator resources are very large. I observe about 500m of additional CPU cores usage per master in addition to 1 core of CPU used directly by olm-operator hotlooping. olm-operator also generates 1-2MB/s of network traffic in this state.

Before patch with ACM Operator installed

olm-operator

After patch. The usage spike is during ACM installation, but it falls back down after installation is compelete, as expected.

olm-operator-patched

@njhale @derekwaynecarr @dinhxuanvu

@openshift-ci-robot
Copy link
Collaborator

Hi @sjenning. Thanks for your PR.

I'm waiting for a operator-framework member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 15, 2020
@sjenning sjenning changed the title prevent no-op hotlooping on Operators Bug 1888073: prevent no-op hotlooping on Operators Oct 15, 2020
@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. label Oct 15, 2020
@openshift-ci-robot
Copy link
Collaborator

@sjenning: This pull request references Bugzilla bug 1888073, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1888073: prevent no-op hotlooping on Operators

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Oct 15, 2020
@dinhxuanvu
Copy link
Member

/ok-to-test

@openshift-ci-robot openshift-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 15, 2020
@sjenning
Copy link
Contributor Author

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot removed the bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. label Oct 15, 2020
@openshift-ci-robot
Copy link
Collaborator

@sjenning: This pull request references Bugzilla bug 1888073, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. label Oct 15, 2020
@sjenning
Copy link
Contributor Author

/hold

Still some things to work out. It isn't recreating the Operator when it is deleted manually. And the unit test failure I think is due to not requeuing on failure.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 15, 2020
@of-deploy-bot
Copy link

This PR failed 1 out of 1 times with 4 individual failed tests and 4 skipped tests. A test is considered flaky if failed on multiple commits.

totaltestcount: 1
failedtestcount: 1
flaketestcount: 4
skippedtestcount: 4
flaketests:

  • classname: End-to-end
    name: 'Operator when a subscription to a package exists should automatically
    adopt components
    '
    counts: 3
    details:
    • count: 3
      error: |4-

      /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/operator_test.go:291
      Timed out after 60.000s.
      Error: Unexpected non-nil/non-zero extra argument at index 1:
      	<*errors.StatusError>: &errors.StatusError{ErrStatus:v1.Status{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ListMeta:v1.ListMeta{SelfLink:"", ResourceVersion:"", Continue:"", RemainingItemCount:(*int64)(nil)}, Status:"Failure", Message:"operators.operators.coreos.com \"kiali.ns-g79gt\" not found", Reason:"NotFound", Details:(*v1.StatusDetails)(0xc000b3bc80), Code:404}}
      /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/operator_test.go:296
      
    meandurationsec: 103.048239
  • classname: End-to-end
    name: 'Install Plan with CSVs across multiple catalog sources'
    counts: 2
    details:
    • count: 2
      error: |4-

      /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/installplan_e2e_test.go:49
      Resource does not match expected value:   &v1alpha1.InstallPlan{
        	TypeMeta: {},
        	ObjectMeta: v1.ObjectMeta{
        		... // 3 identical fields
        		SelfLink:                   "/apis/operators.coreos.com/v1alpha1/namespaces/operators/install"...,
        		UID:                        "9ff10168-70ef-440c-be0d-c65af1f258e6",
      - 		ResourceVersion:            "17537",
      + 		ResourceVersion:            "17539",
        		Generation:                 1,
        		CreationTimestamp:          {Time: s"2020-10-15 19:14:55 +0000 UTC"},
        		DeletionTimestamp:          nil,
        		DeletionGracePeriodSeconds: nil,
        		Labels: map[string]string{
        			"operators.coreos.com/nginx-pmrfl.operators":    "",
      + 			"operators.coreos.com/nginxdep-hd27d.operators": "",
        		},
        		Annotations:     nil,
        		OwnerReferences: {{APIVersion: "operators.coreos.com/v1alpha1", Kind: "Subscription", Name: "sub-nginx-vkbmd", UID: "ac806836-abad-4b8f-8c14-1d5cbb3d8415", ...}},
        		... // 3 identical fields
        	},
        	Spec:   {ClusterServiceVersionNames: {"nginxdep-hd27d-stable", "nginx-pmrfl-stable"}, Approval: "Automatic", Approved: true, Generation: 1},
        	Status: {Phase: "Complete", Conditions: {{Type: "Installed", Status: "True", LastUpdateTime: s"2020-10-15 19:14:57 +0000 UTC", LastTransitionTime: s"2020-10-15 19:14:57 +0000 UTC", ...}}, CatalogSources: {"mock-ocs-dependent-hlzsh", "mock-ocs-main-lvws8"}, Plan: {&{Resolving: "nginxdep-hd27d-stable", Resource: {CatalogSource: "mock-ocs-dependent-hlzsh", CatalogSourceNamespace: "operators", Version: "v1alpha1", Kind: "ClusterServiceVersion", ...}, Status: "Present"}, &{Resolving: "nginx-pmrfl-stable", Resource: {CatalogSource: "mock-ocs-main-lvws8", CatalogSourceNamespace: "operators", Version: "v1alpha1", Kind: "ClusterServiceVersion", ...}, Status: "Present"}, &{Resolving: "nginxdep-hd27d-stable", Resource: {CatalogSource: "mock-ocs-dependent-hlzsh", CatalogSourceNamespace: "operators", Group: "apiextensions.k8s.io", Version: "v1beta1", ...}, Status: "Present"}, &{Resolving: "nginxdep-hd27d-stable", Resource: {CatalogSource: "mock-ocs-dependent-hlzsh", CatalogSourceNamespace: "operators", Group: "operators.coreos.com", Version: "v1alpha1", ...}, Status: "Present"}}, ...},
        }
      
      /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/installplan_e2e_test.go:2792
      
    meandurationsec: 184.2415225
  • classname: End-to-end
    name: 'Catalog gRPC address catalog source'
    counts: 1
    details:
    • count: 1
      error: |4-

      /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/catalog_e2e_test.go:34
      Timed out after 60.001s.
      failed to await deletion of test csvs
      Expected
          <[]v1alpha1.ClusterServiceVersion | len:1, cap:1>: [
              {
                  TypeMeta: {
                      Kind: "ClusterServiceVersion",
                      APIVersion: "operators.coreos.com/v1alpha1",
                  },
                  ObjectMeta: {
                      Name: "nginx-ssw2v-stable-replacement",
                      GenerateName: "",
                      Namespace: "operators",
                      SelfLink: "/apis/operators.coreos.com/v1alpha1/namespaces/operators/clusterserviceversions/nginx-ssw2v-stable-replacement",
                      UID: "9b42716c-ae56-4c3b-96df-4c0d848a4aa8",
                      ResourceVersion: "3507",
                      Generation: 1,
                      CreationTimestamp: {
                          Time: 2020-10-15T18:47:28Z,
                      },
                      DeletionTimestamp: nil,
                      DeletionGracePeriodSeconds: nil,
                      Labels: {
                          "olm.api.a05fc7f44c0232f": "required",
                      },
                      Annotations: {
                          "olm.operatorGroup": "global-operators",
                          "olm.operatorNamespace": "operators",
                          "olm.targetNamespaces": "",
                          "operatorframework.io/properties": "{\"properties\":[{\"type\":\"olm.package\",\"value\":{\"packageName\":\"nginx-ssw2v\",\"version\":\"0.2.0\"}}]}",
                      },
                      OwnerReferences: nil,
                      Finalizers: nil,
                      ClusterName: "",
                      ManagedFields: nil,
                  },
                  Spec: {
                      InstallStrategy: {
                          StrategyName: "deployment",
                          StrategySpec: {
                              DeploymentSpecs: [
                                  {
                                      Name: "dep9lq6d",
                                      Spec: {
                                          Replicas: 1,
                                          Selector: {
                                              MatchLabels: {...: ...},
                                              MatchExpressions: nil,
                                          },
                                          Template: {
                                              ObjectMeta: {
                                                  Name: ...,
                                                  GenerateName: ...,
                                                  Namespace: ...,
                                                  SelfLink: ...,
                                                  UID: ...,
                                                  ResourceVersion: ...,
                                                  Generation: ...,
                                                  CreationTimestamp: ...,
                                                  DeletionTimestamp: ...,
                                                  DeletionGracePeriodSeconds: ...,
                                                  Labels: ...,
                                                  Annotations: ...,
                                                  OwnerReferences: ...,
                                                  Finalizers: ...,
                                                  ClusterName: ...,
                                                  ManagedFields: ...,
                                              },
                                              Spec: {
                                                  Volumes: ...,
                                                  InitContainers: ...,
                                                  Containers: ...,
                                                  EphemeralContainers: ...,
                                                  RestartPolicy: ...,
                                                  TerminationGracePeriodSeconds: ...,
                                                  ActiveDeadlineSeconds: ...,
                                                  DNSPolicy: ...,
                                                  NodeSelector: ...,
                                                  ServiceAccountName: ...,
                                                  DeprecatedServiceAccount: ...,
                                                  AutomountServiceAccountToken: ...,
                                                  NodeName: ...,
                                                  HostNetwork: ...,
                                                  HostPID: ...,
                                                  HostIPC: ...,
                                                  ShareProcessNamespace: ...,
                                                  SecurityContext: ...,
                                                  ImagePullSecrets: ...,
                                                  Hostname: ...,
                                                  Subdomain: ...,
                                                  Affinity: ...,
                                                  SchedulerName: ...,
                                                  Tolerations: ...,
                                                  HostAliases: ...,
                                                  PriorityClassName: ...,
                                                  Priority: ...,
                                                  DNSConfig: ...,
                                                  ReadinessGates: ...,
                                                  RuntimeClassName: ...,
                                                  EnableServiceLinks: ...,
                                                  PreemptionPolicy: ...,
                                                  Overhead: ...,
                                                  TopologySpreadConstraints: ...,
                                              },
                                          },
                                          Strategy: {Type: "", RollingUpdate: nil},
                                          MinReadySeconds: 0,
                                          RevisionHistoryLimit: nil,
                                          Paused: false,
                                          ProgressDeadlineSeconds: nil,
                                      },
                                      Label: nil,
                                  },
                              ],
                              Permissions: nil,
                              ClusterPermissions: nil,
                          },
                      },
                      Version: {
                          Version: {Major: 0, Minor: 2, Patch: 0, Pre: nil, Build: nil},
                      },
                      Maturity: "",
                      CustomResourceDefinitions: {
                          Owned: nil,
                          Required: [
                              {
                                  Name: "ins-ltwf7.cluster.com",
                                  Version: "v1alpha1",
                                  Kind: "ins-ltwf7",
                                  DisplayName: "ins-ltwf7.cluster.com",
                                  Description: "ins-ltwf7.cluster.com",
                                  Resources: nil,
                                  StatusDescriptors: nil,
                                  SpecDescriptors: nil,
                                  ActionDescriptor: nil,
                              },
                          ],
                      },
                      APIServiceDefinitions: {Owned: nil, Required: nil},
                      WebhookDefinitions: nil,
                      NativeAPIs: nil,
                      MinKubeVersion: "0.0.0",
                      DisplayName: "",
                      Description: "",
                      Keywords: nil,
                      Maintainers: nil,
                      Provider: {Name: "", URL: ""},
                      Links: nil,
                      Icon: nil,
                      InstallModes: [
                          {Type: "OwnNamespace", Supported: true},
                          {
                              Type: "SingleNamespace",
                              Supported: true,
                          },
                          {
                              Type: "MultiNamespace",
                              Supported: true,
                          },
                          {
                              Type: "AllNamespaces",
                              Supported: true,
                          },
                      ],
                      Replaces: "nginx-ssw2v-stable",
                      Labels: nil,
                      Annotations: nil,
                      Selector: nil,
                  },
                  Status: {
                      Phase: "Pending",
                      Message: "one or more requirements couldn't be found",
                      Reason: "RequirementsNotMet",
                      LastUpdateTime: {
                          Time: 2020-10-15T18:47:29Z,
                      },
                      LastTransitionTime: {
                          Time: 2020-10-15T18:47:28Z,
                      },
                      Conditions: [
                          {
                              Phase: "Pending",
                              Message: "requirements not yet checked",
                              Reason: "RequirementsUnknown",
                              LastUpdateTime: {
                                  Time: 2020-10-15T18:47:28Z,
                              },
                              LastTransitionTime: {
                                  Time: 2020-10-15T18:47:28Z,
                              },
                          },
                          {
                              Phase: "Pending",
                              Message: "one or more requirements couldn't be found",
                              Reason: "RequirementsNotMet",
                              LastUpdateTime: {
                                  Time: 2020-10-15T18:47:29Z,
                              },
                              LastTransitionTime: {
                                  Time: 2020-10-15T18:47:28Z,
                              },
                          },
                      ],
                      RequirementStatus: [
                          {
                              Group: "operators.coreos.com",
                              Version: "v1alpha1",
                              Kind: "ClusterServiceVersion",
                              Name: "nginx-ssw2v-stable-replacement",
                              Status: "Present",
                              Message: "CSV minKubeVersion (0.0.0) less than server version (v1.17.0)",
                              UUID: "",
                              Dependents: nil,
                          },
                          {
                              Group: "apiextensions.k8s.io",
                              Version: "v1",
                              Kind: "CustomResourceDefinition",
                              Name: "ins-ltwf7.cluster.com",
                              Status: "NotPresent",
                              Message: "CRD is not present",
                              UUID: "",
                              Dependents: nil,
                          },
                      ],
                      CertsLastUpdated: nil,
                      CertsRotateAt: nil,
                  },
              },
          ]
      to be empty
      /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/util_test.go:496
      
    meandurationsec: 94.664752
  • classname: End-to-end
    name: 'Operator should surface components in its status'
    counts: 1
    details:
    • count: 1
      error: |4-

      /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/operator_test.go:69
      Unexpected error:
          <context.deadlineExceededError>: {}
          context deadline exceeded
      occurred
      /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/util_test.go:817
      
    meandurationsec: 61.297442
    skippedtests:
  • classname: End-to-end
    name: 'Subscriptions create required objects from Catalogs Given a Namespace
    when a CatalogSource is created with a bundle that contains prometheus objects
    creating a subscription using the CatalogSource should install the operator
    successfully
    '
    counts: 1
    details: []
    meandurationsec: 23.122268
  • classname: End-to-end
    name: 'Subscriptions create required objects from Catalogs Given a Namespace
    when a CatalogSource is created with a bundle that contains prometheus objects
    creating a subscription using the CatalogSource should have created the expected
    prometheus objects
    '
    counts: 1
    details: []
    meandurationsec: 10.146207
  • classname: End-to-end
    name: 'Subscription updates existing install plan'
    counts: 1
    details: []
    meandurationsec: 0.064719
  • classname: End-to-end
    name: 'Catalog image update'
    counts: 1
    details: []
    meandurationsec: 0.147924

@sjenning
Copy link
Contributor Author

Operator is now recreating on delete and I think the unit test should be fixed now as well
/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 15, 2020
Copy link
Member

@njhale njhale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, awesome work! Thanks again for finding this and patching it so quickly!

I have one last comment before I sign off. Let me know what you think.

@njhale
Copy link
Member

njhale commented Oct 15, 2020

/approve

@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ecordell, njhale, sjenning

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ecordell
Copy link
Member

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 19, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@sjenning
Copy link
Contributor Author

/retest

2 similar comments
@sjenning
Copy link
Contributor Author

/retest

@sjenning
Copy link
Contributor Author

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

12 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 8979865 into operator-framework:master Oct 20, 2020
@openshift-ci-robot
Copy link
Collaborator

@sjenning: All pull requests linked via external trackers have merged:

Bugzilla bug 1888073 has been moved to the MODIFIED state.

In response to this:

Bug 1888073: prevent no-op hotlooping on Operators

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sjenning
Copy link
Contributor Author

/retest

@sjenning
Copy link
Contributor Author

/cherry-pick release-4.6

@openshift-cherrypick-robot

@sjenning: only operator-framework org members may request cherry picks. You can still do the cherry-pick manually.

In response to this:

/cherry-pick release-4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kevinrizza
Copy link
Member

/cherry-pick release-4.6

@openshift-cherrypick-robot

@kevinrizza: new pull request created: #1822

In response to this:

/cherry-pick release-4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants