Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Metrics are generated for OLM managed resources Given an OperatorGroup that supports all namespaces when a CSV is created when the OLM pod restarts CSV metric is preserved" #2441

Closed
Tracked by #2401
timflannagan opened this issue Nov 9, 2021 · 6 comments · Fixed by #2530
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. triaged Issue has been considered by a member of the OLM community

Comments

@timflannagan
Copy link
Contributor

timflannagan commented Nov 9, 2021

Metrics are generated for OLM managed resources
/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/metrics_e2e_test.go:32
  Given an OperatorGroup that supports all namespaces
  /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/metrics_e2e_test.go:45
    when a CSV is created
    /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/metrics_e2e_test.go:121
      when the OLM pod restarts
      /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/metrics_e2e_test.go:147
        CSV metric is preserved [It]
        /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/metrics_e2e_test.go:151

        Expected
            <[]e2e.Metric | len:2, cap:2>: [
                {
                    Family: "csv_upgrade_count",
                    Labels: nil,
                    Value: 0,
                },
                {Family: "csv_count", Labels: nil, Value: 0},
            ]
        to contain element matching
            <*e2e.LikeMetricMatcher | 0xc002120420>: {
                Predicates: [
                    {
                        f: 0x2be9f80,
                        name: "WithFamily(csv_succeeded)",
                    },
                    {
                        f: 0x2bea000,
                        name: "WithLabel(name=csv-test-g48g2-stable)",
                    },
                    {f: 0x2bea240, name: "WithValue(1)"},
                ],
            }

        /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/metrics_e2e_test.go:152

        Full Stack Trace
        github.com/operator-framework/operator-lifecycle-manager/test/e2e.glob..func12.2.3.4.2()
        	/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/metrics_e2e_test.go:152 +0x32b
        github.com/operator-framework/operator-lifecycle-manager/test/e2e.TestEndToEnd(0xc000381380)
        	/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/e2e_test.go:59 +0x296
        testing.tRunner(0xc000381380, 0x343c030)
        	/opt/hostedtoolcache/go/1.16.10/x64/src/testing/testing.go:1193 +0xef
        created by testing.(*T).Run
        	/opt/hostedtoolcache/go/1.16.10/x64/src/testing/testing.go:1238 +0x2b3
@timflannagan timflannagan mentioned this issue Nov 9, 2021
43 tasks
@timflannagan
Copy link
Contributor Author

@timflannagan timflannagan added the kind/bug Categorizes issue or PR as related to a bug. label Nov 9, 2021
@timflannagan
Copy link
Contributor Author

timflannagan commented Nov 9, 2021

Ran into this error in #2438.

@timflannagan
Copy link
Contributor Author

Note: last time we had debugged this failure, it appears to be a problem with the actual test assertions being made vs. functionality that isn't working correctly in OLM. We may need to update the test logic for checking when a deployment is considered "ready" after recycling the catalog-operator to test this behavior as there may be inadequate checks for readiness, leading to the test code checking for metric presence before the new catalog-operator pod has established a metrics endpoint, leading to false positives.

@timflannagan
Copy link
Contributor Author

It looks like there's already an existing issue for this that didn't come up when I had search for it (and I didn't link it properly in the overall CI tracking issue): #2390 (comment). Going to close out the pre-existing one, as this one now has more context available and action runs associated with the error.

@timflannagan
Copy link
Contributor Author

Copy-pasting that issue's description to consolidate issues:


#2216 had fixed a bug where the csv_succeeded was lost between deployment pod restarts. In those changes, a new metric e2e test was created that restarts (e.g. scales down/scales back up) to test whether the metric was retained after pod restarts:

			When("the OLM pod restarts", func() {
				BeforeEach(func() {
					restartDeploymentWithLabel(c, "app=olm-operator")
				})
				It("CSV metric is preserved", func() {
					Expect(getMetricsFromPod(c, getPodWithLabel(c, "app=olm-operator"))).To(
						ContainElement(LikeMetric(WithFamily("csv_succeeded"), WithName(csv.Name), WithValue(1))),
					)
				})
			})

It looks like the restartDeploymentWithLabel(...) doesn't have enough safeguards to verify that the restarted deployment is ready and available, leading to issues running this e2e test on more bloated clusters, as we're attempting to grab the metric before the metric endpoint has been setup and ready to serve traffic.

@timflannagan timflannagan added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. and removed kind/bug Categorizes issue or PR as related to a bug. labels Nov 9, 2021
@njhale njhale self-assigned this Nov 11, 2021
@njhale njhale added the triaged Issue has been considered by a member of the OLM community label Nov 11, 2021
@timflannagan timflannagan linked a pull request Dec 17, 2021 that will close this issue
5 tasks
@timflannagan
Copy link
Contributor Author

It looks like I can't assign @akihikokuroda to this ticket for whatever reason - #2530 is the PR attempting to fix this e2e flake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. triaged Issue has been considered by a member of the OLM community
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants