Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wait for catalogsource status ready before creating subscription #2601

Merged

Conversation

akihikokuroda
Copy link
Member

Signed-off-by: akihikokuroda akihikokuroda2020@gmail.com

Description of the change:
The test doesn't want the catalogsource gRPC connection ready status before it creates the subscription for the catalogsource.

Motivation for the change:
Closes #2600
Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Docs updated or added to /doc
  • Commit messages sensible and descriptive

Signed-off-by: akihikokuroda <akihikokuroda2020@gmail.com>
@@ -95,6 +95,8 @@ var _ = Describe("Subscription", func() {
}

_, teardown = createInternalCatalogSource(ctx.Ctx().KubeClient(), ctx.Ctx().OperatorClient(), "test-catalog", generatedNamespace.GetName(), packages, crds, csvs)
_, err := fetchCatalogSourceOnStatus(ctx.Ctx().OperatorClient(), "test-catalog", generatedNamespace.GetName(), catalogSourceRegistryPodSynced)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch - I was under the impression that we made a clean sweep of anywhere we instantiate a grpc-based CatalogSource, and then subsequently create a Subscription, but this one feels easy to catch given the setup isn't super readable. It would be nice to avoid having to hardcode the "test-catalog" in two places here, but I won't block the PR for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'm somewhat second guessing this the more I think about it. I haven't played around with this locally, but looking at that test case failure output, it's not immediately clear to me why we need to simply wait for the CatalogSource to be reporting a "ready" state. Were you able to reproduce this test case failure locally?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't reproduce this error locally but I saw this in the catalog-operator log of the CI e2e failure.

2022-01-19T20:00:24.250526731Z stderr F time="2022-01-19T20:00:24Z" level=debug msg="syncing catsrc" id=Zfz5K source=test-catalog
2022-01-19T20:00:24.250530131Z stderr F time="2022-01-19T20:00:24Z" level=debug msg="checking catsrc configmap state" id=Zfz5K source=test-catalog
2022-01-19T20:00:24.251445279Z stderr F time="2022-01-19T20:00:24Z" level=debug msg="check registry server healthy: true" id=Zfz5K source=test-catalog
2022-01-19T20:00:24.25145768Z stderr F time="2022-01-19T20:00:24Z" level=debug msg="registry state good" id=Zfz5K source=test-catalog
2022-01-19T20:00:28.931802007Z stderr F time="2022-01-19T20:00:28Z" level=debug msg="Got source event: grpc.SourceState{Key:registry.CatalogKey{Name:\"test-catalog\", Namespace:\"subscription-e2e-gcqhv\"}, State:3}"
2022-01-19T20:00:28.931816007Z stderr F time="2022-01-19T20:00:28Z" level=info msg="state.Key.Namespace=subscription-e2e-gcqhv state.Key.Name=test-catalog state.State=TRANSIENT_FAILURE"
2022-01-19T20:00:28.931824208Z stderr F time="2022-01-19T20:00:28Z" level=debug msg="syncing catsrc" id=j7VvG source=test-catalog
2022-01-19T20:00:28.931827808Z stderr F time="2022-01-19T20:00:28Z" level=debug msg="checking catsrc configmap state" id=j7VvG source=test-catalog
2022-01-19T20:00:28.939247402Z stderr F time="2022-01-19T20:00:28Z" level=debug msg="check registry server healthy: true" id=j7VvG source=test-catalog
2022-01-19T20:00:28.939260203Z stderr F time="2022-01-19T20:00:28Z" level=debug msg="registry state good" id=j7VvG source=test-catalog
2022-01-19T20:00:28.956912641Z stderr F I0119 20:00:28.955396       1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"", Name:"subscription-e2e-gcqhv", UID:"dfe83254-ba15-438f-badf-dd3b79c12036", APIVersion:"v1", ResourceVersion:"815", FieldPath:""}): type: 'Warning' reason: 'ResolutionFailed' [error using catalog test-catalog (in namespace subscription-e2e-gcqhv): failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.96.185.140:50051: connect: connection refused", error using catalog operatorhubio-catalog (in namespace operator-lifecycle-manager): failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.96.27.145:50051: connect: connection refused"]
2022-01-19T20:00:28.956949443Z stderr F I0119 20:00:28.956796       1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"", Name:"subscription-e2e-gcqhv", UID:"dfe83254-ba15-438f-badf-dd3b79c12036", APIVersion:"v1", ResourceVersion:"815", FieldPath:""}): type: 'Warning' reason: 'ResolutionFailed' [error using catalog test-catalog (in namespace subscription-e2e-gcqhv): failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.96.185.140:50051: connec\
t: connection refused", error using catalog operatorhubio-catalog (in namespace operator-lifecycle-manager): failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.96.27.145:50051: connect: connec\
tion refused"]

This shows that the latest gRPC status is TRANSIENT_FAILURE but the status of the catalogsource is
check registry server healthy: true and registry state good.
Then the subscription is created and issue the list bundles request and failed.

The catalogsource sync has checks if the pod of the registry is up, the resources for the registry (service, service accout, role, rolebinding, etc) are OK.
It also has the gRPC status separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - I think that explanation sounds reasonable to me. In any case, this change is harmless so we can always re-open this issue if we misdiagnosed the root cause.

/approve
/lgtm

@openshift-ci
Copy link

openshift-ci bot commented Jan 27, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: akihikokuroda, timflannagan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 27, 2022
@openshift-merge-robot openshift-merge-robot merged commit 5593195 into operator-framework:master Jan 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

e2e - "should create a Subscription for the latest entry providing the required GVK" failure
3 participants