Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run bundle and upgrade bundle does not work when the bundles is not added to default channel #5773

Closed
camilamacedo86 opened this issue May 18, 2022 · 20 comments · Fixed by #6042
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.
Milestone

Comments

@camilamacedo86
Copy link
Contributor

camilamacedo86 commented May 18, 2022

Bug Report

The command run bundle and upgrade bundle will fail when the bundle informed is not configured to be in its default channel:

operators.operatorframework.io.bundle.channel.default.v1: alpha
operators.operatorframework.io.bundle.channels.v1: mce-2.0

By default, the commands create a new index and try to add the bundle to it. So, when SDK call OPM, it fails in:

https://github.com/operator-framework/operator-registry/blob/v1.22.0/pkg/sqlite/load.go#L426-L428

https://github.com/operator-framework/operator-registry/blob/fd85a98cd00fdd70e30ce6e7076ea37e2583e724/pkg/sqlite/loadprocs.go#L118-L131

What did you do?

Following the steps

$ operator-sdk run bundle quay.io/operatorhubio/hive-operator:v2.5.3508-6cb94c6
INFO[0014] Successfully created registry pod: quay-io-operatorhubio-hive-operator-v2-5-3508-6cb94c6 
INFO[0014] Created CatalogSource: hive-operator-catalog 
INFO[0014] OperatorGroup "operator-sdk-og" created      
INFO[0014] Created Subscription: hive-operator-v2-5-3508-6cb94c6-sub 
FATA[0120] Failed to run bundle: install plan is not available for the subscription hive-operator-v2-5-3508-6cb94c6-sub: timed out waiting for the condition 

And then, by checking the bundle logs: (kubectl logs pod/quay-io-operatorhubio-hive-operator-v2-5-3508-6cb94c6)

$ kubectl logs pod/quay-io-operatorhubio-hive-operator-v2-5-3508-6cb94c6 
time="2022-05-11T00:46:00Z" level=warning msg="\x1b[1;33mDEPRECATION NOTICE:\nSqlite-based catalogs and their related subcommands are deprecated. Support for\nthem will be removed in a future release. Please migrate your catalog workflows\nto the new file-based catalog format.\x1b[0m"
time="2022-05-11T00:46:00Z" level=info msg="adding to the registry" bundles="[quay.io/operatorhubio/hive-operator:v2.5.3508-6cb94c6]"
time="2022-05-11T00:46:01Z" level=info msg="Could not find optional dependencies file" file=bundle_tmp1603466453/metadata load=annotations with=./bundle_tmp1603466453
time="2022-05-11T00:46:01Z" level=info msg="Could not find optional properties file" file=bundle_tmp1603466453/metadata load=annotations with=./bundle_tmp1603466453
time="2022-05-11T00:46:01Z" level=info msg="Could not find optional dependencies file" file=bundle_tmp1603466453/metadata load=annotations with=./bundle_tmp1603466453
time="2022-05-11T00:46:01Z" level=info msg="Could not find optional properties file" file=bundle_tmp1603466453/metadata load=annotations with=./bundle_tmp1603466453
time="2022-05-11T00:46:01Z" level=error msg="permissive mode disabled" bundles="[quay.io/operatorhubio/hive-operator:v2.5.3508-6cb94c6]" error="error loading bundle into db: FOREIGN KEY constraint failed"
Error: error loading bundle into db: FOREIGN KEY constraint failed
Usage:
  opm registry add [flags]

Also, we found the same above issue by using the operator-sdk run bundle-upgrade, see: https://github.com/k8s-operatorhub/community-operators/runs/6364587418?check_suite_focus=true#step:3:7120 (More info: k8s-operatorhub/community-operators#1195 )

What did you expect to see?

The bundle and upgrade bundle working.

What did you see instead? Under which circumstances?

The bundle is not shipped in the default channel. ( The following issues were closed in favor of this one so we can try to centralize the info )

Possible Solution

SDK commands replace the info provided via the default channel with `` when an index is not formed. So that OPM will not try to update it. Unless a user provides the index to the commands, their motivation with them would not be impacted:

  • The goal of running the bundle is only to check if the bundle can be deployed with OLM, so the default channel is not relevant
  • The goal of the upgrade bundle is to check if is a possible upgrade from the bundle installed to the new one so unless someone informed an index, the default channel is irrelevant.

Workarounds:

For SDK users that are using it to test the bundle locally

operators.operatorframework.io.bundle.channel.default.v1: alpha
operators.operatorframework.io.bundle.channels.v1: alpha, mce-2.0 // add the default channel to the bundle's channels

For CI/pipelines:

The workaround would be to generate a different temporary bundle adding the default channel to channels. So, this channel would be created in the operator registry; see that all channels will be created or updated before we try to set the default channel: https://github.com/operator-framework/operator-registry/blob/v1.22.0/pkg/sqlite/load.go#L419-L429.

Additional context

@camilamacedo86
Copy link
Contributor Author

camilamacedo86 commented May 18, 2022

Hi @rashmigottipati, and @jmrodri, I tried to centralise all only a task after checking this scenario.

This shows that the required changes in the commands to support FBC might also be an option to solve this scenario at least when a user does not provide as arg an index using SQL format.

c/c @VenkatRamaraju

@camilamacedo86
Copy link
Contributor Author

@asmacdo @rashmigottipati is a bug. Could we add here the bug label?
Also, could please clarify why it needs discussion? What discussion is required about this?

@J0zi
Copy link

J0zi commented May 26, 2022

Thanks @camilamacedo86 for steering it

@rashmigottipati
Copy link
Member

rashmigottipati commented Jun 23, 2022

@J0zi @camilamacedo86 PR #5809 was merged into master. This adds support for FBC images and I believe this addition should resolve your issue as well.

@rashmigottipati
Copy link
Member

I ran the bundle provided in the description against latest master and it successfully installed the CSV.

Below are the logs:
▶ ./build/operator-sdk run bundle quay.io/operatorhubio/hive-operator:v2.5.3508-6cb94c6
INFO[0007] Creating a File-Based Catalog of the bundle "quay.io/operatorhubio/hive-operator:v2.5.3508-6cb94c6"
INFO[0009] Generated a valid File-Based Catalog
INFO[0012] Created registry pod: quay-io-operatorhubio-hive-operator-v2-5-3508-6cb94c6
INFO[0012] Created CatalogSource: hive-operator-catalog
INFO[0012] OperatorGroup "operator-sdk-og" created
INFO[0012] Created Subscription: hive-operator-v2-5-3508-6cb94c6-sub
INFO[0014] Approved InstallPlan install-qzp6x for the Subscription: hive-operator-v2-5-3508-6cb94c6-sub
INFO[0014] Waiting for ClusterServiceVersion "default/hive-operator.v2.5.3508-6cb94c6" to reach 'Succeeded' phase
INFO[0014] Waiting for ClusterServiceVersion "default/hive-operator.v2.5.3508-6cb94c6" to appear
INFO[0025] Found ClusterServiceVersion "default/hive-operator.v2.5.3508-6cb94c6" phase: Pending
INFO[0026] Found ClusterServiceVersion "default/hive-operator.v2.5.3508-6cb94c6" phase: Installing
INFO[0059] Found ClusterServiceVersion "default/hive-operator.v2.5.3508-6cb94c6" phase: Succeeded
INFO[0060] OLM has successfully installed "hive-operator.v2.5.3508-6cb94c6"

@J0zi
Copy link

J0zi commented Sep 12, 2022

@rashmigottipati #5616 was not fixed.
Still struggling to upgrade

operator-sdk run bundle quay.io/community-operators-pipeline/flux:v0.25.2
operator-sdk run bundle-upgrade quay.io/operator_testing/flux:testing0.25.3 

or

operator-sdk run bundle quay.io/community-operators-pipeline/apicurito:v1.0.2
operator-sdk run bundle-upgrade quay.io/operator_testing/apicurito:testing-apicurito.v1.0.3

or even production

operator-sdk run bundle quay.io/operatorhubio/aqua:v2022.4.3
operator-sdk run bundle-upgrade quay.io/operatorhubio/aqua:v2022.4.4

@everettraven
Copy link
Contributor

/assign

@everettraven
Copy link
Contributor

So I did some digging and here are my findings:

When using:

operator-sdk run bundle quay.io/community-operators-pipeline/flux:v0.25.2

It was able to successfully install the operator on the cluster.

When going to upgrade the bundle with:

operator-sdk run bundle-upgrade quay.io/operator_testing/flux:testing0.25.3 

The command stalls and never completes. The bundle is not properly upgraded.

Doing some further debugging I was able to determine that the point in the source that stalls when running operator-sdk run bundle-upgrade ... is when rendering the refs when attempting to upgrade the FBC here:

declcfg, err := fbcutil.RenderRefs(ctx, f.Refs, skipTLSVerify, useHTTP)

This filters down to the fbcutil.RenderRefs() function stalling when it calls containerdregistry.NewRegistry() here:

reg, err := containerdregistry.NewRegistry(
containerdregistry.WithLog(NullLogger()),
containerdregistry.SkipTLSVerify(skipTLSVerify),
containerdregistry.WithPlainHTTP(useHTTP))

Debugging even further resulted in noticing that when using the operator-framework/operator-registry library to create the new registry it uses the package https://pkg.go.dev/go.etcd.io/bbolt and attempts to Open() a database and stalls here:
https://github.com/operator-framework/operator-registry/blob/a3c883e9beee343bd55fd73c1447ea5e98459951/pkg/image/containerdregistry/options.go#L71-L75

Debugging even deeper down the dependency tree I found that the point in which everything is stalling is due to the attempt to lock the .db file in the bbolt library here: https://github.com/etcd-io/bbolt/blob/fd5535f71f488dda0915f610b6ca8c77c9ca2c59/db.go#L223-L233

We can see that the flock() function has the ability to set a timeout and if one is not specified it will infinitely loop at an interval of 50ms in an attempt to get a lock on the file. This can be seen here in the flock() functions implementation: https://github.com/etcd-io/bbolt/blob/fd5535f71f488dda0915f610b6ca8c77c9ca2c59/bolt_unix.go#L15-L45

As far as I could tell, the problem was able to be resolved when the name of the .db file used to create the registry is different (right now it is always the same default value of cache/metadata.db. This makes me think that we will need to make some updates to operator-framework/operator-registry to allow for the option to modify:

  1. The timeout that is used when attempting to create a new registry
  2. The DB file path to be different than the default to prevent file locking issues

Once those are done we can attempt to make modifications to the FBC upgrade logic that takes advantage of the new functionality.

@jmrodri since you were also looking a bit into this, WDYT?

@everettraven
Copy link
Contributor

So the above comment is definitely an issue at the moment because when a fix is applied I am able to use operator-sdk run bundle and operator-sdk run bundle-upgrade to successfully run and upgrade a valid bundle.

@J0zi I suspect for the production operators you mentioned running:

operator-sdk run bundle quay.io/operatorhubio/aqua:v2022.4.3
operator-sdk run bundle-upgrade quay.io/operatorhubio/aqua:v2022.4.4

That you encountered the issue of the command just hanging. Is this correct?

After fixing the command it no longer stalls and is able to run and upgrade that production operator no problem.

As far as the other commands I noticed something else (after fixing the command stalling problem) - the default channel being used in the images from quay.io/operator_testing/... have an entirely different default channel that is used than that of the images from quay.io/community-operators-pipeline/.... Due to this operator-sdk run bundle-upgrade is successfully upgrading the bundle and registry pod, however it times out when waiting for an InstallPlan to be created that it can approve. This is because the original Subscription created with the operator-sdk run bundle command is looking for changes in the stable channel but the FBC generated by operator-sdk run bundle-upgrade sets the upgrade path in the optest channel while the Subscription is not updated and it still looking for upgrade paths in the stable channel - therefore nothing happens and the command times out.
I suspect that you aren't meant to be able to start with the community released version of the operator (i.e quay.io/community-operators-pipeline/flux:v0.25.2) and upgrade to the version used for testing (i.e quay.io/operator_testing/flux:testing0.25.3) due to the fact that have deliberately different default channels that are used (stable and optest respectively).

I hope this makes sense and clears up any confusion - if not, please let me know and I can put together a more detailed response with examples. I will work on fixing the issue where the operator-sdk run bundle-upgrade command stalls indefinitely, but once that is fixed it should be working as expected.

@J0zi
Copy link

J0zi commented Sep 29, 2022

So we are waiting for another release of operator-sdk to test it out and we could be unblocked then.

@J0zi
Copy link

J0zi commented Oct 28, 2022

@everettraven @rashmigottipati please reopen

I have tested it and cannot say if update issue was solved because even run bundle is now broken. So we cannot test operator upgrade and this remains a blocker. We cannot run any bundle at all.

➜ operator-sdk run bundle quay.io/community-operators-pipeline/flux:v0.25.2   
INFO[0013] Creating a File-Based Catalog of the bundle "quay.io/community-operators-pipeline/flux:v0.25.2" 
INFO[0014] Generated a valid File-Based Catalog         
INFO[0032] Created registry pod: quay-io-community-operators-pipeline-flux-v0-25-2 
INFO[0032] Created CatalogSource: flux-catalog          
INFO[0032] OperatorGroup "operator-sdk-og" created      
INFO[0032] Created Subscription: flux-v0-25-2-sub       
INFO[0034] Approved InstallPlan install-92w77 for the Subscription: flux-v0-25-2-sub 
INFO[0034] Waiting for ClusterServiceVersion "default/flux.v0.25.2" to reach 'Succeeded' phase 
INFO[0034]   Waiting for ClusterServiceVersion "default/flux.v0.25.2" to appear 
INFO[0082]   Found ClusterServiceVersion "default/flux.v0.25.2" phase: Pending 
INFO[0085]   Found ClusterServiceVersion "default/flux.v0.25.2" phase: InstallReady 
INFO[0086]   Found ClusterServiceVersion "default/flux.v0.25.2" phase: Installing 
FATA[0120] Failed to run bundle: error waiting for CSV to install: timed out waiting for the condition 

➜  operator-sdk run bundle quay.io/operatorhubio/aqua:v2022.4.3
INFO[0013] Creating a File-Based Catalog of the bundle "quay.io/operatorhubio/aqua:v2022.4.3" 
INFO[0014] Generated a valid File-Based Catalog         
INFO[0021] Created registry pod: quay-io-operatorhubio-aqua-v2022-4-3 
INFO[0021] Created CatalogSource: aqua-catalog          
INFO[0021] Created Subscription: aqua-operator-v2022-4-3-sub 
INFO[0024] Approved InstallPlan install-25nf7 for the Subscription: aqua-operator-v2022-4-3-sub 
INFO[0024] Waiting for ClusterServiceVersion "default/aqua-operator.v2022.4.3" to reach 'Succeeded' phase 
INFO[0024]   Waiting for ClusterServiceVersion "default/aqua-operator.v2022.4.3" to appear 
INFO[0056]   Found ClusterServiceVersion "default/aqua-operator.v2022.4.3" phase: Failed 
FATA[0056] Failed to run bundle: error waiting for CSV to install: csv failed: reason: "UnsupportedOperatorGroup", message: "AllNamespaces InstallModeType not supported, cannot configure to watch all namespaces" 

➜  operator-sdk run bundle quay.io/community-operators-pipeline/apicurito:v1.0.2
INFO[0020] Creating a File-Based Catalog of the bundle "quay.io/community-operators-pipeline/apicurito:v1.0.2" 
INFO[0021] Generated a valid File-Based Catalog         
INFO[0028] Created registry pod: quay-io-community-operators-pipeline-apicurito-v1-0-2 
INFO[0028] Created CatalogSource: apicurito-catalog     
INFO[0028] Created Subscription: apicurito-v1-0-2-sub   
INFO[0031] Approved InstallPlan install-wkgf2 for the Subscription: apicurito-v1-0-2-sub 
INFO[0031] Waiting for ClusterServiceVersion "default/apicurito.v1.0.2" to reach 'Succeeded' phase 
INFO[0031]   Waiting for ClusterServiceVersion "default/apicurito.v1.0.2" to appear 
INFO[0063]   Found ClusterServiceVersion "default/apicurito.v1.0.2" phase: Failed 
FATA[0063] Failed to run bundle: error waiting for CSV to install: csv failed: reason: "UnsupportedOperatorGroup", message: "AllNamespaces InstallModeType not supported, cannot configure to watch all namespaces" 

operator-sdk version                                                         
operator-sdk version: "v1.25.0", commit: "3d4eb4b2de4b68519c8828f2289c2014979ccf2a", kubernetes version: "1.25.0", go version: "go1.19.2", GOOS: "linux", GOARCH: "amd64"

To successfully fix all issues following should work #5773 (comment)

@everettraven
Copy link
Contributor

Reopening as per @J0zi

@everettraven everettraven reopened this Oct 28, 2022
@everettraven
Copy link
Contributor

@J0zi I will take another look at this and report back what I find

@everettraven
Copy link
Contributor

@J0zi So I took a look at this and ran all the same commands you did and for the most part ran into the same errors. The only exception I found was that I was able to successfully install flux with operator-sdk run bundle quay.io/community-operators-pipeline/flux:v0.25.2 with a fresh KinD cluster.

My suspicion as to why the subsequent operator-sdk run bundle ... commands are failing is due to the creation of a single OperatorGroup resource for running all operators in the same namespace. The error FATA[0056] Failed to run bundle: error waiting for CSV to install: csv failed: reason: "UnsupportedOperatorGroup", message: "AllNamespaces InstallModeType not supported, cannot configure to watch all namespaces" makes me think that for whatever reason, using the same the same OperatorGroup every time is causing an issue.

I'm planning to investigate this further by changing the default behavior to create a new OperatorGroup for each operator and see if that resolves it.

To confirm that the aqua operator was able to be run successfully in a namespace without that OperatorGroup I created a new namespace and then used run bundle to install it:

 bpalmer@bpalmer  ~  kubectl create ns aqua-operator                                 
namespace/aqua-operator created
 bpalmer@bpalmer  ~  operator-sdk run bundle quay.io/operatorhubio/aqua:v2022.4.3 -n aqua-operator
INFO[0003] Creating a File-Based Catalog of the bundle "quay.io/operatorhubio/aqua:v2022.4.3" 
INFO[0004] Generated a valid File-Based Catalog         
INFO[0006] Created registry pod: quay-io-operatorhubio-aqua-v2022-4-3 
INFO[0006] Created CatalogSource: aqua-catalog          
INFO[0006] OperatorGroup "operator-sdk-og" created      
INFO[0007] Created Subscription: aqua-operator-v2022-4-3-sub 
INFO[0010] Approved InstallPlan install-5khks for the Subscription: aqua-operator-v2022-4-3-sub 
INFO[0010] Waiting for ClusterServiceVersion "aqua-operator/aqua-operator.v2022.4.3" to reach 'Succeeded' phase 
INFO[0010]   Waiting for ClusterServiceVersion "aqua-operator/aqua-operator.v2022.4.3" to appear 
INFO[0019]   Found ClusterServiceVersion "aqua-operator/aqua-operator.v2022.4.3" phase: Pending 
INFO[0023]   Found ClusterServiceVersion "aqua-operator/aqua-operator.v2022.4.3" phase: Installing 
INFO[0029]   Found ClusterServiceVersion "aqua-operator/aqua-operator.v2022.4.3" phase: Succeeded 
INFO[0029] OLM has successfully installed "aqua-operator.v2022.4.3"

@everettraven
Copy link
Contributor

@J0zi So doing another bit of investigation, it seems it is not possible to have multiple OperatorGroups in a single namespace or else all ClusterServiceVersions will enter a failed state as mentioned in the OLM OperatorGroup Documentation.

This makes me think that installing all of these particular operators in the same namespace would have failed anyways because the flux operator supports the AllNamespace install mode while both aqua and apicurito do not support that install mode.

I think the solution in this case is to install both the aqua and apicurito operators into a different namespace so that the OperatorGroup created by operator-sdk run bundle is configured to work with their supported install modes.

I hope this helps!

@J0zi
Copy link

J0zi commented Oct 31, 2022

@everettraven thank you very much for your investigation. I tested multiple operators and upgrade is working. So we can implement it to our pipelines :)
Thank you again, you can close the issue.

@everettraven
Copy link
Contributor

Closing as per #5773 (comment)

@J0zi
Copy link

J0zi commented Nov 4, 2022

@everettraven we encountered following issue with upgrade:

operator-sdk run bundle quay.io/operatorhubio/strimzi-kafka-operator:v0.31.1 -n testupgrade --skip-tls-verify
...
INFO[0119] OLM has successfully installed "strimzi-cluster-operator.v0.31.1" 

operator-sdk run bundle-upgrade quay.io/operatorhubio/strimzi-kafka-operator:v0.32.0 -n testupgrade --skip-tls-verify
INFO[0001] Found existing subscription with name strimzi-cluster-operator-v0-31-1-sub and namespace testupgrade 
INFO[0001] Found existing catalog source with name strimzi-kafka-operator-catalog and namespace testupgrade 
INFO[0014] Generated a valid Upgraded File-Based Catalog 
FATA[0014] Failed to run bundle upgrade: update catalog error: error creating registry: error building registry pod definition: configMap error: error updating ConfigMap: ConfigMap "operator-sdk-run-bundle-config" is invalid: []: Too long: must have at most 1048576 bytes

@everettraven
Copy link
Contributor

@J0zi This issue specifically seems to be related to the bundle being to large to fit into a ConfigMap and with the new FBC format we have made some changes to operator-sdk run bundle-upgrade that attempts to mount the FBC for the upgrade as a ConfigMap.

This is something that we have seen before. IIRC we did implement a change that helps alleviate this slightly but it isn't perfect and is still prone to this problem.

Would you mind opening a new issue with this problem so we can track it separately and have it show up in our next issue triage meeting? This will help it get some more visibility and allow us to have some further discussion on how we can attempt to resolve this.

@J0zi
Copy link

J0zi commented Nov 7, 2022

@everettraven we will continue here #6144

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
6 participants