Skip to content
This repository has been archived by the owner on Jul 11, 2023. It is now read-only.

FatalInitializationError osm upgrade #2491

Closed
ritazh opened this issue Feb 9, 2021 · 6 comments
Closed

FatalInitializationError osm upgrade #2491

ritazh opened this issue Feb 9, 2021 · 6 comments
Labels
kind/discussion Discussing a topic

Comments

@ritazh
Copy link
Contributor

ritazh commented Feb 9, 2021

Bug description:

Upgrading osm from v0.6.1 to v0.7.0, getting the following error on pod start:

2s          Warning   Unhealthy                  pod/osm-controller-55ff565d8c-r6sj9    Liveness probe failed: Get http://10.244.0.10:9091/health/alive: dial tcp 10.244.0.10:9091: connect: connection refused
5s          Warning   Unhealthy                  pod/osm-controller-55ff565d8c-r6sj9    Readiness probe failed: Get http://10.244.0.10:9091/health/ready: dial tcp 10.244.0.10:9091: connect: connection refused
2s          Normal    Killing                    pod/osm-controller-55ff565d8c-r6sj9    Container osm-controller failed liveness probe, will be restarted
2s          Warning   FatalInitializationError   pod/osm-controller-55ff565d8c-r6sj9    Error creating MeshSpec

You may also see errors like this from the osm controller pod:

Failed to list *v1alpha4.TCPRoute: the server could not find the requested resource (get tcproutes.specs.smi-spec.io)

Affected area (please mark with X where applicable):

  • Install [ ]
  • SMI Traffic Access Policy [ ]
  • SMI Traffic Specs Policy [ ]
  • SMI Traffic Split Policy [ ]
  • Permissive Traffic Policy [ ]
  • Ingress [ ]
  • Egress [ ]
  • Envoy Control Plane [ ]
  • CLI Tool [ ]
  • Metrics [ ]
  • Certificate Management [ ]
  • Sidecar Injection [ ]
  • Logging [ ]
  • Debugging [ ]
  • Tests [ ]
  • CI System [ ]

Expected behavior:
Successfully start osm deployment

Steps to reproduce the bug (as precisely as possible):
Helm install osm chart v0.6.1. upgrade to osm chart v0.7.0.

How was OSM installed?:
Helm

Anything else we need to know?:

Environment:

  • OSM version (use osm version):
  • Kubernetes version (use kubectl version):
  • Size of cluster (number of worker nodes in the cluster):
  • Others:
@ritazh ritazh added the kind/bug Something isn't working label Feb 9, 2021
@ritazh
Copy link
Contributor Author

ritazh commented Feb 9, 2021

OSM version v0.7.0 supports the latest versions of SMI CRDs. Because Helm does not manage CRDs beyond the initial installation, special care needs to be taken during upgrades when CRDs are changed. To upgrade osm, refer to this upgrade guide. Specifically, to upgrade from an older version of SMI CRDs to the latest, delete outdated CRDs.

If you've already upgraded without deleting the CRDs, you can fix your deployment by following this this troubleshooting guide

@ritazh ritazh closed this as completed Feb 9, 2021
@ritazh ritazh reopened this Feb 9, 2021
@ritazh
Copy link
Contributor Author

ritazh commented Feb 9, 2021

Reopen this issue to discuss what is the desirable experience when CRDs are out of date. Until the conversion webhook is in place, should the controller crash?

@shashankram
Copy link
Member

Reopen this issue to discuss what is the desirable experience when CRDs are out of date. Until the conversion webhook is in place, should the controller crash?

The controller exits because the K8s API version for the resources it expects is not available in the cluster. This is not a crash, but a voluntary exit by the controller because it can't function without the necessary newer CRDs.

Currently, we do not have the capability to support multiple API versions for a resource, so the controller not exiting and silently running is not an option.

Documentation around CRD upgrades: https://github.com/openservicemesh/osm/blob/main/docs/content/docs/upgrade_guide.md#crd-upgrades

@ritazh, what do you think?

@ritazh
Copy link
Contributor Author

ritazh commented Mar 2, 2021

FWIW, I think deployment failure is the right experience when the new version of the CRD is not in the cluster. Otherwise operator won't know there is an issue. The error in the controller pod log helped me understand I'm missing the right version. In addition to users fishing thru the pod log, do we generate any K8s events for this type of errors?

A separate question is what if I'm not using the SMI policy mode feature? should this still fail?

@shashankram
Copy link
Member

FWIW, I think deployment failure is the right experience when the new version of the CRD is not in the cluster. Otherwise operator won't know there is an issue. The error in the controller pod log helped me understand I'm missing the right version. In addition to users fishing thru the pod log, do we generate any K8s events for this type of errors?

K8s events are generated for Fatals event from the controller, such as this one.

A separate question is what if I'm not using the SMI policy mode feature? should this still fail?

Per current design, yes, because informer resource initialization happens irrespective of the traffic policy mode. Even if we deferred resource initialization based on the traffic policy mode, one could update the mode in the ConfigMap and cause the controller to exit. This approach is simple and applies to other K8s resources as well - Ingress, cert-manager.io, etc.

If there is a use case to defer SMI resource initialization, we could consider it.

@shashankram shashankram added kind/discussion Discussing a topic and removed kind/bug Something isn't working labels Mar 2, 2021
@shashankram
Copy link
Member

Closing based on #2491 (comment) and clarification provided in #2491 (comment).

With #2737, we will hopefully never run into this issue. But the desired behavior at the moment is to ensure all components within osm-controller can initialize correctly at startup, if not the controller will exit with a FatalInitializationError.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/discussion Discussing a topic
Projects
None yet
Development

No branches or pull requests

2 participants