FatalInitializationError osm upgrade #2491

ritazh · 2021-02-09T02:25:12Z

Bug description:

Upgrading osm from v0.6.1 to v0.7.0, getting the following error on pod start:

2s          Warning   Unhealthy                  pod/osm-controller-55ff565d8c-r6sj9    Liveness probe failed: Get http://10.244.0.10:9091/health/alive: dial tcp 10.244.0.10:9091: connect: connection refused
5s          Warning   Unhealthy                  pod/osm-controller-55ff565d8c-r6sj9    Readiness probe failed: Get http://10.244.0.10:9091/health/ready: dial tcp 10.244.0.10:9091: connect: connection refused
2s          Normal    Killing                    pod/osm-controller-55ff565d8c-r6sj9    Container osm-controller failed liveness probe, will be restarted
2s          Warning   FatalInitializationError   pod/osm-controller-55ff565d8c-r6sj9    Error creating MeshSpec

You may also see errors like this from the osm controller pod:

Failed to list *v1alpha4.TCPRoute: the server could not find the requested resource (get tcproutes.specs.smi-spec.io)

Affected area (please mark with X where applicable):

Install [ ]
SMI Traffic Access Policy [ ]
SMI Traffic Specs Policy [ ]
SMI Traffic Split Policy [ ]
Permissive Traffic Policy [ ]
Ingress [ ]
Egress [ ]
Envoy Control Plane [ ]
CLI Tool [ ]
Metrics [ ]
Certificate Management [ ]
Sidecar Injection [ ]
Logging [ ]
Debugging [ ]
Tests [ ]
CI System [ ]

Expected behavior:
Successfully start osm deployment

Steps to reproduce the bug (as precisely as possible):
Helm install osm chart v0.6.1. upgrade to osm chart v0.7.0.

How was OSM installed?:
Helm

Anything else we need to know?:

Environment:

OSM version (use osm version):
Kubernetes version (use kubectl version):
Size of cluster (number of worker nodes in the cluster):
Others:

The text was updated successfully, but these errors were encountered:

ritazh · 2021-02-09T02:26:48Z

OSM version v0.7.0 supports the latest versions of SMI CRDs. Because Helm does not manage CRDs beyond the initial installation, special care needs to be taken during upgrades when CRDs are changed. To upgrade osm, refer to this upgrade guide. Specifically, to upgrade from an older version of SMI CRDs to the latest, delete outdated CRDs.

If you've already upgraded without deleting the CRDs, you can fix your deployment by following this this troubleshooting guide

ritazh · 2021-02-09T02:28:33Z

Reopen this issue to discuss what is the desirable experience when CRDs are out of date. Until the conversion webhook is in place, should the controller crash?

shashankram · 2021-03-01T19:47:08Z

Reopen this issue to discuss what is the desirable experience when CRDs are out of date. Until the conversion webhook is in place, should the controller crash?

The controller exits because the K8s API version for the resources it expects is not available in the cluster. This is not a crash, but a voluntary exit by the controller because it can't function without the necessary newer CRDs.

Currently, we do not have the capability to support multiple API versions for a resource, so the controller not exiting and silently running is not an option.

Documentation around CRD upgrades: https://github.com/openservicemesh/osm/blob/main/docs/content/docs/upgrade_guide.md#crd-upgrades

@ritazh, what do you think?

ritazh · 2021-03-02T22:36:18Z

FWIW, I think deployment failure is the right experience when the new version of the CRD is not in the cluster. Otherwise operator won't know there is an issue. The error in the controller pod log helped me understand I'm missing the right version. In addition to users fishing thru the pod log, do we generate any K8s events for this type of errors?

A separate question is what if I'm not using the SMI policy mode feature? should this still fail?

shashankram · 2021-03-02T22:43:08Z

FWIW, I think deployment failure is the right experience when the new version of the CRD is not in the cluster. Otherwise operator won't know there is an issue. The error in the controller pod log helped me understand I'm missing the right version. In addition to users fishing thru the pod log, do we generate any K8s events for this type of errors?

K8s events are generated for Fatals event from the controller, such as this one.

A separate question is what if I'm not using the SMI policy mode feature? should this still fail?

Per current design, yes, because informer resource initialization happens irrespective of the traffic policy mode. Even if we deferred resource initialization based on the traffic policy mode, one could update the mode in the ConfigMap and cause the controller to exit. This approach is simple and applies to other K8s resources as well - Ingress, cert-manager.io, etc.

If there is a use case to defer SMI resource initialization, we could consider it.

shashankram · 2021-03-11T00:49:52Z

Closing based on #2491 (comment) and clarification provided in #2491 (comment).

With #2737, we will hopefully never run into this issue. But the desired behavior at the moment is to ensure all components within osm-controller can initialize correctly at startup, if not the controller will exit with a FatalInitializationError.

ritazh added the kind/bug Something isn't working label Feb 9, 2021

ritazh closed this as completed Feb 9, 2021

ritazh reopened this Feb 9, 2021

shashankram added kind/discussion Discussing a topic and removed kind/bug Something isn't working labels Mar 2, 2021

shashankram closed this as completed Mar 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FatalInitializationError osm upgrade #2491

FatalInitializationError osm upgrade #2491

ritazh commented Feb 9, 2021 •

edited

Loading

ritazh commented Feb 9, 2021 •

edited by ksubrmnn

Loading

ritazh commented Feb 9, 2021

shashankram commented Mar 1, 2021

ritazh commented Mar 2, 2021

shashankram commented Mar 2, 2021

shashankram commented Mar 11, 2021

FatalInitializationError osm upgrade #2491

FatalInitializationError osm upgrade #2491

Comments

ritazh commented Feb 9, 2021 • edited Loading

ritazh commented Feb 9, 2021 • edited by ksubrmnn Loading

ritazh commented Feb 9, 2021

shashankram commented Mar 1, 2021

ritazh commented Mar 2, 2021

shashankram commented Mar 2, 2021

shashankram commented Mar 11, 2021

ritazh commented Feb 9, 2021 •

edited

Loading

ritazh commented Feb 9, 2021 •

edited by ksubrmnn

Loading