Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bootstrap slows down dramatically with Cilium installed by helm-operator #484

Closed
errordeveloper opened this issue Jan 25, 2021 · 3 comments

Comments

@errordeveloper
Copy link

errordeveloper commented Jan 25, 2021

Describe the bug

When I attempt installing OpenShift with Cilium using currently published instructions, I get a working cluster.

I have created an operator based on quay.io/operator-framework/helm-operator:v1.2.0 image, and it worked previously, but somehow it stopped working for me. I am not entirely clear what caused this, but something breaks bootstrap and it gets stuck yet cluster installation succeeds eventually.

If I replace helm template steps from Cilium docs with copying of operator manifests, openshift-install create cluster times out (errors appear to be slightly different each time).

Once I leave the cluster for 1 or 2 hours, it eventually stabilises. It appears that due to the complexity of the bootstrap process, it's quite easy to end up in a situation where one of the operators is acting up which may result in the whole system being unstable for much longer than expected.

In one the cases I have worked through the following installer error:

ERROR Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::OAuthRouteCheckEndpointAccessibleController_SyncError::OAuthServiceCheckEndpointAccessibleController_SyncError::OAuthServiceEndpointsCheckEndpointAccessibleController_SyncError::WellKnownReadyController_SyncError: OAuthServiceEndpointsCheckEndpointAccessibleControllerDegraded: oauth service endpoints are not ready
OAuthServiceCheckEndpointAccessibleControllerDegraded: Get "https://172.30.234.48:443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server
OAuthRouteCheckEndpointAccessibleControllerDegraded: Get "https://oauth-openshift.apps.gcp-ocp46-oss191-fix2.ilya-openshift-[test-2](https://issues.redhat.com/browse/test-2).cilium.rocks/healthz": dial tcp: lookup oauth-openshift.apps.gcp-ocp46-oss191-fix2.ilya-openshift-[test-2](https://issues.redhat.com/browse/test-2).cilium.rocks on 172.30.0.10:53: no such host
WellKnownReadyControllerDegraded: kube-apiserver oauth endpoint https://10.0.0.6:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance)

Having checked the DNS right away with dig, I go no answer for auth-openshift.apps.gcp-ocp46-oss191-fix2.ilya-openshift-[test-2](https://issues.redhat.com/browse/test-2).cilium.rocks.

Having looked at ingress operator logs, I confirmed there was a long gap before the record was created:

2021-01-25T15:14:51.914Z	INFO	operator.main	ingress-operator/start.go:52	using operator namespace	{"namespace": "openshift-ingress-operator"}
2021-01-25T15:14:51.915Z	INFO	operator.main	ingress-operator/start.go:52	watching file	{"filename": "/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem"}
I0125 15:14:52.966650       1 request.go:645] Throttling request took 1.028863736s, request: GET:https://172.30.0.1:443/apis/apiextensions.k8s.io/v1?timeout=32s
2021-01-25T15:14:53.670Z	INFO	operator.init.controller-runtime.metrics	manager/manager.go:327	metrics server is starting to listen	{"addr": ":60000"}
2021-01-25T15:14:53.772Z	INFO	operator.init.controller-runtime.manager	manager/manager.go:256	starting metrics server	{"path": "/metrics"}
...
2021-01-25T15:14:54.975Z	INFO	operator.ingress_controller	ingress/deployment.go:78	created router deployment	{"namespace": "openshift-ingress", "name": "router-default"}
...
2021-01-25T15:16:25.354Z	ERROR	operator.ingress_controller	controller/controller.go:235	got retryable error; requeueing	{"after": "1m0s", "error": "IngressController is degraded: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1), LoadBalancerReady=False (LoadBalancerPending: The LoadBalancer service is pending)"}
...

I also checked the DNSRecord with kubectl get DNSRecord -n openshift-ingress-operator default-wildcard -o yaml:

apiVersion: ingress.operator.openshift.io/v1
kind: DNSRecord
metadata:
  creationTimestamp: "2021-01-25T16:19:01Z"
...
status:
  observedGeneration: 1
  zones:
  - conditions:
    - lastTransitionTime: "2021-01-25T16:19:02Z"
      message: The DNS provider succeeded in ensuring the record
      reason: ProviderSuccess
      status: "False"
      type: Failed
    dnsZone:
      id: gcp-ocp46-oss191-fix2-l4kdk-private-zone
  - conditions:
    - lastTransitionTime: "2021-01-25T16:19:03Z"
      message: The DNS provider succeeded in ensuring the record
      reason: ProviderSuccess
      status: "False"
      type: Failed
    dnsZone:
      id: ilya-openshift-[test-2](https://issues.redhat.com/browse/test-2)-cilium-rocks

Another way to illustrate the problem is time age difference between CP & worker nodes:

NAME                                                         STATUS   ROLES    AGE     VERSION
gcp-ocp46-oss191-fix2-l4kdk-master-0.c.cilium-dev.internal   Ready    master   5h12m   v1.19.0+9c69bdc
gcp-ocp46-oss191-fix2-l4kdk-master-1.c.cilium-dev.internal   Ready    master   5h12m   v1.19.0+9c69bdc
gcp-ocp46-oss191-fix2-l4kdk-master-2.c.cilium-dev.internal   Ready    master   5h12m   v1.19.0+9c69bdc
gcp-ocp46-oss191-fix2-l4kdk-worker-a-rhx2t                   Ready    worker   4h5m    v1.19.0+9c69bdc
gcp-ocp46-oss191-fix2-l4kdk-worker-b-qf5bh                   Ready    worker   4h5m    v1.19.0+9c69bdc
gcp-ocp46-oss191-fix2-l4kdk-worker-c-99khp                   Ready    worker   5h5m    v1.19.0+9c69bdc

I am not sure where to look to determine the component in question, I've noticed Cilium helm operator misbehaving and crashloopping, but that's been addressed already.

In theory, it's quite easy to generally say that there is a good chance to disatbilise bootstrap with so many operators at work, but it's hard for me to tell what's holding things up.

Cilium appears to be working properly, and installing without helm operator works just fine. I do suppose helm operator may create a lot of churn on the API, but that's a very instinctive assumption based on activity it logs and I'm not sure if it's significant enough to slow down the API in reality. I do get a sense that bootstrap API node has limited resources and could be easily overwhelmed, but that's another subjective interpretation.

Version

This is IPI OCP 4.6.12 install, same issue occurs on OKD 4.6 and 4.5. If I must, I can provide OKD log bundle.

How reproducible

I'm able to reproduce this very consistently.

Log bundle

log-bundle-20210125154005.tar.gz

@errordeveloper
Copy link
Author

errordeveloper commented Jan 27, 2021

So one issue that I came accross while debugging this was that bootkube was stuck trying to apply a manifest, it kept producing errors like this:

Jan 27 09:51:30 gcp-ocp46-oss193-cc-94498-bootstrap.c.cilium-dev.internal bootkube.sh[2403]: "cluster-network-04-cilium-[test-0001](https://issues.redhat.com/browse/test-0001)]-echo-b-service.yaml": failed to create services.v1./echo-b -n cilium-test: Service "echo-b" is invalid: spec.ports[0].nodePort: Invalid value: 31313: provided port is already allocated

Here is log-bundle-20210127095129.tar.gz that shows what's going on.

The troublesome manifest requests a particular node port and it is part of Cilium connectivity check suite that I was using in attempt to rule out networking issues due to Cilium, I think bootkube should be more robust in fact of errors like this. @smarterclayton is there an open issue for addressing this sort of failure mode? I suppose one question to ask is why this is not reliant on kubectl apply? And the other question is - how can the installer bubble up this sort of issues to the user, so it's clear right away that bootkube is stuck due to manifests (that the user has provided)...?

@errordeveloper
Copy link
Author

The issue with connectivity check manifest I mentioned earlier is not the original issue I've been having.
Looks like my original issue was to do with a different manifest, but it's similar in nature, i.e. Cilium itself gets installed okay, but a non-critical CR manifest (CSV in this case) fails to apply due to a validation error. Eventually bootkube's bootstrap CVO times out, so all is fine in the end, but it takes the extra hour or whatever the bootkube timeout is for bootstrap CVO.

Jan 25 16:16:44 gcp-ocp46-oss191-fix2-l4kdk-bootstrap.c.cilium-dev.internal bootkube.sh[2389]: "cluster-network-06-cilium-00014-cilium.v1.9.1-clusterserviceversion.yaml": failed to create clusterserviceversions.v1alpha1.operators.coreos.com/cilium.v1.9.1 -n cilium: ClusterServiceVersion.operators.coreos.com "cilium.v1.9.1" is invalid: spec.maintainers: Invalid value: "string": spec.maintainers in body must be of type array: "string"
Jan 25 16:16:44 gcp-ocp46-oss191-fix2-l4kdk-bootstrap.c.cilium-dev.internal bootkube.sh[2389]: Failed to create "cluster-network-06-cilium-00014-cilium.v1.9.1-clusterserviceversion.yaml" clusterserviceversions.v1alpha1.operators.coreos.com/cilium.v1.9.1 -n cilium: ClusterServiceVersion.operators.coreos.com "cilium.v1.9.1" is invalid: spec.maintainers: Invalid value: "string": spec.maintainers in body must be of type array: "string"
Jan 25 16:16:44 gcp-ocp46-oss191-fix2-l4kdk-bootstrap.c.cilium-dev.internal bootkube.sh[2389]: [#16897] failed to create some manifests:
Jan 25 16:16:44 gcp-ocp46-oss191-fix2-l4kdk-bootstrap.c.cilium-dev.internal bootkube.sh[2389]: "cluster-network-06-cilium-00014-cilium.v1.9.1-clusterserviceversion.yaml": failed to create clusterserviceversions.v1alpha1.operators.coreos.com/cilium.v1.9.1 -n cilium: ClusterServiceVersion.operators.coreos.com "cilium.v1.9.1" is invalid: spec.maintainers: Invalid value: "string": spec.maintainers in body must be of type array: "string"
Jan 25 16:16:44 gcp-ocp46-oss191-fix2-l4kdk-bootstrap.c.cilium-dev.internal bootkube.sh[2389]: Sending bootstrap-finished event.[#16898] context deadline exceeded
Jan 25 16:16:44 gcp-ocp46-oss191-fix2-l4kdk-bootstrap.c.cilium-dev.internal bootkube.sh[2389]: Tearing down temporary bootstrap control plane...
Jan 25 16:16:45 gcp-ocp46-oss191-fix2-l4kdk-bootstrap.c.cilium-dev.internal bootkube.sh[2389]: Waiting for CEO to finish...
Jan 25 16:16:46 gcp-ocp46-oss191-fix2-l4kdk-bootstrap.c.cilium-dev.internal bootkube.sh[2389]: I0125 16:16:46.420832       1 waitforceo.go:64] Cluster etcd operator bootstrapped successfully
Jan 25 16:16:46 gcp-ocp46-oss191-fix2-l4kdk-bootstrap.c.cilium-dev.internal bootkube.sh[2389]: I0125 16:16:46.421956       1 waitforceo.go:58] cluster-etcd-operator bootstrap etcd
Jan 25 16:16:46 gcp-ocp46-oss191-fix2-l4kdk-bootstrap.c.cilium-dev.internal bootkube.sh[2389]: bootkube.service complete

@errordeveloper
Copy link
Author

I've reported the issue with service manifest here: https://bugzilla.redhat.com/show_bug.cgi?id=1933263

From talking to @vrutkovs, it sounds like user-provided manifest errors cannot really be validated, so I'll close for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant