-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bootstrap slows down dramatically with Cilium installed by helm-operator #484
Comments
So one issue that I came accross while debugging this was that bootkube was stuck trying to apply a manifest, it kept producing errors like this:
Here is log-bundle-20210127095129.tar.gz that shows what's going on. The troublesome manifest requests a particular node port and it is part of Cilium connectivity check suite that I was using in attempt to rule out networking issues due to Cilium, I think bootkube should be more robust in fact of errors like this. @smarterclayton is there an open issue for addressing this sort of failure mode? I suppose one question to ask is why this is not reliant on |
The issue with connectivity check manifest I mentioned earlier is not the original issue I've been having.
|
I've reported the issue with service manifest here: https://bugzilla.redhat.com/show_bug.cgi?id=1933263 From talking to @vrutkovs, it sounds like user-provided manifest errors cannot really be validated, so I'll close for now. |
Describe the bug
When I attempt installing OpenShift with Cilium using currently published instructions, I get a working cluster.
I have created an operator based on
quay.io/operator-framework/helm-operator:v1.2.0
image, and it worked previously, but somehow it stopped working for me. I am not entirely clear what caused this, but something breaks bootstrap and it gets stuck yet cluster installation succeeds eventually.If I replace
helm template
steps from Cilium docs with copying of operator manifests,openshift-install create cluster
times out (errors appear to be slightly different each time).Once I leave the cluster for 1 or 2 hours, it eventually stabilises. It appears that due to the complexity of the bootstrap process, it's quite easy to end up in a situation where one of the operators is acting up which may result in the whole system being unstable for much longer than expected.
In one the cases I have worked through the following installer error:
Having checked the DNS right away with
dig
, I go no answer forauth-openshift.apps.gcp-ocp46-oss191-fix2.ilya-openshift-[test-2](https://issues.redhat.com/browse/test-2).cilium.rocks
.Having looked at ingress operator logs, I confirmed there was a long gap before the record was created:
I also checked the DNSRecord with
kubectl get DNSRecord -n openshift-ingress-operator default-wildcard -o yaml
:Another way to illustrate the problem is time age difference between CP & worker nodes:
I am not sure where to look to determine the component in question, I've noticed Cilium helm operator misbehaving and crashloopping, but that's been addressed already.
In theory, it's quite easy to generally say that there is a good chance to disatbilise bootstrap with so many operators at work, but it's hard for me to tell what's holding things up.
Cilium appears to be working properly, and installing without helm operator works just fine. I do suppose helm operator may create a lot of churn on the API, but that's a very instinctive assumption based on activity it logs and I'm not sure if it's significant enough to slow down the API in reality. I do get a sense that bootstrap API node has limited resources and could be easily overwhelmed, but that's another subjective interpretation.
Version
This is IPI OCP 4.6.12 install, same issue occurs on OKD 4.6 and 4.5. If I must, I can provide OKD log bundle.
How reproducible
I'm able to reproduce this very consistently.
Log bundle
log-bundle-20210125154005.tar.gz
The text was updated successfully, but these errors were encountered: