-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential footgun: Very slow start up with many secrets in cluster #1447
Comments
To locate whether a real bug or configuration, kindly provide more context
|
@ccfishk you can find an example log where start up takes long here: https://gist.github.com/sandermvanvliet/1cc87865eb34c6a6b8c2cf55b7a51213 Environment of
Environment of
Output of |
To answer your question: in this case, why KIC1.3.0 is walking through all secrets before start up. |
By far the amount of secrets is the largest group of resources in our cluster compared to say pods or endpoints. My guess is that with I haven’t had much luck reproducing this but I’ll set up a cluster to test with a single pod with an API that has a Kong ingress. Then add in a ton of secrets and see if it shows the same behaviour. At the very least that’ll tell us if our suspicion this has to do with secrets is actually true |
Still, adding in that label filter makes it start up quick so I guess it’s related. Unless you mean that it shouldn’t even iterate over all those secrets in the first place. |
@sandermvanvliet did you get a chance to add in a ton of secrets and see if it shows the same behavior ? Sorry to get back to you today, I am also trying to figure out what would be the root cause of this scenario ? |
dumped 5k secrets on cluster using golang corev1 secrets creation, did not trigger the same behavior, KIC pod is able to be running status within a second. Unless this is reproducible in certain configuration, we'll not take this a bug. |
NOTE: GitHub issues are reserved for bug reports only.
For anything else, please join the conversation
in Kong Nation https://discuss.konghq.com/c/kubernetes.
Summary
In a cluster with a large amount of secrets, Kong ingress controller is very slow to start which leads to many forced restarts as pods are killed by the Kubernetes scheduler
Kong Ingress controller version
1.3.1
Kong or Kong Enterprise version
2.4.1
Kubernetes version
Environment
uname -a
):Linux aks-agentpool-22749819-0 4.15.0-1103-azure #114~16.04.1-Ubuntu SMP Wed Dec 16 02:39:42 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
What happened
We've been tracking down an issue where the ingress controller pod would take many attempts to start up because it would be killed by the scheduler as readiness and liveness probes were failing.
The reason for the failure was that the
/healthz
endpoint wasn't started yet because the ingress controller was busy collecting the necessary objects from the Kubernetes API.We were triggered to look in this direction as the metrics we collect through Prometheus showed the Read SLI on secrets spiking whenever the Kong ingress controller attempted to start:
In the plot you can see that this is in some cases taking minutes which is quite bad to say the least.
Based on this metric we've checked to see which secrets the ingress controller was requesting and we've found that it was retrieving all secrets form all namespaces (we don't have a
watchNamespace
configured as our ingresses and plugins live in various namespaces).We deploy our applications using Helm3 which stores deployments as secrets (and keeps the last 10 per app by default). This amounts to a large amount of secrets in the cluster. A quick count gave us ~5000 secrets of which it's estimated that ~90% are Helm secrets.
To confirm our suspicions we've compiled a version of the ingress controller with the
client-go
package vendored in where we applied a label filter on theSecrets
informer. The label filter was defined asowner!=helm
.After building the docker container and deploying this into our cluster the start up time was < 10s, previously even 240 seconds grace on the readiness/liveness probe wasn't enough.
With this we now have a stable Kong ingress again, all instances are working fine.
I'm not sure if this is a real bug but it's something that can cause a lot of confusion (it sure did for us!)
As far as I can tell I haven't come across this in the documentation, but it's also very likely I didn't look in the right places.
With this issue we wanted to ensure people are aware of this and if needed we can provide additional info.
The text was updated successfully, but these errors were encountered: