-
Notifications
You must be signed in to change notification settings - Fork 743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flagger with larger number of canaries underperforms #638
Comments
I suspect this is due to Kubernetes API rate limits. I've been benchmarking Flagger with 100 canaries in parallel on GKE and I haven't seen delays more than a couple of seconds, but it really depends on what Kubernetes provider you are using. |
Is there somewhere that flagger will log errors if its being rate limited? Or will flagger just hold onto the request until it finally succeeds? I'm asking because nothing is showing up in the debug logs for the flagger operator 😔 Im on Amazon EKS. |
Seeing a very similar behaviour, also AWS EKS, v1.16. We are at ~45 canaries, nothing from the logs, even in debug. After upgrading to 1.2.0, the leader election problem is gone, but the slowness persists. Happens even if there is only 1 canary being updated at a time. Threadiness increase did not change results for us. If you have an automation for that benchmark, let me know and i am up for porting it to EKS if needed to try and replicate this for others. Thanks for all the things =) |
Hey, just an update on the last comment, i believe i've been able to reproduce on a test environment and identify the cause. It is indeed a rate-limiter, but doesn't seem do be on AWS EKS side, it seems to be on the client config. When the clients are created here and here, the QPS and Burst values are default. and those are throttling some of the responses. Adding something crazy like
and
will do the trick and allow it to breeze through at least 75 canaries with no noticeable impact in the progress performance. Not sure what would be the preferred approach would be, maybe individual clients per canary, configurable limits, tweak them based on the canary count as that is what drives the usage number.. there are multiple options. if you do have a preferred solution, let me know and i might be able to help out.. Regards, |
I think those two options could be set with Flagger command args, we need to figure out a default that works ok with 100 canaries. If you could open a PR for this it would be great. Thank you! |
When I run flagger with ~20 Canaries everything seems fine. But as soon as I scale up to around 50 canaries, flagger starts to have a bad time.
What we see when we scale up the number canaries:
New revision detected....
but never progresses beyond 0%.{"level":"info","ts":"2020-06-29T16:05:51.926Z","caller":"flagger/main.go:302","msg":"Leadership lost"}.
The other replica will pick up for a few minutes before dying with the same message.The above conditions happen when flagger has plenty of resources. We've never actually seen it go above 2% CPU utilization or 50MB memory utilization during the time that any of the above problems happen.
What we've tried:
flag.IntVar(&threadiness, "threadiness", 2, "Worker concurrency.")
Flagger v1.0.0
Kubernetes v1.14
Nginx Controller v0.26.1
The text was updated successfully, but these errors were encountered: