-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upjet providers can't consume workqueue fast enough. Causes huge time-to-readiness delay #116
Upjet providers can't consume workqueue fast enough. Causes huge time-to-readiness delay #116
Comments
Thank you for the detailed report @Kasama, we'll be taking a look at this in our next sprint starting next week. |
Great to hear that! Feel free to reach out either here or on crossplane's slack thread (@roberto.alegro) if I can help with more details or reproduction steps |
Probably related to crossplane-contrib/provider-upjet-aws#86 |
Thanks a lot for your detailed analysis here @Kasama. I believe a low-hanging fruit here is to set some reasonable defaults for |
FYI #99 is also another thing that may cause CPU saturation. |
On a GKE cluster with
There are definitely some improvements between Exp#1 and Exp#2 but TBH, I am a bit surprised with Exp#3 not being much different than Exp#2. I am wondering if this could be related to CPU being throttled for both cases. I am planning to repeat the two Experiments with larger nodes not to get throttled. Experiment 1: With maxConcurrentReconciles=1 and pollInterval=1m (Current defaults)Provisioned 100 Experiment 2: With maxConcurrentReconciles=10 and pollInterval=1m (Community defaults)Provisioned 100 Experiment 3: With maxConcurrentReconciles=10 and pollInterval=10m (Proposed defaults)Provisioned 100 |
Yeah, during my testing I've walked a similar path, I had changed Indeed there are some improvements when bumping the concurrency, but sadly the problem still remains in that the time it takes for new resources to be ready greatly depends on the amount of already existing resources. |
I repeated the last experiment on a bigger node (e2-standard-32) to eliminate the effect of CPU throttling and this time it looks much better (except the resource consumption). Experiment 4: With maxConcurrentReconciles=10 and pollInterval=10m (Proposed defaults) (on e2-standard-32)Provisioned 100 I believe improving resource usage is something orthogonal with the settings here and I feel good with the above defaults while still exposing them as configurable params. I'll open PRs with proposed defaults. |
I was finally able to do some more tests using a bigger instance (an But when trying with ~5000 concurrent resources, there was still a similar problem and the queue was with ~700 resources at all times. That can be mitigated again by increasing the reconciliation time, but It would be much better to have a way to scale these controllers horizontally, though. |
Cross-posting about crossplane/terrajet#300, because the exact same behavior happens with Upjet, and as far as I understood,
terrajet
will be deprecated in favor ofUpjet
, so it makes sense to keep this issue tracked here.This is specially relevant as this now seems to be the "official" backend for provider implementations.
What happened?
The expected behaviour is that upjet resources time-to-readiness wouldn't depend on the amount of resources that already exist in the cluster.
In reality, as calling
terraform
takes a while (around 1 second on my tests), the provider controller is unable to clear the work queue and because of that, any new events (such as creating a new resource) takes very long to complete when there are multiple other resources, since the controller adds those to the end of the queue.There are more details and on the original bug report
How can we reproduce it?
The reproduction steps are basically the same as the original issue, just changing the terrajet provider for the upjet provider.
provider-aws
)it will take some minutes, but a burst of resources is expected to take a bit.
Althought it does take much longer than
provider-aws
for the same resource.The last step will take a long time, which is the problem this bug report is about.
Open collapsible for reproducible commands
The text was updated successfully, but these errors were encountered: