-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load tests with provider-aws #576
Comments
In the context of this issue, I wanted to provision Kubernetes clusters in GKE. I had two tries: Zonal and Regional. I will share my observations on them.
This is an important point for CRD scaling. From the user's perspective, with not-bad machines, it took about more than one or two hours for the provider to set up and become usable. Also, after the cluster was stable, crossplane, crossplane-rbac, and aws provider pod have been restarted many times. |
In AWS cluster the situation is noticeably better. The duration for stabilization is 5-7 mins.
|
@sergenyalcin for GCP clusters I remember Nic mentioning that it requires 11 nodes to become stable, see the relevant slack thread |
Hi folks, |
I did some load tests with provider-aws v0.30.0 in an EKS cluster. Test Environment: EKS Cluster - m5.2xlarge - 32 GB Memory - 8 vCPUs Test Results: We observe that memory usage does not increase after the CPU is saturated. An increase in TTRs is being recorded as expected. However, the provider continues to do its job. Therefore, although TTR is acceptable for 70 MR, the provider did not stop working and provisioned all resources properly. So, it can be said that the provider is working effectively. On the other hand, no zombie processes were found in the tests. Log files for all tests are attached. ps10.log ps20.log ps30.log ps50.log ps60.log ps70.log Experiment: Deploying the mentioned number of MRs to the cluster, and deleting the related resources from the cluster (deleting the finalizers and deleting the physical resource is meant). Experiment Time: The time elapsed between deploying and removing the mentioned number of MRs to the cluster. TTR: The time elapsed between the deployment of MR to the cluster and the Ready condition being True. |
Hi @sergenyalcin,
|
I did some load tests with provider-aws v0.30.0 in a bigger (in the context of CPU) EKS cluster. In the previous test, because of the CPU was saturated, after a point, we did not observe any memory increase. So, we switched to a bigger machine. Test Environment: EKS Cluster - m5.2xlarge - 32 GB Memory - 16 vCPUs Test Results: We did not observe CPU saturation but it did not increase after a point. We think that the reason is related to parallelism parameter of provider. When we reach this parallelism constraint, we cannot saturate the CPU. The main observation, again the provider continues to do its job. Therefore, although TTR is acceptable for 150 MR, the provider did not stop working and provisioned all resources properly. So, it can be said that the provider is working effectively. On the other hand, still, I did not observe the zombie processes. Log files for all tests are attached. |
I did some tests by using the GRPC server enabled image that was prepared by @ulucinar Test Environment: EKS Cluster - m5.2xlarge - 32 GB Memory - 16 vCPUs Test Results: The memory consumption is higher. I synced with @ulucinar offline, and he mentioned an upstream memory leak issue in the context of GRPC server enabled system. On the other hand, there is a significant improvement in the TTR values. And also we successfully created 1000 MRs. I also ran the 1000 MR tests in the v0.30.0 image. Results: There is a significant difference in the provider performance (Please see TTR values). So, from these results, I think, switching to the GRPC server enabled implementation with resolving the leak issue will be the best solution in the context of performance metrics. |
Hi @sergenyalcin, Here's the upstream issue that's probably related to the high memory consumption you observe with the shared server runtime: |
I want to make sure I'm understanding these results correctly. Am I correct that our most efficient (i.e. most optimized) build of |
Test results from the latest image from @ulucinar. (Shared Provider Scheduler) This image contains:
According to the latest results:
With these results we can say that the most successful image is this one. I am putting this image here as a reference:
Please not that, these images are on the v0.31.0 version of the provider-aws. |
Compared to baseline, some of the improvement rates in the final image were:
|
The test results for the Workspace Scheduler (for more context please see crossplane/upjet#178) When we compare the results with the baseline: TTR: % 61 |
Test results from the latest image from crossplane/upjet#178 After the latest changes, the shared scheduler implementation has the same performance results. |
Many experiments have been done to determine the performance characteristic of provider-aws in the context of baseline and different schedulers. In addition, load tests were performed using a large number of MR and the results were recorded. That's why I'm closing the issue. |
In the context of #325, we would like to perform some load tests to better understand the scaling characteristics of the provider. The most recent experiments related to provider performance are here but they were for parameter optimization and not load test experiments. These tests can also help us to give the community sizing & scaling guidance.
We may do a set of experiments in which we gradually increase the # of MRs provisioned until we saturate the computing resources of upbound/provider-aws. I suggest we use a GCP regional cluster with a worker instance type of
e2-standard-32
initially with the vanilla provider and with the default parameters (especially with the defaultmax-reconcile-rate
of 10) as suggested here) so that we can better relate our results with the results of those previous ones and also because the current default provider parameters are chosen using the results of those experiments.We can also make use of the existing tooling from here & here to conduct these tests. We should collect & report at least the following for each experiment:
Ready=True, Synced=True
state in 10 min: During an interval of 10 min, how many of the MRs could acquire these conditions and how many failed to do so?helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack -n prometheus --set namespaceOverride=prometheus --set grafana.namespaceOverride=prometheus --set kube-state-metrics.namespaceOverride=prometheus --set prometheus-node-exporter.namespaceOverride=prometheus --create-namespace
from theprometheus-community
Helm repository (helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
). We may include the Grafana dashboard screenshots like here.kubectl get managed -o yaml
output at the end of the experiment.go run github.com/upbound/uptest/cmd/ttr@fix-69
output (related with the above item)ps -o pid,ppid,etime,comm,args
output from the provider container. We can do this at the end of each experiment run or better, we can have reporting during the course of the experiment with something like:while true; do date; k exec -it <provider pod> -- ps -o pid,ppid,etime,comm,args; done
and log the output to a file. You can refer to our conversion with @mmclane here for more context on why we do this.As long as we have not saturated the compute resources of the provider, we can iterate with a new experiment with more MRs in increments of 5 or 10. I think initially we can start with 30 (let's start with something with 100% success rate, i.e., all MRs provisioned can become ready in the allocated time, i.e., in 10 min).
The text was updated successfully, but these errors were encountered: