Load tests with provider-gcp #255

ulucinar · 2023-03-14T14:26:19Z

We would like to perform some load tests to better understand the scaling characteristics of provider-gcp. The most recent experiments related to provider performance are here but they were for parameter optimization and not load test experiments. These tests can also help us to give the community sizing & scaling guidance.

We may do a set of experiments (with the latest available version of provider-gcp) in which we gradually increase the # of MRs provisioned until we saturate the computing resources of upbound/provider-gcp. We use an EKS cluster with a worker instance type of m5.2xlarge (32 GB Memory - 8 vCPUs) initially with the vanilla provider and with the default parameters (especially with the default max-reconcile-rate of 10) as suggested here) so that we can better relate our results with the results of those previous ones and also because the current default provider parameters are chosen using the results of those experiments.

We can also make use of the existing tooling from here & here to conduct these tests. We should collect & report at least the following for each experiment:

The types and number of MRs provisioned during the test
Success rate for Ready=True, Synced=True state in 10 min: During an interval of 10 min, how many of the MRs could acquire these conditions and how many failed to do so?
Using the available Prometheus metrics from the provider, what was the peak & avg. memory/CPU utilization? You can install the Prometheus and Grafana stack using something like: helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack -n prometheus --set namespaceOverride=prometheus --set grafana.namespaceOverride=prometheus --set kube-state-metrics.namespaceOverride=prometheus --set prometheus-node-exporter.namespaceOverride=prometheus --create-namespace from the prometheus-community Helm repository. We may include the Grafana dashboard screenshots like here.
kubectl get managed -o yaml output at the end of the experiment.
Time-to-readiness metrics as defined here. Histograms like we have there would be great but we can also derive them later.
go run github.com/upbound/uptest/cmd/ttr@fix-69 output (related with the above item)
ps -o pid,ppid,etime,comm,args output from the provider container. We can do this at the end of each experiment run or better, we can have reporting during the course of the experiment with something like: while true; do date; k exec -it <provider pod> -- ps -o pid,ppid,etime,comm,args; done and log the output to a file. You can refer to our conversion with @mmclane here for more context on why we do this.

As long as we have not saturated the compute resources of the provider, we can iterate with a new experiment with more MRs in increments of 5 or 10. I think initially we can start with 30 (let's start with something with 100% success rate, i.e., all MRs provisioned can become ready in the allocated time, i.e., in 10 min).

The text was updated successfully, but these errors were encountered:

Piotr1215 · 2023-03-16T21:35:55Z

First pass of provider GCP tests:

Test scenario

Bursting 1,10,50 and 100 storage buckets
provider GCP v0.29
EKS with 1 node m5.2xlarge - 32 GB Memory - 8 vCPUs.
kubernetes version 1.25.

Test results

Memory, CPU and TTR (time to readiness) were recorded for each run. Ps output showed consistently 2 processes as expected.

Memory graph in Prometheus

Discussion

TTR is higher than with Azure probably due to the resource used (storage bucket) . I wasn't able to use container registry, it wouldn't show in the console for some reason. Interestingly peak memory usage is significantly lower, with comparable CPU results.

Piotr1215 · 2023-03-17T14:13:28Z

Update to the above tests. The tests were run with the debug flag enabled in the ControllerConfig, this affects CPU utilization. In order to keep the results streamlined, the tests will be done without the debug setting. Below are results without the debug setting (first row) and results with (second row)

Piotr1215 · 2023-03-17T16:08:42Z

Here are more results pushing the GCP provider to 500 MRs on the same setup. It is interesting how little memory was consumed, CPU was definitely a bottleneck.

Piotr1215 · 2023-03-22T17:19:10Z

New set of tests with the improved provider image ulucinar/provider-gcp-amd64:v0.29.0-e45875a and the same test conditions

Test scenario

Bursting 1,10,50, 100 and 500 storage buckets
provider GCP v0.29 modified image ulucinar/provider-gcp-amd64:v0.29.0-e45875a
EKS with 1 node m5.2xlarge - 32 GB Memory - 8 vCPUs.
kubernetes version 1.25.

Significant improvements in CPU and memory utilization, but interesting increase in the experiment duration. TTR remains the same.

CPU metrics:

Provider Version	Runs	Experiment Duration	Average Time to Readiness in seconds	Peak Time to Readiness in seconds	Average Memory	Peak Memory	Average CPU %	Peak CPU %
v0.29.0-e45875a	1	122.31	65	65	157.10 MB	185.78 MB	1.58	1.81
v0.29.0-e45875a	10	153.21	66	67	306.52 MB	573.60 MB	4.06	6.08
v0.29.0-e45875a	50	387.97	322.76	330	616.14 MB	1.10 GB	6.85	19.65
v0.29.0-e45875a	100	957.32	686.03	850	515.08 MB	932.49 MB	9.53	37.52
v0.29.0-e45875a	500	4468.2	3375.77	3993	597.22 MB	1.21 GB	16.72	88.38
v0.29.0	1	124.83	67	67	122.80 MB	171.25 MB	2.62	3.05
v0.29.0	10	102.72	72.4	74	443.88 MB	802.28 MB	4.77	10.66
v0.29.0	50	417.79	337.32	350	728.35 MB	1.04 GB	15.06	40.3
v0.29.0	100	825.02	661.47	689	757.08 MB	1.09 GB	24.36	71.11
v0.29.0	500	3955.96	3240.79	3322	818.38 MB	1.25 GB	25.69	98.34

Piotr1215 · 2023-03-23T12:08:06Z

Here are improvements %

Improvements:
Peak CPU:	37.00%
Average CPU:	60.69%
Peak Memory:	8.13%
Average Memory:	26.87%
Peak Time to Readiness in seconds	16.37%
Average Time to Readiness in seconds	3.07%

Piotr1215 · 2023-04-05T14:07:40Z

A new sizing guide has been published based on the findings from the performance tests: https://github.com/upbound/upjet/blob/main/docs/sizing-guide.md

ulucinar assigned Piotr1215 Mar 15, 2023

ulucinar mentioned this issue Mar 22, 2023

Add terraform.ProviderScheduler crossplane/upjet#178

Merged

3 tasks

ulucinar mentioned this issue Mar 27, 2023

Consume upjet ProviderScheduler #260

Merged

1 task

Piotr1215 closed this as completed Apr 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load tests with provider-gcp #255

Load tests with provider-gcp #255

ulucinar commented Mar 14, 2023

Piotr1215 commented Mar 16, 2023

Piotr1215 commented Mar 17, 2023

Piotr1215 commented Mar 17, 2023 •

edited

Loading

Piotr1215 commented Mar 22, 2023 •

edited

Loading

Piotr1215 commented Mar 23, 2023

Piotr1215 commented Apr 5, 2023

Load tests with provider-gcp #255

Load tests with provider-gcp #255

Comments

ulucinar commented Mar 14, 2023

Piotr1215 commented Mar 16, 2023

Test scenario

Test results

Discussion

Piotr1215 commented Mar 17, 2023

Piotr1215 commented Mar 17, 2023 • edited Loading

Piotr1215 commented Mar 22, 2023 • edited Loading

Test scenario

Piotr1215 commented Mar 23, 2023

Piotr1215 commented Apr 5, 2023

Piotr1215 commented Mar 17, 2023 •

edited

Loading

Piotr1215 commented Mar 22, 2023 •

edited

Loading