Large LIST calls being made to Kube API server #10931

prateekgogia · 2023-04-17T17:24:18Z

Summary

While debugging the load on API server and etcd instances, I found that argo workflow controller is making List calls every 1 minute and listing all the workflow objects in the clusters

What change needs making?
Can this implementation be switched to a WATCH call instead of using a List call?

Use Cases

When would you use this?

Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

terrytangyuan · 2023-04-17T20:03:07Z

Could you paste one of the the complete API endpoint URL? Do you have user agent information for those list calls?

prateekgogia · 2023-04-17T21:37:08Z

RequestURI - /apis/argoproj.io/v1alpha1/workflows?labelSelector=%21workflows.argoproj.io%2Fphase%2C%21workflows.argoproj.io%2Fcontroller-instanceid&limit=200

user agent- workflow-controller/v0.0.0 (linux/amd64) kubernetes/$Format/argo-workflows/v3.4.6 argo-controller
username - system:serviceaccount:argo:argo

tooptoop4 · 2023-04-17T21:51:29Z

guess its liveness https://github.com/argoproj/argo-workflows/blob/v3.4.7/workflow/controller/healthz.go#L34

terrytangyuan · 2023-04-18T00:32:28Z

Workaround is to remove your liveness probe or reduce the limit via HEALTHZ_LIST_LIMIT. Then the list call should be very minimal.

prateekgogia · 2023-04-18T19:31:35Z

Thanks so if I understand correctly, HEALTHZ_LIST_LIMIT can limit the number of workflow objects requested in the List API call?

terrytangyuan · 2023-04-18T19:38:14Z

Yes. It should reduce the load.

prateekgogia · 2023-04-18T19:58:07Z

Thanks for confirming, I am double checking with our etcd team because there has been some discussion around how limit param can also cause some excessive load. I will get back once I get an answer from etcd team.

andrewsykim · 2023-04-20T18:59:27Z

I experienced a similar issue with workflow controller, except it was doing large LIST requests for Pods. It seems like workflow controller is issuing periodic list requests without setting resourceVersion and with a labelSelector, which requires apiserver to fetch objects directly from etcd instead of using it's in-memory cache, generating a lot of heavy load on apiserver and etcd.

Ideally these sort of requests can use controller list/watch pattern instead of doing periodic lists, Google has some documentation around this here: https://cloud.google.com/kubernetes-engine/docs/concepts/planning-scalability#use_list_and_watch_pattern_instead_of_periodic_listing

Here's an example LIST request that was generating a lot of load, I redacted some fields that aren't relevant.

"HTTP" verb="LIST" URI="/api/v1/namespaces/<namespace>/pods?labelSelector=workflows.argoproj.io%2Fworkflow%3D<workflow>" latency="5.232824206s" userAgent="workflow-controller/v0.0.0 (linux/amd64) kubernetes/$Format"

andrewsykim · 2023-04-20T19:09:43Z

Dug around and see that this issue has been fixed already! #4024

I believe the version of workflow controller being used for this cluster did not include this performance improvement

terrytangyuan · 2023-04-20T19:15:10Z

Would you like to try a potential fix in this new image tag argoproj/workflow-controller:dev-fix-list-load? It will be ready once all builds finish https://github.com/argoproj/argo-workflows/actions/runs/4758024718/jobs/8455548828

tooptoop4 · 2024-05-12T11:44:25Z

@prateekgogia did u retest?

terrytangyuan · 2024-05-12T13:22:52Z

I think this was fixed in one of these PRs: #11722, #9700, #12133, #11375

Feel free to re-open if you are still having issues.

prateekgogia added the type/feature Feature request label Apr 17, 2023

terrytangyuan closed this as completed May 12, 2024

agilgur5 added type/bug P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important area/controller Controller issues, panics and removed type/feature Feature request labels May 12, 2024

argoproj locked as resolved and limited conversation to collaborators Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large LIST calls being made to Kube API server #10931

Large LIST calls being made to Kube API server #10931

prateekgogia commented Apr 17, 2023

terrytangyuan commented Apr 17, 2023

prateekgogia commented Apr 17, 2023

tooptoop4 commented Apr 17, 2023

terrytangyuan commented Apr 18, 2023

prateekgogia commented Apr 18, 2023

terrytangyuan commented Apr 18, 2023

prateekgogia commented Apr 18, 2023

andrewsykim commented Apr 20, 2023 •

edited

Loading

andrewsykim commented Apr 20, 2023 •

edited

Loading

terrytangyuan commented Apr 20, 2023 •

edited

Loading

tooptoop4 commented May 12, 2024

terrytangyuan commented May 12, 2024

Large LIST calls being made to Kube API server #10931

Large LIST calls being made to Kube API server #10931

Comments

prateekgogia commented Apr 17, 2023

Summary

Use Cases

terrytangyuan commented Apr 17, 2023

prateekgogia commented Apr 17, 2023

tooptoop4 commented Apr 17, 2023

terrytangyuan commented Apr 18, 2023

prateekgogia commented Apr 18, 2023

terrytangyuan commented Apr 18, 2023

prateekgogia commented Apr 18, 2023

andrewsykim commented Apr 20, 2023 • edited Loading

andrewsykim commented Apr 20, 2023 • edited Loading

terrytangyuan commented Apr 20, 2023 • edited Loading

tooptoop4 commented May 12, 2024

terrytangyuan commented May 12, 2024

andrewsykim commented Apr 20, 2023 •

edited

Loading

andrewsykim commented Apr 20, 2023 •

edited

Loading

terrytangyuan commented Apr 20, 2023 •

edited

Loading