Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add automaxprocs package #1443

Closed
wants to merge 1 commit into from
Closed

Add automaxprocs package #1443

wants to merge 1 commit into from

Conversation

hawwwdi
Copy link

@hawwwdi hawwwdi commented Apr 11, 2024

add package automaxprocs to automatically set GOMAXPROCS to match Linux container CPU quota on kubernetes and prevent CPU throttling .
Performance results are here.

if applied, package automaxprocs will respect the CPU quota on kubernetes cluster

Signed-off-by: Hadi Abbasi <hawwwdi@gmail.com>
@stefanprodan
Copy link
Member

@hawwwdi source-controller is I/O bound instead of being CPU bound. Did you observed CPU throttling for source-controller in your clusters?

@hawwwdi
Copy link
Author

hawwwdi commented Apr 11, 2024

@hawwwdi source-controller is I/O bound instead of being CPU bound. Did you observed CPU throttling for source-controller in your clusters?

Yes, I have set the CPU request and limit to 12, but when the source controller uses nearly 6 cores, throttling occurs. Pod monitoring screenshot:
image

@stefanprodan
Copy link
Member

Are you using HTTP/S HelmRepos that contain a huge number of charts? That would explain the CPU usage. Can you please post here: flux stats -A

@hawwwdi
Copy link
Author

hawwwdi commented Apr 11, 2024

Are you using HTTP/S HelmRepos that contain a huge number of charts? That would explain the CPU usage. Can you please post here: flux stats -A

Yes, I've stored many of my Helm charts on a Git repository, and perhaps it's causing high CPU usage. However, it shouldn't be throttled with 6 cores CPU usage when the limit and request are set to 12 cores. The automaxprocs package should fix it.
For flux stats -A, I currently don't have access to my cluster, but I will post it here later.

@souleb
Copy link
Member

souleb commented Apr 11, 2024

The throttle should be only on limits not on requests which should be guaranteed or your pod would be rescheduled. If you have requested 12 cpu units and they have been guaranteed by the scheduler, automaxprocs would set GOMAXPROCS accordingly right? Are you sure using automaxprocs would fixe this particular case?

@hawwwdi
Copy link
Author

hawwwdi commented Apr 11, 2024

The throttle should be only on limits not on requests which should be guaranteed or your pod would be rescheduled. If you have requested 12 cpu units and they have been guaranteed by the scheduler, automaxprocs would set GOMAXPROCS accordingly right? Are you sure using automaxprocs would fixe this particular case?

The automaxprocs package sets GOMAXPROCS according to the CPU limits, not the requests. I've tested the automaxprocs package on my services, which were throttled before reaching the CPU limit. After adding it, the issue was resolved.

@hawwwdi
Copy link
Author

hawwwdi commented Apr 11, 2024

Are you using HTTP/S HelmRepos that contain a huge number of charts? That would explain the CPU usage. Can you please post here: flux stats -A

Here is the output of flux stats -A:
image

@stefanprodan
Copy link
Member

I suggest enabling Helm index caching to see if this reduces the resources usage https://fluxcd.io/flux/installation/configuration/vertical-scaling/#enable-helm-repositories-caching

My guess is that those failing HelmCharts are one of the reasons for high CPU usage, if those come from Git then source-controller needs to compile them and it's running continuously as they fail.

@hawwwdi
Copy link
Author

hawwwdi commented Apr 11, 2024

I suggest enabling Helm index caching to see if this reduces the resources usage https://fluxcd.io/flux/installation/configuration/vertical-scaling/#enable-helm-repositories-caching

My guess is that those failing HelmCharts are one of the reasons for high CPU usage, if those come from Git then source-controller needs to compile them and it's running continuously as they fail.

Yes, perhaps enabling this feature or moving Helm charts from Git to a Helm repository will reduce CPU usage. But, this pull request is not intended to solve the high CPU usage problem. Instead, it aims to solve the issue of throttling before reaching the CPU limit.

@stefanprodan
Copy link
Member

I fail to understand how this PR would help you in any way. In your case, you've set 12 CPU as the limit and you've seen throttling at 6 cores. My understanding from automaxprocs docs, is that this defends against throttling when GOMAXPROCS is set above the limits, but in your case is the opposite, it never got to use all 12 cores.

@stefanprodan
Copy link
Member

By the way, I see no issue with using automaxprocs in Flux as it should help reduce throttling on nodes with lots of CPU cores for when users don't bump the 1 core limit we set by default. Besides source-controller, we should add it to:

  • kustomize-controller
  • helm-controller

Could you please build source-controller from this PR and deploy it on your cluster to see if in your case throttling stops?

@hawwwdi
Copy link
Author

hawwwdi commented Apr 11, 2024

I fail to understand how this PR would help you in any way. In your case, you've set 12 CPU as the limit and you've seen throttling at 6 cores. My understanding from automaxprocs docs, is that this defends against throttling when GOMAXPROCS is set above the limits, but in your case is the opposite, it never got to use all 12 cores.

According to the golang docs, the default value of GOMAXPROCS is set to the number of available logical CPUs. In a Kubernetes environment, it is set to the underlying node logical CPUs. In my case, the underlying node has 56 cores, which leads to setting GOMAXPROCS=56 in the source-controller process which is above the limit (12) i have set before. This causes CPU throttling before reaching the CPU limits.

@hawwwdi
Copy link
Author

hawwwdi commented Apr 11, 2024

By the way, I see no issue with using automaxprocs in Flux as it should help reduce throttling on nodes with lots of CPU cores for when users don't bump the 1 core limit we set by default. Besides source-controller, we should add it to:

  • kustomize-controller
  • helm-controller

Could you please build source-controller from this PR and deploy it on your cluster to see if in your case throttling stops?

Yes, I will report the result here.

@souleb
Copy link
Member

souleb commented Apr 11, 2024

I also think that this makes sense, thanks @hawwwdi.

We could also use the downward-api e.g.

       env:
        - name: GOMAXPROCS
          valueFrom:
            resourceFieldRef:
              resource: limits.cpu

We could save a dependency that way.

@stefanprodan
Copy link
Member

@souleb isn't GOMAXPROCS suppose to be set to an integer value? limits.cpu is in the format 1000m

@stefanprodan
Copy link
Member

stefanprodan commented Apr 11, 2024

So looks like Go handles well the env var and converts it https://blog.howardjohn.info/posts/gomaxprocs/

We could set both GOMAXPROCS and GOMEMLIMIT in flux2 manifests for all controllers. I will test if it works Ok.

@hawwwdi
Copy link
Author

hawwwdi commented Apr 11, 2024

So looks like Go handles well the env var and converts it https://blog.howardjohn.info/posts/gomaxprocs/

We could set both GOMAXPROCS and GOMEMLIMIT in flux2 manifests for all controllers. I will test if it works Ok.

It's a great way. I will test this method and share the results here.

@souleb
Copy link
Member

souleb commented Apr 11, 2024

yes, it converts it to an int32. automaxprocs does some conversion too, so I believe the behavior is the same.
See https://github.com/golang/go/blob/b6778c5230d554c1ba1a69b104513021467d32b2/src/runtime/proc.go#L825

Edit: I believe that if the conversion fails, it falls backs to the number of cores.

@stefanprodan
Copy link
Member

stefanprodan commented Apr 11, 2024

I have added the env vars and I've run the Flux benchmarks a couple of times. My conclusion is that is safe to add the env vars to the main distribution as I see no regression in terms of reconciliation speed.

@hawwwdi please open a PR in the flux2 repo and add this patch to manifests/install https://github.com/fluxcd/flux-benchmark/pull/5/files Would be great if you could post in the new PR your test results for source-controller.

@hawwwdi
Copy link
Author

hawwwdi commented Apr 11, 2024

I have added the env vars and I've run the Flux benchmarks a couple of times. My conclusion is that is safe to add the env vars to the main distribution as I see no regression in terms of reconciliation speed.

@hawwwdi please open a PR in the flux2 repo and add this patch to manifests/install https://github.com/fluxcd/flux-benchmark/pull/5/files Would be great if you could post in the new PR your test results for source-controller.

Thank you for testing. Okay, I will open a PR in the flux2 repo with the test results.

@stefanprodan
Copy link
Member

Superseded by: fluxcd/flux2#4717

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants