Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

argocd-repo-server OOMKilled after Kubernetes version 1.30 upgrade #19740

Open
3 tasks done
KihyeokK opened this issue Aug 30, 2024 · 7 comments
Open
3 tasks done

argocd-repo-server OOMKilled after Kubernetes version 1.30 upgrade #19740

KihyeokK opened this issue Aug 30, 2024 · 7 comments
Labels
bug Something isn't working component:kubernetes component:repo-server version:EOL Latest confirmed affected version has reached EOL

Comments

@KihyeokK
Copy link

KihyeokK commented Aug 30, 2024

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

Recently, we had a case where all argocd components were reassigned to nodes after a kubernetes upgrade. After the upgrade, argocd-repo-server pod was seen to be constantly getting OOMKilled. When we check metrics, there is no sign of constant high memory usage, so it seems like there are some memory usage peaking. We are struggling to find the cause of this memory usage peaking, and why particularly after the Kubernetes upgrade.

More info:

  • We run argocd-repo-server pod with a custom cmp plugin sidecar container. This container was seen to be getting OOMKilled (also memory usage peaking) along with main argocd-repo-server container.
  • We suspected that maybe all argocd components restarting caused refetching of git repos into local cache, but we never experienced this issue before when all argocd components were restarted. Also we mount two different emptydir volumes to /tmp directory for argocd-repo-server container and cmp sidecar container respectively.

Is there any reason why argocd-repo-server may have memory usage peaking after all argocd components were restarted?
Any insights would be appreciated!

To Reproduce

  • Reassign all argocd components to different nodes in a Kubernetes cluster running in version 1.30. We were able to reproduce this issue after the original issue right after Kubernetes upgrade to version 1.30 by cordoning nodes and deleting argocd components, so that they can be reassigned to another nodes. Also to clarify, all worker nodes and control plane nodes were running in version 1.30.

Expected behavior

argocd-repo-server pod's main container and cmp sidecar container not getting OOMKilled (no memory usage peaking)

Screenshots

Version

2.7.14

Logs
logs from argocd-repo-server container before it got OOMKilled

time="2024-08-26T01:32:43Z" level=info msg="ArgoCD Repository Server is starting" built="2023-09-07T16:50:42Z" commit=<REDACTED> port=8081 version=v2.7.14+a40c95a.dirty
time="2024-08-26T01:32:43Z" level=info msg="Generating self-signed TLS certificate for this session"
time="2024-08-26T01:32:44Z" level=info msg="Initializing GnuPG keyring at /app/config/gpg/keys"
time="2024-08-26T01:32:44Z" level=info msg="gpg --no-permission-warning --logger-fd 1 --batch --gen-key /tmp/gpg-key-recipe2057347276" dir= execID=<REDACTED>
time="2024-08-26T01:32:44Z" level=info msg=Trace args="[gpg --no-permission-warning --logger-fd 1 --batch --gen-key /tmp/gpg-key-recipe2057347276]" dir= operation_name="exec gpg" time_ms=191.891694
time="2024-08-26T01:32:44Z" level=info msg="Populating GnuPG keyring with keys from /app/config/gpg/source"
time="2024-08-26T01:32:44Z" level=info msg="gpg --no-permission-warning --list-public-keys" dir= execID=<REDACTED>
time="2024-08-26T01:32:44Z" level=info msg=Trace args="[gpg --no-permission-warning --list-public-keys]" dir= operation_name="exec gpg" time_ms=7.826751999999999
time="2024-08-26T01:32:44Z" level=info msg="gpg --no-permission-warning -a --export <REDACTED>" dir= execID=<REDACTED>
time="2024-08-26T01:32:44Z" level=info msg=Trace args="[gpg --no-permission-warning -a --export <REDACTED>]" dir= operation_name="exec gpg" time_ms=4.347631
time="2024-08-26T01:32:44Z" level=info msg="gpg-wrapper.sh --no-permission-warning --list-secret-keys <REDACTED>" dir= execID=<REDACTED>
time="2024-08-26T01:32:44Z" level=info msg=Trace args="[gpg-wrapper.sh --no-permission-warning --list-secret-keys <REDACTED>]" dir= operation_name="exec gpg-wrapper.sh" time_ms=6.43944
time="2024-08-26T01:32:44Z" level=info msg="Loaded 0 (and removed 0) keys from keyring"
time="2024-08-26T01:32:44Z" level=info msg="argocd-repo-server is listening on [::]:8081"
time="2024-08-26T01:32:44Z" level=info msg="Starting GPG sync watcher on directory '/app/config/gpg/source'"
time="2024-08-26T01:32:55Z" level=info msg="manifest cache hit: &ApplicationSource{<REDACTED>
time="2024-08-26T01:32:55Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GenerateManifest grpc.service=repository.RepoServerService grpc.start_time="2024-08-26T01:32:55Z" grpc.time_ms=21.333 span.kind=server system=grpc
time="2024-08-26T01:32:55Z" level=info msg="manifest cache hit: &ApplicationSource{<REDACTED>
time="2024-08-26T01:32:55Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GenerateManifest grpc.service=repository.RepoServerService grpc.start_time="2024-08-26T01:32:55Z" grpc.time_ms=7.456 span.kind=server system=grpc
time="2024-08-26T01:32:56Z" level=info msg="manifest cache hit: &ApplicationSource{<REDACTED>
time="2024-08-26T01:32:56Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GenerateManifest grpc.service=repository.RepoServerService grpc.start_time="2024-08-26T01:32:56Z" grpc.time_ms=157.251 span.kind=server system=grpc
time="2024-08-26T01:32:57Z" level=info msg="manifest cache hit: &ApplicationSource{<REDACTED>
time="2024-08-26T01:32:57Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GenerateManifest grpc.service=repository.RepoServerService grpc.start_time="2024-08-26T01:32:57Z" grpc.time_ms=18.933 span.kind=server system=grpc
time="2024-08-26T01:33:12Z" level=info msg="manifest cache hit: &ApplicationSource{<REDACTED>
time="2024-08-26T01:33:12Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GenerateManifest grpc.service=repository.RepoServerService grpc.start_time="2024-08-26T01:33:12Z" grpc.time_ms=131.958 span.kind=server system=grpc
time="2024-08-26T01:33:12Z" level=info msg="manifest cache hit: &ApplicationSource{<REDACTED>
time="2024-08-26T01:33:12Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GenerateManifest grpc.service=repository.RepoServerService grpc.start_time="2024-08-26T01:33:12Z" grpc.time_ms=12.935 span.kind=server system=grpc
time="2024-08-26T01:33:13Z" level=info msg="manifest cache hit: &ApplicationSource{<REDACTED>
time="2024-08-26T01:33:13Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GenerateManifest grpc.service=repository.RepoServerService grpc.start_time="2024-08-26T01:33:13Z" grpc.time_ms=5.582 span.kind=server system=grpc
time="2024-08-26T01:33:13Z" level=info msg="manifest cache hit: &ApplicationSource{<REDACTED>
time="2024-08-26T01:33:13Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GenerateManifest grpc.service=repository.RepoServerService grpc.start_time="2024-08-26T01:33:13Z" grpc.time_ms=5.265 span.kind=server system=grpc
time="2024-08-26T01:33:13Z" level=info msg="manifest cache hit: &ApplicationSource{<REDACTED>
time="2024-08-26T01:33:13Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GenerateManifest grpc.service=repository.RepoServerService grpc.start_time="2024-08-26T01:33:13Z" grpc.time_ms=7.226 span.kind=server system=grpc
time="2024-08-26T01:33:13Z" level=info msg="manifest cache hit: &ApplicationSource{<REDACTED>
time="2024-08-26T01:33:13Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GenerateManifest grpc.service=repository.RepoServerService grpc.start_time="2024-08-26T01:33:13Z" grpc.time_ms=6.707 span.kind=server system=grpc
time="2024-08-26T01:33:13Z" level=info msg="manifest cache hit: &ApplicationSource{<REDACTED>
time="2024-08-26T01:33:13Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GenerateManifest grpc.service=repository.RepoServerService grpc.start_time="2024-08-26T01:33:13Z" grpc.time_ms=3.05 span.kind=server system=grpc
time="2024-08-26T01:33:13Z" level=info msg="manifest cache hit: &ApplicationSource{<REDACTED>
time="2024-08-26T01:33:13Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GenerateManifest grpc.service=repository.RepoServerService grpc.start_time="2024-08-26T01:33:13Z" grpc.time_ms=14.005 span.kind=server system=grpc
time="2024-08-26T01:33:13Z" level=info msg="manifest cache hit: &ApplicationSource{<REDACTED>
time="2024-08-26T01:33:13Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GenerateManifest grpc.service=repository.RepoServerService grpc.start_time="2024-08-26T01:33:13Z" grpc.time_ms=10.393 span.kind=server system=grpc
time="2024-08-26T01:33:13Z" level=info msg="manifest cache hit: &ApplicationSource{<REDACTED>
time="2024-08-26T01:33:13Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GenerateManifest grpc.service=repository.RepoServerService grpc.start_time="2024-08-26T01:33:13Z" grpc.time_ms=17.607 span.kind=server system=grpc
time="2024-08-26T01:33:13Z" level=info msg="manifest cache hit: &ApplicationSource{<REDACTED>
time="2024-08-26T01:33:13Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GenerateManifest grpc.service=repository.RepoServerService grpc.start_time="2024-08-26T01:33:13Z" grpc.time_ms=19.055 span.kind=server system=grpc
time="2024-08-26T01:33:13Z" level=info msg="manifest cache hit: &ApplicationSource{<REDACTED>
time="2024-08-26T01:33:13Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GenerateManifest grpc.service=repository.RepoServerService grpc.start_time="2024-08-26T01:33:13Z" grpc.time_ms=26.055 span.kind=server system=grpc
time="2024-08-26T01:33:13Z" level=info msg="manifest cache hit: &ApplicationSource{<REDACTED>
time="2024-08-26T01:33:13Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GenerateManifest grpc.service=repository.RepoServerService grpc.start_time="2024-08-26T01:33:13Z" grpc.time_ms=4.088 span.kind=server system=grpc
time="2024-08-26T01:33:13Z" level=info msg="manifest cache hit: &ApplicationSource{<REDACTED>
time="2024-08-26T01:33:13Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GenerateManifest grpc.service=repository.RepoServerService grpc.start_time="2024-08-26T01:33:13Z" grpc.time_ms=120.873 span.kind=server system=grpc
time="2024-08-26T01:33:13Z" level=info msg="manifest cache hit: &ApplicationSource{<REDACTED>
time="2024-08-26T01:33:13Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GenerateManifest grpc.service=repository.RepoServerService grpc.start_time="2024-08-26T01:33:13Z" grpc.time_ms=121.124 span.kind=server system=grpc
time="2024-08-26T01:33:13Z" level=error msg="finished unary call with code Unknown" error="Unable to resolve 'CCT-1232' to a commit SHA" grpc.code=Unknown grpc.method=GenerateManifest grpc.service=repository.RepoServerService grpc.start_time="2024-08-26T01:33:13Z" grpc.time_ms=124.567 span.kind=server system=grpc
time="2024-08-26T01:33:13Z" level=info msg="manifest cache hit: &ApplicationSource{<REDACTED>
time="2024-08-26T01:33:13Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GenerateManifest grpc.service=repository.RepoServerService grpc.start_time="2024-08-26T01:33:13Z" grpc.time_ms=128.193 span.kind=server system=grpc
time="2024-08-26T01:33:13Z" level=info msg="manifest cache hit: &ApplicationSource{<REDACTED>
time="2024-08-26T01:33:13Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GenerateManifest grpc.service=repository.RepoServerService grpc.start_time="2024-08-26T01:33:13Z" grpc.time_ms=131.709 span.kind=server system=grpc
time="2024-08-26T01:33:13Z" level=info msg="manifest cache hit: &ApplicationSource{<REDACTED>
time="2024-08-26T01:33:13Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GenerateManifest grpc.service=repository.RepoServerService grpc.start_time="2024-08-26T01:33:13Z" grpc.time_ms=135.036 span.kind=server system=grpc
time="2024-08-26T01:33:13Z" level=info msg="manifest cache hit: &ApplicationSource{<REDACTED>
time="2024-08-26T01:33:13Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GenerateManifest grpc.service=repository.RepoServerService grpc.start_time="2024-08-26T01:33:13Z" grpc.time_ms=134.952 span.kind=server system=grpc
time="2024-08-26T01:33:13Z" level=info msg="manifest cache hit: &ApplicationSource{<REDACTED>
.....
..... omitted
.....
time="2024-08-26T01:33:20Z" level=info msg="manifest cache hit: &ApplicationSource{<REDACTED>
time="2024-08-26T01:33:20Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GenerateManifest grpc.service=repository.RepoServerService grpc.start_time="2024-08-26T01:33:20Z" grpc.time_ms=15.656 span.kind=server system=grpc
time="2024-08-26T01:33:24Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=Check grpc.service=grpc.health.v1.Health grpc.start_time="2024-08-26T01:33:24Z" grpc.time_ms=0.052 span.kind=server system=grpc
@KihyeokK KihyeokK added the bug Something isn't working label Aug 30, 2024
@Jack-R-lantern
Copy link
Contributor

Can you confirm that this issue is reproduced in the latest version of argocd?

@tooptoop4
Copy link

can u try argocd 2.12.x ?

@KihyeokK
Copy link
Author

KihyeokK commented Sep 4, 2024

@Jack-R-lantern @tooptoop4 Unfortunately, we cannot upgrade to more recent argocd version due to some blockers. Is there any possible reasons you see that may have caused argocd-repo-server to be OOMKilled specifically after Kubernetes 1.30 upgrade? We have not experienced similar issues in previous Kubernetes upgrades and previous cases where all argocd pods were restarted.

@christianh814
Copy link
Member

I know others have said this but version 2.7.14 is REALLY old and not supported anymore.

See the testing matrix for supported versions https://argo-cd.readthedocs.io/en/stable/operator-manual/tested-kubernetes-versions/

@wanghong230
Copy link
Member

My guess is that many API calls to request the repo content with a cache miss at the same time.

You can try to set the value to be smaller number to see if it helps, but it will also limit the performance of repo server.

https://github.com/argoproj/argo-cd/blob/master/cmd/argocd-repo-server/commands/argocd_repo_server.go#L225

@rumstead
Copy link
Member

rumstead commented Sep 5, 2024

To pile on, compression has also been enabled by default since v2.8.

#13458

image

@KihyeokK
Copy link
Author

KihyeokK commented Sep 9, 2024

@christianh814 @wanghong230 @rumstead Thank you for all the insights! Much appreciated 🙏 Theoretically, enabling redis gzip compression even in 2.7.14 should help with argocd-repos-server memory usage right? It's just that it is not enabled by default in that version?

@andrii-korotkov-verkada andrii-korotkov-verkada added the version:EOL Latest confirmed affected version has reached EOL label Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working component:kubernetes component:repo-server version:EOL Latest confirmed affected version has reached EOL
Projects
None yet
Development

No branches or pull requests

8 participants