-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ArgoCD 2.1.8 repo server filling up helm cache #8773
Comments
@vvoinea-gpsw there were some small changes around chart caching in recent 2.1.x versions. Are you able to upgrade to 2.1.12? I don't have much confidence that it will solve the problem, but I think for relatively effort it's a good starting point. |
@crenshaw-dev I just jumped ahead to 2.2.0 as I saw some cache changed in there too. I don't understand why nobody else is seeing this? |
@vvoinea-gpsw I think the caching changes I'm thinking of were later in the 2.2 series. If you're able to go to the latest patch, that would be good. But not really hopeful for that change. I'll need to come back to this next week because I'll need to read the caching code to get a good feeling for how to reproduce the issue. |
Thank you @crenshaw-dev for looking into this, I'll also be off next week but looking forward to seeing what you find. I will try to upgrade to the latest version and look for any rotation of the caches. |
Hi @crenshaw-dev I upgraded to the latest release 2.3.3 where I still see ta large number of Please let me know if you had time to replicate this issue |
Hi, we experienced this issue last week. In our case, we were changing expired private git token and had to go in and restart repo server instances. After the restart, the emptyDir volume started filling, with new We noticed that reposerver had error logs about some helm charts not being found (specifically redis, external-dns, etc. from bitnami helm repositories). We were depending on old chart versions which seem to have already been removed from the (updated) index in the repos, thus reposerver wasn't able to find them (while they were actually deployed by Argo in the cluster already). After bumping the applications to available helm chart versions, we stopped seeing these errors and the files stopped appearing / volume stopped getting filled excessively. |
So it seems to be limited to when the Helm chart no longer exists. Sounds like there's some cleanup logic that gets missed in that case. Please 👍 this issue if it's affecting you. I don't have time to investigate at the moment, but more thumbs up will help the issue get attention. :-) |
Encountered the same problem with a chart removed by bitnami: |
I believe I am hitting this issue as well, as this dir fills up, my memory usage gets really high too. doing a rolling restart of the thanks to @crenshaw-dev for all your help along the way!!! CNCF link https://cloud-native.slack.com/archives/C01TSERG0KZ/p1655221266746929 running on v2.3.3 |
@crenshaw-dev this really needs some TLC. It occurs when at least one helm app is in an unknown state. for my group, we are self service so this happens from time to time, especially as people learn. The below is my memory usage...Im having to do a |
@jmmclean yikes. I'm still short on time. If anyone has repo-server logs mentioning "-index.yaml and -charts.txt," that would really help me pinpoint the problem code and get a patch out sooner. Bonus points for logs with |
Hello, |
@crenshaw-dev I enabled Command to acquire logs: Logs (scrubbed of company PII):
|
A little more debug, I exec'd into the repo server and dug around the helm dir. It looks like this dir just keep filling up w/o garbage cleanup
|
I'm not sure why I didn't notice it before, but these files are being cached by Helm and then not cleaned up. So maybe a Helm bug? Or maybe Helm intentionally keeps those files around. A potential workaround would be to temporarily set the Helm working dir as some temp dir and then delete the temp dir after manifest generation. But then we might be missing out on some caching benefits of using a shared working dir. Gonna dig into Helm code and see if it's intentionally not deleting these files. |
Found and tested the fix. helm/helm#11172 Evidence from running the custom helm build in my repo server:
The file was created and then quickly deleted. If this is urgent enough, we could build a Helm fork to bundle with Argo CD until they release the fix. Sounds like most folks have found work-arounds thought? |
I do not have a workaround, but i can wait til a fix is released :) OOM killing basically resolves the issue |
Sweet! You could also build a custom Argo CD image and copy in the custom Helm binary. That's a lot of work though for something that can kinda fix itself with a restart. |
agreed....i can wait :) yall do really well with cutting release for Argo, so im sure it won't take long! Thanks for digging in also, i recognize you need your PR to be merged into helm repo, then you likely need to upgrade helm version in Argo (chore) |
Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
…0937) Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> Signed-off-by: Nicholas Johnson <nbjohnson10@gmail.com>
Checklist:
argocd version
.Describe the bug
After upgrading from ArgoCD 1.8 to 2.1.8 we are seeing the argocd-repo-server pods filling up the empyDir helm-working-dir to the amount of 100GB in 2 days -> the node disk is being filled (empyDir maps to host disk OR RAM) and pods are getting evicted.
Tried to limit the cache using the obvious parameters:
reposerver.default.cache.expiration=1h
or
reposerver.repo.cache.expiration=2h
But neither can limit the creation of new
<some-hash>charts.yaml
and<some-hash>index.yaml
files in the/helm-working-dir/repository
on the pod every 3 minutes.Since empyDir can also map to RAM we have also seen higher memory usages similar to this #8698
To Reproduce
Install 2.1.8 and monitor disk and memory usage
Expected behavior
There would be some parameter to limit the rotation of these cached charts
Also if the volume is known to fill up this quickly (due to a perfect storm of large chart repo and small node disk size) could we limit the volume size by changing the volume setup out of the box ?
Screenshots
Version
Logs
The text was updated successfully, but these errors were encountered: