-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Applications are stuck in refreshing #20785
Comments
/assign @alexmt @crenshaw-dev |
Try to upgrade to 2.13, there's been major performance improvements to refresh times, which in my case reduced refresh times for some applications from 30-60 min on medium cluster to < 1 min. |
Please, let us know the results in 2.13. |
I have found why repo-server is hang, the reason is git fetch is hang, and this goroutine hold the mutex.
|
Thanks, I will have a try 2.13. |
There should be an exec timeout after which is should terminate. Although sometimes waiting several minutes for a git fetch is unavoidable, though it should rarely happen. |
I agree with you. A parameter should be exposed for clients to configure. |
You can configure it using env variable on the repo server manifest, e.g.
The default is 1m30s. |
The root cause is that the SSH command(started by git) is stuck, causing the goroutine holding the lock in repo-server to be stuck. Other goroutines in repo-server cannot obtain the lock to continue generating manifests, which eventually causes applications refresh to be stuck.
There are several solutions:
@alexmt @crenshaw-dev @jessesuen Pls some suggestions. |
I'm observing very similar behaviour on our Argo instance, which started seemingly out of nowhere last week. For context, we're running a single instance ( Ever since the issue started our ArgoCD logs have had a steady stream of log entries for:
and occasionally you can catch this in the UI, as it shows up as an application reconciliation error. This often causes the apps to take much longer than normal to reach a Synced state after a change in Git. There's some very peculiar behaviour in the metrics for I ran Similar to the comment above me, I also managed to catch the long-running process using
I'm at a bit of a loss on how to debug this further 😞 The issue appeared early in the morning, when I know that no changes were made to either the underlying infrastructure or our GitOps repo (I double-checked this). |
I suggest increasing the exec timeout via env variable to 3m or even 5m. |
Sounds like the key question is "why isn't the exec timeout killing the stuck git command?" |
We raised our timeout value to 3 minutes and it seemed to fix the problem. Thanks for the suggestion 🙏 I guess for me the key question would be: what makes the git fetch take so long all of the sudden? In our case the issue appeared out of nowhere outside working hours when no changes were made, and was consistently broken for a few days. So it's not even a case of random slowness 🤔 |
I think there's two separate issues. I am not observing repo-server hang up, and I am not using SSH protocol for Git. I am observing app controllers hang up though and /clusters stop reporting statuses. Instead of Successful status next to the cluster there will be nothing, no errors. Nothing in the logs. Restarting app controllers helps for couple of hours. I think it was happening for a while now, hard to tell when it started, but I am running v2.13.3 at the moment. When that happens workqueue_depth on the affected app controller goes to 0, but probes are all passing. CPU usage goes next to nothing, app count/resource count/api resource count/events count also goes to next to nothing on affected controller. Feels like a deadlock or something. |
Happened again, digging through the logs I've found one more clue: every time a controller enters this broken state, there are thousands of errors in the logs like below:
The pod with the controller was not restarted/rescheduled and continues to run. Looks to me like the thread with the watcher for certain clusters fail and are not restarted. |
I think this is the actual issue I am running into #15464, there are other (now "resolved") related issues that I tried to sum up in this comment #15464 (comment) |
@ivan-cai , May I know how you got this git error details. Am going through this issue ( exactly the same , where one repo server , misbehaves on goroutine , then everything comes to stand still ) . Biggest concern is that , there is nothing in the logs , Cpu , memory resources , sync , reconciliation every thing comes down to almost zero . Am using V2.10.7 and struggling with this for some time as there are no helpful debug logs. |
Hard to say how helpful this would be for everyone. But wanted share experience with my current customer where they experienced application stuck refreshing issue. ArgoCD - v2.11.6+089247d We did a lot of tunning on application-controller and repo-server, but most impactful was moving the repo-server manifest cache to ephemeral volume in memory and adding env variable for the repo server to consume that volume. This speed up massively their ArgoCD sync processes and we stopped having any bottlenecks on repo-server side. Workqueue depth behaviour has improved massively after that change(no longer stuck and not processed), as they would need to constantly restart their ArgoCD before to unblock it and now it is just restart every 3h in case this bug happens #14224
|
@vkg23 You can see this. |
@ivan-cai - Am not using GIT SSH , but HTTP only for All the 5000 apps ArgoCD: issue in 2.10.x , issues exists in version 2.13.2 as well . Upgraded today . @jbartyze-rh - I attempted with ephemeral , however the issue continues in short duration. May i know , how you got this stack trace attached. Besides git client , is there a way to figure out , which repo is probably contributing to it. Already attempted tuning git timeout , EXEC timeout etc params. |
@vkg23 here is my whole ArgoCD CR. A lot of tunning there and it is able to handle around 850 apps at the moment. ArgoCD version v2.11.6+089247d On the ArgoCD application side we also use argocd.argoproj.io/manifest-generate-paths: . to lessen the burden on repo-server in monorepo setup. We are not using any CPU limits to avoid CPU throttle on ArgoCD components. Redis compression and Jitter is enabled to lower the peaks + some timeouts extensions as well. I am not able sadly to help with stack trace, but I think it was question directed to @ivan-cai
|
Describe the bug
I have 3000-5000 Applications, sometimes, Applications are stucking in refreshing, only restart application-controller or repo-server can solve it. About 2~3 times per day. Application is triggered to sync by gitlab webhook.
I have :
I have got the repo-server goroutine profile, and like this

My ArgoCD version is 2.12.4
my some config
Logs
Application Controller is Comparing app state, and can not get generated manifests from repo-server
The text was updated successfully, but these errors were encountered: