Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Applications are stuck in refreshing #20785

Open
ivan-cai opened this issue Nov 13, 2024 · 22 comments
Open

Applications are stuck in refreshing #20785

ivan-cai opened this issue Nov 13, 2024 · 22 comments
Labels
bug Something isn't working more-information-needed Further information is requested version:2.12 Latest confirmed affected version is 2.12

Comments

@ivan-cai
Copy link
Contributor

ivan-cai commented Nov 13, 2024

Describe the bug
I have 3000-5000 Applications, sometimes, Applications are stucking in refreshing, only restart application-controller or repo-server can solve it. About 2~3 times per day. Application is triggered to sync by gitlab webhook.
I have :

  • application-controller Deployment, and 1 replicas, and not use Dynamic Cluster Distribution. app_reconcile_queue depth is large
  • 5 repo-server, one repo-server replica have much goroutines, can restart it to solve the hang probelem
    image
    image

I have got the repo-server goroutine profile, and like this
image

My ArgoCD version is 2.12.4

my some config

  controller.operation.processors: "100"
  controller.repo.server.timeout.seconds: "120"
  controller.status.processors: "300"
  reposerver.git.attempts.count: "5"
  reposerver.parallelism.limit: "30"
  server.grpc.max.size.mb: "200"
  server.k8sclient.retry.base.backoff: "200"
  server.webhook.parallelism.limit: "50"`

**Expected behavior**

Application not stuck in refreshing

**Version**
v2.12.4 tag and with this commit https://github.com/argoproj/argo-cd/commit/95be90b5f9f5acebca46e3dcc3df9355307f6285

```shell
Paste the output from `argocd version` here.

Logs
Application Controller is Comparing app state, and can not get generated manifests from repo-server

Paste any relevant application logs here.
@ivan-cai ivan-cai added the bug Something isn't working label Nov 13, 2024
@ivan-cai
Copy link
Contributor Author

ivan-cai commented Nov 13, 2024

/assign @alexmt @crenshaw-dev

@andrii-korotkov-verkada
Copy link
Contributor

Try to upgrade to 2.13, there's been major performance improvements to refresh times, which in my case reduced refresh times for some applications from 30-60 min on medium cluster to < 1 min.

@andrii-korotkov-verkada andrii-korotkov-verkada added the version:2.12 Latest confirmed affected version is 2.12 label Nov 13, 2024
@andrii-korotkov-verkada
Copy link
Contributor

Please, let us know the results in 2.13.

@andrii-korotkov-verkada andrii-korotkov-verkada added the more-information-needed Further information is requested label Nov 14, 2024
@ivan-cai
Copy link
Contributor Author

I have found why repo-server is hang, the reason is git fetch is hang, and this goroutine hold the mutex.

goroutine 10514931 [chan receive, 17 minutes]:
github.com/argoproj/pkg/exec.RunCommandExt(0xc00170e420, {0x14f46b0400, 0x0, {0xf, 0x1}, 0x0, 0x0})
        /go/pkg/mod/github.com/argoproj/pkg@v0.13.7-0.20230626144333-d56162821bd1/exec/exec.go:139 +0xd5d
github.com/argoproj/argo-cd/v2/util/exec.RunWithExecRunOpts(0xc00170e420, {0x0?, {0x0?, 0x0?}, 0x40?, 0x1b?})
        /go/src/github.com/argoproj/argo-cd/util/exec/exec.go:59 +0x7d5
github.com/argoproj/argo-cd/v2/util/git.(*nativeGitClient).runCmdOutput(0xc000897ab0, 0xc00170e420, {0x8?, 0x8c?})
        /go/src/github.com/argoproj/argo-cd/util/git/client.go:887 +0x5f5
github.com/argoproj/argo-cd/v2/util/git.(*nativeGitClient).runCredentialedCmd(0xc000897ab0, {0xc001dd8c08, 0x5, 0x5})
        /go/src/github.com/argoproj/argo-cd/util/git/client.go:843 +0x413
github.com/argoproj/argo-cd/v2/util/git.(*nativeGitClient).fetch(0xc000cc6340?, {0x0?, 0x0?})
        /go/src/github.com/argoproj/argo-cd/util/git/client.go:356 +0x192
github.com/argoproj/argo-cd/v2/util/git.(*nativeGitClient).Fetch(0xc000897ab0, {0x0?, 0x0?})
        /go/src/github.com/argoproj/argo-cd/util/git/client.go:383 +0x9b
github.com/argoproj/argo-cd/v2/reposerver/repository.checkoutRevision({0x55af400, 0xc000897ab0}, {0xc001629d40, 0x28}, 0x1)
        /go/src/github.com/argoproj/argo-cd/reposerver/repository/repository.go:2440 +0x222
github.com/argoproj/argo-cd/v2/reposerver/repository.(*Service).checkoutRevision(0xc000b70b40, {0x55af400, 0xc000897ab0}, {0xc001629d40, 0x28}, 0x1)
        /go/src/github.com/argoproj/argo-cd/reposerver/repository/repository.go:2418 +0x75
github.com/argoproj/argo-cd/v2/reposerver/repository.(*Service).GetGitDirectories.func1()
        /go/src/github.com/argoproj/argo-cd/reposerver/repository/repository.go:2670 +0x3d
github.com/argoproj/argo-cd/v2/reposerver/repository.(*repositoryLock).Lock(0xc001045060, {0xc001336300, 0x36}, {0xc001629d40, 0x28}, 0x1, 0xc001dd9208)
        /go/src/github.com/argoproj/argo-cd/reposerver/repository/lock.go:55 +0x2e5
github.com/argoproj/argo-cd/v2/reposerver/repository.(*Service).GetGitDirectories(0xc000b70b40, {0x3612ec0?, 0x554f420?}, 0xc000b2d770)
        /go/src/github.com/argoproj/argo-cd/reposerver/repository/repository.go:2669 +0x40f
github.com/argoproj/argo-cd/v2/reposerver/apiclient._RepoServerService_GetGitDirectories_Handler.func1({0x558f168?, 0xc002381a40?}, {0x3dc1120?, 0xc000b2d770?})
        /go/src/github.com/argoproj/argo-cd/reposerver/apiclient/repository.pb.go:3085 +0xcb
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ErrorSanitizerUnaryServerInterceptor.func3({0x558f168, 0xc002381a10}, {0x3dc1120, 0xc000b2d770}, 0x0?, 0xc0022f76e0)
        /go/src/github.com/argoproj/argo-cd/util/grpc/sanitizer.go:24 +0x71
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5.1({0x558f168?, 0xc002381a10?}, {0x3dc1120?, 0xc000b2d770?})
        /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.4.0/chain.go:48 +0x45
github.com/argoproj/argo-cd/v2/reposerver.NewServer.PanicLoggerUnaryServerInterceptor.func2({0x558f168?, 0xc002381a10?}, {0x3dc1120?, 0xc000b2d770?}, 0x4009594?, 0x11?)
        /go/src/github.com/argoproj/argo-cd/util/grpc/grpc.go:33 +0x8c
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5.1({0x558f168?, 0xc002381a10?}, {0x3dc1120?, 0xc000b2d770?})
        /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.4.0/chain.go:48 +0x45
github.com/grpc-ecosystem/go-grpc-prometheus.init.(*ServerMetrics).UnaryServerInterceptor.func2({0x558f168, 0xc002381a10}, {0x3dc1120, 0xc000b2d770}, 0x0?, 0xc0023a0040)
        /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:107 +0x7d
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5.1({0x558f168?, 0xc002381a10?}, {0x3dc1120?, 0xc000b2d770?})
        /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.4.0/chain.go:48 +0x45
github.com/grpc-ecosystem/go-grpc-middleware/logging/logrus.UnaryServerInterceptor.func1({0x558f168, 0xc002381920}, {0x3dc1120, 0xc000b2d770}, 0xc001c5e060, 0xc0023a0080)
        /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.4.0/logging/logrus/server_interceptors.go:31 +0xfe
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5.1({0x558f168?, 0xc002381920?}, {0x3dc1120?, 0xc000b2d770?})
        /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.4.0/chain.go:48 +0x45
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1({0x558f168, 0xc002381860}, {0x3dc1120, 0xc000b2d770}, 0xc001c5e060, 0xc0023a00c0)
        /go/pkg/mod/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0.46.1/interceptor.go:326 +0x5a4
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5({0x558f168, 0xc002381860}, {0x3dc1120, 0xc000b2d770}, 0xc001c5e060, 0x78?)
        /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.4.0/chain.go:53 +0x123
github.com/argoproj/argo-cd/v2/reposerver/apiclient._RepoServerService_GetGitDirectories_Handler({0x3e19820, 0xc000b70b40}, {0x558f168, 0xc002381860}, 0xc00126b280, 0xc001297140)
        /go/src/github.com/argoproj/argo-cd/reposerver/apiclient/repository.pb.go:3087 +0x143
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0003bc5a0, {0x558f168, 0xc0023817a0}, {0x55a2260, 0xc000d83a00}, 0xc0005e2a20, 0xc001297440, 0x79837e8, 0x0)
        /go/pkg/mod/google.golang.org/grpc@v1.59.0/server.go:1343 +0xdd1
google.golang.org/grpc.(*Server).handleStream(0xc0003bc5a0, {0x55a2260, 0xc000d83a00}, 0xc0005e2a20)
        /go/pkg/mod/google.golang.org/grpc@v1.59.0/server.go:1737 +0xc47
google.golang.org/grpc.(*Server).serveStreams.func1.1()
        /go/pkg/mod/google.golang.org/grpc@v1.59.0/server.go:986 +0x86
created by google.golang.org/grpc.(*Server).serveStreams.func1 in goroutine 10514930
        /go/pkg/mod/google.golang.org/grpc@v1.59.0/server.go:997 +0x136

@ivan-cai
Copy link
Contributor Author

Please, let us know the results in 2.13.

Thanks, I will have a try 2.13.

@andrii-korotkov-verkada
Copy link
Contributor

There should be an exec timeout after which is should terminate. Although sometimes waiting several minutes for a git fetch is unavoidable, though it should rarely happen.

@ivan-cai
Copy link
Contributor Author

There should be an exec timeout after which is should terminate. Although sometimes waiting several minutes for a git fetch is unavoidable, though it should rarely happen.

I agree with you. A parameter should be exposed for clients to configure.

@andrii-korotkov-verkada
Copy link
Contributor

You can configure it using env variable on the repo server manifest, e.g.

          env:
            - name: ARGOCD_EXEC_TIMEOUT
              value: "5m"

The default is 1m30s.

@ivan-cai
Copy link
Contributor Author

ivan-cai commented Nov 19, 2024

I have found why repo-server is hang, the reason is git fetch is hang, and this goroutine hold the mutex.

goroutine 10514931 [chan receive, 17 minutes]:
github.com/argoproj/pkg/exec.RunCommandExt(0xc00170e420, {0x14f46b0400, 0x0, {0xf, 0x1}, 0x0, 0x0})
        /go/pkg/mod/github.com/argoproj/pkg@v0.13.7-0.20230626144333-d56162821bd1/exec/exec.go:139 +0xd5d
github.com/argoproj/argo-cd/v2/util/exec.RunWithExecRunOpts(0xc00170e420, {0x0?, {0x0?, 0x0?}, 0x40?, 0x1b?})
        /go/src/github.com/argoproj/argo-cd/util/exec/exec.go:59 +0x7d5
github.com/argoproj/argo-cd/v2/util/git.(*nativeGitClient).runCmdOutput(0xc000897ab0, 0xc00170e420, {0x8?, 0x8c?})
        /go/src/github.com/argoproj/argo-cd/util/git/client.go:887 +0x5f5
github.com/argoproj/argo-cd/v2/util/git.(*nativeGitClient).runCredentialedCmd(0xc000897ab0, {0xc001dd8c08, 0x5, 0x5})
        /go/src/github.com/argoproj/argo-cd/util/git/client.go:843 +0x413
github.com/argoproj/argo-cd/v2/util/git.(*nativeGitClient).fetch(0xc000cc6340?, {0x0?, 0x0?})
        /go/src/github.com/argoproj/argo-cd/util/git/client.go:356 +0x192
github.com/argoproj/argo-cd/v2/util/git.(*nativeGitClient).Fetch(0xc000897ab0, {0x0?, 0x0?})
        /go/src/github.com/argoproj/argo-cd/util/git/client.go:383 +0x9b
github.com/argoproj/argo-cd/v2/reposerver/repository.checkoutRevision({0x55af400, 0xc000897ab0}, {0xc001629d40, 0x28}, 0x1)
        /go/src/github.com/argoproj/argo-cd/reposerver/repository/repository.go:2440 +0x222
github.com/argoproj/argo-cd/v2/reposerver/repository.(*Service).checkoutRevision(0xc000b70b40, {0x55af400, 0xc000897ab0}, {0xc001629d40, 0x28}, 0x1)
        /go/src/github.com/argoproj/argo-cd/reposerver/repository/repository.go:2418 +0x75
github.com/argoproj/argo-cd/v2/reposerver/repository.(*Service).GetGitDirectories.func1()
        /go/src/github.com/argoproj/argo-cd/reposerver/repository/repository.go:2670 +0x3d
github.com/argoproj/argo-cd/v2/reposerver/repository.(*repositoryLock).Lock(0xc001045060, {0xc001336300, 0x36}, {0xc001629d40, 0x28}, 0x1, 0xc001dd9208)
        /go/src/github.com/argoproj/argo-cd/reposerver/repository/lock.go:55 +0x2e5
github.com/argoproj/argo-cd/v2/reposerver/repository.(*Service).GetGitDirectories(0xc000b70b40, {0x3612ec0?, 0x554f420?}, 0xc000b2d770)
        /go/src/github.com/argoproj/argo-cd/reposerver/repository/repository.go:2669 +0x40f
github.com/argoproj/argo-cd/v2/reposerver/apiclient._RepoServerService_GetGitDirectories_Handler.func1({0x558f168?, 0xc002381a40?}, {0x3dc1120?, 0xc000b2d770?})
        /go/src/github.com/argoproj/argo-cd/reposerver/apiclient/repository.pb.go:3085 +0xcb
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ErrorSanitizerUnaryServerInterceptor.func3({0x558f168, 0xc002381a10}, {0x3dc1120, 0xc000b2d770}, 0x0?, 0xc0022f76e0)
        /go/src/github.com/argoproj/argo-cd/util/grpc/sanitizer.go:24 +0x71
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5.1({0x558f168?, 0xc002381a10?}, {0x3dc1120?, 0xc000b2d770?})
        /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.4.0/chain.go:48 +0x45
github.com/argoproj/argo-cd/v2/reposerver.NewServer.PanicLoggerUnaryServerInterceptor.func2({0x558f168?, 0xc002381a10?}, {0x3dc1120?, 0xc000b2d770?}, 0x4009594?, 0x11?)
        /go/src/github.com/argoproj/argo-cd/util/grpc/grpc.go:33 +0x8c
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5.1({0x558f168?, 0xc002381a10?}, {0x3dc1120?, 0xc000b2d770?})
        /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.4.0/chain.go:48 +0x45
github.com/grpc-ecosystem/go-grpc-prometheus.init.(*ServerMetrics).UnaryServerInterceptor.func2({0x558f168, 0xc002381a10}, {0x3dc1120, 0xc000b2d770}, 0x0?, 0xc0023a0040)
        /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:107 +0x7d
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5.1({0x558f168?, 0xc002381a10?}, {0x3dc1120?, 0xc000b2d770?})
        /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.4.0/chain.go:48 +0x45
github.com/grpc-ecosystem/go-grpc-middleware/logging/logrus.UnaryServerInterceptor.func1({0x558f168, 0xc002381920}, {0x3dc1120, 0xc000b2d770}, 0xc001c5e060, 0xc0023a0080)
        /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.4.0/logging/logrus/server_interceptors.go:31 +0xfe
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5.1({0x558f168?, 0xc002381920?}, {0x3dc1120?, 0xc000b2d770?})
        /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.4.0/chain.go:48 +0x45
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1({0x558f168, 0xc002381860}, {0x3dc1120, 0xc000b2d770}, 0xc001c5e060, 0xc0023a00c0)
        /go/pkg/mod/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0.46.1/interceptor.go:326 +0x5a4
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5({0x558f168, 0xc002381860}, {0x3dc1120, 0xc000b2d770}, 0xc001c5e060, 0x78?)
        /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.4.0/chain.go:53 +0x123
github.com/argoproj/argo-cd/v2/reposerver/apiclient._RepoServerService_GetGitDirectories_Handler({0x3e19820, 0xc000b70b40}, {0x558f168, 0xc002381860}, 0xc00126b280, 0xc001297140)
        /go/src/github.com/argoproj/argo-cd/reposerver/apiclient/repository.pb.go:3087 +0x143
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0003bc5a0, {0x558f168, 0xc0023817a0}, {0x55a2260, 0xc000d83a00}, 0xc0005e2a20, 0xc001297440, 0x79837e8, 0x0)
        /go/pkg/mod/google.golang.org/grpc@v1.59.0/server.go:1343 +0xdd1
google.golang.org/grpc.(*Server).handleStream(0xc0003bc5a0, {0x55a2260, 0xc000d83a00}, 0xc0005e2a20)
        /go/pkg/mod/google.golang.org/grpc@v1.59.0/server.go:1737 +0xc47
google.golang.org/grpc.(*Server).serveStreams.func1.1()
        /go/pkg/mod/google.golang.org/grpc@v1.59.0/server.go:986 +0x86
created by google.golang.org/grpc.(*Server).serveStreams.func1 in goroutine 10514930
        /go/pkg/mod/google.golang.org/grpc@v1.59.0/server.go:997 +0x136

The root cause is that the SSH command(started by git) is stuck, causing the goroutine holding the lock in repo-server to be stuck. Other goroutines in repo-server cannot obtain the lock to continue generating manifests, which eventually causes applications refresh to be stuck.
app-controller --> repo-server :GenerateManifest --> runRepoOperation --> checkoutRevision --> Fetch --> runCredentialedCmd --> runCmdOutput --> RunWithExecRunOpts --> RunCommandExt, stuck on:

		if timeoutBehavior.ShouldWait {
                        ## stuck here
			<-done
		}

There are several solutions:

  1. To optimize the git server.
  2. To use HTTP connect git repo instead of SSH in the argocd Applications. But I have 3000+ Application is working, convert to HTTP may affect out online business deployment.
  3. Argo CD as the git client, It is necessary to be compatible with this scenario. 2 solutions:
    3.1 Argo CD repo-server releases the lock after the timeout without waiting, but this solution map may cause processes leaks.
    3.2 After the timeout, repo-server tries to obtain the corresponding SSH process and kills it.

@alexmt @crenshaw-dev @jessesuen Pls some suggestions.

  • ps -ef in the stuck repo-server
    image

  • ssh commad: ssh -i /dev/shm/2884236167 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o SendEnv=GIT_PROTOCOL git@git.xxxx.cn git-upload-pack 'abc/abc-gitops.git'

@ivan-cai ivan-cai changed the title Application stuck in refreshing Applications are stuck in refreshing Nov 20, 2024
@aleks-andr
Copy link

aleks-andr commented Dec 11, 2024

I'm observing very similar behaviour on our Argo instance, which started seemingly out of nowhere last week. For context, we're running a single instance (v2.13.1) with around 1.5k applications, across ~10 target clusters.

Ever since the issue started our ArgoCD logs have had a steady stream of log entries for:

git fetch origin --tags --force --prune` failed timeout after 1m30s
failed to acquire lock for referenced source <git_repo_url>

and occasionally you can catch this in the UI, as it shows up as an application reconciliation error. This often causes the apps to take much longer than normal to reach a Synced state after a change in Git.

There's some very peculiar behaviour in the metrics for repo-server Pods that correspond to the start of this issue (see picture):
CPU usage and network traffic fell off a cliff, but memory usage is suddenly much higher than usual. Another metric also shows that the average amount of active Go threads went from ~30 to ~50-55.

Screenshot 2024-12-11 at 12 57 51

I ran repo-server with debug-level logs for a bit, and it looks like the average execution time for the git fetch is around 1 second (at least for us). Just to be safe I also doubled the resource allocation for repo-server Pods but it had zero effect on this issue. I also considered a potential network issue, but the NAT that handles outgoing connections for our ArgoCD instance appears to be behaving totally fine (and other services behind the same NAT are also working fine).

Similar to the comment above me, I also managed to catch the long-running process using ps -ef, but in my case it's git fetch itself, nothing to do with ssh. You can see the ps command was executed at 10:52, and git fetch was started at 10:50. The Git process disappeared a few seconds later, presumably hitting its timeout of 90s.

$ ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
argocd         1       0  0 10:42 ?        00:00:00 /usr/bin/tini -- /usr/local/bin/argocd-repo-server --port=8081 --metrics-port=8084
argocd         7       1  4 10:42 ?        00:00:28 /usr/local/bin/argocd-repo-server --port=8081 --metrics-port=8084
argocd        16       1  0 10:42 ?        00:00:00 gpg-agent --homedir /app/config/gpg/keys --use-standard-socket --daemon
argocd       788       7  1 10:50 ?        00:00:00 git fetch origin --tags --force --prune
argocd       789     788  0 10:50 ?        00:00:00 [git] <defunct>
argocd       808     788 21 10:50 ?        00:00:19 /usr/lib/git-core/git index-pack --stdin --fix-thin --keep=fetch-pack 788 on argocd-repo-server-7786b86649-z77lg --pack_header=2,8
argocd       816       0  0 10:51 pts/0    00:00:00 bash
argocd       828     816  0 10:52 pts/0    00:00:00 ps -ef

I'm at a bit of a loss on how to debug this further 😞 The issue appeared early in the morning, when I know that no changes were made to either the underlying infrastructure or our GitOps repo (I double-checked this).
Would a possible workaround be to decrease the timeout value to something relatively short (10sec), and configure ArgoCD to retry the git-fetch a couple times in case of failure? I seem to remember there being env vars to achieve something like this.
Curious to hear if anyone has ideas or possible other workarounds 🙏

@andrii-korotkov-verkada
Copy link
Contributor

I suggest increasing the exec timeout via env variable to 3m or even 5m.

@crenshaw-dev
Copy link
Member

Sounds like the key question is "why isn't the exec timeout killing the stuck git command?"

@aleks-andr
Copy link

I suggest increasing the exec timeout via env variable to 3m or even 5m.

We raised our timeout value to 3 minutes and it seemed to fix the problem. Thanks for the suggestion 🙏

I guess for me the key question would be: what makes the git fetch take so long all of the sudden? In our case the issue appeared out of nowhere outside working hours when no changes were made, and was consistently broken for a few days. So it's not even a case of random slowness 🤔

@dee-kryvenko
Copy link

I think there's two separate issues. I am not observing repo-server hang up, and I am not using SSH protocol for Git. I am observing app controllers hang up though and /clusters stop reporting statuses. Instead of Successful status next to the cluster there will be nothing, no errors. Nothing in the logs. Restarting app controllers helps for couple of hours. I think it was happening for a while now, hard to tell when it started, but I am running v2.13.3 at the moment. When that happens workqueue_depth on the affected app controller goes to 0, but probes are all passing. CPU usage goes next to nothing, app count/resource count/api resource count/events count also goes to next to nothing on affected controller. Feels like a deadlock or something.

@dee-kryvenko
Copy link

What's interesting is that multiple controllers would fall out at the same time, yet not all of them, and not necessarily even the most loaded as seen from this chart
Screenshot 2025-01-10 at 3 41 51 PM

@dee-kryvenko
Copy link

Happened again, digging through the logs I've found one more clue: every time a controller enters this broken state, there are thousands of errors in the logs like below:

E0113 13:52:22.160626       7 retrywatcher.go:131] "Watch failed" err="context canceled"

The pod with the controller was not restarted/rescheduled and continues to run. Looks to me like the thread with the watcher for certain clusters fail and are not restarted.

@dee-kryvenko
Copy link

I think this is the actual issue I am running into #15464, there are other (now "resolved") related issues that I tried to sum up in this comment #15464 (comment)

@vkg23
Copy link

vkg23 commented Jan 23, 2025

@ivan-cai , May I know how you got this git error details. Am going through this issue ( exactly the same , where one repo server , misbehaves on goroutine , then everything comes to stand still ) .

Biggest concern is that , there is nothing in the logs , Cpu , memory resources , sync , reconciliation every thing comes down to almost zero .

Am using V2.10.7 and struggling with this for some time as there are no helpful debug logs.
EXEC TIMEOUT was set to 120s and GIT is not SSH .

@jbartyze-rh
Copy link

jbartyze-rh commented Jan 23, 2025

Hard to say how helpful this would be for everyone. But wanted share experience with my current customer where they experienced application stuck refreshing issue.

ArgoCD - v2.11.6+089247d
Operator - openshift-gitops-operator - 1.13.1

We did a lot of tunning on application-controller and repo-server, but most impactful was moving the repo-server manifest cache to ephemeral volume in memory and adding env variable for the repo server to consume that volume. This speed up massively their ArgoCD sync processes and we stopped having any bottlenecks on repo-server side.

Workqueue depth behaviour has improved massively after that change(no longer stuck and not processed), as they would need to constantly restart their ArgoCD before to unblock it and now it is just restart every 3h in case this bug happens #14224

    - name: TMPDIR 
      value: "/cache"
    volumeMounts:
    - mountPath: /cache
      name: cache
    - name: cache
      emptyDir: 
        medium: "Memory"
        sizeLimit: 3Gi

@ivan-cai
Copy link
Contributor Author

I have found why repo-server is hang, the reason is git fetch is hang, and this goroutine hold the mutex.

goroutine 10514931 [chan receive, 17 minutes]:
github.com/argoproj/pkg/exec.RunCommandExt(0xc00170e420, {0x14f46b0400, 0x0, {0xf, 0x1}, 0x0, 0x0})
        /go/pkg/mod/github.com/argoproj/pkg@v0.13.7-0.20230626144333-d56162821bd1/exec/exec.go:139 +0xd5d
github.com/argoproj/argo-cd/v2/util/exec.RunWithExecRunOpts(0xc00170e420, {0x0?, {0x0?, 0x0?}, 0x40?, 0x1b?})
        /go/src/github.com/argoproj/argo-cd/util/exec/exec.go:59 +0x7d5
github.com/argoproj/argo-cd/v2/util/git.(*nativeGitClient).runCmdOutput(0xc000897ab0, 0xc00170e420, {0x8?, 0x8c?})
        /go/src/github.com/argoproj/argo-cd/util/git/client.go:887 +0x5f5
github.com/argoproj/argo-cd/v2/util/git.(*nativeGitClient).runCredentialedCmd(0xc000897ab0, {0xc001dd8c08, 0x5, 0x5})
        /go/src/github.com/argoproj/argo-cd/util/git/client.go:843 +0x413
github.com/argoproj/argo-cd/v2/util/git.(*nativeGitClient).fetch(0xc000cc6340?, {0x0?, 0x0?})
        /go/src/github.com/argoproj/argo-cd/util/git/client.go:356 +0x192
github.com/argoproj/argo-cd/v2/util/git.(*nativeGitClient).Fetch(0xc000897ab0, {0x0?, 0x0?})
        /go/src/github.com/argoproj/argo-cd/util/git/client.go:383 +0x9b
github.com/argoproj/argo-cd/v2/reposerver/repository.checkoutRevision({0x55af400, 0xc000897ab0}, {0xc001629d40, 0x28}, 0x1)
        /go/src/github.com/argoproj/argo-cd/reposerver/repository/repository.go:2440 +0x222
github.com/argoproj/argo-cd/v2/reposerver/repository.(*Service).checkoutRevision(0xc000b70b40, {0x55af400, 0xc000897ab0}, {0xc001629d40, 0x28}, 0x1)
        /go/src/github.com/argoproj/argo-cd/reposerver/repository/repository.go:2418 +0x75
github.com/argoproj/argo-cd/v2/reposerver/repository.(*Service).GetGitDirectories.func1()
        /go/src/github.com/argoproj/argo-cd/reposerver/repository/repository.go:2670 +0x3d
github.com/argoproj/argo-cd/v2/reposerver/repository.(*repositoryLock).Lock(0xc001045060, {0xc001336300, 0x36}, {0xc001629d40, 0x28}, 0x1, 0xc001dd9208)
        /go/src/github.com/argoproj/argo-cd/reposerver/repository/lock.go:55 +0x2e5
github.com/argoproj/argo-cd/v2/reposerver/repository.(*Service).GetGitDirectories(0xc000b70b40, {0x3612ec0?, 0x554f420?}, 0xc000b2d770)
        /go/src/github.com/argoproj/argo-cd/reposerver/repository/repository.go:2669 +0x40f
github.com/argoproj/argo-cd/v2/reposerver/apiclient._RepoServerService_GetGitDirectories_Handler.func1({0x558f168?, 0xc002381a40?}, {0x3dc1120?, 0xc000b2d770?})
        /go/src/github.com/argoproj/argo-cd/reposerver/apiclient/repository.pb.go:3085 +0xcb
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ErrorSanitizerUnaryServerInterceptor.func3({0x558f168, 0xc002381a10}, {0x3dc1120, 0xc000b2d770}, 0x0?, 0xc0022f76e0)
        /go/src/github.com/argoproj/argo-cd/util/grpc/sanitizer.go:24 +0x71
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5.1({0x558f168?, 0xc002381a10?}, {0x3dc1120?, 0xc000b2d770?})
        /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.4.0/chain.go:48 +0x45
github.com/argoproj/argo-cd/v2/reposerver.NewServer.PanicLoggerUnaryServerInterceptor.func2({0x558f168?, 0xc002381a10?}, {0x3dc1120?, 0xc000b2d770?}, 0x4009594?, 0x11?)
        /go/src/github.com/argoproj/argo-cd/util/grpc/grpc.go:33 +0x8c
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5.1({0x558f168?, 0xc002381a10?}, {0x3dc1120?, 0xc000b2d770?})
        /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.4.0/chain.go:48 +0x45
github.com/grpc-ecosystem/go-grpc-prometheus.init.(*ServerMetrics).UnaryServerInterceptor.func2({0x558f168, 0xc002381a10}, {0x3dc1120, 0xc000b2d770}, 0x0?, 0xc0023a0040)
        /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:107 +0x7d
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5.1({0x558f168?, 0xc002381a10?}, {0x3dc1120?, 0xc000b2d770?})
        /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.4.0/chain.go:48 +0x45
github.com/grpc-ecosystem/go-grpc-middleware/logging/logrus.UnaryServerInterceptor.func1({0x558f168, 0xc002381920}, {0x3dc1120, 0xc000b2d770}, 0xc001c5e060, 0xc0023a0080)
        /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.4.0/logging/logrus/server_interceptors.go:31 +0xfe
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5.1({0x558f168?, 0xc002381920?}, {0x3dc1120?, 0xc000b2d770?})
        /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.4.0/chain.go:48 +0x45
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1({0x558f168, 0xc002381860}, {0x3dc1120, 0xc000b2d770}, 0xc001c5e060, 0xc0023a00c0)
        /go/pkg/mod/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0.46.1/interceptor.go:326 +0x5a4
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5({0x558f168, 0xc002381860}, {0x3dc1120, 0xc000b2d770}, 0xc001c5e060, 0x78?)
        /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.4.0/chain.go:53 +0x123
github.com/argoproj/argo-cd/v2/reposerver/apiclient._RepoServerService_GetGitDirectories_Handler({0x3e19820, 0xc000b70b40}, {0x558f168, 0xc002381860}, 0xc00126b280, 0xc001297140)
        /go/src/github.com/argoproj/argo-cd/reposerver/apiclient/repository.pb.go:3087 +0x143
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0003bc5a0, {0x558f168, 0xc0023817a0}, {0x55a2260, 0xc000d83a00}, 0xc0005e2a20, 0xc001297440, 0x79837e8, 0x0)
        /go/pkg/mod/google.golang.org/grpc@v1.59.0/server.go:1343 +0xdd1
google.golang.org/grpc.(*Server).handleStream(0xc0003bc5a0, {0x55a2260, 0xc000d83a00}, 0xc0005e2a20)
        /go/pkg/mod/google.golang.org/grpc@v1.59.0/server.go:1737 +0xc47
google.golang.org/grpc.(*Server).serveStreams.func1.1()
        /go/pkg/mod/google.golang.org/grpc@v1.59.0/server.go:986 +0x86
created by google.golang.org/grpc.(*Server).serveStreams.func1 in goroutine 10514930
        /go/pkg/mod/google.golang.org/grpc@v1.59.0/server.go:997 +0x136

The root cause is that the SSH command(started by git) is stuck, causing the goroutine holding the lock in repo-server to be stuck. Other goroutines in repo-server cannot obtain the lock to continue generating manifests, which eventually causes applications refresh to be stuck. app-controller --> repo-server :GenerateManifest --> runRepoOperation --> checkoutRevision --> Fetch --> runCredentialedCmd --> runCmdOutput --> RunWithExecRunOpts --> RunCommandExt, stuck on:

		if timeoutBehavior.ShouldWait {
                        ## stuck here
			<-done
		}

There are several solutions:

  1. To optimize the git server.
  2. To use HTTP connect git repo instead of SSH in the argocd Applications. But I have 3000+ Application is working, convert to HTTP may affect out online business deployment.
  3. Argo CD as the git client, It is necessary to be compatible with this scenario. 2 solutions:
    3.1 Argo CD repo-server releases the lock after the timeout without waiting, but this solution map may cause processes leaks.
    3.2 After the timeout, repo-server tries to obtain the corresponding SSH process and kills it.

@alexmt @crenshaw-dev @jessesuen Pls some suggestions.

  • ps -ef in the stuck repo-server
    image
  • ssh commad: ssh -i /dev/shm/2884236167 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o SendEnv=GIT_PROTOCOL git@git.xxxx.cn git-upload-pack 'abc/abc-gitops.git'

@vkg23 You can see this.

@vkg23
Copy link

vkg23 commented Jan 24, 2025

@ivan-cai - Am not using GIT SSH , but HTTP only for All the 5000 apps
but more or less the issue is same i believe with git client.

ArgoCD: issue in 2.10.x , issues exists in version 2.13.2 as well . Upgraded today .

@jbartyze-rh - I attempted with ephemeral , however the issue continues in short duration. May i know , how you got this stack trace attached. Besides git client , is there a way to figure out , which repo is probably contributing to it.

Already attempted tuning git timeout , EXEC timeout etc params.

@jbartyze-rh
Copy link

jbartyze-rh commented Feb 14, 2025

@vkg23 here is my whole ArgoCD CR. A lot of tunning there and it is able to handle around 850 apps at the moment.

ArgoCD version v2.11.6+089247d

On the ArgoCD application side we also use argocd.argoproj.io/manifest-generate-paths: . to lessen the burden on repo-server in monorepo setup.

We are not using any CPU limits to avoid CPU throttle on ArgoCD components. Redis compression and Jitter is enabled to lower the peaks + some timeouts extensions as well.

Image

Image

I am not able sadly to help with stack trace, but I think it was question directed to @ivan-cai

apiVersion: argoproj.io/v1beta1
kind: ArgoCD
metadata:
  annotations:
    argocd.argoproj.io/sync-options: SkipDryRunOnMissingResource=true
    argocd.argoproj.io/sync-wave: '10'
  name: acme-platform-argo
  namespace: argocd-mgmt
  finalizers:
    - argoproj.io/finalizer
spec:
  server:
    resources:
      limits:
        memory: 2Gi
      requests:
        cpu: 125m
        memory: 128Mi
    host: acme-platform-argo-server-argocd-mgmt.apps.c30.acme.acme.com
    route:
      annotations:
        haproxy.router.openshift.io/timeout: 60s
      enabled: true
      path: /
      tls:
        insecureEdgeTerminationPolicy: Redirect
        termination: reencrypt
      wildcardPolicy: None
    env:
      - name: ARGOCD_SERVER_ENABLE_GZIP
        value: 'true'
      - name: REDIS_COMPRESSION
        valueFrom:
          configMapKeyRef:
            key: redis.compression
            name: argocd-cm
            optional: true
    ingress:
      enabled: false
    service:
      type: ''
    extraCommandArgs:
      - '--enable-gzip'
    autoscale:
      enabled: false
    replicas: 3
    grpc:
      ingress:
        enabled: false
  grafana:
    enabled: false
    ingress:
      enabled: false
    route:
      enabled: false
  monitoring:
    enabled: false
  notifications:
    enabled: false
  prometheus:
    enabled: false
    ingress:
      enabled: false
    route:
      enabled: false
  initialSSHKnownHosts: {}
  sso:
    dex:
      openShiftOAuth: true
      resources:
        limits:
          memory: 1Gi
        requests:
          cpu: 250m
          memory: 128Mi
    provider: dex
  applicationSet:
    resources:
      limits:
        memory: 1Gi
      requests:
        cpu: 250m
        memory: 512Mi
    webhookServer:
      ingress:
        enabled: false
      route:
        enabled: false
  rbac:
    policy: |
      g, system:acme, role:admin
      g, acme, role:admin
      g, acme-grp, role:admin
    scopes: '[groups]'
  extraConfig:
    redis.compression: gzip
    resource.customizations.ignoreResourceUpdates.all: |
      jsonPointers:
      - /status
    resource.ignoreResourceUpdatesEnabled: 'true'
    resource.respectRBAC: strict
  repo:
    initContainers:
      - command:
          - sh
          - '-c'
          - cp /usr/local/bin/argocd-vault-plugin /custom-tools/
        image: 'acme.acme.com:18449/acme/acme/argocd-vault-plugin@sha256:e68bd003cf806342289e10821f41b7c1ff93ed3e66f5c7ea4f5b04e8e841b92b'
        name: download-tools
        resources: {}
        volumeMounts:
          - mountPath: /custom-tools
            name: custom-tools
    resources:
      limits:
        memory: 10Gi
      requests:
        cpu: 1200m
        memory: 1Gi
    extraRepoCommandArgs:
      - '--loglevel warn'
      - '--default-cache-expiration 1h'
      - '--repo-cache-expiration 20m'
    env:
      - name: ARGOCD_HELM_ALLOW_CONCURRENCY
        value: 'true'
      - name: ARGOCD_REPO_SERVER_PARALLELISM_LIMIT
        value: '50'
      - name: TMPDIR
        value: /cache
      - name: ARGOCD_GIT_ATTEMPTS_COUNT
        value: '3'
      - name: ARGOCD_EXEC_TIMEOUT
        value: 3m
      - name: REDIS_COMPRESSION
        valueFrom:
          configMapKeyRef:
            key: redis.compression
            name: argocd-cm
            optional: true
    sidecarContainers:
      - args:
          - '--loglevel'
          - warn
        command:
          - /var/run/argocd/argocd-cmp-server
        env:
          - name: ARGOCD_EXEC_TIMEOUT
            value: 3m
        image: 'acme.acme.com:18449/redhat/ubi9@sha256:204383c3d96c0e6c7154c91d07764f92035738dd67aa8896679f7feb73f66bfd'
        name: avp
        resources: {}
        securityContext:
          runAsNonRoot: true
        volumeMounts:
          - mountPath: /var/run/argocd
            name: var-files
          - mountPath: /home/argocd/cmp-server/plugins
            name: plugins
          - mountPath: /home/argocd/cmp-server/config/plugin.yaml
            name: cmp-plugin-configmap-vol
            subPath: avp.yaml
          - mountPath: /usr/local/bin/argocd-vault-plugin
            name: custom-tools
            subPath: argocd-vault-plugin
          - mountPath: /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
            name: cluster-root-ca-bundle
            subPath: ca-bundle.crt
    mountsatoken: true
    volumeMounts:
      - mountPath: /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
        name: cluster-root-ca-bundle
        subPath: ca-bundle.crt
      - mountPath: /cache
        name: cache
    serviceaccount: vplugin
    volumes:
      - configMap:
          name: cluster-root-ca-bundle
        name: cluster-root-ca-bundle
      - configMap:
          name: cmp-plugin-configmap
        name: cmp-plugin-configmap-vol
      - emptyDir: {}
        name: custom-tools
      - emptyDir:
          medium: Memory
          sizeLimit: 3Gi
        name: cache
    replicas: 3
  resourceExclusions: |
    - apiGroups:
      - tekton.dev
      clusters:
      - '*'
      kinds:
      - TaskRun
      - PipelineRun
  resourceHealthChecks:
    - check: |
        health_status = {}
        if obj.status ~= nil then
          if obj.status.conditions ~= nil then
            numDegraded = 0
            numPending = 0
            msg = ""
            for i, condition in pairs(obj.status.conditions) do
              msg = msg .. i .. ": " .. condition.type .. " | " .. condition.status .. "\n"
              if condition.type == "InstallPlanPending" and condition.status == "True" then
                numPending = numPending + 1
              elseif (condition.type == "InstallPlanMissing" and condition.reason ~= "ReferencedInstallPlanNotFound") then
                numDegraded = numDegraded + 1
              elseif (condition.type == "CatalogSourcesUnhealthy" or condition.type == "InstallPlanFailed") and condition.status == "True" then
                numDegraded = numDegraded + 1
              elseif (condition.type == "CatalogSourcesUnhealthy" and condition.status == "False") then
                break
              end
            end
            if numDegraded == 0 and numPending == 0 then
              health_status.status = "Healthy"
              health_status.message = msg
              return health_status
            elseif numPending > 0 and numDegraded == 0 then
              health_status.status = "Progressing"
              health_status.message = "An install plan for a subscription is pending installation"
              return health_status
            else
              health_status.status = "Degraded"
              health_status.message = msg
              return health_status
            end
          end
        end
        health_status.status = "Progressing"
        health_status.message = "An install plan for a subscription is pending installation"
        return health_status
      group: operators.coreos.com
      kind: Subscription
  ha:
    enabled: false
    resources:
      limits:
        memory: 256Mi
      requests:
        cpu: 250m
        memory: 128Mi
  tls:
    ca:
      configMapName: cluster-root-ca-bundle
  redis:
    resources:
      limits:
        memory: 2Gi
      requests:
        cpu: 250m
        memory: 128Mi
  controller:
    env:
      - name: ARGOCD_CONTROLLER_REPLICAS
        value: '5'
      - name: ARGOCD_RECONCILIATION_JITTER
        value: 3m
      - name: ARGOCD_APPLICATION_CONTROLLER_REPO_SERVER_TIMEOUT_SECONDS
        value: '120'
      - name: ARGOCD_K8S_CLIENT_QPS
        value: '150'
      - name: ARGOCD_K8S_CLIENT_BURST
        value: '300'
      - name: ARGOCD_K8S_TCP_TIMEOUT
        value: 60s
      - name: REDIS_COMPRESSION
        valueFrom:
          configMapKeyRef:
            key: redis.compression
            name: argocd-cm
            optional: true
    logLevel: warn
    processors:
      operation: 25
      status: 50
    resources:
      limits:
        memory: 20Gi
      requests:
        cpu: 900m
        memory: 2Gi
    sharding:
      enabled: true
      replicas: 5


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working more-information-needed Further information is requested version:2.12 Latest confirmed affected version is 2.12
Projects
None yet
Development

No branches or pull requests

7 participants