Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to sync cluster #1006

Closed
adieu opened this issue Jan 11, 2019 · 12 comments
Closed

Failed to sync cluster #1006

adieu opened this issue Jan 11, 2019 · 12 comments

Comments

@adieu
Copy link

adieu commented Jan 11, 2019

I just upgraded our argocd instance to 0.11.0 today. The overall experience was smooth but had one issue which did not exist in 0.10.5.

Argocd is running in one of our k8s clusters and it's trying to manage applications in other k8s clusters.

One cluster failed to sync and here is the log:

time="2019-01-11T09:13:11Z" level=info msg="Start syncing cluster" server="https://10.0.0.1:6443"
time="2019-01-11T09:14:14Z" level=error msg="Failed to sync cluster https://10.0.0.1:6443: the server was unable to return a response in the time allotted, but may still be processing the request"

It looks like the sync process got timeout after around 1 minute which is possible because those two clusters are not in the same location.

We didn't have this issue before the latest release and I was trying to identify the timeout request but failed.

I saw this log on the apiserver of the failed cluster:

I0111 09:38:04.687179       1 controller.go:105] OpenAPI AggregationController: Processing item v1alpha1.repositories.stash.appscode.com
I0111 09:38:04.724927       1 controller.go:116] OpenAPI AggregationController: action for item v1alpha1.repositories.stash.appscode.com: Requeue.

Not sure if it's related.

@jessesuen
Copy link
Member

jessesuen commented Jan 11, 2019

@adieu does the other cluster have any api resource, which extends k8s using api server aggregation (instead of CRDs), and the deployment which is supposed to handle that api resource is down?

An example of a resource with this is Service Catalog. We have seen issues where if the deployment which backs service catalog is down, it causes problem for Argo CD. This may be similar for v1alpha1.repositories.stash.appscode.com

@jessesuen
Copy link
Member

jessesuen commented Jan 11, 2019

This was our previous issue with ServiceCatalog: #650

One way to test is to run kubectl get <problematic-custom-resource>. If this is slow or fails, then it could cause problems for Argo CD.

@jessesuen
Copy link
Member

jessesuen commented Jan 11, 2019

Also it will help to enable kubernetes related logs in the argocd-application-controller by adding: --gloglevel 6 to see which API request may be blocked.

@adieu
Copy link
Author

adieu commented Jan 13, 2019

@jessesuen Thank you for your reply. I tested repositories resource from Stash and it would return normally. I'll enable verbose logging on argocd-application-controller to check if there's any block request.

@jessesuen
Copy link
Member

jessesuen commented Jan 13, 2019

Ok it may be another one. Can you share:

kubectl get apiservice
kubectl get crd
kubectl api-resources

Running kubectl get for each item in kubectl api-resources is expected to return in a reasonable time.

@adieu
Copy link
Author

adieu commented Jan 15, 2019

I can confirm there are requests got timeout but which resource caused the problem is unknown. I'll try crd resources one by one.

I0115 15:25:59.852076       1 round_trippers.go:408] Response Status: 504 Gateway Timeout in 60166 milliseconds
time="2019-01-15T15:25:59Z" level=error msg="Failed to sync cluster https://10.0.0.1:6443: the server was unable to return a response in the time allotted, but may still be processing the request"

EDITED:

I can confirm it's the apiserver aggregation caused the problem. The backend service is not down but it got timeout when we have lots of snapshots. Maybe need to filter it out.

~$ kubectl get snapshot --all-namespaces
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get snapshots.repositories.stash.appscode.com)

@jessesuen
Copy link
Member

#1018

This fix may help. We also want to add a way to exclude resources kinds from our watch.

@jessesuen
Copy link
Member

@adieu you may want to try argoproj/argocd:latest which has the fix to become best effort when api resource discovery fails for a particular api group.

@adieu
Copy link
Author

adieu commented Jan 16, 2019

Sadly the latest image does not help. I guess it's because the backend server is running alive and responding to discovery request correctly. The get request is slow but it would not timeout when getting resources from a single namespace. The timeout error will only occur when getting all snapshot resources at once. A temporary fix might be deleting some old snapshots to make the request faster.

@jessesuen
Copy link
Member

Ok. Then we need to implement resource exclusion (issue #1010).
I assume that snapshot is not a managed resource that you need Argo CD to track/view in UI or deploy from git?

@adieu
Copy link
Author

adieu commented Jan 22, 2019

No. I don't need snapshot resource. I manage to work around the problem by patch the code to exclude the snapshots resource.

@alexmt
Copy link
Collaborator

alexmt commented Mar 15, 2019

Resource exclusion feature was implemented

@alexmt alexmt closed this as completed Mar 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants