Failed to sync cluster #1006

adieu · 2019-01-11T09:54:08Z

I just upgraded our argocd instance to 0.11.0 today. The overall experience was smooth but had one issue which did not exist in 0.10.5.

Argocd is running in one of our k8s clusters and it's trying to manage applications in other k8s clusters.

One cluster failed to sync and here is the log:

time="2019-01-11T09:13:11Z" level=info msg="Start syncing cluster" server="https://10.0.0.1:6443"
time="2019-01-11T09:14:14Z" level=error msg="Failed to sync cluster https://10.0.0.1:6443: the server was unable to return a response in the time allotted, but may still be processing the request"

It looks like the sync process got timeout after around 1 minute which is possible because those two clusters are not in the same location.

We didn't have this issue before the latest release and I was trying to identify the timeout request but failed.

I saw this log on the apiserver of the failed cluster:

I0111 09:38:04.687179       1 controller.go:105] OpenAPI AggregationController: Processing item v1alpha1.repositories.stash.appscode.com
I0111 09:38:04.724927       1 controller.go:116] OpenAPI AggregationController: action for item v1alpha1.repositories.stash.appscode.com: Requeue.

Not sure if it's related.

The text was updated successfully, but these errors were encountered:

jessesuen · 2019-01-11T19:27:00Z

@adieu does the other cluster have any api resource, which extends k8s using api server aggregation (instead of CRDs), and the deployment which is supposed to handle that api resource is down?

An example of a resource with this is Service Catalog. We have seen issues where if the deployment which backs service catalog is down, it causes problem for Argo CD. This may be similar for v1alpha1.repositories.stash.appscode.com

jessesuen · 2019-01-11T19:31:01Z

This was our previous issue with ServiceCatalog: #650

One way to test is to run kubectl get <problematic-custom-resource>. If this is slow or fails, then it could cause problems for Argo CD.

jessesuen · 2019-01-11T19:34:10Z

Also it will help to enable kubernetes related logs in the argocd-application-controller by adding: --gloglevel 6 to see which API request may be blocked.

adieu · 2019-01-13T13:47:23Z

@jessesuen Thank you for your reply. I tested repositories resource from Stash and it would return normally. I'll enable verbose logging on argocd-application-controller to check if there's any block request.

jessesuen · 2019-01-13T19:49:39Z

Ok it may be another one. Can you share:

kubectl get apiservice
kubectl get crd
kubectl api-resources

Running kubectl get for each item in kubectl api-resources is expected to return in a reasonable time.

adieu · 2019-01-15T15:49:41Z

I can confirm there are requests got timeout but which resource caused the problem is unknown. I'll try crd resources one by one.

I0115 15:25:59.852076       1 round_trippers.go:408] Response Status: 504 Gateway Timeout in 60166 milliseconds
time="2019-01-15T15:25:59Z" level=error msg="Failed to sync cluster https://10.0.0.1:6443: the server was unable to return a response in the time allotted, but may still be processing the request"

EDITED:

I can confirm it's the apiserver aggregation caused the problem. The backend service is not down but it got timeout when we have lots of snapshots. Maybe need to filter it out.

~$ kubectl get snapshot --all-namespaces
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get snapshots.repositories.stash.appscode.com)

jessesuen · 2019-01-15T18:26:18Z

#1018

This fix may help. We also want to add a way to exclude resources kinds from our watch.

jessesuen · 2019-01-16T02:06:48Z

@adieu you may want to try argoproj/argocd:latest which has the fix to become best effort when api resource discovery fails for a particular api group.

adieu · 2019-01-16T04:47:36Z

Sadly the latest image does not help. I guess it's because the backend server is running alive and responding to discovery request correctly. The get request is slow but it would not timeout when getting resources from a single namespace. The timeout error will only occur when getting all snapshot resources at once. A temporary fix might be deleting some old snapshots to make the request faster.

jessesuen · 2019-01-16T05:03:08Z

Ok. Then we need to implement resource exclusion (issue #1010).
I assume that snapshot is not a managed resource that you need Argo CD to track/view in UI or deploy from git?

adieu · 2019-01-22T06:50:22Z

No. I don't need snapshot resource. I manage to work around the problem by patch the code to exclude the snapshots resource.

alexmt · 2019-03-15T04:40:48Z

Resource exclusion feature was implemented

alexmt mentioned this issue Jan 14, 2019

Support resource filtering in Application controller #1010

Closed

alexmt closed this as completed Mar 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to sync cluster #1006

Failed to sync cluster #1006

adieu commented Jan 11, 2019

jessesuen commented Jan 11, 2019 •

edited

Loading

jessesuen commented Jan 11, 2019 •

edited

Loading

jessesuen commented Jan 11, 2019 •

edited

Loading

adieu commented Jan 13, 2019 •

edited

Loading

jessesuen commented Jan 13, 2019 •

edited

Loading

adieu commented Jan 15, 2019 •

edited

Loading

jessesuen commented Jan 15, 2019

jessesuen commented Jan 16, 2019

adieu commented Jan 16, 2019

jessesuen commented Jan 16, 2019

adieu commented Jan 22, 2019

alexmt commented Mar 15, 2019

Failed to sync cluster #1006

Failed to sync cluster #1006

Comments

adieu commented Jan 11, 2019

jessesuen commented Jan 11, 2019 • edited Loading

jessesuen commented Jan 11, 2019 • edited Loading

jessesuen commented Jan 11, 2019 • edited Loading

adieu commented Jan 13, 2019 • edited Loading

jessesuen commented Jan 13, 2019 • edited Loading

adieu commented Jan 15, 2019 • edited Loading

jessesuen commented Jan 15, 2019

jessesuen commented Jan 16, 2019

adieu commented Jan 16, 2019

jessesuen commented Jan 16, 2019

adieu commented Jan 22, 2019

alexmt commented Mar 15, 2019

jessesuen commented Jan 11, 2019 •

edited

Loading

jessesuen commented Jan 11, 2019 •

edited

Loading

jessesuen commented Jan 11, 2019 •

edited

Loading

adieu commented Jan 13, 2019 •

edited

Loading

jessesuen commented Jan 13, 2019 •

edited

Loading

adieu commented Jan 15, 2019 •

edited

Loading