Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

Flux chokes on empty Custom Metrics GroupVersion #1951

Closed
dvelitchkov opened this issue Apr 17, 2019 · 8 comments · Fixed by #1957
Closed

Flux chokes on empty Custom Metrics GroupVersion #1951

dvelitchkov opened this issue Apr 17, 2019 · 8 comments · Fixed by #1957
Labels

Comments

@dvelitchkov
Copy link

Running into a strange problem.

I have a release that is managed by Flux+Helm. So far so good. Recently pushed an updated docker image and now Flux can't deploy it.

What's really concerning is, Flux pushes a commit to the Gitops repo saying the new release was rolled out, even though the operation clearly fails.

Running fluxctl sync:

fluxctl sync
Synchronizing with <redacted>
HEAD of master is ed3ba29
Waiting for ed3ba29 to be applied ...
Error: timeout

Some of the errors I see are collating resources in cluster for sync: not found and couldn't get resource list for custom.metrics.k8s.io/v1beta1: <nil>. Not super helpful.

Log:

2019-04-17T16:18:07.202427131Z ts=2019-04-17T16:18:07.202281634Z caller=loop.go:111 component=sync-loop jobID=fb24233b-779c-3020-99db-647abc155bac state=in-progress
2019-04-17T16:18:09.15816782Z ts=2019-04-17T16:18:09.158030323Z caller=loop.go:123 component=sync-loop jobID=fb24233b-779c-3020-99db-647abc155bac state=done success=true
2019-04-17T16:18:11.084675671Z ts=2019-04-17T16:18:11.084497151Z caller=loop.go:103 component=sync-loop event=refreshed url=<redacted> branch=master HEAD=ed3ba298cdc026dbb635eaf731548395271bb233
2019-04-17T16:18:31.21213776Z ts=2019-04-17T16:18:31.211961076Z caller=images.go:18 component=sync-loop msg="polling images"
2019-04-17T16:18:31.335243305Z ts=2019-04-17T16:18:31.335087864Z caller=main.go:199 type="internal kubernetes error" err="couldn't get resource list for custom.metrics.k8s.io/v1beta1: <nil>"
2019-04-17T16:18:31.385604796Z ts=2019-04-17T16:18:31.385439156Z caller=images.go:109 component=sync-loop workload=staging:helmrelease/decompress container=chart-image repo=<redacted> pattern=glob:staging* current=<redacted>:staging-v1.3.1-2 info="added update to automation run" new=<redacted>:staging-v1.3.1-3 reason="latest staging-v1.3.1-3 (2019-04-16 16:47:58.788602027 +0000 UTC) > current staging-v1.3.1-2 (2019-04-12 23:52:13.588390498 +0000 UTC)"
2019-04-17T16:18:31.385627972Z ts=2019-04-17T16:18:31.385537557Z caller=loop.go:111 component=sync-loop jobID=cf4c96e4-54c6-0777-b868-9ef41c8180ee state=in-progress
2019-04-17T16:18:31.484304234Z ts=2019-04-17T16:18:31.484182549Z caller=releaser.go:58 component=sync-loop jobID=cf4c96e4-54c6-0777-b868-9ef41c8180ee type=release updates=0
2019-04-17T16:18:31.484331638Z ts=2019-04-17T16:18:31.484218832Z caller=releaser.go:60 component=sync-loop jobID=cf4c96e4-54c6-0777-b868-9ef41c8180ee type=release exit="no images to update for services given"
2019-04-17T16:18:31.523865111Z ts=2019-04-17T16:18:31.523740767Z caller=loop.go:121 component=sync-loop jobID=cf4c96e4-54c6-0777-b868-9ef41c8180ee state=done success=false err="no changes made in repo"
2019-04-17T16:18:33.065109033Z ts=2019-04-17T16:18:33.064978047Z caller=loop.go:103 component=sync-loop event=refreshed url=<redacted> branch=master HEAD=ed3ba298cdc026dbb635eaf731548395271bb233
2019-04-17T16:18:37.929537537Z ts=2019-04-17T16:18:37.929414905Z caller=loop.go:210 component=sync-loop err="collating resources in cluster for sync: not found"
2019-04-17T16:18:37.969697408Z ts=2019-04-17T16:18:37.969557523Z caller=loop.go:90 component=sync-loop err="collating resources in cluster for sync: not found"

The image I want rolled is staging-v1.3.1-3, the currently running one is staging-v1.3.1-2.

The upgrade from v1.3.1-1 to v1.3.1-2 went off without a hitch, which was with an older releae of Flux.

I am now running the fresh-est release of Flux: docker.io/weaveworks/flux:1.12.0 and docker.io/weaveworks/helm-operator:0.8.0

Any ideas?

@2opremio
Copy link
Contributor

2opremio commented Apr 18, 2019

Thanks for the detailed report and sorry for the inconvenience.

I think that the underlying issue is that the GroupVersion custom.metrics.k8s.io/v1beta1 doesn't exist in the cluster yet and some of your resources in Git are using that GroupVersion.

Was custom.metrics.k8s.io/v1beta1 going to be provided by a CRD (or implicitly by a HelmChart) defined in your repo? (I took a quick look at https://github.com/kubernetes-incubator/custom-metrics-apiserver but I am no expert). If that was the case, I think you may be experiencing #1941 which is fixed in master, you can give it a try with docker.io/weaveworks/flux-prerelease:master-74b133af

What's really concerning is, Flux pushes a commit to the Gitops repo saying the new release was rolled out, even though the operation clearly fails

Pushing to git (release) and syncing (which is what fails) are separate operations. What was the exact message?

Some of the errors I see are collating resources in cluster for sync: not found and couldn't get resource list for custom.metrics.k8s.io/v1beta1: <nil>. Not super helpful.

I agree:

  1. couldn't get resource list for custom.metrics.k8s.io/v1beta1: <nil> is an internal error from kubernetes and there is not much we can do about it (I don't really understand the kubernetes approach of not exposing all errors to the API, but that's how it is). I can try to provide a better context than caller=main.go:199 though.

  2. collating resources in cluster for sync: not found should indeed be improved, it's pretty useless as is. I think it's caused by a cache miss due to custom.metrics.k8s.io/v1beta1 not existing. I'll try to improve it in a PR.

@2opremio
Copy link
Contributor

2opremio commented Apr 18, 2019

I think that the underlying issue is that the GroupVersion custom.metrics.k8s.io/v1beta1 doesn't exist in the cluster yet and some of your resources in Git are using that GroupVersion.

I stand corrected. After some extra information from @stefansedich (he was kind enough to reproduce the problem on master) we have better error reporting:

ts=2019-04-18T16:02:33.113902025Z caller=loop.go:90 component=sync-loop err="collating resources in cluster for sync: unable to retrieve the complete list of server APIs: external.metrics.k8s.io/v1beta1: Got empty response for: external.metrics.k8s.io/v1beta1"
$ kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" 
{"kind":"APIResourceList","apiVersion":"v1","groupVersion":"external.metrics.k8s.io/v1beta1","resources":[]}

I think the problem is that the Kubernetes Go client doesn't like empty GroupVersions. I don't know if that's ilegal but I will check.

@stefansedich was using Datadog.

@dvelitchkov was that the case with you too?

@2opremio
Copy link
Contributor

Related: argoproj/argo-cd#524

@2opremio 2opremio changed the title Flux can't roll out new release - timeout from cli, "not found" errors in logs Flux chokes on empty Custom Metrics API Apr 18, 2019
@2opremio
Copy link
Contributor

I will work on a patch to tolerate groups with no API resources.

Note, however, that the kubernetes client assumes the invariant of no group being empty. So, either the metric is breaking the invariant or the client is being too strict.

@2opremio 2opremio changed the title Flux chokes on empty Custom Metrics API Flux chokes on empty Custom Metrics GroupVersion Apr 18, 2019
@dvelitchkov
Copy link
Author

@2opremio - sorry, not sure what you mean by "was that the case with you too"?

Now that I know what to look at more closely, I did update a very old version of https://github.com/DirectXMan12/k8s-prometheus-adapter that we were running. So, currently:

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1                      
{"kind":"APIResourceList","apiVersion":"v1","groupVersion":"custom.metrics.k8s.io/v1beta1","resources":[]}

So I'm getting an empty array in the response, which perhaps is a problem in my cluster, perhaps not, but it seems like this is what the k8s client is choking on here, right?

@dvelitchkov
Copy link
Author

I sorted this out.
I was using prometheus-adapter and it was misconfigured - wasn't pointing at the correct URL from prometheus - hence the empty array.

As soon as I fixed that, flux started working!

So really not a bug in Flux per se, just misconfiguration -> empty array -> k8s api reports error -> flux can't sync.

Thank you guys for the quick response and for leading me down the right path.

@2opremio
Copy link
Contributor

2opremio commented Apr 18, 2019

Yes, that's were the problem was. I am working on a fix to tolerate empty VersionGroups, although technically is a misconfiguration as you stated.

@2opremio
Copy link
Contributor

not sure what you mean by "was that the case with you too"?

I meant whether you were using DataDog as well.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants