Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

flux 1.11.0 no longer syncs without ClusterRole #1830

Closed
zeeZ opened this issue Mar 14, 2019 · 20 comments · Fixed by #1840
Closed

flux 1.11.0 no longer syncs without ClusterRole #1830

zeeZ opened this issue Mar 14, 2019 · 20 comments · Fixed by #1840
Labels

Comments

@zeeZ
Copy link

zeeZ commented Mar 14, 2019

I run flux with explicit permissions, as limited as possible and with only a single namespaced Role and --k8s-namespace-whitelist set. After upgrading to 1.11.0 it no longer syncs unless it is able to list virtually everything in the cluster.

This is the ClusterRole I created from sync-loop errors before it was able to sync again. You can tell where I gave up:

apiVersion: rbac.authorization.k8s.io
kind: ClusterRole
metadata:
  labels:
    name: flux
  name: flux
rules:
- apiGroups:
  - apiextensions.k8s.io
  resources:
  - customresourcedefinitions
  verbs:
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - componentstatuses
  - configmaps
  - endpoints
  - events
  - limitranges
  - namespaces
  - nodes
  - persistentvolumeclaims
  - persistentvolumes
  - pods
  - podtemplates
  - replicationcontrollers
  - "*"
  verbs:
  - list
- apiGroups:
  - apiregistration.k8s.io
  resources:
  - apiservices
  verbs:
  - list
- apiGroups:
  - extensions
  resources:
  - daemonsets
  - deployments
  - ingresses
  - networkpolicies
  - podsecuritypolicies
  - "*"
  verbs:
  - list
- apiGroups:
  - apps
  - events.k8s.io
  - autoscaling
  - batch
  - "*"
  resources:
  - "*"
  verbs:
  - list

The FAQ answers "Can I restrict the namespaces that Flux can see" with "yes, experimental". Sadly, this is no longer the case.

Also name dropping #1217 and #1471

@squaremo
Copy link
Member

Curses, I did not intend this to be the case with #1442, though I admit I wasn't very diligent about trying out this scenario.

Where exactly does it come to a halt, when it's not given a ClusterRole? (what do the logs say?)

@2opremio
Copy link
Contributor

#1830 , which should fix this, is complete but pending review

@zeeZ
Copy link
Author

zeeZ commented Mar 14, 2019

Hey, thanks for the responses.

Where exactly does it come to a halt, when it's not given a ClusterRole? (what do the logs say?)

Without ClusterRole:

ts=2019-03-14T13:39:48.868422318Z caller=main.go:165 version=1.11.0
ERROR: logging before flag.Parse: E0314 13:39:49.929945       8 reflector.go:205] github.com/weaveworks/flux/cluster/kubernetes/cached_disco.go:100: Failed to list *v1beta1.CustomResourceDefinition: customresourcedefinitions.apiextensions.k8s.io is forbidden: User "system:serviceaccount:flux:flux" cannot list resource "customresourcedefinitions" in API group "apiextensions.k8s.io" at the cluster scope
ts=2019-03-14T13:39:49.947370986Z caller=main.go:295 component=cluster identity=/etc/fluxd/ssh/identity
ts=2019-03-14T13:39:49.947449236Z caller=main.go:296 component=cluster identity.pub="ssh-rsa ..."
ts=2019-03-14T13:39:49.947527827Z caller=main.go:297 component=cluster host=https://10.3.0.1:443 version=kubernetes-v1.12.5
ts=2019-03-14T13:39:49.947616546Z caller=main.go:309 component=cluster kubectl=/usr/local/bin/kubectl
ts=2019-03-14T13:39:49.949160458Z caller=main.go:319 component=cluster ping=true
ERROR: logging before flag.Parse: E0314 13:39:50.932939       8 reflector.go:205] github.com/weaveworks/flux/cluster/kubernetes/cached_disco.go:100: Failed to list *v1beta1.CustomResourceDefinition: customresourcedefinitions.apiextensions.k8s.io is forbidden: User "system:serviceaccount:flux:flux" cannot list resource "customresourcedefinitions" in API group "apiextensions.k8s.io" at the cluster scope

The last line is spammed forever after.

After adding the first set of permissions, updated the repo and tried to fluxctl sync:

ts=2019-03-14T13:44:07.898249713Z caller=checkpoint.go:24 component=checkpoint msg="up to date" latest=1.11.0
ts=2019-03-14T13:44:31.133643198Z caller=loop.go:103 component=sync-loop event=refreshed url=... branch=... HEAD=beb4159a14847c5d0b0e5d4cbeccb7f3d4da2766
ts=2019-03-14T13:44:31.247826109Z caller=loop.go:210 component=sync-loop err="collating resources in cluster for sync: componentstatuses is forbidden: User \"system:serviceaccount:flux:flux\" cannot list resource \"componentstatuses\" in API group \"\" at the cluster scope"
ts=2019-03-14T13:44:31.250451239Z caller=loop.go:90 component=sync-loop err="collating resources in cluster for sync: componentstatuses is forbidden: User \"system:serviceaccount:flux:flux\" cannot list resource \"componentstatuses\" in API group \"\" at the cluster scope"
ts=2019-03-14T13:45:08.121177099Z caller=warming.go:268 component=warmer info="refreshing image" image=... tag_count=207 to_update=1 of_which_refresh=1 of_which_missing=0
ts=2019-03-14T13:45:08.139291505Z caller=warming.go:364 component=warmer updated=... successful=1 attempted=1
ts=2019-03-14T13:49:07.446622744Z caller=images.go:17 component=sync-loop msg="polling images"
ts=2019-03-14T13:49:38.983850606Z caller=loop.go:103 component=sync-loop event=refreshed url=ssh://git@....git branch=... HEAD=beb4159a14847c5d0b0e5d4cbeccb7f3d4da2766
ts=2019-03-14T13:54:07.629381015Z caller=images.go:17 component=sync-loop msg="polling images"
ts=2019-03-14T13:54:44.119740704Z caller=loop.go:103 component=sync-loop event=refreshed url=ssh://git@....git branch=... HEAD=84970b52031752ec2790c20802a0f2419f6b4c84
ts=2019-03-14T13:54:44.336051836Z caller=loop.go:210 component=sync-loop err="collating resources in cluster for sync: configmaps is forbidden: User \"system:serviceaccount:flux:flux\" cannot list resource \"configmaps\" in API group \"\" at the cluster scope"
ts=2019-03-14T13:54:44.338921916Z caller=loop.go:90 component=sync-loop err="collating resources in cluster for sync: configmaps is forbidden: User \"system:serviceaccount:flux:flux\" cannot list resource \"configmaps\" in API group \"\" at the cluster scope"
ts=2019-03-14T13:59:07.767724146Z caller=images.go:17 component=sync-loop msg="polling images"
ts=2019-03-14T13:59:49.26397648Z caller=loop.go:103 component=sync-loop event=refreshed url=ssh://git@....git branch=... HEAD=84970b52031752ec2790c20802a0f2419f6b4c84
ts=2019-03-14T14:04:07.889994656Z caller=images.go:17 component=sync-loop msg="polling images"
ts=2019-03-14T14:04:56.89208238Z caller=loop.go:103 component=sync-loop event=refreshed url=ssh://git@....git branch=... HEAD=84970b52031752ec2790c20802a0f2419f6b4c84
ts=2019-03-14T14:05:23.734827732Z caller=loop.go:111 component=sync-loop jobID=1d217122-5fbe-df8e-976f-05db5f03a6f0 state=in-progress
ts=2019-03-14T14:05:31.362681374Z caller=loop.go:123 component=sync-loop jobID=1d217122-5fbe-df8e-976f-05db5f03a6f0 state=done success=true
ts=2019-03-14T14:05:36.499539849Z caller=loop.go:103 component=sync-loop event=refreshed url=ssh://git@....git branch=... HEAD=84970b52031752ec2790c20802a0f2419f6b4c84
ts=2019-03-14T14:09:08.028520016Z caller=images.go:17 component=sync-loop msg="polling images"
ts=2019-03-14T14:10:04.550939503Z caller=loop.go:103 component=sync-loop event=refreshed url=ssh://git@....git branch=... HEAD=84970b52031752ec2790c20802a0f2419f6b4c84

Always the following after a restart with the tag behind head, with varying resources.

caller=loop.go:210 component=sync-loop err="collating resources in cluster for sync: configmaps is forbidden: User \"system:serviceaccount:flux:flux\" cannot list resource \"configmaps\" in API group \"\" at the cluster scope"
caller=loop.go:90 component=sync-loop err="collating resources in cluster for sync: configmaps is forbidden: User \"system:serviceaccount:flux:flux\" cannot list resource \"configmaps\" in API group \"\" at the cluster scope"

Repo tag never moved and nothing was applied. I added that resource, killed the pod and repeated until I added the * to the role. No errors after and it applied and moved the tag.

#1830 , which should fix this, is complete but pending review

#1668 I assume?

@squaremo
Copy link
Member

Brill, thanks for that @zeeZ, most helpful!

@squaremo
Copy link
Member

You might have to stick to v1.10.1 for now @zeeZ -- sorry about that :-/

@2opremio
Copy link
Contributor

2opremio commented Mar 14, 2019

#1668 I assume?

Yeah, sorry

@2opremio
Copy link
Contributor

2opremio commented Mar 14, 2019

Now I am thinking that #1668 by itself won't be enough since it doesn't prevent flux from attempting to list cluster-scoped resources.

We need to think about this.

@2opremio
Copy link
Contributor

2opremio commented Mar 15, 2019

@zeeZ The fix will be included in the next Fix release. For now, you can test whether your issue is definitely fixed by using image quay.io/weaveworks/flux:master-5f0e9292.

Please reopen this issue if it isn't fixed.

@zeeZ
Copy link
Author

zeeZ commented Mar 15, 2019

@2opremio I actually checked out your branch earlier. With no config change from 1.10.1 to yours sync worked as expected, thank you.

What remains is the following, but didn't have any impact for me as there are no CRDs managed by flux:

ERROR: logging before flag.Parse: E0315 11:00:55.601512       9 reflector.go:205] github.com/weaveworks/flux/cluster/kubernetes/cached_disco.go:100: Failed to list *v1beta1.CustomResourceDefinition: customresourcedefinitions.apiextensions.k8s.io is forbidden: User "system:serviceaccount:flux:flux" cannot list resource "customresourcedefinitions" in API group "apiextensions.k8s.io" at the cluster scope

This is repeated every second

@2opremio
Copy link
Contributor

Fantastic! I will look into fixing that as well

@2opremio 2opremio reopened this Mar 15, 2019
@2opremio 2opremio added the bug label Mar 15, 2019
@2opremio
Copy link
Contributor

@zeeZ Are you getting any other errors? (even if not repeated)

@zeeZ
Copy link
Author

zeeZ commented Mar 15, 2019

No further errors after adding a watch/list CRD cluster role.

@2opremio
Copy link
Contributor

Great, I will try to get a fix for that early next week

@zeeZ
Copy link
Author

zeeZ commented Mar 16, 2019

I've created a sample repo of some of the things I did to lock down Flux, maybe it can be of some use:
https://github.com/zeeZ/locked-down-flux

I believe that's as far as I can go without Helm or GC enabled. Removing any of the rules defined will produce some kind of error during common operations, though I haven't played around with it enough to be able to tell where sync is actually affected and what is just noise.

@2opremio
Copy link
Contributor

2opremio commented Mar 18, 2019

I've taken a look at the remaining recurring error. It's a tricky one because the client-go library swallows it and handles it internally (logging by default):

func (r *Reflector) Run(stopCh <-chan struct{}) {
	glog.V(3).Infof("Starting reflector %v (%s) from %s", r.expectedType, r.resyncPeriod, r.name)
	wait.Until(func() {
		if err := r.ListAndWatch(stopCh); err != nil {
			utilruntime.HandleError(err)
		}
	}, r.period, stopCh)
}

I see a bunch of options:

  1. Create a PR which passes an error-handling function to the controller and reflector (I can try, but I doubt it will succeed).
  2. Create and maintain our own implementation of the controller/reflector (which sounds awful)
  3. Modify runtime.ErrorHandlers to mute Forbidden/NotExist errors (probably a bad idea) or to do some smart error handling (probably another bad idea).

I dealt with a similar problem in Scope before, going for (2) but the error handling wasn't so deep down in the call stack.

@squaremo / @hiddeco thoughts?

@squaremo
Copy link
Member

squaremo commented Mar 18, 2019

2. Create and maintain our own implementation of the controller/reflector (which sounds awful)

Yes; adapting parts of client-go is usually a quixotic enterprise. If it's much more complicated than the solution in weaveworks/scope, I'd say it's not worth it.

Can we mute glog by doing flag.Parse with some fake command-line options? I'm grasping at straws .. (it's probably better to do 3. instead)

@2opremio
Copy link
Contributor

I went for (3) in the end

@2opremio
Copy link
Contributor

@zeeZ It should be fixed now. I would appreciate if you could give it a try ( quay.io/weaveworks/flux:master-2d4cc4d )

@zeeZ
Copy link
Author

zeeZ commented Mar 18, 2019

After removing the CRD role I still get a constant stream of

ts=2019-03-18T21:05:54.062786645Z caller=main.go:175 type="internal kubernetes error" err="github.com/weaveworks/flux/cluster/kubernetes/cached_disco.go:100: Failed to list *v1beta1.CustomResourceDefinition: customresourcedefinitions.apiextensions.k8s.io is forbidden: User \"system:serviceaccount:flux-system:flux\" cannot list resource \"customresourcedefinitions\" in API group \"apiextensions.k8s.io\" at the cluster scope"

I did some digging around the IsForbidden || IsNotFound workaround you added, but it seems ReasonForError returns StatusReasonUnknown. I'm not familiar with K8S source, but I believe what we're dealing with here is no metav1 error but a more generic one: https://github.com/kubernetes/client-go/blob/7d04d0e2a0a1a4d4a1cd6baa432a2301492e4e65/tools/cache/reflector.go#L251

While it stings a bit, I can live with allowing CRD listing. My initial issue was with list access to everything in the cluster, which has been resolved thanks to you.

Perhaps documentation could be added with the minimum privileges Flux needs in order to operate properly, though I suspect that be complicated with helm and GC. Maybe a more restricted minimal example next to deploy?

On a positive note, at least it is not silently firing a request every second that may add up for each instance you run ;)

@2opremio
Copy link
Contributor

2opremio commented Mar 18, 2019 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants