Flux does not work properly for cluster with large number of namespaces #1181

mwhittington21 · 2018-06-29T06:40:55Z

The problem

Works great on all the clusters I manage until I get to a snowflake that has 30k+ namespaces. When that happens the daemon won't output any logs and will max out on it's limit of 50 requests per second to the API server.

I assume it is because it is trying to list any deployments happening in those namespaces.

The solution

As clusters with this many namespaces are quite rare I would like to propose the following solution.

I think that being able to whitelist some namespaces that you care about for Flux management would be a good way to solve this. If no namespaces are whitelisted then it would default to its current behaviour. If any namesapces are provided then it will only look for objects within the specified namespaces. This means that current clusters are unaffected while large clusters can pick and choose.

Symptoms and debugging

Example logs from cluster that has 30k+ namespaces:

us-east-1_sync --docker-config=/docker-creds/config.json --registry-trace=true
ts=2018-06-29T05:28:24.9841875Z caller=main.go:138 version=1.4.1
ts=2018-06-29T05:28:25.006700363Z caller=main.go:227 component=cluster identity=/etc/fluxd/ssh/identity
ts=2018-06-29T05:28:25.006744553Z caller=main.go:228 component=cluster identity.pub="ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCZ4gypLgAlw/Ja5SWTpXP26HoAvaswz7S+0550I8AdZx5bwGw23StBgFZ4KrfqNlpbtNPx9dwRLu4SK9uFaP5ocJ/9ghNfv6VoBtvdvOGmVZxljMrFAQza3adZef67WkOaV23FwPJ0EYfEKRK5EHR2Ju1IJWwaqsLqjmEoRYQGVUaNgcA76UQfFmNE8yQmvXcyVoBZc2Go3ILMjnk4RD3CK63I+EUhOiYTuyAuWY05Dc9XFctalNcpHzli/Eas/79qRyAeg1zyaAI/a7trULE9SPXlKXwfch7hZQvYRtfLqPKu5B8XjvfRga3sFHDETgLxtjmyALaHJFYe7AKLTiQP"
ts=2018-06-29T05:28:25.006774386Z caller=main.go:229 component=cluster host=https://10.155.0.1:443 version=kubernetes-v1.10.3
ts=2018-06-29T05:28:25.006843526Z caller=main.go:241 component=cluster kubectl=/usr/local/bin/kubectl
ts=2018-06-29T05:28:25.007820146Z caller=main.go:249 component=cluster ping=true
ts=2018-06-29T05:28:25.00939803Z caller=main.go:368 url=ssh://git@stash.atlassian.com:7997/kube/goliath-flux user="Weave Flux" email=support@weave.works sync-tag=bbci/prod/us-east-1_sync notes-ref=flux set-author=false
ts=2018-06-29T05:28:25.009441423Z caller=main.go:423 upstream="no upstream URL given"
ts=2018-06-29T05:28:25.009678217Z caller=loop.go:89 component=sync-loop err="git repo not ready"
ts=2018-06-29T05:28:25.009843024Z caller=images.go:15 component=sync-loop msg="polling images"
ts=2018-06-29T05:28:25.010617777Z caller=images.go:21 component=sync-loop error="getting unlocked automated services: git repo not ready"
ts=2018-06-29T05:28:25.010434008Z caller=main.go:440 addr=:3030
ts=2018-06-29T05:28:25.307131203Z caller=checkpoint.go:24 component=checkpoint msg="up to date" latest=1.4.1
ts=2018-06-29T05:28:30.360228703Z caller=loop.go:102 component=sync-loop event=refreshed url=ssh://git@stash.atlassian.com:7997/kube/goliath-flux branch=bbci/prod/us-east-1 HEAD=7bb72f2c1dff479e90fcec2e8a96f09aba65d6a3

Graphs showing Flux smashing the API server with requests.

The text was updated successfully, but these errors were encountered:

squaremo · 2018-06-29T10:09:18Z

Oooph, 30k namespaces, yeah that'll do it :-S

The flux daemon is constantly looking for all the workloads it can find, as you surmise. There will be a point where it saturates its rate limit, and finds it difficult to answer queries. This would probably be improved by using a local cache of the kubernetes model and watchers (see #471, #1039).

One thing you can do is narrow down the namespaces that fluxd can see, by giving it a service account with narrowed permissions, using role-base access control (RBAC). In the absence of RBAC, a whitelist is a good idea (I can't think of any situations in which you wouldn't want the whitelist and RBAC to line up, but there may be some).

mwhittington21 · 2018-07-02T00:23:23Z

That sounds like a pretty good solution in the short term. I'll give that a go and report back.

mwhittington21 · 2018-07-02T07:59:00Z

While the RBAC solution somewhat worked, Flux was absolutely spamming the logs in that cluster because it would still try to access the namespaces. You cannot restrict the namespaces returned by a "list" operation and so Flux would try to search in all 30k+ namespaces individually just to find out that it did not have access.

To reduce API server overhead I have added the whitelist in the above PR. I do believe that by moving Flux to a watch-based workflow it will alleviate most of the problems found here, so this is just a bandaid until then. It may even be useful beyond then in terms of limiting the scope of Flux in certain cluster scenarios.

…tch. Fixes fluxcd#1181 Currently, Flux expects to have access to all namespaces, even if no manifests in the repository reference another namespace, it will check all namespaces for controllers to update. This change adds a --k8s-namespace-whitelist setting which, if set, will restrict Flux to only watch the specified namespaces and ignore all others. Intended for clusters with large amounts of namespaces or restrictive RBAC policies. If provided Flux will only monitor workloads in the given namespaces. This significantly cuts the number of API calls made. An empty list (i.e. not provided) yields the usual behaviour.

mwhittington21 mentioned this issue Jul 2, 2018

Adds --k8s-namespace-whitelist cli option #1186

Closed

oliviabarrick mentioned this issue Jul 3, 2018

Add --k8s-namespace-whitelist setting that specifies namespaces to watch. #1184

Merged

squaremo closed this as completed in #1184 Jul 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flux does not work properly for cluster with large number of namespaces #1181

Flux does not work properly for cluster with large number of namespaces #1181

mwhittington21 commented Jun 29, 2018

squaremo commented Jun 29, 2018

mwhittington21 commented Jul 2, 2018

mwhittington21 commented Jul 2, 2018

Flux does not work properly for cluster with large number of namespaces #1181

Flux does not work properly for cluster with large number of namespaces #1181

Comments

mwhittington21 commented Jun 29, 2018

The problem

The solution

Symptoms and debugging

squaremo commented Jun 29, 2018

mwhittington21 commented Jul 2, 2018

mwhittington21 commented Jul 2, 2018