-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Flux does not work properly for cluster with large number of namespaces #1181
Comments
Oooph, 30k namespaces, yeah that'll do it :-S The flux daemon is constantly looking for all the workloads it can find, as you surmise. There will be a point where it saturates its rate limit, and finds it difficult to answer queries. This would probably be improved by using a local cache of the kubernetes model and watchers (see #471, #1039). One thing you can do is narrow down the namespaces that fluxd can see, by giving it a service account with narrowed permissions, using role-base access control (RBAC). In the absence of RBAC, a whitelist is a good idea (I can't think of any situations in which you wouldn't want the whitelist and RBAC to line up, but there may be some). |
That sounds like a pretty good solution in the short term. I'll give that a go and report back. |
While the RBAC solution somewhat worked, Flux was absolutely spamming the logs in that cluster because it would still try to access the namespaces. You cannot restrict the namespaces returned by a "list" operation and so Flux would try to search in all 30k+ namespaces individually just to find out that it did not have access. To reduce API server overhead I have added the whitelist in the above PR. I do believe that by moving Flux to a watch-based workflow it will alleviate most of the problems found here, so this is just a bandaid until then. It may even be useful beyond then in terms of limiting the scope of Flux in certain cluster scenarios. |
…tch. Fixes fluxcd#1181 Currently, Flux expects to have access to all namespaces, even if no manifests in the repository reference another namespace, it will check all namespaces for controllers to update. This change adds a --k8s-namespace-whitelist setting which, if set, will restrict Flux to only watch the specified namespaces and ignore all others. Intended for clusters with large amounts of namespaces or restrictive RBAC policies. If provided Flux will only monitor workloads in the given namespaces. This significantly cuts the number of API calls made. An empty list (i.e. not provided) yields the usual behaviour.
The problem
Works great on all the clusters I manage until I get to a snowflake that has 30k+ namespaces. When that happens the daemon won't output any logs and will max out on it's limit of 50 requests per second to the API server.
I assume it is because it is trying to list any deployments happening in those namespaces.
The solution
As clusters with this many namespaces are quite rare I would like to propose the following solution.
I think that being able to whitelist some namespaces that you care about for Flux management would be a good way to solve this. If no namespaces are whitelisted then it would default to its current behaviour. If any namesapces are provided then it will only look for objects within the specified namespaces. This means that current clusters are unaffected while large clusters can pick and choose.
Symptoms and debugging
Example logs from cluster that has 30k+ namespaces:
Graphs showing Flux smashing the API server with requests.
The text was updated successfully, but these errors were encountered: