Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GatewayAPI topology (DAG 1.0) #530

Closed
9 of 12 tasks
eguzki opened this issue Apr 10, 2024 · 4 comments
Closed
9 of 12 tasks

GatewayAPI topology (DAG 1.0) #530

eguzki opened this issue Apr 10, 2024 · 4 comments
Labels
kind/epic Master issue tracking broken down work size/large

Comments

@eguzki
Copy link
Contributor

eguzki commented Apr 10, 2024

#447 introduced the GatewayAPI topology (DAG) implementation

This (epic) issue tracks the usage of GatewayAPI topology (DAG) implementation by the different kuadrant controllers

The GatewayAPI Topology

The controller builds what it is named "The GatewayAPI Topology". It reads ALL Gateways, HTTPRoutes and policies (only those the controller cares about) in the cluster. With all that info, the controller creates a new instance of the Gateway API topology as well as a set of indexes (another "view" of the topology) that can be queried. Interesting to note here that field indexers can optionally be added to the manager to build internal index caches. One example being implemented is an index to query "give me all the HTTPRoutes parenting Gateway A".

The current implementation of the "GatewayAPI Topology" does not use a k8s client to query Gateway API and kuadrant policies. Instead, fetching resources must be done externally and then the GatewayAPI topology receives a list of resources. It could be done otherwise and have generic function with a k8s client to build topology for a given policy of one kind. The main reason for the current implementation is to decouple GatewayAPI topology and the cluster. Meaning that I can build GatewayAPI topology from a set of resources filtered with some criteria.

The implementation assumes "cheap" operations of reading ALL the gateways, routes and policies. That assumption comes from the controller runtime k8s client that holds a cache. At least for list and get operations. That cache is being kept up to date in the background when there are updates/deletes in the objects triggered by kuadrant controllers or any other third party controllers or any other k8s API server user. Check for this comment with a verification of this "cheap" operation assumption. Indeed, the resource lists are cached across reconciliation events and controllers.

The topology implementation is based on a pure Directed Acyclic Graph data structure implementation that is totally decoupled from the GatewayAPI relations. On top of that, there is an implementation of the GatewayAPI topology which builds a graph from Gateways and HTTPRoutes. In the future, it can be extended for additional GatewayAPI API types. On top of the GatewayAPI topology, a number of indexes are created which are tailored for the needs of the kuadrant controllers. Thus:

Kuadrant controller 
       |
       |
       V
Custom topology indexes (views)
       |
       |
       V
GatewayAPI topology (focused on one gateway)
       |
       |
       V
     DAG
@eguzki eguzki moved this to In Progress in Kuadrant Apr 10, 2024
@eguzki eguzki added kind/epic Master issue tracking broken down work size/large labels Apr 10, 2024
@guicassolato
Copy link
Contributor

guicassolato commented Apr 11, 2024

I like the direction this is going really. In fact, I now wonder if the approach adopted for the WasmPlugin (#447) and (ongoing) for the Limitador CR (#527) shouldn't be a model for all internal resources managed by the Kuadrant Operator, and not resources that map 1:1 to the number of gateway objects only.

What I mean by that is no longer having any controller that reconciles a policy. Instead, policies would be treated as just another source of events that triggers reconciliation through mapping. All events would map to the affected gateways, and therefore the input to the reconciliation function becomes the gateway object always.

The kinds of resources related to policies to actually reconcile are:

  • WasmPlugin
  • EnvoyFilter
  • Limitador
  • AuthConfig
  • AuthorizationPolicy
  • DNSRecord
  • Certificate

The reconciliation of each of these resources can be independent events from each other, by being handled by dedicated reconcilers. Within each reconciliation process, we could leverage concurrency in Go to compute the desired states of resources known upfront to be truly independent from each other (if there is any, given by the number of branches down from a gateway node until that kind of resource.)

In the future, if we decide to grow the hierarchy upwards and, say, turn the GatewayClass the new root (thus replacing the Gateway), it should be easy to adapt the mappers and wrap every reconciliation function within another loop. Arguably, this could be done in a generic way from day 1.

I realise this approach increases the computation load, compared to the current attempt to isolate the blast by focusing on each individual policy object. On the other hand, I believe it's more consistent with the expectation of a "state of the world" kind of reconciliation. Plus, computing a bigger desired state does not mean necessarily more requests to the API server.

The bigger challenge IMO is status updating, but I believe even on this one we can win with this approach. The status of the reconciled objects (when available) is straightforward as it can be set at each reconciliation function. The status of the policy objects, on the hand, depends on gathering the final states of all reconciliation functions to only then to be set. If every step (i.e. every internal resource kind, every reconciliation function) has a corresponding status condition added to the policy, the final one should be a no brainer. Finally, the state the gateway objects and routes can wait for the policy and/or be partially and incrementally set by the reconciliation functions.

At the level of the Kuadrant instance, the kinds of resource to reconcile are:

  • IstioOperator
  • (Istio) ConfigMap
  • ServiceMeshControlPlane

I'd like to propose modifying the list of controllers in the description to the list of resource kinds above. I.e., no more "AuthPolicy controller" or "RateLimitPolicy controller", but "AuthConfigReconciler", "AuthorizationPolicyReconciler", "WasmPluginReconciler", "LimitadorReconciler", etc.

@eguzki
Copy link
Contributor Author

eguzki commented Apr 11, 2024

The main motivation of moving logic from "policy controller" to specific resource controller is that each controller can define their own mappers and predicates to filter events. This is a powerful capability that totally affects the controller's logic. Additionally, the reconciliation of some elements are not coupled with others. Responsibility is decoupled by design. But that also comes with a price. And that brings me to "policy controllers". They still need to exist, as far as I can see now. The policy controllers main responsibility would be basically maintain updated status. Plus other features centered on the policy itself. For instance the constraint of network resources being targeted by a single kuadrant policy (per type). That functionaliy is owned, IMO, to the "policy controller".

The controller reconciliation function input cannot be always Gateways (or gateway classes). One example is Limitador's limits controller. There is one limitador object to be reconciled per kuadrant instance (currently only one allowed per cluster). So for this particular case, the controller reconciliation function input is the kuadrant CR.

I'd like to propose modifying the list of controllers in the description to the list of resource kinds above. I.e., no more "AuthPolicy controller" or "RateLimitPolicy controller", but "AuthConfigReconciler", "AuthorizationPolicyReconciler", "WasmPluginReconciler", "LimitadorReconciler", etc.

Sure 👍

@guicassolato
Copy link
Contributor

The controller reconciliation function input cannot be always Gateways (or gateway classes). One example is Limitador's limits controller. There is one limitador object to be reconciled per kuadrant instance (currently only one allowed per cluster). So for this particular case, the controller reconciliation function input is the kuadrant CR.

This makes sense and it's a more balanced view than mine from before actually. Depending on the kind of object, we should pick the correct root of the tree from where to reconcile. And I guess the "correct" one is the first common node that would require no jumping to sibling nodes to compute the desired state – as you exemplified for the Limitador CR, whose root cannot be the Gateway.

IOW, if you can create a Go routine that reduces the problem to the size of the subtree rooted by the object passed as input to the reconciliation function, then you're good. But you shouldn't have to query anything outside this subtree.

I could argue that today the Kuadrant CR is the tip and true root of the entire tree.


The main motivation of moving logic from "policy controller" to specific resource controller is that each controller can define their own mappers and predicates to filter events. This is a powerful capability that totally affects the controller's logic. Additionally, the reconciliation of some elements are not coupled with others. Responsibility is decoupled by design. But that also comes with a price. And that brings me to "policy controllers". They still need to exist, as far as I can see now. The policy controllers main responsibility would be basically maintain updated status. Plus other features centered on the policy itself. For instance the constraint of network resources being targeted by a single kuadrant policy (per type). That functionaliy is owned, IMO, to the "policy controller".

Here I see almost no conflict with what I said actually. What you call "policy controller" is what I implied as needed for the status reconciliation. I would call those "xPolicyStatusReconciler" and stick with the pattern of a reconciler is named after the resource kind it reconciles, not the resource kind it reconciles from.

As for the validation, I argue it does not belong to this controller. If you deem a policy to be invalid at this stage, it's already too late as the other reconcilers may have already created resources as if the policy were valid. All validation should be performed at the API level IMO, which is not the case today, I know.

@eguzki eguzki mentioned this issue Jul 23, 2024
@guicassolato guicassolato changed the title GatewayAPI topology (DAG) GatewayAPI topology (DAG 1.0) Aug 16, 2024
@eguzki
Copy link
Contributor Author

eguzki commented Sep 10, 2024

we can close this. Already working on DAG 2.0

Follow up on Kuadrant/architecture#29 Kuadrant/architecture#95

@eguzki eguzki closed this as completed Sep 10, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in Kuadrant Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/epic Master issue tracking broken down work size/large
Projects
Status: Done
Development

No branches or pull requests

2 participants