Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Policy engine cleanup #534

Merged
merged 1 commit into from
May 13, 2024
Merged

Policy engine cleanup #534

merged 1 commit into from
May 13, 2024

Conversation

orozery
Copy link
Collaborator

@orozery orozery commented Apr 18, 2024

No description provided.

@orozery
Copy link
Collaborator Author

orozery commented Apr 18, 2024

@zivnevo just heads up
not yet working

@orozery orozery force-pushed the loadbalancer branch 14 times, most recently from 63c8672 to 1fcd74e Compare May 1, 2024 13:02
@orozery orozery changed the title (WIP) Policy engine cleanup Policy engine cleanup May 1, 2024
@orozery orozery linked an issue May 1, 2024 that may be closed by this pull request
@orozery orozery marked this pull request as ready for review May 1, 2024 13:14
@zivnevo
Copy link
Collaborator

zivnevo commented May 1, 2024

This is somewhat more than a cleanup - it changes the algorithm for selecting the destination peer for an egress request.
Previously we had:

1. Start with all source peers for the required Import
2. Filter-out unreachable peers
3. Consult PDP to filter-out unauthorized peers
4. Consult load-balancer to select one of the remaining peers (if any)

Now we have

1. Consult load-balancer to select a peer out of the not-yet-failing source peers for the required import.
2. Check peer status. If unreachable - mark as failing and go to 1.
3. Consult PDP to check if the selected peer is authorized. If not, mark as failing and go to 1.
4. Try to authorize the request at the selected peer. If unauthorized, mark as failing and go to 1.

While the new algorithm is more coherent and has simpler implementation, we may lose some performance in the (rare?) cases where imports have a large number of sources.

@elevran , @kfirtoledo , your thoughts?
@orozery , please correct me if I'm wrong somewhere.

@orozery
Copy link
Collaborator Author

orozery commented May 1, 2024

we may lose some performance in the (rare?) cases where imports have a large number of sources.

Where do you lose performance?
Actually, in most cases you gain performance since you query the PDP only for one destination, and not all possible destinations.

@orozery
Copy link
Collaborator Author

orozery commented May 1, 2024

@orozery , please correct me if I'm wrong somewhere.

One correction is that in step 2, if peer is unreachable, it is not marked as failed, but rather is "delayed".

To be more precise, per each egress connection, the load-balancing scheme yields some ordering (or shuffling if you may) of the list of import sources.
The controlplane tries them out one-by-one, except for those who's peer status is "unreachable" (which is determined periodically by monitoring heartbeats).
Those skipped (or "delayed") sources are tried out as a last resort only if all other sources failed.

@kfirtoledo
Copy link
Collaborator

kfirtoledo commented May 2, 2024

I am ok with the new algorithm, but I'm not really sure about its impact on performance because anyway most of the time is spent on opening a connection to the destination in another cluster.

@orozery orozery force-pushed the loadbalancer branch 2 times, most recently from 1ca7657 to 79ebe1f Compare May 2, 2024 11:09
}
// Decide makes allow/deny decisions for the queried connection between src and dest.
// The decision, as well as the deciding policy, is recorded in the returned DestinationDecision struct.
func (pdp *PDP) Decide(src, dest WorkloadAttrs, ns string) (*DestinationDecision, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: why is ns not part of either source or destination attributes?

// Decide makes allow/deny decisions for the queried connection between src and dest.
// The decision, as well as the deciding policy, is recorded in the returned DestinationDecision struct.
func (pdp *PDP) Decide(src, dest WorkloadAttrs, ns string) (*DestinationDecision, error) {
decision := DestinationDecision{Destination: dest}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: are we ok with the default values of the other fields?

@elevran
Copy link
Collaborator

elevran commented May 8, 2024

I am ok with the new algorithm, but I'm not really sure about its impact on performance because anyway most of the time is spent on opening a connection to the destination in another cluster.

I think performance is a secondary concern.
The main motivation for the current algorithm was to ensure we only load balance on allowed destinations. This works if you do ACL before LB and filter out the denied destinations. It is obviously more work to check all potential destinations than just the specific one selected by LB.
If you do LB first you could end up with a destination that fails ACL and then you need to redo LB - being careful not to select ACL denied peers so you don't loop forever.
In either case, we may want to filter out unreachable peers before checking policy but that's an optimization.

A side issue is whether we want to mask failures by retrying internally in the gateway or let the client handle retries the way they see fit by failing to connect.

This commit moves code from the policyengine package to the authz package.
It also includes a re-writing of the load balancer.

Signed-off-by: Or Ozeri <oro@il.ibm.com>
@orozery orozery merged commit df2a766 into clusterlink-net:main May 13, 2024
9 checks passed
@orozery orozery deleted the loadbalancer branch May 28, 2024 10:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Policy-engine code cleanup
4 participants