Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic reachableServices #6551

Closed
lahabana opened this issue Apr 17, 2023 · 28 comments
Closed

Automatic reachableServices #6551

lahabana opened this issue Apr 17, 2023 · 28 comments
Assignees
Labels
kind/design Design doc or related kind/feature New feature triage/accepted The issue was reviewed and is complete enough to start working on it

Comments

@lahabana
Copy link
Contributor

lahabana commented Apr 17, 2023

Description

Reachable services is an ok optimization but it's very manual and annoying.

We rely on TrafficRoutes to populate all the clusters. So if you don't have a TrafficRoute there's nothing you can do (unfortunately this makes it impossible to only use MeshHTTPRoute and MeshTCPRoute).

Could we use the existing policies to only add clusters for the clusters we want to talk to.

Open questions

  1. What do we want to happen when we don't have any MeshHTTPRoute nor MeshTCPRoute?
    To me we should be able to route everywhere.
  2. Should we use MeshTrafficPermissions to figure who receives the configuration for which service?
    This makes sense as you would expect to not have any information about a service you don't have access to.
  3. What's the migration strategy?

Implementation idea

Use MeshTrafficPermissions to build the reverse graph of which services can reach which services.
Only add envoy clusters for the services you have permissions to.

This has the following benefits:

  1. That's what customers expect this feature to look like
  2. You get added security that proxies that have not access to the service don't even it exists
  3. Places that go 0 trust get the best perf out of the box
  4. You can always allow everything by have a targetRef: Mesh -- from: targetRef: Mesh -- Allow policy (which would likely be a good default).
  5. No need to figure out with 2 different policies. No need to manage to define a route that does All to All.
  6. MeshTrafficPermission would do something even when mTLS is not enabled

Drawbacks:

  1. Using MeshTrafficPermission kind of breaks our rule “top-level targetRef selects proxy to configure”. You change Allow to ShadowDeny for Service A and this affects xds configurations of all other services in the mesh.

Note: If we do this you'd want ShadowDeny to be considered the same as Allow when it comes to advertising clusters to consumers of a service. Because you want to get the opportunity of a service

@lahabana lahabana added triage/pending This issue will be looked at on the next triage meeting kind/feature New feature kind/design Design doc or related labels Apr 17, 2023
@lahabana
Copy link
Contributor Author

Triage: this can also be used for auditing who can talk to whom.

@lahabana lahabana added triage/accepted The issue was reviewed and is complete enough to start working on it and removed triage/pending This issue will be looked at on the next triage meeting labels Apr 17, 2023
@github-actions github-actions bot added the triage/stale Inactive for some time. It will be triaged again label Jul 17, 2023
@lahabana lahabana removed the triage/stale Inactive for some time. It will be triaged again label Jul 17, 2023
@lahabana lahabana changed the title Replace reachableServices by Mesh*Route access Automatic reachableServices Sep 27, 2023
@kumahq kumahq deleted a comment from github-actions bot Sep 27, 2023
@lukidzi
Copy link
Contributor

lukidzi commented Sep 28, 2023

I think just relying on 'MeshTrafficPermissions' might not be sufficient. We could do the combination of both MeshTrafficPermission and Mesh*Route. I think sometimes you might define allow-all but define subset of routes. In this case we could cover more cases.

Also, 2nd idea is to use stats of communication, it's less reliable but we could do something more like suggestions

  1. Scrape metrics of clusters over some period of time
  2. Build a map of services which service requested and persist in the store
  3. After defined time we could present to the user that based on the traffic you service communicates with: service-a, service-b and

@jakubdyszkiewicz
Copy link
Contributor

You get added security that proxies that have not access to the service don't even it exists

Can you elaborate on this point?

@michaelbeaumont
Copy link
Contributor

I think sometimes you might define allow-all but define subset of routes. In this case we could cover more cases.

I think this could just be solely a new feature, likely a filter, added to MeshHTTPRoute. Where you have a final "match all" rule and then configure a direct return of 401/403.

@jakubdyszkiewicz
Copy link
Contributor

Let's say I'm a mesh operator and have a Kuma deployment with 100 services and no MeshTrafficPermission nor reachable services. I started to see perf struggles because of lack of reachable services.

Current options: reachable services.
I send an e-mail to service owners "hey, please use reachable services. You just set annotations with services that you connect to that you most likely already have in your config/values file". Alternatively, someone from the platform team can go service by service and make PRs.

Proposed solution: graph from MeshTrafficPermission
I send an e-mail to service owners "hey, please set MeshTrafficPermission to restrict which services can connect to your service". Now, how would service owners do this? Do we provide any tools to figure this out? It's a much harder and more stressful change.

I'm not saying the proposed solution is bad, I like it. We just need to figure out how we can make it scalable inside the organization.

@lahabana
Copy link
Contributor Author

1

Let's say I'm a mesh operator and have a Kuma deployment with 100 services and no MeshTrafficPermission nor reachable services. I started to see perf struggles because of lack of reachable services.

Note: You don't have no MeshTrafficPermission as this would allow no-one (because of 0 trust being the default). What you'd have is 1 allow all permission.

I send an e-mail to service owners "hey, please set MeshTrafficPermission to restrict which services can connect to your service". Now, how would service owners do this? Do we provide any tools to figure this out? It's a much harder and more stressful change.

I think it makes sense though here's a couple of remarks:

Do we provide any tools to figure this out?

ShadowDeny is partly made for this right?

Now, how would service owners do this?

Some service owners would go "well I can't do this because I run a public service I'm not sure who consumes this".
Yet other services would be "Ho right, this service is an internal service I know who communicates to it".
Think about Konnect and our services. It'd be super easy to add MTP right?

I think there's a few things to keep in mind:

a. We don't need a perfect subset we just need to prune things down to make conf slimmer
b. Large organizations may start with 0 trust so they'd slowly add extra accesses so they'd get that for free
c. We can still keep reachableServices if a service owner would like to trim things further.

2

You get added security that proxies that have not access to the service don't even it exists

Can you elaborate on this point?

At the moment if you have a Deny MTP the envoy configuration will still have a cluster. So someone could learn about services that exist by just looking at their envoy dump. I feel like this might not be super great security wise (you shouldn't be able to know about things you can't talk to).

3

I think sometimes you might define allow-all but define subset of routes. In this case we could cover more cases.

I think this could just be solely a new feature, likely a filter, added to MeshHTTPRoute. Where you have a final "match all" rule and then configure a direct return of 401/403.

Yeah I agree this is a different feature altogether I'd quite like here to stay with each cluster being a distinct kuma.io/service value. Any subset of endpoints smaller than this should be managed differently (Probably talked there #6331).

4

I think just relying on 'MeshTrafficPermissions' might not be sufficient. We could do the combination of both MeshTrafficPermission and Mesh*Route. I think sometimes you might define allow-all but define subset of routes. In this case we could cover more cases.

Also, 2nd idea is to use stats of communication, it's less reliable but we could do something more like suggestions

  1. Scrape metrics of clusters over some period of time
  2. Build a map of services which service requested and persist in the store
  3. After defined time we could present to the user that based on the traffic you service communicates with: service-a, service-b and

This 2nd idea is something we talked about but I also think it's a whole different feature than this.

@jakubdyszkiewicz
Copy link
Contributor

Note: You don't have no MeshTrafficPermission as this would allow no-one (because of 0 trust being the default). What you'd have is 1 allow all permission.

Yes, ok. That does not change my point.

ShadowDeny is partly made for this right?

I don't think it is. Correct me if I'm wrong, but ShadowDeny can only show you that there is still traffic that is being denied (metric), but does not show exactly what is this traffic.

Service Map is potentially such a tool, but

  1. I'd avoid relying on it, because it's not straightforward to deploy IMHO.
  2. Not sure how big is the "time resolution". Let say there is a job that runs every week. Will it include it?
  3. I'd like to have a tool that comes out of the box with Kuma.

Ideally, the GUI would have a tab that can generate MeshTrafficPermission based on the traffic history.

this service is an internal service I know who communicates to it

IMHO this is very rarely the case unless there is strict security already in place or there is a tooling to check it.

We can still keep reachableServices if a service owner would like to trim things further.

👍

@lobkovilya
Copy link
Contributor

lobkovilya commented Sep 28, 2023

I think just relying on 'MeshTrafficPermissions' might not be sufficient. We could do the combination of both MeshTrafficPermission and Mesh*Route. I think sometimes you might define allow-all but define subset of routes. In this case we could cover more cases.

IMO we can rely solely on Mesh*Route and avoid making decisions based on MeshTrafficPermission at all.

Deciding outbound clusters based on MeshTrafficPermission has a few downsides:

  1. wrong errors on the client side, instead of 403 client will get can't resolve backend.mesh hostname or something like this
  2. wrong RBAC stats on the server side, you never know if some application tried to consume your service
  3. requires mTLS, otherwise it makes no sense to ask users to create MeshTrafficPermission

@michaelbeaumont
Copy link
Contributor

IMO we can rely solely on Mesh*Route and avoid making decisions based on MeshTrafficPermission at all.

Can you clarify what you're proposing with "rely solely on Mesh*Route"? That is, how do we "rely solely on Mesh*Route" "to figure who receives the configuration for which service"?

@lobkovilya
Copy link
Contributor

We configure the outbound clusters only if a MeshTCPRoute (or MeshHTTPRoute) matches your service. If there are no MeshTCPRoutes or MeshHTTPRoutes matching your service, your service has 0 outbound clusters and can't consume other services. Basically Mesh*Route policy is a single source of information on what outbounds your service requires.

So if you create:

apiVersion: kuma.io/v1alpha1
kind: MeshTCPRoute
metadata:
  name: route-2
  namespace: kuma-system
spec:
  targetRef:
    kind: MeshService
    name: client
  to:
    - targetRef:
        kind: MeshService
        name: backend
    - targetRef:
        kind: MeshService
        name: web

then the client's DPP will receive backend and web clusters.

Note: we have to adjust the Mesh*Route policy to not require explicitly listing services when the client consumes all services in the mesh, i.e.:

apiVersion: kuma.io/v1alpha1
kind: MeshTCPRoute
metadata:
  name: route-2
  namespace: kuma-system
spec:
  targetRef:
    kind: MeshService
    name: client
  to:
    - targetRef:
        kind: Mesh

@michaelbeaumont
Copy link
Contributor

Deciding outbound clusters based on MeshTrafficPermission has a few downsides:

Aren't these all things that can be solved by just generating clusters differently? The principal question IMO is more what are the semantics of MeshHTTPRoute and MeshTrafficPermission in Kuma?

  1. wrong errors on the client side, instead of 403 client will get can't resolve backend.mesh hostname or something like this

IMO neither is better. It's about the particular semantics of "deny traffic". Why should service A, that service B shouldn't ever call or even know about, return a 403 error? Why not just cut traffic off completely.
If you want a 403 error for some set of requests, just create an HTTPRoute that returns 403.

  1. wrong RBAC stats on the server side, you never know if some application tried to consume your service

Not sure about the capabilities of RBAC filters, but again, why can't we just create the RBAC filters based on more than one type of resource.

  1. requires mTLS, otherwise it makes no sense to ask users to create MeshTrafficPermission

I don't get why this makes no sense. If I want to deny traffic, I create a MeshTrafficPermission to block it. If the only way to guarantee this, because mTLS is off, is to generate Envoy config without clusters than we just do that.


The trend for meshes AFAICT and IMO more intuitive semantics of an "HTTP route" is that they modify HTTP request/responses. The idea of needing a "do nothing" route is IMO counterintuitive. They do not directly correspond to Envoy resources like Routes or Clusters and IMO the average user won't expect them to. We can just use both MeshTrafficPermission and Mesh*Routes to generate Envoy resources.

Needing a default resource is fraught and clunky IMO, and should be avoided.
This plus the default of a k8s cluster being "every service reaches every other" is a very good reason to:
a) not by default block traffic just from installing a mesh, obviously
b) not get around this behavior by creating a default "allow all" resource

If we want to optimize performance and reduce the amount of clusters we create, IMO this optimization maps much more cleanly to the idea of "permission to contact a service". Not "what happens to my HTTP request and response as it moves through my mesh".

But I think I've said all that before so I'll leave this comment as my opinion on default deny or default allow with route resources.

@lobkovilya
Copy link
Contributor

I don't get why this makes no sense. If I want to deny traffic, I create a MeshTrafficPermission to block it. If the only way to guarantee this, because mTLS is off, is to generate Envoy config without clusters than we just do that.

It's a false sense of security, clients still can consume the service using ClusterIP (probably even using Kubernetes DNS) unless you disable passthrough cluster mesh-wide.

@lobkovilya
Copy link
Contributor

Aren't these all things that can be solved by just generating clusters differently?

I was thinking we don't want to generate clusters on the client side at all (to reduce the size of the envoy config). Do you have something else in mind?

@lahabana
Copy link
Contributor Author

Deciding outbound clusters based on MeshTrafficPermission has a few downsides:

Aren't these all things that can be solved by just generating clusters differently? The principal question IMO is more what are the semantics of MeshHTTPRoute and MeshTrafficPermission in Kuma?

  1. wrong errors on the client side, instead of 403 client will get can't resolve backend.mesh hostname or something like this

100% This is exactly like return 404 when you don't have access to a repo. You shouldn't know something you don't have access to exists.

IMO neither is better. It's about the particular semantics of "deny traffic". Why should service A, that service B shouldn't ever call or even know about, return a 403 error? Why not just cut traffic off completely. If you want a 403 error for some set of requests, just create an HTTPRoute that returns 403.

  1. wrong RBAC stats on the server side, you never know if some application tried to consume your service

Not sure about the capabilities of RBAC filters, but again, why can't we just create the RBAC filters based on more than one type of resource.

I don't think that's dramatic. It's the consumer's problem that they are trying to call you.

  1. requires mTLS, otherwise it makes no sense to ask users to create MeshTrafficPermission

I don't get why this makes no sense. If I want to deny traffic, I create a MeshTrafficPermission to block it. If the only way to guarantee this, because mTLS is off, is to generate Envoy config without clusters than we just do that.

Yes I see this as a feature you don't need mTLS anymore to use MTP. With the caveat that it's less secure than when you are using mTLS.

The trend for meshes AFAICT and IMO more intuitive semantics of an "HTTP route" is that they modify HTTP request/responses. The idea of needing a "do nothing" route is IMO counterintuitive. They do not directly correspond to Envoy resources like Routes or Clusters and IMO the average user won't expect them to. We can just use both MeshTrafficPermission and Mesh*Routes to generate Envoy resources.

Needing a default resource is fraught and clunky IMO, and should be avoided. This plus the default of a k8s cluster being "every service reaches every other" is a very good reason to: a) not by default block traffic just from installing a mesh, obviously b) not get around this behavior by creating a default "allow all" resource

If we want to optimize performance and reduce the amount of clusters we create, IMO this optimization maps much more cleanly to the idea of "permission to contact a service". Not "what happens to my HTTP request and response as it moves through my mesh".

But I think I've said all that before so I'll leave this comment as my opinion on default deny or default allow with route resources.

Yes I think as we go with routes being as close as possible to GatewayAPI and there's no such thing as a route without a destination in GatewayAPI I'd stay away from this.

@lahabana
Copy link
Contributor Author

I don't get why this makes no sense. If I want to deny traffic, I create a MeshTrafficPermission to block it. If the only way to guarantee this, because mTLS is off, is to generate Envoy config without clusters than we just do that.

It's a false sense of security, clients still can consume the service using ClusterIP (probably even using Kubernetes DNS) unless you disable passthrough cluster mesh-wide.

Anything using plaintext is a false sense of security :)

@lobkovilya
Copy link
Contributor

100% This is exactly like return 404 when you don't have access to a repo. You shouldn't know something you don't have access to exists.

I think this is applicable to the "edge" use case when clients are external. In our case, it's internal infrastructure, there is no need for an envoy proxy to hide information from another envoy proxy. Also, think about troubleshooting, if users misconfigure MeshTrafficPermission and there is no traffic between A and B, the only thing they see in traces/metrics is 404 errors.

Anything using plaintext is a false sense of security :)

True, but if you have clients that use k8s hostnames foo.my-namespace.svc.cluster.local (instead .mesh), when you create MeshTrafficPermission with Deny for this client, not only does it keep consuming your service, it starts bypassing its Envoy proxy. That's definitely not what users expect, and this makes the whole MeshTrafficPermission feature more complex to use.

@lukidzi
Copy link
Contributor

lukidzi commented Sep 29, 2023

What about ExternalServices? I think you cannot define MeshTrafficPermission for them because they are outside of the cluster but still you might not need to track them. In this case using Mesh*Route might have more sense than MTP.

@lahabana
Copy link
Contributor Author

lahabana commented Oct 1, 2023

I think this is applicable to the "edge" use case when clients are external. In our case, it's internal infrastructure, there is no need for an envoy proxy to hide information from another envoy proxy. Also, think about troubleshooting, if users misconfigure MeshTrafficPermission and there is no traffic between A and B, the only thing they see in traces/metrics is 404 errors.

I'm not convinced by this argument. In large enough organisations even "internal infrastructure" has boundaries of knowledge. For the 2nd part I think this is easily fixable by looking at the inspect apis and educating people. Anyway that's probably a longer discussion but it's not uncommon that things show as non-existent when you don't have access to it and for good reasons.

True, but if you have clients that use k8s hostnames foo.my-namespace.svc.cluster.local (instead .mesh), when you create MeshTrafficPermission with Deny for this client, not only does it keep consuming your service, it starts bypassing its Envoy proxy. That's definitely not what users expect, and this makes the whole MeshTrafficPermission feature more complex to use.

Hmm isn't this already the behaviour if the destination service is not in the mesh? And then wouldn't it be quite easy for the destination service in HTTP to check that the source traffic comes from Envoy to minimize this issue (I think we even mentioned we could make MTP work on non mTLS traffic by leveraging headers)?
Feels like there are workarounds to this issue (albeit not strongly secure but it's on plaintext anyway so we're already not secure).

What about ExternalServices? I think you cannot define MeshTrafficPermission for them because they are outside of the cluster but still you might not need to track them. In this case using Mesh*Route might have more sense than MTP.

IFAIK MTP works for ES if that's not the case that's a bug.

@lahabana
Copy link
Contributor Author

lahabana commented Oct 3, 2023

@jakubdyszkiewicz suggest having a to section in MeshTrafficPermission to specify the list of services you can communicate with

@johnharris85
Copy link
Contributor

@jakubdyszkiewicz suggest having a to section in MeshTrafficPermission to specify the list of services you can communicate with

How is this different to reachable services?

@lahabana
Copy link
Contributor Author

lahabana commented Oct 5, 2023

@jakubdyszkiewicz suggest having a to section in MeshTrafficPermission to specify the list of services you can communicate with

How is this different to reachable services?

By using MeshSubset as a targetRef you can more easily add it a list of services. But I agree it's very similar to existing reachableServices

@slonka
Copy link
Contributor

slonka commented Oct 13, 2023

b) not get around this behavior by creating a default "allow all" resource

but we do need that Route on the Envoy side for the traffic to flow. So you're for an implicit allow all that can't be deleted by the user, right?

True, but if you have clients that use k8s hostnames foo.my-namespace.svc.cluster.local (instead .mesh), when you create MeshTrafficPermission with Deny for this client, not only does it keep consuming your service, it starts bypassing its Envoy proxy. That's definitely not what users expect, and this makes the whole MeshTrafficPermission feature more complex to use.

re: it starts bypassing its Envoy proxy - how is it bypassing it? It is bypassing on the egreess right? If we're allowing MTP without mTLS we could check x-kuma-tags and block it on the ingress. I know it can be forged but as everything without TLS.

the only thing they see in traces/metrics is 404 errors

we will know if this is a real 404 or a 404 caused by missing cluster. I think we're fine with that as long as we surface the actual reason to the user.

@slonka
Copy link
Contributor

slonka commented Oct 13, 2023

IMO we can rely solely on Mesh*Route and avoid making decisions based on MeshTrafficPermission at all.

I think this is double the work (it's exactly what I did in the previous company). I assume here that you need a default to be able to route everywhere (so for onboarding the mesh you don't need to start with everything configured from the get-go), then you define which services you want to talk to and then you define who can talk to you. Basing this on only MeshTrafficPermission means we implicitly define the outbounds and we cut down on one step. It's usually pretty easy to know which service you call in your service so this might be the right approach. But tying this to MeshHTTPRoute changes the meaning of the policy a bit.

@slonka
Copy link
Contributor

slonka commented Oct 16, 2023

We just need to figure out how we can make it scalable inside the organisation.

How do large orgs migrate to Kuma? Can we only enable ingress to gather stats and then based on these stats enable egress with reachable services?

@slonka slonka self-assigned this Oct 16, 2023
@slonka
Copy link
Contributor

slonka commented Oct 16, 2023

I'm gonna start working on a MADR for this.

@michaelbeaumont
Copy link
Contributor

michaelbeaumont commented Oct 16, 2023

b) not get around this behavior by creating a default "allow all" resource

but we do need that Route on the Envoy side for the traffic to flow. So you're for an implicit allow all that can't be deleted by the user, right?

I'm for: when I install my mesh, no resources are created by default and traffic continues to flow as it did before. I suppose you can call this an implicit allow all but "it can't be deleted by the user" is kind of nonsensical because there's nothing to delete in the first place. It's the same idea as NetworkPolicy.
There exists some resource with which I can restrict traffic. I can use this resource to explicitly create a "deny all" policy, which I then selectively loosen.

@slonka
Copy link
Contributor

slonka commented Oct 17, 2023

next points from meeting with Charly:

  • look into implementing reverse graph from MTP
  • how to migrate (graph it out / flush out concerns)

@slonka
Copy link
Contributor

slonka commented Nov 1, 2023

closed via #8084 and #8125

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Design doc or related kind/feature New feature triage/accepted The issue was reviewed and is complete enough to start working on it
Projects
None yet
Development

No branches or pull requests

7 participants