Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine if RBAC updates should drain #311

Closed
howardjohn opened this issue Jan 4, 2023 · 42 comments
Closed

Determine if RBAC updates should drain #311

howardjohn opened this issue Jan 4, 2023 · 42 comments
Assignees
Labels
Ambient Beta Must have for OSS Beta of Ambient Mesh

Comments

@howardjohn
Copy link
Member

howardjohn commented Jan 4, 2023

Consider a long lived connection. After it is established, we create a rule that would have denied it. Should the connection be terminated, or should only new connections start being terminated?

Prior art:

  • Istio: Drains
  • Cilium: operates on many layers, so needs more research
  • Ztunnel currently: does not drain
@keithmattix
Copy link
Contributor

Definitely feels like it should drain IMO

@GregHanson
Copy link
Member

GregHanson commented Jun 9, 2023

this also impact waypoints. If a long lived connection exists, and a waypoint is deployed, ztunnel will not direct traffic through the waypoint until the original connection is closed and re-established.

steps to reproduce using online boutique microservice:

istioctl install --set hub=istio --set tag=1.18.0 -y --set profile=ambient --set meshConfig.accessLogFile=/dev/stdout
kubectl label namespace default istio.io/dataplane-mode=ambient --overwrite
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/microservices-demo/main/release/kubernetes-manifests.yaml

# deploy namespace wide waypoint
istioctl x waypoint apply -n default

# no access logs
kubectl logs deploy/namespace-istio-waypoint -f

kubectl delete pod -lapp=loadgenerator

# access logs generated per request as expected
kubectl logs deploy/namespace-istio-waypoint -f

@kyessenov
Copy link

kyessenov commented Jun 9, 2023

I don't understand what "drain" here means. There's no general drain strategy for TCP connections. You can only abruptly terminate them. Is that what's proposed?

@GregHanson That's a good point in favor of abrupt termination since authorization depends on Waypoints for HTTP rules, so it'd be incorrect not to route to Waypoint indefinitely.

@howardjohn
Copy link
Member Author

howardjohn commented Jun 9, 2023 via email

@kyessenov
Copy link

kyessenov commented Jun 9, 2023

I think never terminating would be dangerous in an eventually consistent system. When workloads and policies are deployed concurrently (enabling waypoint is a policy), the order should not matter when things open connections. So it seems necessary to terminate eventually, and the challenge is defining when to terminate connections and limiting the scope to only those should be.

@linsun
Copy link
Member

linsun commented Jun 10, 2023

Agreed, I'd expect updated RBAC become effective even with long lived connection.

cc @louiscryan for input

@howardjohn
Copy link
Member Author

I agree with the concerns of eventual consistency. For RBAC it makes more sense I think - we explicitly would close the connection if it was new. For Waypoint being added, though, closing the connection seems unneccesarily aggressive

@bleggett
Copy link
Contributor

bleggett commented Jun 12, 2023

I agree with the concerns of eventual consistency. For RBAC it makes more sense I think - we explicitly would close the connection if it was new. For Waypoint being added, though, closing the connection seems unneccesarily aggressive.

Dropping a connection that would previously have been allowed but no longer is, as soon as we are aware of that, regardless of its age, is an non-negotiable obligation on our end, IMO.

Scoping that to "pure RBAC" so we don't have to do the same when L7 concerns are added with Waypoints is both a complex behavior to actually model, and creates behavioral exceptions in ztunnel around connection handling that make it harder to reason about - I think it costs more than it saves to parse the behavior differently here.

I think we just drop in both cases, because that's the simplest, safest, and most obvious behavior.

@howardjohn
Copy link
Member Author

I don't agree its simpler nor safer. Its not simpler as we need to go implement it, which is strictly added complexity. Its not safer because it abruptly tears down users connections.

I don't think most users will be happy when they deploy a waypoint and suddenly get a massive surge of 503s. Lin gave a talk about dynamically adding a waypoint to do a canary rollout - interestingly, its broken in both cases:

  • If we close the connections, users will get 503s which has a negative impact and may impact the canary analysis (looks like new version is broken, in reality just the waypoint enablement broke things).
  • If we don't close the connections, clients may not respect the canary rules.

Both are pretty unfortunate

@kyessenov
Copy link

I think we're over-indexing on TCP handling in ztunnel. If it was working on the packets, it would be completely natural to drop packets as soon as the routing + rbac imply so. Since ztunnel has to support UDP and L3 eventually, it is the necessary thing to do. The fact that errors are opaque is a general problem with non-application level protocols.

@howardjohn
Copy link
Member Author

howardjohn commented Jun 12, 2023 via email

@bleggett
Copy link
Contributor

worth surveying the landscape

Fair enough.

In general, delaying the application of authZ rules once you have them is a security hole.

It's unclear to me what is gained, ultimately, by choosing to carry that security hole.

@hzxuzhonghu
Copy link
Member

I am not sure how can we be ware of the rbac failure after a HBONE request was sent, IIUC the authz is only done once for a tcp connection

@istio-policy-bot istio-policy-bot added the lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. label Dec 14, 2023
@keithmattix keithmattix reopened this Dec 14, 2023
@linsun linsun added Ambient Beta Must have for OSS Beta of Ambient Mesh and removed lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. labels Dec 14, 2023
@ilrudie
Copy link
Contributor

ilrudie commented Dec 14, 2023

It seems there are (potentially) 2 difference scenarios here:

  1. Change to traffic flow is requested, i.e. they added a waypoint. In this scenario traffic which is bypassing the waypoint may be something L4 is obliged to stop.
  2. Change to existing L4 policy. In this case we should probably try to evaluate whether or not existing connections violate the new policy. If they do, we're likely obliged to halt it, if they are still allowed we don't need to do anything.

I think we're also able to say that item 1 isn't actually a real thing. Existence of a waypoint doesn't imply in anyway that traffic strictly must use it. User's should express this via auth API and thus item 2 is really the only problem we're concerned with. It could potentially be a property exposed via the SuperService whether or not we manage these policies for the user.

@howardjohn
Copy link
Member Author

Change to existing L4 policy. In this case we should probably try to evaluate whether or not existing connections violate the new policy. If they do, we're likely obliged to halt it, if they are still allowed we don't need to do anything.

I definitely agree that if we change something, we should not do anything if the new policies are allowed (only if they deny, then we close). So we would not want a traditional Envoy "drain".

We still need to decide if we want to apply them to existing connections. I think we should

@linsun
Copy link
Member

linsun commented Dec 15, 2023

@ilrudie ok to assign this to you? Can we identify action item for this in terms of beta for L4 first? Seems everyone agrees that we should do something when RBAC updates, even for long lived connections.

@ilrudie
Copy link
Contributor

ilrudie commented Dec 15, 2023

@linsun, Sure! Happy to work on this.

@linsun linsun moved this to In Progress in Ambient tracking Dec 15, 2023
@ilrudie
Copy link
Contributor

ilrudie commented Jan 9, 2024

Added a topic to the ambient WG to try to finalize consensus here before diving into implementation.

@hzxuzhonghu
Copy link
Member

My opinion is for L7 we can deny future request, but for L4 there is no need, and also hard because the authz is applied per connection, while it is per request for L7

@ilrudie
Copy link
Contributor

ilrudie commented Jan 10, 2024

@hzxuzhonghu To be clear: you're saying you do want to close/drain existing connections, yeah?

@hzxuzhonghu
Copy link
Member

Not exactly, as to L4, the rbac is enforced once after connection established. We donot have the chance to close connection.

For L7, as the connection is shared, the authz is for requests, there is no reason to close the connection.

@ilrudie
Copy link
Contributor

ilrudie commented Jan 11, 2024

It should be possible to do though. I think we can pass a rx channel through to our proxy functions and then instead of awaiting proxy_to we can select on proxy_to and query the channel

@ilrudie
Copy link
Contributor

ilrudie commented Jan 11, 2024

High level sketch, what I'm looking at right now is having a proxy instance be able to handle it's own connections. We'll either poll for auth updates or have a channel where something like a policy update can include notifying the routine that will close these channels (undecided). Each connection will have a channel it can check which will cause it to close and those close channels will likely be keyed by rbac::Connection in a shared hashmap so the proxy has a relatively easy time of determining which connections must be closed. It's pretty WIP but hope to have some code to peak at soon.

@hzxuzhonghu
Copy link
Member

But each connection we do assert_rbac once now.

@ilrudie
Copy link
Contributor

ilrudie commented Jan 12, 2024

You're correct. At present we only assert_rbac once. The proposal is to have a way for the zt to assert_rbac on existing connections and then close/drain them via a channel. It's net new code/functionality for zt.

@ilrudie
Copy link
Contributor

ilrudie commented Jan 16, 2024

Sorry it'll be a bit tougher to read since it's built on top of zt code which hasn't merged yet but #772 has a partial implementation of what I was thinking about in terms of being able to assert policy against connections after they were established.

@mikemorris
Copy link
Member

mikemorris commented Jan 17, 2024

I've seen the behavior of not re-authorizing long-lived connections when AuthZ policy changes confuse users wondering why their rules don't seem to be effective - this is something we had wanted to implement in Consul but seemed to be prohibitively difficult for our architecture without Envoy changes. The implementation @ilrudie is describing sounds substantially aligned to what I think the optimal user experience should be, excited to see progress on this!

@kyessenov
Copy link

kyessenov commented Jan 17, 2024

I think you need to support draining at the tunnel protocol level to make enforcement tolerable in terms of UX. What I mean is that a connection that is destined to be terminated should signal the peers that are application aware to terminate it gracefully. Ztunnel cannot do it for HTTP but if it uses a waypoint, an HTTP connection to ztunnel should not be abruptly closed by ztunnel without asking the waypoint to send a GO_AWAY frame. Otherwise, this is a classical 503 scenario and not acceptable in production.

@ilrudie
Copy link
Contributor

ilrudie commented Jan 17, 2024

It seems to me that layer targeted policy would be helpful here. If you target L4 then it's explicitly not application enforced and DROP seems like suitable behavior. If you want some sort of gentle application-aware handling then target L7 layer with your policy.

@bleggett
Copy link
Contributor

bleggett commented Jan 18, 2024

It seems to me that layer targeted policy would be helpful here. If you target L4 then it's explicitly not application enforced and DROP seems like suitable behavior. If you want some sort of gentle application-aware handling then target L7 layer with your policy.

+1 and we should already get that in most cases IIUC - if a policy is applied that should terminate a L7 connection, waypoint should terminate it, and ZT is none the wiser.

If you nuke connections via policy at the L4 layer, regardless of whether the connection might be carrying a higher-level protocol, there's little we can do but drop it.

I guess we could set up a backchannel to the waypoint to inform it and give it time to do something, but that adds complexity and I don't see it as a beta blocker.

@ilrudie
Copy link
Contributor

ilrudie commented Jan 18, 2024

I guess we could set up a backchannel to the waypoint to inform it and give it time to do something, but that adds complexity and I don't see it as a beta blocker.

Yeah, it seems plausible to make but it would likely wind up a special behavior of the reference implementation that requires Istio ztunnel and waypoint. I'd agree re: Beta and add that having a good methodology for how to try to achieve this type of behavior without needing tightly coupled components is probably a strong start.

@linsun
Copy link
Member

linsun commented Jan 18, 2024

It seems to me that layer targeted policy would be helpful here. If you target L4 then it's explicitly not application enforced and DROP seems like suitable behavior. If you want some sort of gentle application-aware handling then target L7 layer with your policy.

+1 on this as well. I've observed DROP for L4 policy for CNI behaviors as well.

@kyessenov
Copy link

Yeah, I think it's pretty clear that you can't drain L4 RBAC. So the answer to the issue title is "no", and the choice is how often to enforce policies to drop (at start or periodically).

@ilrudie
Copy link
Contributor

ilrudie commented Jan 18, 2024

Yeah, I think it's pretty clear that you can't drain L4 RBAC. So the answer to the issue title is "no", and the choice is how often to enforce policies to drop (at start or periodically).

Indeed. "drain" and "rbac" here are probably not that ideal... #naming-hard

@ilrudie
Copy link
Contributor

ilrudie commented Jan 18, 2024

If anyone was holding off on taking a look because the diff was huge I've rebased against master now that Yuval's inpod has merged and it's much more reasonable manageable.

@zengyuxing007
Copy link

Istio's L4 AuthorizationPolicy functionality essentially overlaps with the Kubernetes networkpolicy.
NetworkPolicy usually does not process existing connections, but only takes effect for new connections.

@zengyuxing007
Copy link

istio/istio#48876 Similar issue at the beginning of Ambient deploy/open.

@bleggett
Copy link
Contributor

bleggett commented Feb 5, 2024

NetworkPolicy usually does not process existing connections, but only takes effect for new connections.

it is implementation defined as to whether the change will take effect for that existing connection or not. Example: A policy is created that leads to denying a previously allowed connection, the underlying network plugin implementation is responsible for defining if that new policy will close the existing connections or not.

Since it's implementation-defined and not part of NetPol to dictate one or the other, we can do what makes sense for us.

NetPol and kubernetes do not make the same security or policy guarantees or have the same scoping as Istio AuthPolicy.

NetworkPolicy is fine for basics. Istio's L4 AuthPolicy is an optional extension that composes on top of k8s NetworkPolicy, providing both more policy controls and more security guarantees.

It is entirely fine for Istio's AuthPolicy impl to be more strict than either the K8S NetworkPolicy standard, or a specific implementation of the K8S NetworkPolicy standard.

@alpha-baby
Copy link

The decision to forcibly terminate existing connections has both advantages and disadvantages. It ultimately depends on what aspects the users value more, and giving them the choice to decide would likely be a better approach.

Perhaps an option to enforce connection termination could be added to the L4 RBAC policy.

example in my pr: istio/istio#49106

@ilrudie
Copy link
Contributor

ilrudie commented Feb 22, 2024

It seems fundamentally different to me. In your pr the intent is to allow not severing existing connections when adding a wl to ambient. One could say the intent is clear here and there are plenty of controls to manage this BUT adding to ambient isn't an explicitly "destructive" operation in the way denying connectivity is. In the case of auth policy change it would be allowing a connection the user explicitly denied to continue which seems odd since they have taken an action specifically to deny this connectivity. User intent seems way less ambiguous when they've said "deny this connection".

Louis's comments about disconnecting cause from effect also hold a lot of sway here IMO. If the user made a mistake it's better they find out when they make it vs finding out at some indeterminate point in the future. We do them no favors by allowing them to accidentally make a ticking time bomb.

@linsun
Copy link
Member

linsun commented Apr 4, 2024

Hi @ilrudie - should we close this? I recall you implemented this already?

@ilrudie
Copy link
Contributor

ilrudie commented Apr 4, 2024

@linsun - yeah, closed by #772

@ilrudie ilrudie closed this as completed Apr 4, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in Ambient tracking Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ambient Beta Must have for OSS Beta of Ambient Mesh
Projects
Status: Done
Development

No branches or pull requests