Determine if RBAC updates should drain #311

howardjohn · 2023-01-04T18:09:22Z

Consider a long lived connection. After it is established, we create a rule that would have denied it. Should the connection be terminated, or should only new connections start being terminated?

Prior art:

Istio: Drains
Cilium: operates on many layers, so needs more research
Ztunnel currently: does not drain

keithmattix · 2023-05-23T18:48:46Z

Definitely feels like it should drain IMO

GregHanson · 2023-06-09T21:06:23Z

this also impact waypoints. If a long lived connection exists, and a waypoint is deployed, ztunnel will not direct traffic through the waypoint until the original connection is closed and re-established.

steps to reproduce using online boutique microservice:

istioctl install --set hub=istio --set tag=1.18.0 -y --set profile=ambient --set meshConfig.accessLogFile=/dev/stdout
kubectl label namespace default istio.io/dataplane-mode=ambient --overwrite
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/microservices-demo/main/release/kubernetes-manifests.yaml

# deploy namespace wide waypoint
istioctl x waypoint apply -n default

# no access logs
kubectl logs deploy/namespace-istio-waypoint -f

kubectl delete pod -lapp=loadgenerator

# access logs generated per request as expected
kubectl logs deploy/namespace-istio-waypoint -f

kyessenov · 2023-06-09T22:43:03Z

I don't understand what "drain" here means. There's no general drain strategy for TCP connections. You can only abruptly terminate them. Is that what's proposed?

@GregHanson That's a good point in favor of abrupt termination since authorization depends on Waypoints for HTTP rules, so it'd be incorrect not to route to Waypoint indefinitely.

howardjohn · 2023-06-09T22:47:07Z

Yes, drain was poorly worded. Although fwiw I'm not proposing it, just that we make an explicit decision - I am leaning towards not terminating

…

On Fri, Jun 9, 2023, 3:43 PM Kuat ***@***.***> wrote: I don't understand what "drain" here means. There's no general drain strategy for TCP connections. You can only abruptly terminate them. Is that what's proposed? — Reply to this email directly, view it on GitHub <#311 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEYGXJ2HZOVJ2N24BN24WTXKORIFANCNFSM6AAAAAATRCHGVI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

kyessenov · 2023-06-09T22:50:35Z

I think never terminating would be dangerous in an eventually consistent system. When workloads and policies are deployed concurrently (enabling waypoint is a policy), the order should not matter when things open connections. So it seems necessary to terminate eventually, and the challenge is defining when to terminate connections and limiting the scope to only those should be.

linsun · 2023-06-10T04:58:56Z

Agreed, I'd expect updated RBAC become effective even with long lived connection.

cc @louiscryan for input

howardjohn · 2023-06-12T14:17:48Z

I agree with the concerns of eventual consistency. For RBAC it makes more sense I think - we explicitly would close the connection if it was new. For Waypoint being added, though, closing the connection seems unneccesarily aggressive

bleggett · 2023-06-12T15:23:35Z

I agree with the concerns of eventual consistency. For RBAC it makes more sense I think - we explicitly would close the connection if it was new. For Waypoint being added, though, closing the connection seems unneccesarily aggressive.

Dropping a connection that would previously have been allowed but no longer is, as soon as we are aware of that, regardless of its age, is an non-negotiable obligation on our end, IMO.

Scoping that to "pure RBAC" so we don't have to do the same when L7 concerns are added with Waypoints is both a complex behavior to actually model, and creates behavioral exceptions in ztunnel around connection handling that make it harder to reason about - I think it costs more than it saves to parse the behavior differently here.

I think we just drop in both cases, because that's the simplest, safest, and most obvious behavior.

howardjohn · 2023-06-12T15:30:42Z

I don't agree its simpler nor safer. Its not simpler as we need to go implement it, which is strictly added complexity. Its not safer because it abruptly tears down users connections.

I don't think most users will be happy when they deploy a waypoint and suddenly get a massive surge of 503s. Lin gave a talk about dynamically adding a waypoint to do a canary rollout - interestingly, its broken in both cases:

If we close the connections, users will get 503s which has a negative impact and may impact the canary analysis (looks like new version is broken, in reality just the waypoint enablement broke things).
If we don't close the connections, clients may not respect the canary rules.

Both are pretty unfortunate

kyessenov · 2023-06-12T15:43:32Z

I think we're over-indexing on TCP handling in ztunnel. If it was working on the packets, it would be completely natural to drop packets as soon as the routing + rbac imply so. Since ztunnel has to support UDP and L3 eventually, it is the necessary thing to do. The fact that errors are opaque is a general problem with non-application level protocols.

howardjohn · 2023-06-12T15:46:26Z

AFAIK most CNIs actually do not drop packets when the netpol changes. At least that has been my experience a few years ago - worth surveying the landscape

…

On Mon, Jun 12, 2023 at 8:43 AM Kuat ***@***.***> wrote: I think we're over-indexing on TCP handling in ztunnel. If it was working on the packets, it would be completely natural to drop packets as soon as the routing + rbac imply so. Since ztunnel has to support UDP and L3 eventually, it is the necessary thing to do. The fact that errors are opaque is a general problem with non-application level protocols. — Reply to this email directly, view it on GitHub <#311 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEYGXLL55YQFECKARFAZ7LXK42K5ANCNFSM6AAAAAATRCHGVI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

bleggett · 2023-06-12T15:50:42Z

worth surveying the landscape

Fair enough.

In general, delaying the application of authZ rules once you have them is a security hole.

It's unclear to me what is gained, ultimately, by choosing to carry that security hole.

hzxuzhonghu · 2023-06-13T01:53:00Z

I am not sure how can we be ware of the rbac failure after a HBONE request was sent, IIUC the authz is only done once for a tcp connection

ilrudie · 2023-12-14T18:48:20Z

It seems there are (potentially) 2 difference scenarios here:

Change to traffic flow is requested, i.e. they added a waypoint. In this scenario traffic which is bypassing the waypoint may be something L4 is obliged to stop.
Change to existing L4 policy. In this case we should probably try to evaluate whether or not existing connections violate the new policy. If they do, we're likely obliged to halt it, if they are still allowed we don't need to do anything.

I think we're also able to say that item 1 isn't actually a real thing. Existence of a waypoint doesn't imply in anyway that traffic strictly must use it. User's should express this via auth API and thus item 2 is really the only problem we're concerned with. It could potentially be a property exposed via the SuperService whether or not we manage these policies for the user.

howardjohn · 2023-12-15T00:00:23Z

Change to existing L4 policy. In this case we should probably try to evaluate whether or not existing connections violate the new policy. If they do, we're likely obliged to halt it, if they are still allowed we don't need to do anything.

I definitely agree that if we change something, we should not do anything if the new policies are allowed (only if they deny, then we close). So we would not want a traditional Envoy "drain".

We still need to decide if we want to apply them to existing connections. I think we should

linsun · 2023-12-15T19:56:13Z

@ilrudie ok to assign this to you? Can we identify action item for this in terms of beta for L4 first? Seems everyone agrees that we should do something when RBAC updates, even for long lived connections.

ilrudie · 2023-12-15T20:10:27Z

@linsun, Sure! Happy to work on this.

ilrudie · 2024-01-09T18:11:42Z

Added a topic to the ambient WG to try to finalize consensus here before diving into implementation.

hzxuzhonghu · 2024-01-10T01:14:44Z

My opinion is for L7 we can deny future request, but for L4 there is no need, and also hard because the authz is applied per connection, while it is per request for L7

ilrudie · 2024-01-10T19:41:37Z

@hzxuzhonghu To be clear: you're saying you do want to close/drain existing connections, yeah?

hzxuzhonghu · 2024-01-11T01:29:35Z

Not exactly, as to L4, the rbac is enforced once after connection established. We donot have the chance to close connection.

For L7, as the connection is shared, the authz is for requests, there is no reason to close the connection.

ilrudie · 2024-01-11T17:05:27Z

It should be possible to do though. I think we can pass a rx channel through to our proxy functions and then instead of awaiting proxy_to we can select on proxy_to and query the channel

ilrudie · 2024-01-11T23:24:39Z

High level sketch, what I'm looking at right now is having a proxy instance be able to handle it's own connections. We'll either poll for auth updates or have a channel where something like a policy update can include notifying the routine that will close these channels (undecided). Each connection will have a channel it can check which will cause it to close and those close channels will likely be keyed by rbac::Connection in a shared hashmap so the proxy has a relatively easy time of determining which connections must be closed. It's pretty WIP but hope to have some code to peak at soon.

hzxuzhonghu · 2024-01-12T01:29:23Z

But each connection we do assert_rbac once now.

ilrudie · 2024-01-12T16:15:32Z

You're correct. At present we only assert_rbac once. The proposal is to have a way for the zt to assert_rbac on existing connections and then close/drain them via a channel. It's net new code/functionality for zt.

ilrudie · 2024-01-16T22:18:30Z

Sorry it'll be a bit tougher to read since it's built on top of zt code which hasn't merged yet but #772 has a partial implementation of what I was thinking about in terms of being able to assert policy against connections after they were established.

mikemorris · 2024-01-17T19:36:40Z

I've seen the behavior of not re-authorizing long-lived connections when AuthZ policy changes confuse users wondering why their rules don't seem to be effective - this is something we had wanted to implement in Consul but seemed to be prohibitively difficult for our architecture without Envoy changes. The implementation @ilrudie is describing sounds substantially aligned to what I think the optimal user experience should be, excited to see progress on this!

kyessenov · 2024-01-17T22:01:22Z

I think you need to support draining at the tunnel protocol level to make enforcement tolerable in terms of UX. What I mean is that a connection that is destined to be terminated should signal the peers that are application aware to terminate it gracefully. Ztunnel cannot do it for HTTP but if it uses a waypoint, an HTTP connection to ztunnel should not be abruptly closed by ztunnel without asking the waypoint to send a GO_AWAY frame. Otherwise, this is a classical 503 scenario and not acceptable in production.

ilrudie · 2024-01-17T22:23:48Z

It seems to me that layer targeted policy would be helpful here. If you target L4 then it's explicitly not application enforced and DROP seems like suitable behavior. If you want some sort of gentle application-aware handling then target L7 layer with your policy.

bleggett · 2024-01-18T00:02:43Z

It seems to me that layer targeted policy would be helpful here. If you target L4 then it's explicitly not application enforced and DROP seems like suitable behavior. If you want some sort of gentle application-aware handling then target L7 layer with your policy.

+1 and we should already get that in most cases IIUC - if a policy is applied that should terminate a L7 connection, waypoint should terminate it, and ZT is none the wiser.

If you nuke connections via policy at the L4 layer, regardless of whether the connection might be carrying a higher-level protocol, there's little we can do but drop it.

I guess we could set up a backchannel to the waypoint to inform it and give it time to do something, but that adds complexity and I don't see it as a beta blocker.

ilrudie · 2024-01-18T14:42:41Z

I guess we could set up a backchannel to the waypoint to inform it and give it time to do something, but that adds complexity and I don't see it as a beta blocker.

Yeah, it seems plausible to make but it would likely wind up a special behavior of the reference implementation that requires Istio ztunnel and waypoint. I'd agree re: Beta and add that having a good methodology for how to try to achieve this type of behavior without needing tightly coupled components is probably a strong start.

linsun · 2024-01-18T15:32:32Z

It seems to me that layer targeted policy would be helpful here. If you target L4 then it's explicitly not application enforced and DROP seems like suitable behavior. If you want some sort of gentle application-aware handling then target L7 layer with your policy.

+1 on this as well. I've observed DROP for L4 policy for CNI behaviors as well.

kyessenov · 2024-01-18T17:58:18Z

Yeah, I think it's pretty clear that you can't drain L4 RBAC. So the answer to the issue title is "no", and the choice is how often to enforce policies to drop (at start or periodically).

ilrudie · 2024-01-18T18:11:02Z

Yeah, I think it's pretty clear that you can't drain L4 RBAC. So the answer to the issue title is "no", and the choice is how often to enforce policies to drop (at start or periodically).

Indeed. "drain" and "rbac" here are probably not that ideal... #naming-hard

ilrudie · 2024-01-18T22:11:18Z

If anyone was holding off on taking a look because the diff was huge I've rebased against master now that Yuval's inpod has merged and it's much more reasonable manageable.

zengyuxing007 · 2024-02-04T08:04:45Z

Istio's L4 AuthorizationPolicy functionality essentially overlaps with the Kubernetes networkpolicy.
NetworkPolicy usually does not process existing connections, but only takes effect for new connections.

zengyuxing007 · 2024-02-04T08:08:27Z

istio/istio#48876 Similar issue at the beginning of Ambient deploy/open.

bleggett · 2024-02-05T17:37:34Z

NetworkPolicy usually does not process existing connections, but only takes effect for new connections.

it is implementation defined as to whether the change will take effect for that existing connection or not. Example: A policy is created that leads to denying a previously allowed connection, the underlying network plugin implementation is responsible for defining if that new policy will close the existing connections or not.

Since it's implementation-defined and not part of NetPol to dictate one or the other, we can do what makes sense for us.

NetPol and kubernetes do not make the same security or policy guarantees or have the same scoping as Istio AuthPolicy.

NetworkPolicy is fine for basics. Istio's L4 AuthPolicy is an optional extension that composes on top of k8s NetworkPolicy, providing both more policy controls and more security guarantees.

It is entirely fine for Istio's AuthPolicy impl to be more strict than either the K8S NetworkPolicy standard, or a specific implementation of the K8S NetworkPolicy standard.

alpha-baby · 2024-02-22T07:06:16Z

The decision to forcibly terminate existing connections has both advantages and disadvantages. It ultimately depends on what aspects the users value more, and giving them the choice to decide would likely be a better approach.

Perhaps an option to enforce connection termination could be added to the L4 RBAC policy.

example in my pr: istio/istio#49106

ilrudie · 2024-02-22T14:07:45Z

It seems fundamentally different to me. In your pr the intent is to allow not severing existing connections when adding a wl to ambient. One could say the intent is clear here and there are plenty of controls to manage this BUT adding to ambient isn't an explicitly "destructive" operation in the way denying connectivity is. In the case of auth policy change it would be allowing a connection the user explicitly denied to continue which seems odd since they have taken an action specifically to deny this connectivity. User intent seems way less ambiguous when they've said "deny this connection".

Louis's comments about disconnecting cause from effect also hold a lot of sway here IMO. If the user made a mistake it's better they find out when they make it vs finding out at some indeterminate point in the future. We do them no favors by allowing them to accidentally make a ticking time bomb.

linsun · 2024-04-04T14:57:47Z

Hi @ilrudie - should we close this? I recall you implemented this already?

ilrudie · 2024-04-04T17:10:00Z

@linsun - yeah, closed by #772

GregHanson mentioned this issue Aug 11, 2023

non-disruptive waypoint additions istio/istio#46495

Closed

istio-policy-bot closed this as completed Dec 14, 2023

istio-policy-bot added the lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. label Dec 14, 2023

keithmattix reopened this Dec 14, 2023

linsun added this to Ambient tracking Dec 14, 2023

linsun added Ambient Beta Must have for OSS Beta of Ambient Mesh and removed lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. labels Dec 14, 2023

linsun assigned ilrudie Dec 15, 2023

linsun moved this to In Progress in Ambient tracking Dec 15, 2023

ilrudie mentioned this issue Jan 16, 2024

close connections which violate policy after updates #772

Merged

supercharge-xsy mentioned this issue Jan 30, 2024

Kmesh support l4auth policy kmesh-net/kmesh#77

Closed

ilrudie mentioned this issue Feb 5, 2024

iterate on close connections which violate policy after updates #798

Open

3 tasks

ilrudie closed this as completed Apr 4, 2024

github-project-automation bot moved this from In Progress to Done in Ambient tracking Apr 4, 2024

Determine if RBAC updates should drain #311

Determine if RBAC updates should drain #311

Comments

howardjohn commented Jan 4, 2023 • edited Loading

keithmattix commented May 23, 2023

GregHanson commented Jun 9, 2023 • edited Loading

kyessenov commented Jun 9, 2023 • edited Loading

howardjohn commented Jun 9, 2023 via email

kyessenov commented Jun 9, 2023 • edited Loading

linsun commented Jun 10, 2023

howardjohn commented Jun 12, 2023

bleggett commented Jun 12, 2023 • edited Loading

howardjohn commented Jun 12, 2023

kyessenov commented Jun 12, 2023

howardjohn commented Jun 12, 2023 via email

bleggett commented Jun 12, 2023

hzxuzhonghu commented Jun 13, 2023

ilrudie commented Dec 14, 2023

howardjohn commented Dec 15, 2023

linsun commented Dec 15, 2023

ilrudie commented Dec 15, 2023

ilrudie commented Jan 9, 2024

hzxuzhonghu commented Jan 10, 2024

ilrudie commented Jan 10, 2024

hzxuzhonghu commented Jan 11, 2024

ilrudie commented Jan 11, 2024

ilrudie commented Jan 11, 2024

hzxuzhonghu commented Jan 12, 2024

ilrudie commented Jan 12, 2024

ilrudie commented Jan 16, 2024

mikemorris commented Jan 17, 2024 • edited Loading

kyessenov commented Jan 17, 2024 • edited Loading

ilrudie commented Jan 17, 2024

bleggett commented Jan 18, 2024 • edited Loading

ilrudie commented Jan 18, 2024

linsun commented Jan 18, 2024

kyessenov commented Jan 18, 2024

ilrudie commented Jan 18, 2024

ilrudie commented Jan 18, 2024

zengyuxing007 commented Feb 4, 2024

zengyuxing007 commented Feb 4, 2024

bleggett commented Feb 5, 2024 • edited Loading

alpha-baby commented Feb 22, 2024

ilrudie commented Feb 22, 2024

linsun commented Apr 4, 2024

ilrudie commented Apr 4, 2024

howardjohn commented Jan 4, 2023 •

edited

Loading

GregHanson commented Jun 9, 2023 •

edited

Loading

kyessenov commented Jun 9, 2023 •

edited

Loading

kyessenov commented Jun 9, 2023 •

edited

Loading

bleggett commented Jun 12, 2023 •

edited

Loading

mikemorris commented Jan 17, 2024 •

edited

Loading

kyessenov commented Jan 17, 2024 •

edited

Loading

bleggett commented Jan 18, 2024 •

edited

Loading

bleggett commented Feb 5, 2024 •

edited

Loading