Avoid race between operation and events listener #12885

masnax · 2024-02-13T17:51:41Z

If we implicitly create the events listener after the operation has fired, then there's a chance the operation will have completed and emitted the corresponding event before we even begin listening. In many cases, the opposite happens too where the operation finishes so quickly after we start creating the events listener that we receive it on the first HTTP request and then never actually use the events listener.

This leads to an obscure race condition where a client which is authenticated with a temporary key tries to open an events listener, which only supports TLS authentication. Most of the time, the operation returns promptly enough that we never actually end up connecting to the events API which would return an unauthorized error if we hit it.

To avoid having to keep track of all of these pathways and avoid running into obscure races, we shouldn't try to create an events listener automatically when setting up the wait process as we have no way of determining the current state of an operation at this point.

If we want an events listener with our operation, we should make sure to set it up before making the operation request. To do this, we need to make sure the client can actually talk to the events API. We can determine this by checking whether or not the client has any TLS certificates. If it does, then we may be able to authenticate with the /1.0/events API, but if not then we definitely can't.

Adds a supportsAuthentication field to protocolLXD which determines whether the client should also try to target the /1.0/events API automatically to set up an event websocket, in the special cases (for image creation, for example) where we re-implement the logic from queryOperation.
Ensures we always try to create the /1.0/events listener before sending the request for the operation to be started.
Removes the now redundant skipListener field, as the operation wait process will not spawn its own listener if the one from the client is nil.
Ensures all calls to AddHandler are on an operation that was created with a client that has TLS authentication support.

masnax · 2024-02-13T17:59:51Z

Hm, too optimistic I guess. Drafting for now.

masnax · 2024-02-14T02:20:39Z

Okay, so the test failures were due to remotes that were unauthorized to interact with the events API. Previously they all set up their events listeners after the operation events had all been emitted already, so there was no error, but now that the events listener is set up beforehand in every case, they're hitting an unauthorized error on GET /1.0/events.

In particular the remote_usage tests were failing to copy an image to a remote, because that remote wasn't authorized to talk back to the events API. I've worked around it by just removing the events listener from the CreateImage client function.

I can make all of the client functions skip the event listener and that should solve the problem, but to be honest I'm not too familiar with the events API so I don't exactly know why it's used with these operations in the first place. If there's a good reason to keep them around, then we can try to do something fancy and detect in the client if the action involves a remote, and if so check if that remote is also trusted by us, and only then include the events listener.

markylaing · 2024-02-14T09:05:24Z

The /1.0/events API is being used as a catch-all in the client so that only one websocket is used at a time. I've encountered issues with it when working with authorization as well. For example, a user with permission to exec into an instance but without permission to view all events in a project is unable to do so via the client. The addition of the skipListener argument was intended to fix this by instead using /1.0/operations/wait but unfortunately we have found that long-polling can be very inconsistent (especially when there is an intermediate load-balancer).

Tom and I came up with the idea of adding websocket support to the operation wait endpoint to counteract this (https://warthogs.atlassian.net/browse/LXD-726?atlOrigin=eyJpIjoiMTM1ZjY4NTgyY2MxNDAzOGIwMjk2MTQzODYzZGZkODgiLCJwIjoiaiJ9). It will need more discussion before we implement anything though.

masnax · 2024-02-14T18:57:11Z

It sounds to me like these connections should actually be trusted by the remote then. Maybe we need an access handler on GET /1.0/events that allows untrusted requests to collect only operation type events, since GET /1.0/operations/id/wait is already untrusted.

masnax · 2024-02-15T02:56:33Z

So untrusted operation requests instead require a secret passed as a query parameter here:

lxd/lxd/operations.go

Lines 907 to 924 in 0487e51

 secret := r.FormValue("secret") 

 trusted, _, _, _, _ := d.Authenticate(nil, r) 

 if !trusted && secret == "" { 

 return response.Forbidden(nil) 

 } 

 timeoutSecs, err := shared.AtoiEmptyDefault(r.FormValue("timeout"), -1) 

 if err != nil { 

 return response.InternalError(err) 

 } 

 // First check if the query is for a local operation from this node 

 op, err := operations.OperationGetInternal(id) 

 if err == nil { 

 if secret != "" && op.Metadata()["secret"] != secret { 

 return response.Forbidden(nil) 

 }

Making GET /1.0/events untrusted by default then would require us to pass in the secret as well (likely as a query parameter) and throw out non-operation type events and any operation events that don't also have the secret in their metadata.

So this would mean that the events API would only serve exactly one operation to an untrusted remote.

tomponline · 2024-02-20T12:42:32Z

@masnax are you going to land the cherry-pick from Incus separately?

tomponline · 2024-02-21T10:11:03Z

@masnax shall we close this for now?

This field can be used to determine whether the client is capable of making potentially trusted connections to its target server. This will be useful when determining whether the client should automatically spawn an events websocket listener for operations and other events which require authentication with the `/1.0/events` endpoint. Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>

The /1.0/events API only allows authenticated connections. When making a request that spawns an operation, we try to automatically connect to the events API, but the client may not necessarily support authentication if it is using a temporary key to establish trust with a particular endpoint. This leads to an obscure Forbidden error even though the actual request we made was trusted. Instead, this ensures we don't try to incorrectly connect to the events API if the underlying ProtocolLXD does not support authentication. Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>

To ensure we have begun listening for events before the operation has completed, we should explicitly create the events listener for the operation before actually initiating the request that creates and starts the operation. Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>

…ration If we implicitly create the events listener after the operation has fired, then there's a chance the operation will have completed and emitted the corresponding event before we even begin listening. Waiting for an operation can thus potentially block indefinitely, if we first encounter the operation in a running state, but then it completes before we set up the listener. Instead, we shouldn't try to create an events listener automatically when setting up the wait process as we have no way of determining the current state of an operation yet. Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>

As the `setupListener` method will no longer be responsible for spawning a listener, we can simply check if the operation has a non-nil events listener to determine whether we should use /1.0/operations/wait or /1.0/events to track the operation. Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>

masnax · 2024-03-18T19:17:51Z

@tomponline As we had discussed a few weeks ago, I've also tested to make sure this doesn't interfere with the VM agent. It seems to work fine for launching / execing into VMs.

Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>

In many cases, we call `AddHandler` directly after creating an operation, expecting it to support connection to an events websocket. This won't be the case if the client is unable to authenticate with that endpoint. Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>

tomponline · 2024-03-19T09:05:33Z

client/connection.go

+ ctxConnectedCancel: ctxConnectedCancel,
+ eventConns: make(map[string]*websocket.Conn),
+ eventListeners: make(map[string][]*EventListener),
+ supportsAuthentication: t.TLSClientConfig != nil && len(t.TLSClientConfig.Certificates) > 0,


@markylaing @masnax will this work with OIDC authenticated clients?

At the moment, I'm guessing no since they won't have TLS certs. @markylaing is there something we can check to see if the client intends to use OIDC auth?

There are two indicators from ConnectionArgs: AuthType and OIDCTokens. I don't think this will work reliably though, AuthType may or may not be set (we have a default preference for OIDC over TLS), and OIDCTokens will not be set if the client has not already authenticated.

ProtocolLXD has an oidcClient field that will be non-nil if the client has an access token for that remote, so you could potentially use that.

The tests are passing suggesting that the tests arent covering this scenario with the mini-oidc service, can they be updated to check for the regression now and then we can see the fix?

I don't know much about the mini-oidc tests, but the related change from this PR just determines whether we use long polling or a websocket to track an operation. So we would need a test that explicitly tries to connect to the events websocket to track an operation, and fail if we hit the /1.0/operations/wait API.

@markylaing Is there a case where the client should be able to create and then track long-lived operations without having can_view_events & can_view_projects? Would it be valid and expected to work if a user tries to create/edit/manage instances/images/storage without also having can_view_events and can_view_projects?

My expectation was no, that you would need a set of entitlements inclusive of those two for those types of actions, and in that case checking for an intent to authenticate should be enough, we don't necessarily need to know if the connection will be authorized for /1.0/events. We already don't for TLS auth, since we just check if certs exist, not if they're actually going to be accepted by the server. So we could let the endpoint just handle the error and correctly return 403 if a client is authorized to create or manage images/instances/storage, but not authorized to use the events API.

Yes, ideally we would like to decouple this and have the entitlements work in isolation. For example, a client with can_exec on an instance shouldn't be required to have can_view_events for the whole project in order to perform the exec.

I haven't been able to dive into decoupling these yet but my thinking is that adding websocket functionality to /1.0/operations/{uuid}/wait OR adding a /1.0/operations/{uuid}/events would be the way to go. Both can be untrusted and require the websocket secret to function. We would then stop using events in the client except for when specifically using lxc monitor for example.

Please keep in mind the client and lxc command still need to be able to work with older servers

Yes, ideally we would like to decouple this and have the entitlements work in isolation. For example, a client with can_exec on an instance shouldn't be required to have can_view_events for the whole project in order to perform the exec.

I'm not sure I follow, that would still behave the same if we infer authentication:

The client intends to use OIDC auth, but does not have can_exec, so the request fails with 403 anyway, so it wouldn't matter that we set up a bad websocket connection.

The client does not intend to use OIDC auth, but tries to exec anyway, so the request fails as it's untrusted. In this case we would have skipped the websocket too, but it's ineffectual.

The client intends to use OIDC auth, so we set up the websocket listener. The client has all the necessary entitlements so the request succeeds.

Seems to me that's true especially if we want to have entitlements work in isolation. That would mean that whenever a client is authorized for a particular action that can involve a websocket, it must necessarily be authorized to start a websocket on that action by the fact that its authorized for the action itself. As you said, we don't need can_view_events to view events in that case.

Can you give an example of a situation where we might incorrectly infer that we should start a websocket for a particular action, that action is in fact authorized, but the websocket connection is not?

tomponline · 2024-03-19T09:12:10Z

client/lxd_containers.go

@@ -207,8 +207,10 @@ func (r *ProtocolLXD) tryCreateContainer(req api.ContainersPost, urls []string)

 rop.targetOp = op

- for _, handler := range rop.handlers {
- _, _ = rop.targetOp.AddHandler(handler)
+ if r.supportsAuthentication {


Can you use SupportsAuthentication everywhere in client so that when we search for usage of this function we get a clear picture of the usage.

tomponline · 2024-03-19T09:15:13Z

client/lxd_containers.go

- _, _ = rop.targetOp.AddHandler(handler)
+ if r.supportsAuthentication {
+ for _, handler := range rop.handlers {
+ _, _ = rop.targetOp.AddHandler(handler)


Should we be ignoring the error returned from AddHandler?

I've seen this sort of thing all over the place wherever we try to set up an events listener with getEvents or AddHandler and I didn't understand the reason.

tomponline · 2024-03-28T10:00:53Z

@masnax should this be moved back into draft status?

tomponline · 2024-04-05T07:18:56Z

Moved back to draft status as not active.

tomponline · 2024-04-11T09:06:58Z

@masnax shall we close this until you are able to work on it again? To get the open PR list down.

masnax requested a review from tomponline as a code owner February 13, 2024 17:51

masnax marked this pull request as draft February 13, 2024 18:00

masnax force-pushed the event-op-race branch from af07b25 to 34ca0ba Compare February 14, 2024 01:51

masnax force-pushed the event-op-race branch from 34ca0ba to 4afce0a Compare February 15, 2024 18:06

masnax mentioned this pull request Feb 29, 2024

Race between untrusted operations and events websocket listener #12994

Open

masnax force-pushed the event-op-race branch from 4afce0a to c14750c Compare March 1, 2024 21:32

masnax force-pushed the event-op-race branch 2 times, most recently from 12da673 to 7f17365 Compare March 18, 2024 18:54

masnax added 5 commits March 18, 2024 19:02

masnax force-pushed the event-op-race branch from 7f17365 to 463feac Compare March 18, 2024 19:02

masnax marked this pull request as ready for review March 18, 2024 19:14

masnax added 2 commits March 18, 2024 20:37

client/lxd: Expose SupportsAuthentication field

861f757

Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>

masnax force-pushed the event-op-race branch from 463feac to c3a3f44 Compare March 18, 2024 20:37

tomponline reviewed Mar 19, 2024

View reviewed changes

tomponline marked this pull request as draft April 5, 2024 07:18

masnax closed this Apr 11, 2024

markylaing mentioned this pull request Aug 22, 2024

Effective project handling in the events API #13962

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid race between operation and events listener #12885

Avoid race between operation and events listener #12885

masnax commented Feb 13, 2024 •

edited

Loading

masnax commented Feb 13, 2024

masnax commented Feb 14, 2024

markylaing commented Feb 14, 2024

masnax commented Feb 14, 2024

masnax commented Feb 15, 2024

tomponline commented Feb 20, 2024

tomponline commented Feb 21, 2024

masnax commented Mar 18, 2024

tomponline Mar 19, 2024

masnax Mar 19, 2024

markylaing Mar 20, 2024

tomponline Mar 20, 2024

masnax Mar 20, 2024

masnax Mar 21, 2024

markylaing Mar 21, 2024

markylaing Mar 21, 2024

tomponline Mar 21, 2024

masnax Mar 21, 2024 •

edited

Loading

tomponline Mar 19, 2024

tomponline Mar 19, 2024

masnax Mar 19, 2024

tomponline commented Mar 28, 2024

tomponline commented Apr 5, 2024

tomponline commented Apr 11, 2024

Avoid race between operation and events listener #12885

Avoid race between operation and events listener #12885

Conversation

masnax commented Feb 13, 2024 • edited Loading

masnax commented Feb 13, 2024

masnax commented Feb 14, 2024

markylaing commented Feb 14, 2024

masnax commented Feb 14, 2024

masnax commented Feb 15, 2024

tomponline commented Feb 20, 2024

tomponline commented Feb 21, 2024

masnax commented Mar 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masnax Mar 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomponline commented Mar 28, 2024

tomponline commented Apr 5, 2024

tomponline commented Apr 11, 2024

masnax commented Feb 13, 2024 •

edited

Loading

masnax Mar 21, 2024 •

edited

Loading