Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid race between operation and events listener #12885

Closed
wants to merge 7 commits into from

Conversation

masnax
Copy link
Contributor

@masnax masnax commented Feb 13, 2024

Closes #12994

If we implicitly create the events listener after the operation has fired, then there's a chance the operation will have completed and emitted the corresponding event before we even begin listening. In many cases, the opposite happens too where the operation finishes so quickly after we start creating the events listener that we receive it on the first HTTP request and then never actually use the events listener.

This leads to an obscure race condition where a client which is authenticated with a temporary key tries to open an events listener, which only supports TLS authentication. Most of the time, the operation returns promptly enough that we never actually end up connecting to the events API which would return an unauthorized error if we hit it.

To avoid having to keep track of all of these pathways and avoid running into obscure races, we shouldn't try to create an events listener automatically when setting up the wait process as we have no way of determining the current state of an operation at this point.

If we want an events listener with our operation, we should make sure to set it up before making the operation request. To do this, we need to make sure the client can actually talk to the events API. We can determine this by checking whether or not the client has any TLS certificates. If it does, then we may be able to authenticate with the /1.0/events API, but if not then we definitely can't.

  • Adds a supportsAuthentication field to protocolLXD which determines whether the client should also try to target the /1.0/events API automatically to set up an event websocket, in the special cases (for image creation, for example) where we re-implement the logic from queryOperation.

  • Ensures we always try to create the /1.0/events listener before sending the request for the operation to be started.

  • Removes the now redundant skipListener field, as the operation wait process will not spawn its own listener if the one from the client is nil.

  • Ensures all calls to AddHandler are on an operation that was created with a client that has TLS authentication support.

@masnax
Copy link
Contributor Author

masnax commented Feb 13, 2024

Hm, too optimistic I guess. Drafting for now.

@masnax
Copy link
Contributor Author

masnax commented Feb 14, 2024

Okay, so the test failures were due to remotes that were unauthorized to interact with the events API. Previously they all set up their events listeners after the operation events had all been emitted already, so there was no error, but now that the events listener is set up beforehand in every case, they're hitting an unauthorized error on GET /1.0/events.

In particular the remote_usage tests were failing to copy an image to a remote, because that remote wasn't authorized to talk back to the events API. I've worked around it by just removing the events listener from the CreateImage client function.

I can make all of the client functions skip the event listener and that should solve the problem, but to be honest I'm not too familiar with the events API so I don't exactly know why it's used with these operations in the first place. If there's a good reason to keep them around, then we can try to do something fancy and detect in the client if the action involves a remote, and if so check if that remote is also trusted by us, and only then include the events listener.

@markylaing
Copy link
Contributor

The /1.0/events API is being used as a catch-all in the client so that only one websocket is used at a time. I've encountered issues with it when working with authorization as well. For example, a user with permission to exec into an instance but without permission to view all events in a project is unable to do so via the client. The addition of the skipListener argument was intended to fix this by instead using /1.0/operations/wait but unfortunately we have found that long-polling can be very inconsistent (especially when there is an intermediate load-balancer).

Tom and I came up with the idea of adding websocket support to the operation wait endpoint to counteract this (https://warthogs.atlassian.net/browse/LXD-726?atlOrigin=eyJpIjoiMTM1ZjY4NTgyY2MxNDAzOGIwMjk2MTQzODYzZGZkODgiLCJwIjoiaiJ9). It will need more discussion before we implement anything though.

@masnax
Copy link
Contributor Author

masnax commented Feb 14, 2024

It sounds to me like these connections should actually be trusted by the remote then. Maybe we need an access handler on GET /1.0/events that allows untrusted requests to collect only operation type events, since GET /1.0/operations/id/wait is already untrusted.

@masnax
Copy link
Contributor Author

masnax commented Feb 15, 2024

So untrusted operation requests instead require a secret passed as a query parameter here:

lxd/lxd/operations.go

Lines 907 to 924 in 0487e51

secret := r.FormValue("secret")
trusted, _, _, _, _ := d.Authenticate(nil, r)
if !trusted && secret == "" {
return response.Forbidden(nil)
}
timeoutSecs, err := shared.AtoiEmptyDefault(r.FormValue("timeout"), -1)
if err != nil {
return response.InternalError(err)
}
// First check if the query is for a local operation from this node
op, err := operations.OperationGetInternal(id)
if err == nil {
if secret != "" && op.Metadata()["secret"] != secret {
return response.Forbidden(nil)
}

Making GET /1.0/events untrusted by default then would require us to pass in the secret as well (likely as a query parameter) and throw out non-operation type events and any operation events that don't also have the secret in their metadata.

So this would mean that the events API would only serve exactly one operation to an untrusted remote.

@tomponline
Copy link
Member

@masnax are you going to land the cherry-pick from Incus separately?

@tomponline
Copy link
Member

@masnax shall we close this for now?

This field can be used to determine whether the client is capable of
making potentially trusted connections to its target server.

This will be useful when determining whether the client should
automatically spawn an events websocket listener for operations and
other events which require authentication with the
`/1.0/events` endpoint.

Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
The /1.0/events API only allows authenticated connections.

When making a request that spawns an operation, we try to
automatically connect to the events API, but the client may
not necessarily support authentication if it is using a
temporary key to establish trust with a particular endpoint.

This leads to an obscure Forbidden error even though the actual
request we made was trusted. Instead, this ensures we don't try to
incorrectly connect to the events API if the underlying ProtocolLXD
does not support authentication.

Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
To ensure we have begun listening for events before the operation has
completed, we should explicitly create the events listener for the
operation before actually initiating the request that creates and starts
the operation.

Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
…ration

If we implicitly create the events listener after the operation has
fired, then there's a chance the operation will have completed and
emitted the corresponding event before we even begin listening.

Waiting for an operation can thus potentially block indefinitely,
if we first encounter the operation in a running state, but then
it completes before we set up the listener.

Instead, we shouldn't try to create an events listener automatically
when setting up the wait process as we have no way of determining the
current state of an operation yet.

Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
As the `setupListener` method will no longer be responsible for spawning
a listener, we can simply check if the operation has a non-nil events
listener to determine whether we should use /1.0/operations/wait or
/1.0/events to track the operation.

Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
@masnax masnax marked this pull request as ready for review March 18, 2024 19:14
@masnax
Copy link
Contributor Author

masnax commented Mar 18, 2024

@tomponline As we had discussed a few weeks ago, I've also tested to make sure this doesn't interfere with the VM agent. It seems to work fine for launching / execing into VMs.

Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
In many cases, we call `AddHandler` directly after creating an
operation, expecting it to support connection to an events websocket.

This won't be the case if the client is unable to authenticate with that
endpoint.

Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
ctxConnectedCancel: ctxConnectedCancel,
eventConns: make(map[string]*websocket.Conn),
eventListeners: make(map[string][]*EventListener),
supportsAuthentication: t.TLSClientConfig != nil && len(t.TLSClientConfig.Certificates) > 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markylaing @masnax will this work with OIDC authenticated clients?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, I'm guessing no since they won't have TLS certs. @markylaing is there something we can check to see if the client intends to use OIDC auth?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two indicators from ConnectionArgs: AuthType and OIDCTokens. I don't think this will work reliably though, AuthType may or may not be set (we have a default preference for OIDC over TLS), and OIDCTokens will not be set if the client has not already authenticated.

ProtocolLXD has an oidcClient field that will be non-nil if the client has an access token for that remote, so you could potentially use that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are passing suggesting that the tests arent covering this scenario with the mini-oidc service, can they be updated to check for the regression now and then we can see the fix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know much about the mini-oidc tests, but the related change from this PR just determines whether we use long polling or a websocket to track an operation. So we would need a test that explicitly tries to connect to the events websocket to track an operation, and fail if we hit the /1.0/operations/wait API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markylaing Is there a case where the client should be able to create and then track long-lived operations without having can_view_events & can_view_projects? Would it be valid and expected to work if a user tries to create/edit/manage instances/images/storage without also having can_view_events and can_view_projects?

My expectation was no, that you would need a set of entitlements inclusive of those two for those types of actions, and in that case checking for an intent to authenticate should be enough, we don't necessarily need to know if the connection will be authorized for /1.0/events. We already don't for TLS auth, since we just check if certs exist, not if they're actually going to be accepted by the server. So we could let the endpoint just handle the error and correctly return 403 if a client is authorized to create or manage images/instances/storage, but not authorized to use the events API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ideally we would like to decouple this and have the entitlements work in isolation. For example, a client with can_exec on an instance shouldn't be required to have can_view_events for the whole project in order to perform the exec.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't been able to dive into decoupling these yet but my thinking is that adding websocket functionality to /1.0/operations/{uuid}/wait OR adding a /1.0/operations/{uuid}/events would be the way to go. Both can be untrusted and require the websocket secret to function. We would then stop using events in the client except for when specifically using lxc monitor for example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep in mind the client and lxc command still need to be able to work with older servers

Copy link
Contributor Author

@masnax masnax Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ideally we would like to decouple this and have the entitlements work in isolation. For example, a client with can_exec on an instance shouldn't be required to have can_view_events for the whole project in order to perform the exec.

I'm not sure I follow, that would still behave the same if we infer authentication:

  • The client intends to use OIDC auth, but does not have can_exec, so the request fails with 403 anyway, so it wouldn't matter that we set up a bad websocket connection.
  • The client does not intend to use OIDC auth, but tries to exec anyway, so the request fails as it's untrusted. In this case we would have skipped the websocket too, but it's ineffectual.
  • The client intends to use OIDC auth, so we set up the websocket listener. The client has all the necessary entitlements so the request succeeds.

Seems to me that's true especially if we want to have entitlements work in isolation. That would mean that whenever a client is authorized for a particular action that can involve a websocket, it must necessarily be authorized to start a websocket on that action by the fact that its authorized for the action itself. As you said, we don't need can_view_events to view events in that case.

Can you give an example of a situation where we might incorrectly infer that we should start a websocket for a particular action, that action is in fact authorized, but the websocket connection is not?

@@ -207,8 +207,10 @@ func (r *ProtocolLXD) tryCreateContainer(req api.ContainersPost, urls []string)

rop.targetOp = op

for _, handler := range rop.handlers {
_, _ = rop.targetOp.AddHandler(handler)
if r.supportsAuthentication {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use SupportsAuthentication everywhere in client so that when we search for usage of this function we get a clear picture of the usage.

_, _ = rop.targetOp.AddHandler(handler)
if r.supportsAuthentication {
for _, handler := range rop.handlers {
_, _ = rop.targetOp.AddHandler(handler)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be ignoring the error returned from AddHandler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen this sort of thing all over the place wherever we try to set up an events listener with getEvents or AddHandler and I didn't understand the reason.

@tomponline
Copy link
Member

@masnax should this be moved back into draft status?

@tomponline tomponline marked this pull request as draft April 5, 2024 07:18
@tomponline
Copy link
Member

Moved back to draft status as not active.

@tomponline
Copy link
Member

@masnax shall we close this until you are able to work on it again? To get the open PR list down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race between untrusted operations and events websocket listener
3 participants