Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race between untrusted operations and events websocket listener #12994

Open
masnax opened this issue Feb 29, 2024 · 7 comments
Open

Race between untrusted operations and events websocket listener #12994

masnax opened this issue Feb 29, 2024 · 7 comments
Assignees
Labels
Bug Confirmed to be a bug
Milestone

Comments

@masnax
Copy link
Contributor

masnax commented Feb 29, 2024

An issue causing intermittent test failures in the clustering image refresh tests was fixed by e7b3c92

The problem was a race where an operation would complete after setting up the events websocket listener, but before the listener is added to the event handler, and so we would block while waiting for the operation complete event. The fix works by reducing the time between the creation of the listener and adding it to the handler.

While this works, I think it is a bit obscure and overly couples the client with the order of operations on the server. It made more sense to me to just ensure we set up the client and all its listeners before we make any request to create an operation, which those listeners are meant to track. While testing this over here: #12885 I found a separate race that is connected to the one solved above, resulting in this error: https://github.com/canonical/lxd/actions/runs/7920353180/job/21623203927?pr=12885#step:13:15409

This happens if we run lxc image copy localhost:${fingerprint} lxd2: --mode=pushto copy an image across 2 remotes.

Internally, we make the following POST /1.0/images/{fingerprint}/export to localhost here. From this endpoint we make a CreateImage request to lxd2 without a client keypair. The request is untrusted since it's run from the server, but when we add lxd2 as a remote, it only trusts the client, not the server.

The race comes in here when CreateImage performs the request which spawns an operation on lxd2. Around this time, it also sets up an event listener which requires trust. If the operation completes before we set up the listener then everything works fine, but if the listener manages to start up first, then the request will fail with an unauthorized error.

I think a simple solution in this case would be to just ensure that we don't spawn an events listener if the client hasn't been assigned a keypair. We can infer from this that the request we are trying to make is going to be untrusted, so we shouldn't try to separately connect to a trusted endpoint.

@tomponline
Copy link
Member

I think a simple solution in this case would be to just ensure that we don't spawn an events listener if the client hasn't been assigned a keypair. We can infer from this that the request we are trying to make is going to be untrusted, so we shouldn't try to separately connect to a trusted endpoint.

This makes sense. Although I am wondering why the image copy succeeds if the two servers dont trust each other and there is no temporary key?

@tomponline tomponline added the Bug Confirmed to be a bug label Feb 29, 2024
@tomponline tomponline added this to the lxd-6.1 milestone Feb 29, 2024
@masnax
Copy link
Contributor Author

masnax commented Feb 29, 2024

I think a simple solution in this case would be to just ensure that we don't spawn an events listener if the client hasn't been assigned a keypair. We can infer from this that the request we are trying to make is going to be untrusted, so we shouldn't try to separately connect to a trusted endpoint.

This makes sense. Although I am wondering why the image copy succeeds if the two servers dont trust each other and there is no temporary key?

There is a secret set in the request header or operation metadata which is manually verified in the corresponding images and operations endpoints, if that's what you mean by using a temporary key.

I looked into handling the secret in the events endpoints as well, but I couldn't figure out an elegant solution for verifying it.
Best I thought of is to intercept any events going through the websocket and only propagate the events which contain an operation that has the secret in its metadata. That seems overly complex to me.

@tomponline
Copy link
Member

@masnax ok in that case im not following what you mean by:

I think a simple solution in this case would be to just ensure that we don't spawn an events listener if the client hasn't been assigned a keypair

@masnax
Copy link
Contributor Author

masnax commented Feb 29, 2024

@masnax ok in that case im not following what you mean by:

I think a simple solution in this case would be to just ensure that we don't spawn an events listener if the client hasn't been assigned a keypair

The client function which spawns the racy event listener is CreateImage. It sets up an http.Request with the secret in its header. Performing the request spawns an operation.

The request that spawns the operation bypasses the the normal auth process of comparing the request peer certs to the locally trusted certificates, and instead verifies the secret, which was sent to that server beforehand.

CreateImage returns an operation struct. When we call operation.Wait, we create an events websocket listener which will listen for any type of event (logging, operations, etc). This listener hits the /1.0/events endpoint which only supports the traditional auth method.

@masnax
Copy link
Contributor Author

masnax commented Feb 29, 2024

So what I'm suggesting is to not leave spawning the listener until after the operation is created and we call operation.Wait, but instead create it before we make the request that spawns the operation. We do this under the constraint that the underlying ProtocolLXD actually has a client keypair assigned to it, and otherwise set skipListener to true.

@tomponline
Copy link
Member

@masnax ok lets have a look at a PR that does this when you get a chance. Thanks

@tomponline
Copy link
Member

Filtering events by what the user can see, ala lxc/incus#292 would be helpful here.

@markylaing says

Ultimately, filtering events by what the user is able to view would solve the PR. Perhaps we can add an operation_secret query parameter and allow untrusted requests on the events endpoint.

@tomponline tomponline modified the milestones: lxd-6.1, lxd-6.2 May 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Confirmed to be a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants